Assignment Week 5 ~ Functions and Loops

logo week 10

Nazwa Nur Ramadhani

Undergraduate Student in Data Science at Institut Teknologi Sains Bandung

Introduction

In an increasingly data-driven world, the ability to understand and process data is no longer merely an additional skill, but has become a fundamental necessity. Through this practicum, not only are programming concepts explored, but also how data can be processed and interpreted into meaningful information. This practicum aims to explore the use of R in the data analysis process, starting from data creation, processing, to visualization. Each task is designed not only to focus on the final outcome, but also on the logical thinking process in developing solutions through the use of functions, loops, and various data processing techniques.

1.Dynamic Multi-Formula Function

# Library
library(ggplot2)
library(tidyr)
library(plotly)
 
# Color
PASTEL <- c('#FFB3C1','#FFD6A5','#A0C4FF','#BDB2FF')
 
# Function
compute_formula <- function(x, formula) {
  valid <- c("linear", "quadratic", "cubic", "exponential")
  if (!(formula %in% valid)) {
    stop(paste("Formula '", formula, "' tidak valid. Pilih:", paste(valid, collapse = ", ")))
  }
  if      (formula == "linear")      return(2*x + 3)
  else if (formula == "quadratic")   return(x^2 - 4*x + 4)
  else if (formula == "cubic")       return(x^3 - 3*x^2 + 2*x)
  else if (formula == "exponential") return(exp(0.3 * x))
}
 
# Nested loop: hitung semua formula untuk x = 1..20
x_vals   <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")
results  <- data.frame(x = x_vals)
 
for (formula in formulas) {          # loop formula (luar)
  y_vals <- c()
  for (x in x_vals) {               # loop nilai x (dalam)
    y_vals <- c(y_vals, compute_formula(x, formula))
  }
  results[[formula]] <- y_vals
}
 
# Reshape untuk ggplot
df_long <- pivot_longer(results, cols = -x,
                        names_to  = "formula",
                        values_to = "y")
 
# Plot
p <-ggplot(df_long, aes(x = x, y = y, color = formula, group = formula)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  scale_color_manual(values = setNames(PASTEL, formulas),
                     labels  = tools::toTitleCase(formulas)) +
  labs(
    title  = "Multi-Formula Comparison",
    x      = "x",
    y      = "y = f(x)",
    color  = "Formula"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.background    = element_rect(fill = "#FFF9F9", color = NA),
    panel.background   = element_rect(fill = "#FFF9F9", color = NA),
    panel.grid.major   = element_line(color = "#F0E6EE", linewidth = 0.8),
    panel.grid.minor   = element_blank(),
    axis.text          = element_text(color = "#9B7BB8"),
    axis.title         = element_text(color = "#9B7BB8"),
    plot.title         = element_text(color = "#7B5EA7", face = "bold", size = 13),
    legend.background  = element_rect(fill = "#FFF9F9", color = "#CCAACC"),
    legend.text        = element_text(color = "#7B5EA7"),
    legend.title       = element_text(color = "#7B5EA7")
  )

ggplotly(p)

Interpretation:

The chart shows a comparison of four functions are linear, quadratic, cubic, and exponential, with respect to x values ranging from 1 to 20. The linear function increases steadily and forms a straight line, meaning that each increase in x results in a constant increase in y. The quadratic function increases more rapidly than the linear function and forms a curve that rises as x increases. The exponential function also shows a faster increase, especially at higher x values.

However, the graph shows that the cubic function increases more sharply than the other functions, resulting in the largest y values at high x values. Overall, all functions show an increasing trend, but at different growth rates, with the linear function being the most stable, the quadratic and exponential increasing more rapidly, and the cubic function increasing the fastest over this x range.

2.Nested Simulation: Multi-Sales & Discounts

#Library
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)

# Load Data
df23 <- read.csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 2, 3.csv")

datatable(df23,
          caption = "Dataset",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Color
PASTEL <- c('#FFB3C1','#FFD6A5','#CAFFBF','#9BF6FF','#BDB2FF')

# Function: Discount
apply_discount <- function(x) {
  if (x > 900) return(0.20)
  else if (x > 700) return(0.15)
  else if (x > 500) return(0.10)
  else if (x > 300) return(0.05)
  else return(0.00)
}

# Nested Function: Cumulative Sales
cumulative_sales_func <- function(sales_vec) {
  total <- 0
  result <- c()
  
  for (s in sales_vec) {
    total <- total + s
    result <- c(result, total)
  }
  
  return(result)
}

# Loop Per Salaesperson
sales_ids <- unique(df23$sales_id)

final_data <- data.frame()

for (s in sales_ids) {
  
  temp <- df23 %>% filter(sales_id == s)
  
  # Apply discount
  temp$discount_rate <- sapply(temp$sales_amount, apply_discount)
  temp$net_sales <- temp$sales_amount * (1 - temp$discount_rate)
  
  # Cumulative (nested function)
  temp$cumulative_sales <- cumulative_sales_func(temp$net_sales)
  
  final_data <- rbind(final_data, temp)
}

# Summary Stats
summary_sales <- final_data %>%
  group_by(sales_id) %>%
  summarise(
    total_sales = sum(net_sales),
    avg_sales = mean(sales_amount),
    max_sales = max(sales_amount),
    avg_discount = mean(discount_rate)
  )

# Table Summary
datatable(summary_sales)
# Plot
p <- ggplot(final_data, aes(x = day, y = cumulative_sales,
                           color = sales_id, group = sales_id)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  scale_color_manual(values = setNames(PASTEL, unique(final_data$sales_id))) +
  labs(
    title = "Cumulative Sales per Salesperson",
    x = "Day",
    y = "Cumulative Net Sales"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.background  = element_rect(fill = "#FFF9F9", color = NA),
    panel.background = element_rect(fill = "#FFF9F9", color = NA),
    panel.grid.major = element_line(color = "#F0E6EE"),
    axis.text        = element_text(color = "#9B7BB8"),
    plot.title       = element_text(color = "#7B5EA7", face = "bold")
  )

ggplotly(p)

Interpretation:

The cumulative sales graph shows that the total net sales of each salesperson increase from day to day because it is an accumulation of sales. A steeper line indicates a salesperson with higher sales within a certain period of time. From the graph, it can be seen that some salespeople have faster sales growth than others, which means their sales performance is higher. Meanwhile, a flatter line indicates smaller or more stable sales.

3.Multi-Level Performance Categorization

# Library
library(readr)
library(dplyr)
library(plotly)
library(DT)

# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 2, 3.csv")

datatable(data,
          caption = "Dataset",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Kolom kategori
kategori <- data$performance_category

# Loop hitung frekuensi
freq <- c()

for(i in unique(kategori)){
  freq[i] <- sum(kategori == i)
}

# Hitung persentase
persentase <- (freq / sum(freq)) * 100

# Buat tabel
tabel <- data.frame(
  Category = names(freq),
  Frequency = as.numeric(freq),
  Percentage = persentase
)

tabel <- tabel %>%
  arrange(Frequency)

tabel$Category <- factor(tabel$Category, levels = tabel$Category)

# Bar Chart
bar_plot <- plot_ly(
  tabel,
  x = ~Category,
  y = ~Frequency,
  type = "bar",
  color = ~Category,
  text = ~Frequency,
textposition = "outside",
  hovertext = ~paste("Category:", Category,
                "<br>Frequency:", Frequency,
                "<br>Percentage:", round(Percentage,2), "%"),
  hoverinfo = "text"
) %>%
  layout(
    title = "Bar Plot Distribution of Category",
    xaxis = list(title = "Category"),
    yaxis = list(title = "Frequency"),
    showlegend = FALSE
  )

bar_plot
# Pie Chart
pie_chart <- plot_ly(
  tabel,
  labels = ~Category,
  values = ~Frequency,
  type = "pie",
  textinfo = "percent",
  hoverinfo = "label+value+percent"
  
) %>%
  layout(
    title = "Pie Chart Distribution of Category"
  )

pie_chart

Interpretation:

The bar plot shows that the “Poor” category has the highest frequency, followed by “Very Good” and “Average”, while “Good” and especially “Excellent” have the lowest counts. This can be seen from the tallest bar appearing in the “Poor” category. Meanwhile, the pie chart clarifies the proportion of the distribution, where “Poor” takes the largest portion (32%), followed by “Very Good” (24%) and “Average” (20%), while “Excellent” is only about 10% as the smallest portion. Both charts consistently show that the performance distribution is not evenly distributed and is still dominated by the low performance category.

4.Multi-Company Dataset Simulation

# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)

# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 4,6.csv")

datatable(data,
          caption = "Dataset",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Nested Loop Company & Employee
companies <- unique(data$company_id)
top_performers <- data.frame()

for(c in companies){
  
  company_data <- data[data$company_id == c, ]
  employees <- unique(company_data$employee_id)
  
  for(e in employees){
    
    emp_data <- company_data[company_data$employee_id == e, ]
    
    if(emp_data$KPI_score > 90){
      top_performers <- rbind(top_performers, emp_data)
    }
  }
}

# Top Performers Table
datatable(top_performers,
          caption = "Top Performers (KPI > 90)",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Summary per Company
summary_company <- data %>%
  group_by(company_id) %>%
  summarise(
    Avg_Salary = mean(salary, na.rm = TRUE),
    Avg_Performance = mean(performance_score, na.rm = TRUE),
    Max_KPI = max(KPI_score, na.rm = TRUE)
  )

# Summary Table
datatable(summary_company,
          caption = "Summary per Company",
          options = list(pageLength = 5, scrollX = TRUE),
          rownames = FALSE)
# Plot 1: AVG SALARY
sc1 <- summary_company %>% arrange(Avg_Salary)

plot_salary <- plot_ly(
  sc1,
  x = ~company_id,
  y = ~Avg_Salary,
  type = "bar",
  color = ~company_id,
  text = ~round(Avg_Salary, 0),
  textposition = "outside"
) %>%
  layout(
    title = "Average Salary per Company",
    xaxis = list(title = "Company"),
    yaxis = list(title = "Average Salary",
                 range = c(0, max(sc1$Avg_Salary) * 1.2)),
    showlegend = FALSE
  )

plot_salary
# Plot 2: AVG PERFORMANCE
sc2 <- summary_company %>% arrange(Avg_Performance)

plot_perf <- plot_ly(
  sc2,
  x = ~company_id,
  y = ~Avg_Performance,
  type = "bar",
  color = ~company_id,
  text = ~round(Avg_Performance, 1),
  textposition = "outside"
) %>%
  layout(
    title = "Average Performance per Company",
    xaxis = list(title = "Company"),
    yaxis = list(title = "Average Performance",
                 range = c(0, max(sc2$Avg_Performance) * 1.2)),
    showlegend = FALSE
  )

plot_perf
# Plot 3: MAX KPI
sc3 <- summary_company %>% arrange(Max_KPI)

plot_kpi <- plot_ly(
  sc3,
  x = ~company_id,
  y = ~Max_KPI,
  type = "bar",
  color = ~company_id,
  text = ~Max_KPI,
  textposition = "outside"
) %>%
  layout(
    title = "Maximum KPI per Company",
    xaxis = list(title = "Company"),
    yaxis = list(title = "Max KPI",
                 range = c(0, max(sc3$Max_KPI) * 1.2)),
    showlegend = FALSE
  )

plot_kpi

Interpretation:

  • Average Salary per Company: The highest average salary is found in Company 03, followed by Company 01, Company 04, and the lowest is Company 02. This indicates that Company 03 offers the highest compensation among the companies, while Company 02 provides the lowest.

  • Average Performance per Company: In terms of average performance, Company 03 also ranks the highest at around 76.2, followed by Company 01 and Company 04, which have similar values, while Company 02 is the lowest. This shows that employees in Company 03 generally perform better compared to those in other companies, whereas Company 02 lags behind.

  • Maximum KPI per Company: For maximum KPI, the highest value is achieved by Company 04 (~99.8), followed by Company 03 (~99.2), and then Company 01 and Company 02 (~98.8). This means that although Company 04 does’nt have the highest average performance, it has the top-performing individual.

5.Monte Carlo Simulation: Pi & Probability

# Library
library(ggplot2)
library(plotly)

# Jumlah iterasi
n <- 5000

# Loop generate titik random
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)

inside_circle <- c()
inside_square <- c()

for(i in 1:n){
  
  # Cek dalam lingkaran
  if(x[i]^2 + y[i]^2 <= 1){
    inside_circle[i] <- 1
  } else {
    inside_circle[i] <- 0
  }
  
  # Sub-square kecil
  if(x[i] >= -0.5 & x[i] <= 0.5 & y[i] >= -0.5 & y[i] <= 0.5){
    inside_square[i] <- 1
  } else {
    inside_square[i] <- 0
  }
}

# Hitung Pi
pi_estimate <- 4 * sum(inside_circle) / n
pi_estimate
## [1] 3.1424
# Probabilitas titik masuk sub-square
prob_square <- sum(inside_square) / n
prob_square
## [1] 0.2416
# Data untuk plot
points_data <- data.frame(
  x = x,
  y = y,
  inside_circle = as.factor(inside_circle)
)

# Plot
p <- ggplot(points_data, aes(x = x, y = y, color = inside_circle)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Monte Carlo Simulation",
    color = "Inside Circle"
  ) +
  theme_minimal()

ggplotly(p)

Interpretation:

The graph shows the distribution of random points in a Monte Carlo simulation, where the green points are inside the circle and the orange points are outside the circle but inside the square. The ratio of the number of points inside the circle to the total number of points is used to estimate the value of π, and the more points that are used, the more accurate the estimation of π will be.

6.Advanced Data Trasnformation & Feature Engineering

# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)

# Load Data
company <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 4,6.csv")

datatable(company,
          caption = "Dataset",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Normalization Function
normalize <- function(x){
  (x - min(x)) / (max(x) - min(x))
}

# Apply normalization
company$salary_norm <- normalize(company$salary)

# Feature Engineering
company$salary_bracket <- cut(
  company$salary,
  breaks = 3,
  labels = c("Low", "Medium", "High")
)

company$performance_category <- cut(
  company$performance_score,
  breaks = 3,
  labels = c("Low", "Medium", "High")
)

# Table
datatable(
  company %>% select(salary, salary_norm, salary_bracket),
  caption = "Salary Transformation Table",
  options = list(pageLength = 10),
  rownames = FALSE
)
# Histogram
df_plot <- data.frame(
  value = c(company$salary, company$salary_norm * max(company$salary)),
  type = c(rep("Before", nrow(company)),
           rep("After", nrow(company)))
)

p1 <- ggplot(df_plot, aes(x=value, fill=type)) +
  geom_histogram(alpha=0.5, bins=30, position="identity") +
  labs(
    title="Salary Distribution: Before vs After Normalization",
    x="Salary",
    y="Count",
    fill="Condition"
  ) +
  scale_fill_manual(values = c("#FFB3C1", "#A0C4FF")) +
  theme_minimal()

ggplotly(p1)
# Boxplot 
df_box <- data.frame(
  value = c(company$salary, company$salary_norm * max(company$salary)),
  type = c(rep("Before", nrow(company)),
           rep("After", nrow(company)))
)

p2 <- ggplot(df_box, aes(x=type, y=value, fill=type)) +
  geom_boxplot(alpha=0.7) +
  labs(
    title="Salary Distribution: Before vs After Normalization",
    x="Condition",
    y="Salary",
    fill="Condition"
  ) +
  scale_fill_manual(values = c("#FFB3C1", "#A0C4FF")) +
  theme_minimal()

ggplotly(p2)

Interpretation:

  • Histogram: The histogram shows the distribution of salaries before and after normalization. Before normalization (Before), the salary data has a wider range and greater spread. After normalization (After), the data distribution becomes more centered and the range is smaller, making the data more standardized and less dispersed. This indicates that the normalization process successfully reduces excessive variation in the salary data, resulting in a more balanced distribution that is easier to compare.

  • Boxplot: The boxplot shows a comparison of the median, data spread, and salary range before and after normalization. The median salary after normalization appears slightly lower compared to before normalization. In addition, the data range (box and whiskers) after normalization is smaller, which means the data variation has decreased and the data has become more stable. There are no extremely significant differences in outliers, but overall, normalization makes the data distribution more compact and less spread out.

7.Mini Project: Company KPI Dashboard & Simulation

# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)

# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset task 7.csv")

datatable(data,
          caption = "Dataset",
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)
# Color
PASTEL <- c(
  "#FFB3C1", "#FFD6A5", "#A0C4FF",
  "#BDB2FF", "#CAFFBF", "#FFC6FF",
  "#9BF6FF", "#FDFFB6", "#Caffbf",
  "#E7C6FF"
)

# Summary per Company
company_summary <- data %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = mean(salary),
    avg_KPI = mean(KPI_score),
    top_performers = sum(KPI_score > 90)
  )

datatable(company_summary, caption = "Company Summary")
# KPI Tier (Loop)
KPI_tier <- c()

for(i in 1:nrow(data)){
  if(data$KPI_score[i] >= 90){
    KPI_tier[i] <- "Excellent"
  } else if(data$KPI_score[i] >= 75){
    KPI_tier[i] <- "Good"
  } else if(data$KPI_score[i] >= 60){
    KPI_tier[i] <- "Average"
  } else {
    KPI_tier[i] <- "Low"
  }
}

data$KPI_tier <- KPI_tier

datatable(data %>% select(employee_id, KPI_score, KPI_tier),
          caption = "KPI Tier Table")
# Top Performers
top_perf_summary <- data %>%
  filter(KPI_score > 90) %>%
  group_by(company_id) %>%
  summarise(count = n()) %>%
  arrange(count)

top_perf_summary$company_id <- factor(
  top_perf_summary$company_id,
  levels = top_perf_summary$company_id
)

bar_plot <- plot_ly(
  top_perf_summary,
  x = ~company_id,
  y = ~count,
  type = "bar",
  color = ~company_id,
  text = ~count,          
  textposition = "outside"    
) %>%
  layout(
    title = "Top Performers per Company",
    xaxis = list(title = "Company"),
    yaxis = list(title = "Count",
                 range = c(0, max(top_perf_summary$count) * 1.2)),
    showlegend = FALSE
  )

bar_plot
# AVG Salary per Department
dept_summary <- data %>%
  group_by(department) %>%
  summarise(avg_salary = mean(salary)) %>%
  arrange(avg_salary)

dept_summary$department <- factor(
  dept_summary$department,
  levels = dept_summary$department
)

bar_plot2 <- plot_ly(
  dept_summary,
  x = ~department,
  y = ~avg_salary,
  type = "bar",
  color = ~department,
  text = ~round(avg_salary, 0),
  textposition = "outside"
) %>%
  layout(
    title = "Average Salary per Department",
    xaxis = list(title = "Department"),
    yaxis = list(title = "Average Salary",
                 range = c(0, max(dept_summary$avg_salary) * 1.2)),
    showlegend = FALSE
  )

bar_plot2
# Salary Distribution
p3 <- ggplot(data, aes(x=salary)) +
  geom_histogram(fill=PASTEL[1], bins=30, alpha=0.7) +
  labs(
    title="Salary Distribution",
    x="Salary",
    y="Count"
  ) +
  theme_minimal()

ggplotly(p3)
# Scatter + Regression
p4 <- ggplot(data,
             aes(x=salary, y=performance_score)) +
  geom_point(color=PASTEL[3], size=2) +
  geom_smooth(method="lm", se=FALSE, color=PASTEL[4]) +
  labs(
    title="Salary vs Performance Score",
    x="Salary",
    y="Performance Score"
  ) +
  theme_minimal()

ggplotly(p4)

Interpretation:

  • Top Performance per Company: The chart shows the number of top performers in each company. It can be seen that C02 has the highest number of top performers, followed by C05 and C03. Meanwhile, C01 has the lowest number of top performers. This indicates that the best-performing employees mostly come from company C02 compared to the other companies.

  • Average Salary per Department: The chart shows the average salary in each department. The Finance department has the highest average salary, while Operations has the lowest average salary. Other departments such as Engineering, HR, Marketing, and Sales fall in between, with differences that are not too significant.

  • Salary Distribution: The histogram shows the distribution of employee salaries. It can be seen that the salaries are fairly evenly distributed from around 5 million IDR to 25 million IDR, with most employees in the range of 10 million IDR to 20 million IDR. This indicates that the salary distribution is fairly normal and not heavily skewed to one side.

  • Salary vs Performance Score: The scatter plot shows the relationship between salary and performance score. The points are fairly randomly distributed and the trend line appears almost flat, indicating that there is no strong relationship between salary and performance score. This means that employees with higher salaries do not necessarily have higher performance scores, and vice versa.

Conclusion

In conclusion, this practicum provides a comprehensive understanding of fundamental data processing and analysis using R. Through a series of tasks, various programming concepts such as functions, loops, and data manipulation techniques have been effectively applied to handle and analyze data. Furthermore, the practicum demonstrates how raw data can be transformed into meaningful information through preprocessing, feature engineering, and visualization. The use of simulated datasets, including employee and sales data, also highlights the practical application of these concepts in real-world scenarios. Overall, this practicum not only strengthens technical programming skills but also enhances analytical thinking in interpreting data and generating relevant insights.