Naisya

Naisya Hafizh Mufidah

NIM = 52250040

Dosen Pengampu = Mr. Bakti Siregar, M.Sc., CDS.

Institut Teknologi Sains Bandung 🔬 Data Science 📈 Basic Statistics


1. Dynamic Multi-Formula Function

Objective

This task develops a dynamic function to compute multiple mathematical formulas using nested loops and conditional statements. The function evaluates linear, quadratic, cubic, and exponential formulas for a range of x values. The results are generated iteratively, validated for correct input, and visualized to compare the behavior of each formula.

# data tidak berubah
set.seed(123)

# function
compute_formula <- function(x, formula) {
  
  # memilih jenis formula berdasarkan input
  if (formula == "linear") {
    result <- x   # rumus linear
    
  } else if (formula == "quadratic") {
    result <- x^2   # rumus kuadrat
    
  } else if (formula == "cubic") {
    result <- x^3   # rumus kubik
    
  } else if (formula == "exponential") {
    result <- 2^x   # rumus eksponensial
    
  } else {
    stop("Invalid formula input")  # validasi jika input salah
  }
  
  return(result)  # mengembalikan hasil perhitungan
}
# nested loop

library(knitr)
library(kableExtra)

# daftar formula yang akan dihitung
formulas <- c("linear", "quadratic", "cubic", "exponential")

# nilai x dari 1 sampai 20
x_values <- 1:20

# data kosong untuk menyimpan hasil
results <- data.frame(
  x = numeric(),
  y = numeric(),
  formula = character()
)

# loop untuk setiap formula
for (f in formulas) {
  for (x in x_values) {
    
    y <- compute_formula(x, f)
    
    results <- rbind(results, data.frame(
      x = x,
      y = y,
      formula = f
    ))
  }
}

results %>%
  head() %>%
  knitr::kable(caption = "Preview of Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Preview of Results
x y formula
1 1 linear
2 2 linear
3 3 linear
4 4 linear
5 5 linear
6 6 linear
# Plot
library(ggplot2)
library(plotly)

# buat plot
p <- ggplot(results, aes(x = x, y = y, color = formula)) +
  geom_line(size = 1) +
  geom_point(size = 1.5) +
  labs(
    title = "Plot of Multiple Formulas",
    x = "x",
    y = "y"
  )

# ubah jadi interaktif
ggplotly(p)

Interpretation

The plot compares four mathematical functions across values of x from 1 to 20.

The linear function increases at a constant rate. The quadratic and cubic functions grow faster than the linear function, with the cubic function increasing more rapidly than the quadratic.

The exponential function exhibits the fastest growth among all, especially at higher x values, showing a sharp upward curve compared to the others.

2. Nested Simulation: Multi-Sales & Discounts

Objective

This task simulates sales data using nested loops and functions. Each salesperson generates daily sales values, with discounts applied based on sales amount. A nested function is used to compute cumulative sales, and results are summarized and visualized.

# function simulasi sales
simulate_sales <- function(n_salesperson, days) {
  
  # nested function untuk cumulative sales
  cumulative_sales <- function(sales_vector) {
    return(cumsum(sales_vector))
  }
  
  # data kosong 
  results <- data.frame(
    sales_id = integer(),
    day = integer(),
    sales_amount = numeric(),
    discount_rate = numeric(),
    salesperson = integer(),
    cumulative = numeric()
  )
  
  sales_id_counter <- 1
  
  # nested loop
  for (sp in 1:n_salesperson) {
    sales_vec <- c()
    
    for (d in 1:days) {
      
      # generate sales random
      sales_amount <- round(runif(1, 100, 1000), 2)
      
      # conditional discount
      if (sales_amount > 800) {
        discount <- 0.5
      } else if (sales_amount > 500) {
        discount <- 0.25
      } else {
        discount <- 0.1
      }
      
      sales_vec <- c(sales_vec, sales_amount)
      
      # simpan ke data 
      results <- rbind(results, data.frame(
        sales_id = sales_id_counter,
        day = d,
        sales_amount = sales_amount,
        discount_rate = discount,
        salesperson = sp,
        cumulative = NA   # placeholder dulu
      ))
      
      sales_id_counter <- sales_id_counter + 1
    }
    
    # isi cumulative per salesperson
    results$cumulative[results$salesperson == sp] <- cumulative_sales(sales_vec)
  }
  
  return(results)
}
# simulasi

results_sales <- simulate_sales(4, 8)

results_sales %>%
  head() %>%
  knitr::kable(caption = "Preview Sales Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Preview Sales Data
sales_id day sales_amount discount_rate salesperson cumulative
1 1 358.82 0.1 1 358.82
2 2 809.47 0.5 1 1168.29
3 3 468.08 0.1 1 1636.37
4 4 894.72 0.5 1 2531.09
5 5 946.42 0.5 1 3477.51
6 6 141.00 0.1 1 3618.51
# summary

library(dplyr)
library(knitr)
library(kableExtra)

summary_table <- results_sales %>%
  group_by(salesperson) %>%
  summarise(
    avg_sales = mean(sales_amount),
    total_sales = sum(sales_amount),
    min_sales = min(sales_amount),
    max_sales = max(sales_amount)
  )

summary_table %>%
  knitr::kable(caption = "Summary Statistics per Salesperson") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary Statistics per Salesperson
salesperson avg_sales total_sales min_sales max_sales
1 637.1225 5096.98 141.00 946.42
2 625.5050 5004.04 192.63 961.15
3 638.6150 5108.92 137.85 994.84
4 640.4550 5123.64 232.40 966.72

Interpretation

The summary statistics table presents the overall sales performance for each salesperson. Differences in average and total sales indicate variations in individual performance levels. The minimum and maximum values show the range of sales achieved, reflecting the consistency and variability of each salesperson’s daily performance.

# plot

library(ggplot2)
library(plotly)

p <- ggplot(results_sales, aes(x = day, y = cumulative, color = as.factor(salesperson))) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Cumulative Sales per Salesperson",
    x = "Day",
    y = "Cumulative Sales",
    color = "Salesperson"
  )

ggplotly(p)

Interpretation

The graph tracks sales growth over 8 days for four salespeople. All lines trend upward, showing steady increases.

  • Salesperson 1 and 2 rise quickly, leading in cumulative sales.

  • Salesperson 3 and 4 start slower but catch up by Day 8.

  • By the end, totals converge, meaning overall performance is fairly balanced.

3. Multi-Level Performance Categorization

Objective

This task categorizes sales performance into multiple levels using a function and loops. Each sales amount is assigned to a category based on predefined thresholds, and the distribution of categories is analyzed using percentages and visualized with bar and pie charts.

library(knitr)
library(dplyr)
library(kableExtra)

# function categorization
categorize_performance <- function(sales_amount) {
  
  categories <- character(length(sales_amount))
  
  # loop melalui vector
  for (i in seq_along(sales_amount)) {
    
    value <- sales_amount[i]
    
    # multi-level categorization
    if (value >= 800) {
      categories[i] <- "Excellent"
    } else if (value >= 650) {
      categories[i] <- "Very Good"
    } else if (value >= 500) {
      categories[i] <- "Good"
    } else if (value >= 350) {
      categories[i] <- "Average"
    } else {
      categories[i] <- "Poor"
    }
  }
  
  return(categories)
}
# apply function

results_sales$performance <- categorize_performance(results_sales$sales_amount)
results_sales %>%
  kable(caption = "Data After Categorization") %>%
  kable_styling()
Data After Categorization
sales_id day sales_amount discount_rate salesperson cumulative performance
1 1 358.82 0.10 1 358.82 Average
2 2 809.47 0.50 1 1168.29 Excellent
3 3 468.08 0.10 1 1636.37 Average
4 4 894.72 0.50 1 2531.09 Excellent
5 5 946.42 0.50 1 3477.51 Excellent
6 6 141.00 0.10 1 3618.51 Poor
7 7 575.29 0.25 1 4193.80 Good
8 8 903.18 0.50 1 5096.98 Excellent
9 1 596.29 0.25 2 596.29 Good
10 2 510.95 0.25 2 1107.24 Good
11 3 961.15 0.50 2 2068.39 Excellent
12 4 508.00 0.25 2 2576.39 Good
13 5 709.81 0.25 2 3286.20 Very Good
14 6 615.37 0.25 2 3901.57 Good
15 7 192.63 0.10 2 4094.20 Poor
16 8 909.84 0.50 2 5004.04 Excellent
17 1 321.48 0.10 3 321.48 Poor
18 2 137.85 0.10 3 459.33 Poor
19 3 395.13 0.10 3 854.46 Average
20 4 959.05 0.50 3 1813.51 Excellent
21 5 900.59 0.50 3 2714.10 Excellent
22 6 723.52 0.25 3 3437.62 Very Good
23 7 676.46 0.25 3 4114.08 Very Good
24 8 994.84 0.50 3 5108.92 Excellent
25 1 690.14 0.25 4 690.14 Very Good
26 2 737.68 0.25 4 1427.82 Very Good
27 3 589.66 0.25 4 2017.48 Good
28 4 634.73 0.25 4 2652.21 Good
29 5 360.24 0.10 4 3012.45 Average
30 6 232.40 0.10 4 3244.85 Poor
31 7 966.72 0.50 4 4211.57 Excellent
32 8 912.07 0.50 4 5123.64 Excellent

Interpretation

After categorization, the data becomes easier to interpret as it groups sales into performance levels.The categorization process transforms raw sales values into meaningful performance levels, making it easier to analyze and compare sales outcomes across observations.

library(knitr)
library(kableExtra)

# hitung frequency & percentage
freq_table <- table(results_sales$performance)

percent_table <- round(100 * freq_table / sum(freq_table), 2)

summary_performance <- data.frame(
  Category = names(freq_table),
  Count = as.vector(freq_table),
  Percentage = as.vector(percent_table)
)

summary_performance %>%
  knitr::kable(caption = "Performance Distribution (%)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Performance Distribution (%)
Category Count Percentage
Average 4 12.50
Excellent 11 34.38
Good 7 21.88
Poor 5 15.62
Very Good 5 15.62

Interpretation

This performance distribution table shows that:

  • The Excellent category has the highest number and percentage (34.38%), meaning that most sales fall within the highest performance level.

  • The Good category comes in second place (21.88%), indicating that a significant number of sales are at a strong but not yet optimal level.

  • The Poor and Very Good categories both account for 15.62%, indicating a balance between low and moderately high performance.

  • The Average category has the lowest percentage (12.50%), indicating that only a small portion of sales fall within the middle.

This interpretation emphasizes that sales distribution is more concentrated in the Excellent category, while the other categories are relatively even with smaller numbers.
# bar plot
summary_performance <- summary_performance %>%
  arrange(desc(Count))

plot_ly(
  summary_performance,
  x = ~reorder(Category, -Count),
  y = ~Count,
  type = "bar",
  text = ~Count,
  textposition = "auto",
  marker = list(
    color = c("#FF9AA2", "#A0E7E5", "#B4F8C8", "#FFB7B2", "#B5EAD7")
  )
) %>%
layout(
  title = "Bar Chart Performance Distribution",
  xaxis = list(title = "Category"),
  yaxis = list(title = "Count")
)

Interpretation

This bar chart clearly shows the distribution of performance: the Excellent category stands out as having the highest number of scores, while Average has the lowest. Other categories, such as Good, Poor, and Very Good, fall in the middle with more balanced scores. This chart emphasizes the dominance of the Excellent category over the other categories.

# pie chart
plot_ly(
  summary_performance,
  labels = ~Category,
  values = ~Percentage,
  type = "pie",
  textinfo = "label+percent",
  showlegend = FALSE
) %>%
layout(
  title = "Pie Chart Performance Distribution"
)

Interpretation

This pie chart shows the proportions of each performance category. The Excellent category dominates with the largest percentage, while Average has the smallest. Meanwhile, the Good, Poor, and Very Good categories fall in the middle with more balanced portions. This visualization emphasizes that the majority of sales are at the highest level, while the remainder is evenly distributed across the other categories.

4. Multi-Company Dataset Simulation

Objective

This task simulates a multi-company dataset using a function with nested loops. Each company consists of multiple employees with randomly generated salary, department, performance score, and KPI score. Conditional logic is applied to identify top performers based on KPI scores above 90.

The dataset is then summarized by company, including average salary, average performance, and maximum KPI. Finally, the results are presented in a summary table and visualized using bar charts to compare performance across companies.

# function simulasi dataset multi-company
generate_company_data <- function(n_company, n_employees) {
  
  # vector untuk simpan data
  company_id <- c()
  employee_id <- c()
  salary <- c()
  department <- c()
  performance_score <- c()
  KPI_score <- c()
  
  departments <- c("HR", "Finance", "IT", "Marketing")
  
  # nested loop 
  for (i in 1:n_company) {
    
    for (j in 1:n_employees) {
      
      company_id <- c(company_id, paste0("Company_", i))
      employee_id <- c(employee_id, paste0("Emp_", i, "_", j))
      
      salary <- c(salary, round(runif(1, 4000, 10000)))
      department <- c(department, sample(departments, 1))
      
      performance_score <- c(performance_score, round(runif(1, 60, 100), 1))
      KPI_score <- c(KPI_score, round(runif(1, 50, 100), 1))
    }
  }
  
  data.frame(
    Company = company_id,
    Employee = employee_id,
    Salary = salary,
    Department = department,
    Performance = performance_score,
    KPI = KPI_score
  )
}
library(dplyr)
library(knitr)
library(kableExtra)

# generate data
data_company <- generate_company_data(n_company = 3, n_employees = 50)

# summary per company 
summary_company <- data_company %>%
  group_by(Company) %>%
  summarise(
    Avg_Salary = round(mean(Salary), 2),
    Avg_Performance = round(mean(Performance), 2),
    Max_KPI = max(KPI)
  )

# tampilkan tabel
summary_company %>%
  kable(
    caption = "Summary Per Company",
    align = "c"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    position = "center"
  ) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE)
Summary Per Company
Company Avg_Salary Avg_Performance Max_KPI
Company_1 6932.46 80.40 97.7
Company_2 7056.90 78.07 99.6
Company_3 7213.20 79.16 99.6

Interpretation

This table shows a comparison of three companies based on average salary, average performance, and maximum KPI.

Company_3 has the highest average salary, indicating a more competitive compensation strategy. However, Company_1 shows the highest average performance, suggesting that higher salary does not necessarily lead to better employee performance.

This indicates that other factors, such as management quality, work environment, or employee engagement, may play a more significant role in driving performance.

In terms of KPI, Company_2 and Company_3 achieve similarly high maximum scores, showing that top-performing individuals exist in multiple companies regardless of differences in average performance.

library(dplyr)

# menandai top performers
data_company$Top_Performer <- ifelse(data_company$KPI > 90, "Yes", "No")

top_summary <- data_company %>%
  group_by(Company, Top_Performer) %>%
  summarise(Count = n(), .groups = "drop")

top_summary %>%
  knitr::kable(caption = "Top Performer Count per Company") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Top Performer Count per Company
Company Top_Performer Count
Company_1 No 45
Company_1 Yes 5
Company_2 No 35
Company_2 Yes 15
Company_3 No 39
Company_3 Yes 11

Interpretation

Company_2 has the highest number of top performers, indicating a stronger concentration of high-performing employees. This suggests that Company_2 may have more effective performance evaluation or KPI alignment compared to the other companies.

library(dplyr)
library(ggplot2)
library(plotly)

# pastikan summary ada dan terurut
summary_company <- summary_company %>%
  arrange(desc(Avg_Salary))

# bar chart
p_salary <- ggplot(summary_company, aes(
  x = reorder(Company, Avg_Salary),
  y = Avg_Salary,
  fill = Company
)) +
  geom_bar(stat = "identity", show.legend = FALSE, width = 0.6) +
  coord_flip() +  
  labs(
    title = "Average Salary per Company",
    x = "Company",
    y = "Average Salary"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 11)
  )

# jadi interaktif
ggplotly(p_salary)

Interpretation

The visualization clearly shows that Company_3 leads in compensation. However, this advantage does not directly translate into superior performance, reinforcing the idea that salary alone is not the primary driver of employee effectiveness.

library(dplyr)
library(ggplot2)
library(plotly)

# summary data 
summary_company <- data_company %>%
  group_by(Company) %>%
  summarise(
    Avg_Performance = round(mean(Performance), 2),
    Max_KPI = max(KPI),
    .groups = "drop"
  )

# bar chart rata-rata performance
p_perf <- ggplot(summary_company, aes(
  x = Company,
  y = Avg_Performance,
  fill = Company
)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = Avg_Performance),
            vjust = -0.5, size = 3.5) +
  labs(
    title = "Average Performance per Company",
    x = "Company",
    y = "Average Performance"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c(
    "Company_1" = "#FF69B4",
    "Company_2" = "#FFD700",
    "Company_3" = "#00CED1"
  )) +
  theme(legend.position = "none")

ggplotly(p_perf)

Interpretation

This graph shows the average employee performance at each company. Company_1 has the highest average performance score compared to the other two companies, indicating a relatively more consistent employee work quality. The differences between the companies are clearly visible in the length of the bars, making it easy to identify which company has the best average performance.

library(dplyr)
library(ggplot2)
library(plotly)

# summary data 
summary_company <- data_company %>%
  group_by(Company) %>%
  summarise(
    Avg_Performance = round(mean(Performance), 2),
    Max_KPI = max(KPI),
    .groups = "drop"
  )

# Bar chart Max KPI
p_kpi <- ggplot(summary_company, aes(
  x = Company,
  y = Max_KPI,
  fill = Company
)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = Max_KPI),
            vjust = -0.5, size = 3.5) +
  labs(
    title = "Maximum KPI per Company",
    x = "Company",
    y = "Max KPI"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c(
    "Company_1" = "#8A2BE2",  # ungu
    "Company_2" = "#32CD32",  # hijau lime
    "Company_3" = "#FF4500"   # oranye terang
  )) +
  theme(legend.position = "none")

ggplotly(p_kpi)

Interpretation

This graph displays the highest KPI achievement for each company. Company 2 and Company 3 both achieved higher maximum scores than Company 1, indicating the presence of exceptional individuals in both companies. This visualization emphasizes that while average performance differs, peak KPI achievement can be consistent across multiple companies.

5. Monte Carlo Simulation: Pi & Probability

Objective

Monte Carlo simulation is a random sampling-based computational method for estimating values or probabilities. By generating a large number of random points, we can calculate the ratio of points falling within a circle to the total number of points to approximate the value of \(\pi\). This method is also used to calculate the probability of a point falling within a specific area, and the more random trials are performed, the closer the results become to the theoretical value.

library(dplyr)
library(knitr)
library(kableExtra)

# Monte Carlo Simulation function
monte_carlo_pi <- function(n_points) {
  
  # generate random points
  x <- runif(n_points, -1, 1)
  y <- runif(n_points, -1, 1)
  
  # check if inside circle
  inside <- (x^2 + y^2) <= 1
  
  # estimate pi
  pi_est <- 4 * sum(inside) / n_points
  
  # probability of falling in sub-square (0 ≤ x ≤ 0.5, 0 ≤ y ≤ 0.5)
  sub_square <- (x >= 0 & x <= 0.25 & y >= 0 & y <= 0.25)
  prob_sub <- sum(sub_square) / n_points
  
  # return list 
  return(list(
    pi_estimate = pi_est,
    prob_sub_square = prob_sub,
    data = data.frame(x = x, y = y, inside = inside)
  ))
}

# run simulation
set.seed(123)
sim_results <- monte_carlo_pi(10000)

# bikin summary table
summary_table <- data.frame(
  Estimasi_Pi = sim_results$pi_estimate,
  Prob_SubSquare = sim_results$prob_sub_square,
  Jumlah_Titik = nrow(sim_results$data)
)

summary_table %>%
  knitr::kable(caption = "Monte Carlo Simulation Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Monte Carlo Simulation Results
Estimasi_Pi Prob_SubSquare Jumlah_Titik
3.1576 0.0168 10000

Interpretation

The table shows the main results of the Monte Carlo simulation with 10,000 random points. The estimated value of π is approximately 3.14, which is very close to the theoretical value of 3.14159.

This demonstrates that Monte Carlo simulation is effective for approximating mathematical constants and probabilities, especially when analytical solutions are difficult to obtain.

The probability of a point falling in the sub-square is approximately 0.0156, which corresponds to the ratio of the sub-square area (0.0625) to the total square area (4).

Both the \(\pi\) estimation and probability calculation rely on the same principle of random sampling, showing how Monte Carlo methods can be applied to different types of problems.

library(ggplot2)
library(plotly)

# plot visualize
p <- ggplot(sim_results$data, aes(x = x, y = y, color = inside)) +
  geom_point(alpha = 0.6, size = 1.2) +
  labs(
    title = "Monte Carlo Simulation: Points Inside vs Outside Circle",
    x = "X",
    y = "Y",
    color = "Inside Circle"
  ) +
  theme_minimal()
  
ggplotly(p, showlegend = FALSE)

Interpretation

This interactive Monte Carlo graph displays random points in a \([-1,1] × [-1,1]\) coordinate grid. The color of the points indicates whether they are inside the circle (inside = TRUE) or outside the circle (inside = FALSE). The even distribution of points shows how the ratio of points inside the circle to total points approaches the value of \(\pi\). This visualization reinforces the intuition that the more random points used, the more accurate and stable the estimate of \(\pi\) will be.

6. Advanced Data Transformation & Feature Engineering

Objective

Performing advanced data transformations using column normalization, z-score standardization, and creating new features for analysis. The goal is to see how the data distribution changes after the transformation and how additional features can aid classification or segmentation.

library(dplyr)
library(knitr)
library(kableExtra)

# contoh data dummy
df <- data.frame(
  performance = c(60, 75, 90, 55, 80, 95),
  salary = c(3.5, 4.2, 5.0, 3.0, 4.8, 6.0)
)

# fungsi normalisasi (min-max)
normalize_columns <- function(df) {
  as.data.frame(lapply(df, function(x) (x - min(x)) / (max(x) - min(x))))
}

# fungsi z-score
z_score <- function(df) {
  as.data.frame(lapply(df, function(x) (x - mean(x)) / sd(x)))
}

# transformasi
df_norm <- normalize_columns(df)
df_z <- z_score(df)

# buat fitur baru
df$performance_category <- ifelse(df$performance >= 80, "High",
                           ifelse(df$performance >= 60, "Medium", "Low"))

df$salary_bracket <- cut(df$salary,
                         breaks = c(0, 4, 5, 10),
                         labels = c("Low", "Medium", "High"))


# gabungkan semua hasil ke satu data frame
output_table <- data.frame(
  Performance_Original = df$performance,
  Salary_Original = df$salary,
  Performance_Normalized = df_norm$performance,
  Salary_Normalized = df_norm$salary,
  Performance_Zscore = df_z$performance,
  Salary_Zscore = df_z$salary,
  Performance_Category = df$performance_category,
  Salary_Bracket = df$salary_bracket
)

# tampilkan tabel rapi
output_table %>%
  knitr::kable(caption = "Data Transformation & Feature Engineering Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "400px")
Data Transformation & Feature Engineering Results
Performance_Original Salary_Original Performance_Normalized Salary_Normalized Performance_Zscore Salary_Zscore Performance_Category Salary_Bracket
60 3.5 0.125 0.1666667 -0.9931459 -0.8446956 Medium Low
75 4.2 0.500 0.4000000 -0.0522708 -0.1996553 Medium Medium
90 5.0 0.875 0.6666667 0.8886042 0.5375336 High Medium
55 3.0 0.000 0.0000000 -1.3067709 -1.3054387 Low Low
80 4.8 0.625 0.6000000 0.2613542 0.3532364 High Medium
95 6.0 1.000 1.0000000 1.2022293 1.4590197 High High

Interpretation

The table displays the original data (performance, salary), the normalized results (values are converted to the range [0,1]), the z-score results (mean 0, standard deviation 1), and new features in the form of performance categories (performance_category) and salary brackets (salary_bracket). The table shows:

  • Original Performance & Salary are still on a raw scale, so salary is much larger than performance.

  • Normalization puts all variables on the same scale (0–1), facilitating comparisons between variables.

  • Z-scores standardize the distribution so that the mean is close to 0 and variation is measured in standard deviation units.

  • New categories (High, Medium, Low) provide a more understandable qualitative context; for example, a salary of 3.5 is in the “Low” category, while 6.0 is in the “High” category.

This table visualization shows how preprocessing rescales the data without changing the relative patterns between values, while also adding a categorical dimension for segmentation analysis. This way, the data is better prepared for statistical analysis and machine learning.

library(ggplot2)
library(plotly)

# histogram performance original
p1 <- ggplot(df, aes(x = performance)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 5) +
  labs(title = "Distribusi Performance (Original)") +
  theme_minimal()

ggplotly(p1)

Interpretation

  • This graph shows the distribution of actual performance scores.
  • Scores are spread between 55–95, with a concentration in the Medium–High categories.
# histogram performance normalized
p2 <- ggplot(df_norm, aes(x = performance)) +
  geom_histogram(fill = "orange", color = "black", bins = 5) +
  labs(title = "Distribusi Performance (Normalized)") +
  theme_minimal()

ggplotly(p2)

Interpretation

After normalization, the shape of the distribution remains the same, but the scale changes to [0,1]. This facilitates comparison between variables.

# boxplot salary original
p3 <- ggplot(df, aes(y = salary)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Salary (Original)") +
  theme_minimal()

ggplotly(p3)

Interpretation

The original salary boxplot shows a salary range of 3.0–6.0, with a median of around 4.5. The scale is still crude.

# boxplot salary z-score
p4 <- ggplot(df_z, aes(y = salary)) +
  geom_boxplot(fill = "pink") +
  labs(title = "Salary (Z-score)") +
  theme_minimal()

ggplotly(p4)

Interpretation

After the z-score, the median shifts to 0, and the scale is measured in standard deviations. The distribution remains the same, but it is better suited for statistical analysis.

Summary

# summary sebelum transformasi
summary_original <- data.frame(
  Variable = names(df[,1:2]),
  Mean = sapply(df[,1:2], mean),
  SD   = sapply(df[,1:2], sd),
  Min  = sapply(df[,1:2], min),
  Max  = sapply(df[,1:2], max)
)

# summary setelah normalisasi
summary_norm <- data.frame(
  Variable = names(df_norm),
  Mean = sapply(df_norm, mean),
  SD   = sapply(df_norm, sd),
  Min  = sapply(df_norm, min),
  Max  = sapply(df_norm, max)
)

# summary setelah z-score
summary_z <- data.frame(
  Variable = names(df_z),
  Mean = sapply(df_z, mean),
  SD   = sapply(df_z, sd),
  Min  = sapply(df_z, min),
  Max  = sapply(df_z, max)
)

# gabungkan jadi satu tabel
summary_all <- rbind(
  cbind(Method = "Original", summary_original),
  cbind(Method = "Normalized", summary_norm),
  cbind(Method = "Z-score", summary_z)
)

summary_all %>%
  knitr::kable(caption = "Summary Statistics Before & After Transformation") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary Statistics Before & After Transformation
Method Variable Mean SD Min Max
performance Original performance 75.8333333 15.9426054 55.000000 95.000000
salary Original salary 4.4166667 1.0852035 3.000000 6.000000
performance1 Normalized performance 0.5208333 0.3985651 0.000000 1.000000
salary1 Normalized salary 0.4722222 0.3617345 0.000000 1.000000
performance2 Z-score performance 0.0000000 1.0000000 -1.306771 1.202229
salary2 Z-score salary 0.0000000 1.0000000 -1.305439 1.459020

Interpretation

This table shows how basic statistics change after transformation. The original data had different scales across variables (e.g., salary was greater than performance). Normalization kept all values within the range \([0,1]\), so \(Min = 0\) and \(Max = 1\), making it easier to compare variables with different scales. \(Z-scores\) standardized the data with a mean of \(0\) and a standard deviation of \(1\), making the distribution more balanced for statistical analysis. This comparison confirms that transformation helps equalize the variables’ scales, facilitating multivariate analysis, and allowing new features to be added to the standardized data.

7. Mini Project: Company KPI Dashboard & Simulation

Objective

Building simulated datasets for several companies, then creating KPI summaries, departmental analyses, salary distributions, and interactive visualizations. The goal is to practice data wrangling, feature engineering, and dashboarding skills.

library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
library(htmltools)

set.seed(123)

# generate dataset: 5 perusahaan, masing2 100 karyawan
generate_kpi_data <- function(n_company, n_employees) {
  
  departments <- c("HR", "Finance", "IT", "Sales", "Marketing")
  
  data <- data.frame()
  
  for (i in 1:n_company) {
    for (j in 1:n_employees) {
      
      new_row <- data.frame(
        employee_id = paste0("E", i, "_", j),
        company_id = paste0("C", i),
        salary = round(runif(1, 3000, 10000), 2),
        performance_score = sample(50:100, 1),
        KPI_score = sample(60:100, 1),
        department = sample(departments, 1)
      )
      
      data <- rbind(data, new_row)
    }
  }
  
  return(data)
}

df_kpi <- generate_kpi_data(5, 100)

# kategorisasi KPI tiers
categorize_kpi <- function(kpi_scores) {
  
  tiers <- character(length(kpi_scores))
  
  for (i in seq_along(kpi_scores)) {
    
    if (kpi_scores[i] >= 90) {
      tiers[i] <- "Excellent"
    } else if (kpi_scores[i] >= 75) {
      tiers[i] <- "Good"
    } else {
      tiers[i] <- "Average"
    }
  }
  
  tiers <- factor(tiers, levels = c("Excellent", "Good", "Average"))
  
  return(tiers)
}

df_kpi$KPI_tier <- categorize_kpi(df_kpi$KPI_score)

# summary per company (dibulatkan 2 digit)

summary_df <- df_kpi %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = round(mean(salary), 2),
    avg_KPI = round(mean(KPI_score), 2),
    top_performers = sum(KPI_score > 90, na.rm = TRUE)
  )


# tabel data employee
datatable(
  df_kpi %>%
    select(company_id, employee_id, salary, KPI_score, KPI_tier, performance_score, department),
  rownames  = FALSE,
  options = list(
    scrollX   = FALSE,
    autoWidth = FALSE,
    dom       = 'ftp',
    columnDefs = list(
      list(className = 'dt-center', targets = "_all"),
      list(width = '12%', targets = 0),
      list(width = '18%', targets = 1),
      list(width = '14%', targets = 2),
      list(width = '13%', targets = 3),
      list(width = '13%', targets = 4),
      list(width = '16%', targets = 5),
      list(width = '14%', targets = 6)
    )
  ),
  class   = "stripe hover compact",
  width   = "100%",
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold;',
    'Employee Dataset with KPI Tiers'
  )
)
# summary table
datatable(
  summary_df,
  rownames  = FALSE,
  options = list(
    scrollX   = FALSE,
    autoWidth = FALSE,
    dom       = 't',
    columnDefs = list(
      list(className = 'dt-center', width = '25%', targets = "_all")
    )
  ),
  class   = "stripe hover compact",
  width   = "100%",
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold;',
    'Company KPI Summary'
  )
)

Interpretation

The Company KPI Summary table compares five companies (C1–C5) across three metrics: average salary, average KPI score, and number of top performers.

  • Average Salary: All companies have similar salary levels, ranging between about 6,370 and 6,615. Company C4 shows the highest average salary (≈ 6,614), while C3 has the lowest (≈ 6,371).

  • Average KPI: The KPI averages are fairly close, between 78.9 and 82.0. Company C1 leads with the highest KPI average (≈ 82.04), while C4 has the lowest (≈ 78.86).

  • Top Performers: The number of employees with very high KPI scores varies. Company C1 has the most top performers (29), while C5 has the fewest (19).

Key insight:

  • Company C1 stands out as strong overall, with the highest KPI average and the largest pool of top performers.

  • Company C4 pays the highest average salary but has the lowest KPI average and fewer top performers, suggesting compensation alone doesn’t guarantee higher performance.

  • Company C5 has relatively modest salary and KPI averages, and the fewest top performers, indicating potential areas for improvement.

library(dplyr)

# data department summary
dept_summary <- df_kpi %>%
  group_by(company_id, department) %>%
  summarise(
    avg_KPI = mean(KPI_score),
    .groups = "drop"
  )

datatable(dept_summary)

Interpretation

  • Company-to-Company Performance: C3 consistently outperforms both in terms of average salary, KPIs, and the number of top performers. This suggests that investment in talent is directly linked to results.

  • Employee Rating Distribution: The majority are in the Average and Very Good categories, while Excellent is rare. This indicates that overall quality is quite good, but there is still much room for improvement.

  • Department Comparison: Sales and IT frequently appear as departments with high KPIs, while HR tends to have lower ones. This could signal areas that need strengthening.

  • π Simulation: The estimated results are close to the true values, confirming that probabilistic methods can be effective with large sample sizes.

This data depicts a stable organization with fairly good average performance, but not many truly excellent ones. Companies that pay higher compensation and have more top performers (like C3) tend to produce better KPIs. Therefore, focusing on talent development and quality improvement in weak departments can drive more Excellent results.

# scatter visualize
p_scatter <- ggplot(df_kpi, aes(
  x = performance_score,
  y = KPI_score,
  color = company_id
)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Performance vs KPI with Regression Line",
    x = "Performance Score",
    y = "KPI Score"
  )

ggplotly(p_scatter)

Interpretation

The chart compares Performance Score vs KPI Score for five companies. Each company’s points are shown in different colors with its own regression line.

  • Most regression lines are slightly positive, meaning higher performance scores generally align with higher KPI scores.

  • Some lines are nearly flat or slightly negative, showing weaker or inconsistent relationships in certain companies.

  • Overall, the plot highlights that the strength of the link between performance and KPI varies across companies.

# histogram visualize
p_salary <- ggplot(df_kpi, aes(
  x = salary,
  fill = company_id
)) +
  geom_histogram(bins = 30, alpha = 0.6) +
  labs(
    title = "Salary Distribution",
    x = "Salary",
    y = "Frequency"
  )

ggplotly(p_salary)

Interpretation

The chart shows how salaries are spread across five companies. Most salaries cluster between 4,000–8,000. The stacked colors reveal overlaps, but we can see differences:

  • C1 and C4 appear more concentrated in the mid‑range.

  • C2 and C5 have wider spreads, reaching higher values.

  • Overall, the distributions are fairly similar, with no company standing out as extreme.

# filter data top perform
top_data <- df_kpi %>%
  filter(KPI_score > 90)

# visualize top performers per company & department
p_top <- ggplot(top_data, aes(
  x = company_id,
  fill = department
)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Top Performers Distribution",
    x = "Company",
    y = "Count"
  )

ggplotly(p_top)

Interpretation

The chart shows how top performers are spread across departments in each company.

  • Sales consistently has the highest count of top performers, especially in C1 and C2.

  • IT also shows strong numbers in some companies.

  • HR and Marketing generally have fewer top performers compared to other departments.

Insight Summary

This dashboard presents a comprehensive analysis of employee data across multiple companies, including KPI performance, salary distribution, and departmental contributions.

The results show that employee performance, as measured by KPI scores, is positively related to performance scores, indicating that higher-performing employees tend to achieve better KPI outcomes.

Salary distributions across companies are relatively similar, with most values concentrated in a consistent range. Any observed differences are likely due to random variation in the simulated data.

Department-level analysis reveals that performance varies within companies, highlighting the importance of evaluating specific departments rather than relying solely on overall company averages.

Additionally, the identification of top performers shows that certain companies and departments contribute more significantly to high-achieving employees.

Overall, this analysis demonstrates how data visualization and simulation can be used to generate meaningful insights and support data-driven decision making.

8. Automated Report Generation

Objective

This task focuses on building an automated reporting system using functions and loops. The goal is to generate summary reports for each company, including key metrics, tables, and visualizations in a structured format.

By automating the reporting process, this task demonstrates how repetitive analysis can be efficiently handled, allowing scalable and consistent insights across multiple companies. Additionally, it introduces the concept of exporting results into formats such as HTML, CSV, or PDF for practical business use.

# function report per company
generate_company_report <- function(data, company_name) {
  
  library(dplyr)
  library(ggplot2)
  library(plotly)
  library(DT)
  library(htmltools)
  
  # filter data
  df <- data %>% filter(company_id == company_name)
  
  # summary
  summary <- df %>%
    summarise(
      avg_salary = round(mean(salary), 2),
      avg_KPI = round(mean(KPI_score), 2),
      total_employee = n()
    )
  
  # tabel
  table_out <- datatable(
    summary,
    options = list(dom = 't', autoWidth = TRUE),
    rownames = FALSE
  )
  
  # plot
  p <- ggplot(df, aes(x = KPI_score)) +
    geom_histogram(bins = 20, fill = "skyblue") +
    labs(
      title = paste("KPI Distribution -", company_name),
      x = "KPI Score",
      y = "Count"
    )
  
  plot_out <- ggplotly(p)
  
  # RETURN SEMUA OUTPUT
  tagList(
    h3(paste("Report for", company_name)),
    table_out,
    plot_out
  )
}
# try
generate_company_report(df_kpi, "C1")

Report for C1

library(htmltools)

# ambil semua company
companies <- unique(df_kpi$company_id)

#loop otomatis
all_reports <- lapply(companies, function(comp) {
  tagList(
    generate_company_report(df_kpi, comp),
    br(), br()
  )
})

tagList(all_reports)

Report for C1



Report for C2



Report for C3



Report for C4



Report for C5



library(dplyr)
library(knitr)
library(kableExtra)
library(DT)

# export data per company
companies <- unique(df_kpi$company_id)

# simpan nama file
file_list <- data.frame(
  Company = character(),
  File_Name = character()
)

for (comp in companies) {
  
  df_temp <- df_kpi %>% filter(company_id == comp)
  
  file_name <- paste0("report_", comp, ".csv")
  
  write.csv(df_temp, file_name, row.names = FALSE)
  
  file_list <- rbind(file_list, data.frame(
    Company = comp,
    File_Name = file_name
  ))
}

# tampilkan tabel
file_list %>%
  kable(caption = "Exported Report Files") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Exported Report Files
Company File_Name
C1 report_C1.csv
C2 report_C2.csv
C3 report_C3.csv
C4 report_C4.csv
C5 report_C5.csv
datatable(summary_df,
  extensions = 'Buttons',   
  options = list(
    dom = 'Bfrtip',        
    buttons = c('csv'),
    autoWidth = TRUE
  ),
  rownames = FALSE
)

Interpretation

The table displays the list of automatically generated report files for each company. Each company has a corresponding CSV file containing its respective dataset, which has been created using a loop-based approach.

This demonstrates how repetitive tasks, such as generating and exporting reports for multiple companies, can be efficiently automated using programming techniques.

By applying loops and functions, the reporting process becomes scalable, consistent, and less prone to manual errors. Overall, this approach reflects practical data workflows where automation is essential for handling large and repetitive datasets.

Conclusion

This project demonstrates the use of programming concepts such as functions, loops, and simulation in analyzing structured datasets.

The workflow includes generating synthetic data, performing data transformations, and creating interactive visualizations to explore patterns in performance and salary distributions. Additionally, automated report generation highlights how repetitive analytical tasks can be efficiently scaled across multiple entities.

Overall, the project shows how combining data processing, visualization, and automation can support efficient and reproducible data analysis.

References

[1] Data Science Labs. (n.d.). Functions and Loops. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/03-Functions-and-Loops.html

[2] StatQuest with Josh Starmer. (n.d.). Statistics Fundamentals Playlist. Retrieved from https://www.youtube.com/playlist?list=PL9dABXznEOVLa2K-KTuV9OH78zeQI7991