Assignment Week 5 ~ Functions and loops

Lulu Najla Salsabila

INSTITUT TEKNOLOGI SAINS BANDUNG

1 Pendahuluan

Praktikum ini bertujuan untuk mengimplementasikan konsep fungsi, loop, dan kondisional dalam menyelesaikan berbagai permasalahan data science. Seluruh tugas disusun secara bertahap dari tingkat dasar hingga lanjutan, mencakup simulasi data, transformasi data, analisis statistik, hingga visualisasi interaktif. Praktikum ini juga menekankan pada otomatisasi workflow data science menggunakan R.

2 Task 1 — Dynamic Multi-Formula Function

Tujuan: Membangun fungsi fleksibel yang dapat menghitung berbagai jenis persamaan matematika.

Deskripsi:

Fungsi compute_formula(x, formula) dibuat untuk menerima input nilai x dan jenis formula (“linear”, “quadratic”, “cubic”, atau “exponential”). Fungsi ini mengembalikan hasil perhitungan sesuai dengan formula yang dipilih. Implementasi menggunakan nested loop untuk menghitung semua formula sekaligus dalam rentang x = 1:20. Program juga melakukan validasi input untuk memastikan formula yang dimasukkan pengguna valid. Hasil akhir dari keempat formula divisualisasikan dalam satu grafik overlay sehingga pola pertumbuhan setiap fungsi dapat dibandingkan secara langsung.

Output:

  • Tabel hasil perhitungan untuk setiap formula

  • Grafik gabungan (linear, kuadratik, kubik, eksponensial) dalam satu plot

2.1 Function & Computation

# ── Definisi fungsi ──────────────────────────────────────────────────────────
compute_formula <- function(x, formula) {
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
  if (!formula %in% valid_formulas) {
    stop(paste("Invalid formula. Choose from:", paste(valid_formulas, collapse = ", ")))
  }
  switch(formula,
    "linear"      = 2 * x + 3,
    "quadratic"   = x^2 - 3 * x + 2,
    "cubic"       = x^3 - 4 * x^2 + x + 6,
    "exponential" = exp(0.3 * x)
  )
}

# ── Hitung semua formula untuk x = 1:20 dengan nested loop ──────────────────
x_values   <- 1:20
formulas   <- c("linear", "quadratic", "cubic", "exponential")
results_df <- data.frame()

for (x in x_values) {
  for (f in formulas) {
    y          <- compute_formula(x, f)
    results_df <- rbind(results_df, data.frame(x = x, formula = f, y = y))
  }
}

# ── Tabel pivot: tiap formula jadi kolom ────────────────────────────────────
results_wide        <- tidyr::pivot_wider(results_df, names_from = formula, values_from = y)
results_wide[, 2:5] <- round(results_wide[, 2:5], 3)

results_wide %>%
  kable(
    caption   = "Tabel 1.1 — Nilai f(x) untuk Setiap Formula (x = 1 sampai 20)",
    col.names = c("x", "Exponential", "Cubic", "Linear", "Quadratic"),
    align     = c("c", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#424242") %>%
  column_spec(2, color = "#9C27B0", bold = TRUE) %>%
  column_spec(3, color = "#FF5722") %>%
  column_spec(4, color = "#2196F3") %>%
  column_spec(5, color = "#4CAF50")
Tabel 1.1 — Nilai f(x) untuk Setiap Formula (x = 1 sampai 20)
x Exponential Cubic Linear Quadratic
1 5 0 4 1.350
2 7 0 0 1.822
3 9 2 0 2.460
4 11 6 10 3.320
5 13 12 36 4.482
6 15 20 84 6.050
7 17 30 160 8.166
8 19 42 270 11.023
9 21 56 420 14.880
10 23 72 616 20.086
11 25 90 864 27.113
12 27 110 1170 36.598
13 29 132 1540 49.402
14 31 156 1980 66.686
15 33 182 2496 90.017
16 35 210 3094 121.510
17 37 240 3780 164.022
18 39 272 4560 221.406
19 41 306 5440 298.867
20 43 342 6426 403.429

2.2 Visualization

ggplot(results_df, aes(x = x, y = y, color = formula)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2, alpha = 0.7) +
  scale_color_manual(values = c(
    "linear"      = "#2196F3",
    "quadratic"   = "#4CAF50",
    "cubic"       = "#FF5722",
    "exponential" = "#9C27B0"
  )) +
  labs(
    title    = "Comparison of Mathematical Formulas (x = 1 to 20)",
    subtitle = "Linear, Quadratic, Cubic, and Exponential",
    x        = "x",
    y        = "f(x)",
    color    = "Formula"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

2.3 Interpretasi

Dari grafik di atas terlihat bahwa keempat formula memberikan pola pertumbuhan yang sangat berbeda. Formula linear (2x + 3) naik secara konsisten dan landai. Formula quadratic membentuk kurva parabola yang mulai turun di awal lalu naik kembali. Formula cubic punya lekukan lebih kompleks karena derajatnya tiga. Sementara formula exponential (e^0.3x) menunjukkan pertumbuhan yang paling drastis — nilainya meledak tajam di atas x = 15. Ini menunjukkan betapa besar perbedaan perilaku fungsi berdasarkan jenisnya, terutama ketika x semakin besar.

3 Task 2 — Nested Simulation: Multi-Sales & Discounts

Tujuan: Simulasi data penjualan untuk banyak salesperson dengan sistem diskon dinamis.

Deskripsi:

Fungsi simulate_sales(n_salesperson, days) menghasilkan dataset penjualan secara acak. Setiap salesperson memiliki data penjualan harian selama jumlah hari yang ditentukan. Nested loop digunakan untuk mengiterasi setiap salesperson dan setiap hari. Diskon diberikan secara kondisional berdasarkan besaran penjualan (semakin besar penjualan, semakin besar diskon). Fungsi nested tambahan dibuat untuk menghitung total penjualan kumulatif per salesperson. Statistik ringkasan seperti total penjualan, rata-rata penjualan, dan total diskon ditampilkan.

Output:

  • Dataset simulasi (Sales ID, Day, Sales Amount, Discount Rate)

  • Ringkasan statistik per salesperson

  • Grafik cumulative sales per salesperson

3.1 Load Data

sales_df <- read.csv("sales_data.csv")
cat("Dimensi data:", nrow(sales_df), "baris,", ncol(sales_df), "kolom\n")
## Dimensi data: 150 baris, 4 kolom
head(sales_df, 10) %>%
  kable(
    caption     = "Tabel 2.1 — Preview sales_data.csv (10 baris pertama)",
    col.names   = c("Sales ID", "Day", "Sales Amount", "Discount Rate"),
    align       = c("c", "c", "r", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, color = "#E65100")
Tabel 2.1 — Preview sales_data.csv (10 baris pertama)
Sales ID Day Sales Amount Discount Rate
1 1 1,314.91 0.15
1 2 147.52 0.05
1 3 622.56 0.10
1 4 524.10 0.10
1 5 1,499.30 0.15
1 6 1,385.73 0.15
1 7 1,795.14 0.20
1 8 265.18 0.05
1 9 901.65 0.10
1 10 156.61 0.05

3.2 Functions & Simulation

# ── Nested function: hitung net sales & kumulatif per salesperson ────────────
apply_discount <- function(amount, rate) {
  amount * (1 - rate)
}

get_cumulative_sales <- function(df, salesperson_id) {
  sub_df                <- df[df$sales_id == salesperson_id, ]
  sub_df                <- sub_df[order(sub_df$day), ]
  sub_df$net_sales      <- round(mapply(apply_discount,
                                        sub_df$sales_amount,
                                        sub_df$discount_rate), 2)
  sub_df$cumulative_net <- cumsum(sub_df$net_sales)
  return(sub_df)
}

simulate_sales <- function(df) {
  all_cumulative  <- data.frame()
  salesperson_ids <- unique(df$sales_id)
  for (sid in salesperson_ids) {
    all_cumulative <- rbind(all_cumulative, get_cumulative_sales(df, sid))
  }
  return(all_cumulative)
}

# ── Jalankan simulasi ────────────────────────────────────────────────────────
sim_result <- simulate_sales(sales_df)

head(sim_result, 10) %>%
  kable(
    caption     = "Tabel 2.2 — Hasil Simulasi: Net Sales & Cumulative Net (10 baris pertama)",
    col.names   = c("Sales ID", "Day", "Sales Amount", "Discount Rate",
                    "Net Sales", "Cumulative Net"),
    align       = c("c", "c", "r", "r", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(5, color = "#1B5E20", bold = TRUE) %>%
  column_spec(6, color = "#6A1B9A", bold = TRUE)
Tabel 2.2 — Hasil Simulasi: Net Sales & Cumulative Net (10 baris pertama)
Sales ID Day Sales Amount Discount Rate Net Sales Cumulative Net
1 1 1,314.91 0.15 1,117.67 1,117.67
1 2 147.52 0.05 140.14 1,257.81
1 3 622.56 0.10 560.30 1,818.11
1 4 524.10 0.10 471.69 2,289.80
1 5 1,499.30 0.15 1,274.40 3,564.20
1 6 1,385.73 0.15 1,177.87 4,742.07
1 7 1,795.14 0.20 1,436.11 6,178.18
1 8 265.18 0.05 251.92 6,430.10
1 9 901.65 0.10 811.48 7,241.58
1 10 156.61 0.05 148.78 7,390.36

3.3 Summary Statistics

summary_sales <- sim_result %>%
  group_by(sales_id) %>%
  summarise(
    Total_Gross  = round(sum(sales_amount), 2),
    Total_Net    = round(sum(net_sales), 2),
    Avg_Discount = round(mean(discount_rate) * 100, 1),
    Max_Net_Day  = round(max(net_sales), 2),
    .groups      = "drop"
  )

summary_sales %>%
  kable(
    caption     = "Tabel 2.3 — Ringkasan Penjualan per Salesperson",
    col.names   = c("Sales ID", "Total Gross (IDR)", "Total Net (IDR)",
                    "Avg Discount (%)", "Max Net/Day"),
    align       = c("c", "r", "r", "c", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(2, color = "#424242") %>%
  column_spec(3, bold = TRUE, color = "#1B5E20") %>%
  column_spec(4, color = "#E65100")
Tabel 2.3 — Ringkasan Penjualan per Salesperson
Sales ID Total Gross (IDR) Total Net (IDR) Avg Discount (%) Max Net/Day
1 27,150.32 23,098.05 11.7 1,534.96
2 30,183.54 25,728.56 12.3 1,559.14
3 31,852.49 26,809.95 13.3 1,596.26
4 33,147.16 27,901.67 13.2 1,594.10
5 30,854.48 25,759.86 13.3 1,519.36

3.4 Visualization

ggplot(sim_result, aes(x = day, y = cumulative_net,
                       color = factor(sales_id), group = sales_id)) +
  geom_line(linewidth = 1.1) +
  scale_color_brewer(palette = "Set1", name = "Salesperson ID") +
  labs(
    title    = "Cumulative Net Sales per Salesperson (30 Days)",
    subtitle = "After applying conditional discount rates",
    x        = "Day",
    y        = "Cumulative Net Sales (IDR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

3.5 Interpretasi

Simulasi penjualan selama 30 hari menunjukkan bahwa masing-masing salesperson punya trajektori kumulatif yang terus naik, yang wajar karena penjualan terjadi setiap hari. Setelah diskon diterapkan secara kondisional (semakin besar penjualan, semakin besar diskon), nilai net sales sedikit lebih rendah dari gross-nya, tapi polanya tetap konsisten naik. Perbedaan kemiringan antar salesperson mencerminkan seberapa sering mereka berhasil menjual di nilai tinggi.

4 Task 3 — Multi-Level Performance Categorization

Tujuan: Mengelompokkan performa penjualan ke dalam kategori bertingkat.

Deskripsi:

Fungsi categorize_performance(sales_amount) menerima vektor angka penjualan dan mengelompokkannya ke dalam 5 kategori: Excellent (≥ 1500), Very Good (1200–1499), Good (900–1199), Average (500–899), Poor (< 500). Proses dilakukan dengan loop untuk setiap elemen vektor. Setelah kategorisasi selesai, fungsi menghitung persentase jumlah data di setiap kategori. Hasilnya divisualisasikan dalam bentuk bar plot dan pie chart untuk memudahkan interpretasi distribusi performa.

Output:

  • Tabel frekuensi dan persentase per kategori

  • Bar plot distribusi kategori

  • Pie chart proporsi kategori

4.1 Function & Categorization

# ── Fungsi kategorisasi performa ─────────────────────────────────────────────
categorize_performance <- function(sales_amount) {
  categories <- c()
  for (amt in sales_amount) {
    cat_label <- if (amt >= 1800) {
      "Excellent"
    } else if (amt >= 1400) {
      "Very Good"
    } else if (amt >= 900) {
      "Good"
    } else if (amt >= 400) {
      "Average"
    } else {
      "Poor"
    }
    categories <- c(categories, cat_label)
  }
  return(categories)
}

# ── Terapkan ke data & hitung persentase ─────────────────────────────────────
sales_df$performance_cat <- categorize_performance(sales_df$sales_amount)

cat_order   <- c("Excellent", "Very Good", "Good", "Average", "Poor")
cat_summary <- sales_df %>%
  group_by(performance_cat) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(percentage = round(count / sum(count) * 100, 1)) %>%
  arrange(factor(performance_cat, levels = cat_order))

cat_summary %>%
  kable(
    caption   = "Tabel 3.1 — Distribusi Kategori Performa Penjualan",
    col.names = c("Kategori", "Jumlah", "Persentase (%)"),
    align     = c("l", "c", "c")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, color = "#1565C0", bold = TRUE) %>%
  column_spec(3, color = "#6A1B9A") %>%
  row_spec(which(cat_summary$performance_cat == "Excellent"), background = "#E8F5E9") %>%
  row_spec(which(cat_summary$performance_cat == "Poor"),      background = "#FFEBEE")
Tabel 3.1 — Distribusi Kategori Performa Penjualan
Kategori Jumlah Persentase (%)
Excellent 14 9.3
Very Good 30 20.0
Good 40 26.7
Average 41 27.3
Poor 25 16.7

4.2 Visualization

palette_cat <- c(
  "Excellent" = "#1B5E20",
  "Very Good" = "#388E3C",
  "Good"      = "#FBC02D",
  "Average"   = "#F57C00",
  "Poor"      = "#C62828"
)

cat_summary$performance_cat <- factor(cat_summary$performance_cat, levels = cat_order)

p_bar <- ggplot(cat_summary, aes(x = performance_cat, y = count,
                                  fill = performance_cat)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = paste0(count, "\n(", percentage, "%)")),
            vjust = -0.3, size = 3.8, fontface = "bold") +
  scale_fill_manual(values = palette_cat) +
  labs(title = "Sales Performance Distribution — Bar Chart",
       x = "Category", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(cat_summary$count) * 1.2)

p_pie <- ggplot(cat_summary, aes(x = "", y = count, fill = performance_cat)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  scale_fill_manual(values = palette_cat, name = "Category") +
  geom_text(aes(label = paste0(percentage, "%")),
            position = position_stack(vjust = 0.5),
            size = 4, color = "white", fontface = "bold") +
  labs(title = "Sales Performance Distribution — Pie Chart") +
  theme_void(base_size = 13)

grid.arrange(p_bar, p_pie, ncol = 2)

4.3 Interpretasi

Dari 150 data penjualan harian, distribusi performa terbagi cukup merata karena data dibangkitkan secara acak uniform antara 100–2000. Kategori Good dan Average mendominasi karena rentang nilainya paling lebar (400–1400). Sementara Excellent dan Poor cukup sedikit karena berada di ujung ekstrem. Ini menggambarkan kondisi penjualan yang tipikal — sebagian besar transaksi berada di kisaran menengah, dan hanya sebagian kecil yang benar-benar luar biasa atau sangat rendah.

5 Task 4 — Multi-Company Dataset Simulation

Tujuan: Simulasi data karyawan untuk banyak perusahaan dengan atribut lengkap.

Deskripsi:

Fungsi generate_company_data(n_company, n_employees) menghasilkan dataset dengan kolom: company_id, employee_id, salary, department, performance_score (0–100), dan KPI_score (0–100). Nested loop digunakan: loop luar untuk setiap perusahaan, loop dalam untuk setiap karyawan. Kondisi logis diterapkan untuk menentukan top performer (KPI > 90). Setelah data terbentuk, program menghitung ringkasan per perusahaan: rata-rata gaji, rata-rata performance score, rata-rata KPI, jumlah top performer, dan departemen dengan KPI tertinggi.

Output:

  • Dataset simulasi karyawan

  • Tabel ringkasan per perusahaan

  • Bar plot perbandingan rata-rata KPI antar perusahaan

  • Scatter plot salary vs performance_score

5.1 Load Data

company_df <- read.csv("company_data.csv")
cat("Dimensi:", nrow(company_df), "baris,", ncol(company_df), "kolom\n")
## Dimensi: 160 baris, 6 kolom
head(company_df, 10) %>%
  kable(
    caption     = "Tabel 4.1 — Preview company_data.csv (10 baris pertama)",
    col.names   = c("Company ID", "Employee ID", "Salary", "Department",
                    "Performance Score", "KPI Score"),
    align       = c("c", "c", "r", "l", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(4, italic = TRUE, color = "#4A148C") %>%
  column_spec(5, color = "#1B5E20") %>%
  column_spec(6, bold = TRUE, color = "#E65100")
Tabel 4.1 — Preview company_data.csv (10 baris pertama)
Company ID Employee ID Salary Department Performance Score KPI Score
1 1 8,257.20 Marketing 75.9 45.4
1 2 3,768.31 Finance 51.1 64.9
1 3 5,642.61 Operations 53.5 68.4
1 4 3,808.80 Finance 51.6 54.9
1 5 5,856.06 Operations 83.4 49.6
1 6 11,680.23 Finance 94.1 93.6
1 7 12,415.43 Finance 90.4 87.9
1 8 10,907.79 HR 67.7 58.5
1 9 11,080.37 HR 99.2 86.5
1 10 11,738.56 Engineering 90.0 86.6

5.2 Function & Summary

# ── Nested loops + conditional: summary per company ──────────────────────────
generate_company_summary <- function(df) {
  company_ids  <- unique(df$company_id)
  summary_list <- list()

  for (cid in company_ids) {
    sub_df         <- df[df$company_id == cid, ]
    n_emp          <- nrow(sub_df)
    top_performers <- 0

    for (i in 1:n_emp) {
      if (sub_df$KPI_score[i] > 90) {
        top_performers <- top_performers + 1
      }
    }

    summary_list[[length(summary_list) + 1]] <- data.frame(
      Company         = paste("Company", cid),
      N_Employees     = n_emp,
      Avg_Salary      = round(mean(sub_df$salary), 2),
      Avg_Performance = round(mean(sub_df$performance_score), 2),
      Max_KPI         = round(max(sub_df$KPI_score), 2),
      Top_Performers  = top_performers
    )
  }
  return(do.call(rbind, summary_list))
}

company_summary <- generate_company_summary(company_df)

company_summary %>%
  kable(
    caption     = "Tabel 4.2 — Ringkasan per Perusahaan: Gaji, Performa & Top Performers",
    col.names   = c("Perusahaan", "Jml Karyawan", "Avg Salary (IDR)",
                    "Avg Performance", "Max KPI", "Top Performers"),
    align       = c("l", "c", "r", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(3, color = "#1565C0", bold = TRUE) %>%
  column_spec(5, color = "#E65100", bold = TRUE) %>%
  column_spec(6, bold = TRUE, color = "white",
              background = ifelse(company_summary$Top_Performers >= 3,
                                  "#1B5E20", "#E53935"))
Tabel 4.2 — Ringkasan per Perusahaan: Gaji, Performa & Top Performers
Perusahaan Jml Karyawan Avg Salary (IDR) Avg Performance Max KPI Top Performers
Company 1 40 8,091.14 77.09 99.9 6
Company 2 40 8,983.58 78.89 98.4 12
Company 3 40 8,568.23 75.11 99.5 10
Company 4 40 9,087.17 74.92 99.5 7

5.3 Visualization

p1 <- ggplot(company_summary, aes(x = Company, y = Avg_Salary, fill = Company)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = scales::comma(Avg_Salary)), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "Blues", direction = -1) +
  labs(title = "Average Salary per Company", x = "", y = "Avg Salary") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_summary$Avg_Salary) * 1.15)

p2 <- ggplot(company_summary, aes(x = Company, y = Top_Performers, fill = Company)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = Top_Performers), vjust = -0.4, size = 4, fontface = "bold") +
  scale_fill_brewer(palette = "Oranges", direction = -1) +
  labs(title = "Top Performers (KPI > 90) per Company", x = "", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_summary$Top_Performers) * 1.2)

grid.arrange(p1, p2, ncol = 2)

5.4 Interpretasi

Hasil summary per perusahaan menunjukkan variasi yang cukup nyata antar company. Rata-rata gaji berbeda karena data dibangkitkan secara acak dalam rentang yang luas (3000–15000). Jumlah top performer (KPI > 90) juga bervariasi — ini bergantung pada seberapa banyak karyawan yang punya performance score di atas 85. Perusahaan dengan lebih banyak top performer cenderung punya Max KPI yang lebih tinggi, mencerminkan distribusi kinerja yang heterogen sebagaimana sering terjadi di dunia nyata.

6 Task 5 — Monte Carlo Simulation: π & Probability

Tujuan: Estimasi nilai π menggunakan metode Monte Carlo dan analisis probabilitas.

Deskripsi:

Fungsi monte_carlo_pi(n_points) menghasilkan titik acak (x, y) dalam rentang [0,1] sebanyak n_points. Titik-titik tersebut dihitung jaraknya dari titik pusat (0.5, 0.5) untuk menentukan apakah berada di dalam lingkaran (radius = 0.5). Nilai π diestimasi dengan rumus:

\[ \pi \approx 4 \times \frac{\text{jumlah titik dalam lingkaran}}{\text{total titik}} \]

Fungsi juga menghitung probabilitas titik acak jatuh di dalam sub-area tertentu. Simulasi ini dijalankan dengan loop untuk berbagai jumlah titik guna melihat konvergensi estimasi π. Visualisasi menggunakan scatter plot membedakan titik di dalam lingkaran (warna hijau) dan di luar lingkaran (warna merah).

Output:

  • Estimasi nilai π

  • Probabilitas titik dalam sub-area

  • Scatter plot titik dalam vs luar lingkaran

6.1 Function & Simulation

# ── Fungsi Monte Carlo ────────────────────────────────────────────────────────
monte_carlo_pi <- function(n_points) {
  set.seed(123)
  x <- runif(n_points, -1, 1)
  y <- runif(n_points, -1, 1)

  inside_circle  <- (x^2 + y^2) <= 1
  pi_estimate    <- 4 * sum(inside_circle) / n_points

  in_subsquare   <- abs(x) <= 0.5 & abs(y) <= 0.5
  prob_subsquare <- sum(in_subsquare) / n_points

  return(list(
    pi_estimate    = pi_estimate,
    prob_subsquare = prob_subsquare,
    x              = x,
    y              = y,
    inside_circle  = inside_circle
  ))
}

# ── Konvergensi dengan berbagai ukuran n ─────────────────────────────────────
n_values   <- c(100, 500, 1000, 5000, 10000)
pi_results <- data.frame()

for (n in n_values) {
  res        <- monte_carlo_pi(n)
  pi_results <- rbind(pi_results, data.frame(
    n_points       = n,
    pi_estimate    = round(res$pi_estimate, 5),
    error          = round(abs(res$pi_estimate - pi), 5),
    prob_subsquare = round(res$prob_subsquare, 4)
  ))
}

cat("Pi aktual:", pi, "\n")
## Pi aktual: 3.141593
pi_results %>%
  kable(
    caption     = "Tabel 5.1 — Estimasi π Monte Carlo (berbagai ukuran sampel)",
    col.names   = c("N Points", "Estimasi π", "Error vs π Asli", "P(Sub-Square)"),
    align       = c("r", "r", "r", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "#1565C0") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(pi_results$error < 0.01, "#1B5E20", "#E65100")) %>%
  row_spec(nrow(pi_results), background = "#E8F5E9")
Tabel 5.1 — Estimasi π Monte Carlo (berbagai ukuran sampel)
N Points Estimasi π Error vs π Asli P(Sub-Square)
100 3.4000 0.25841 0.2600
500 3.2000 0.05841 0.2400
1,000 3.2000 0.05841 0.2650
5,000 3.1632 0.02161 0.2630
10,000 3.1576 0.01601 0.2507

6.2 Visualization

res_plot <- monte_carlo_pi(2000)
plot_df  <- data.frame(
  x      = res_plot$x,
  y      = res_plot$y,
  status = ifelse(res_plot$inside_circle, "Inside Circle", "Outside Circle")
)

p_scatter <- ggplot(plot_df, aes(x = x, y = y, color = status)) +
  geom_point(alpha = 0.5, size = 0.8) +
  scale_color_manual(values = c("Inside Circle"  = "#1565C0",
                                "Outside Circle" = "#E53935")) +
  coord_fixed() +
  annotate("path",
           x = cos(seq(0, 2 * pi, length.out = 300)),
           y = sin(seq(0, 2 * pi, length.out = 300)),
           color = "black", linewidth = 0.8) +
  labs(
    title    = "Monte Carlo Simulation (n = 2000)",
    subtitle = paste0("pi estimate = ", round(res_plot$pi_estimate, 4)),
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

p_conv <- ggplot(pi_results, aes(x = n_points, y = pi_estimate)) +
  geom_line(color = "#1565C0", linewidth = 1.2) +
  geom_point(color = "#E53935", size = 3) +
  geom_hline(yintercept = pi, linetype = "dashed", color = "gray40") +
  geom_text(aes(label = round(pi_estimate, 4)), vjust = -0.8, size = 3.5) +
  annotate("text", x = max(pi_results$n_points) * 0.8, y = pi + 0.01,
           label = "True pi", color = "gray40") +
  scale_x_log10(labels = scales::comma) +
  labs(
    title = "Convergence of pi Estimate vs Sample Size",
    x     = "Number of Points (log scale)",
    y     = "pi Estimate"
  ) +
  theme_minimal(base_size = 13)

grid.arrange(p_scatter, p_conv, ncol = 2)

6.3 Interpretasi

Simulasi Monte Carlo bekerja dengan melempar titik acak ke dalam kotak bujursangkar, lalu menghitung berapa yang jatuh di dalam lingkaran. Rasionya dikalikan 4 untuk mendapatkan estimasi nilai pi. Hasilnya sangat menarik — semakin besar n, semakin dekat estimasi ke nilai pi asli (3.14159). Dengan n = 10.000 poin, errornya sudah sangat kecil. Probabilitas titik jatuh di sub-square sekitar 25%, yang masuk akal secara geometris karena luasnya (1x1) adalah seperempat dari total luas kotak (2x2).

7 Task 6 — Advanced Data Transformation & Feature Engineering

Tujuan: Transformasi data dan pembuatan fitur baru menggunakan pendekatan looping.

Deskripsi:

Dibuat dua fungsi transformasi: normalize_columns(df) untuk melakukan min-max normalization (rentang 0–1) dan z_score(df) untuk melakukan standardization (mean = 0, sd = 1). Kedua fungsi menggunakan loop untuk mengiterasi setiap kolom numerik dalam dataframe. Setelah transformasi selesai, program membuat fitur baru seperti performance_category (berdasarkan performance_score) dan salary_bracket (berdasarkan kuartil gaji). Perbandingan distribusi data sebelum dan sesudah transformasi divisualisasikan menggunakan histogram dan boxplot secara berdampingan.

Output:

  • Dataframe hasil normalisasi

  • Dataframe hasil z-score

  • Histogram perbandingan distribusi

  • Boxplot perbandingan distribusi

7.1 Load Data & Functions

df6          <- read.csv("company_data.csv")
numeric_cols <- c("salary", "performance_score", "KPI_score")
cat("Dimensi:", nrow(df6), "baris,", ncol(df6), "kolom\n")
## Dimensi: 160 baris, 6 kolom
# ── Min-Max Normalization (loop-based) ───────────────────────────────────────
normalize_columns <- function(df, cols) {
  df_norm <- df
  for (col in cols) {
    mn             <- min(df[[col]], na.rm = TRUE)
    mx             <- max(df[[col]], na.rm = TRUE)
    df_norm[[col]] <- round((df[[col]] - mn) / (mx - mn), 4)
  }
  return(df_norm)
}

# ── Z-Score Standardization (loop-based) ─────────────────────────────────────
z_score <- function(df, cols) {
  df_z <- df
  for (col in cols) {
    mu          <- mean(df[[col]], na.rm = TRUE)
    sigma       <- sd(df[[col]], na.rm = TRUE)
    df_z[[col]] <- round((df[[col]] - mu) / sigma, 4)
  }
  return(df_z)
}

df_norm <- normalize_columns(df6, numeric_cols)
df_z    <- z_score(df6, numeric_cols)

# ── Feature Engineering ───────────────────────────────────────────────────────
df6$performance_category <- categorize_performance(df6$performance_score)
df6$salary_bracket <- cut(
  df6$salary,
  breaks         = c(0, 5000, 8000, 11000, Inf),
  labels         = c("Low", "Medium", "High", "Very High"),
  include.lowest = TRUE
)

cat("Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.\n")
## Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.

7.2 Preview Data

# ── Data asli ─────────────────────────────────────────────────────────────────
head(df6, 10) %>%
  kable(
    caption     = "Tabel 6.1 — Data Asli (sebelum transformasi)",
    col.names   = c("Company ID", "Employee ID", "Salary", "Department",
                    "Performance Score", "KPI Score",
                    "Performance Category", "Salary Bracket"),
    align       = c("c", "c", "r", "l", "r", "r", "l", "l"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(5, color = "#6A1B9A") %>%
  column_spec(7, italic = TRUE, color = "#E65100") %>%
  column_spec(8, italic = TRUE, color = "#0277BD")
Tabel 6.1 — Data Asli (sebelum transformasi)
Company ID Employee ID Salary Department Performance Score KPI Score Performance Category Salary Bracket
1 1 8,257.20 Marketing 75.9 45.4 Poor High
1 2 3,768.31 Finance 51.1 64.9 Poor Low
1 3 5,642.61 Operations 53.5 68.4 Poor Medium
1 4 3,808.80 Finance 51.6 54.9 Poor Low
1 5 5,856.06 Operations 83.4 49.6 Poor Medium
1 6 11,680.23 Finance 94.1 93.6 Poor Very High
1 7 12,415.43 Finance 90.4 87.9 Poor Very High
1 8 10,907.79 HR 67.7 58.5 Poor High
1 9 11,080.37 HR 99.2 86.5 Poor Very High
1 10 11,738.56 Engineering 90.0 86.6 Poor Very High
# ── Setelah Min-Max Normalization ─────────────────────────────────────────────
head(df_norm[, numeric_cols], 10) %>%
  kable(
    caption   = "Tabel 6.2 — Setelah Min-Max Normalization (rentang [0, 1])",
    col.names = c("Salary (norm)", "Performance Score (norm)", "KPI Score (norm)"),
    align     = c("r", "r", "r"),
    digits    = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, color = "#1565C0", bold = TRUE) %>%
  column_spec(2, color = "#1B5E20") %>%
  column_spec(3, color = "#E65100")
Tabel 6.2 — Setelah Min-Max Normalization (rentang [0, 1])
Salary (norm) Performance Score (norm) KPI Score (norm)
0.4408 0.5181 0.0886
0.0576 0.0201 0.4147
0.2176 0.0683 0.4732
0.0610 0.0301 0.2475
0.2358 0.6687 0.1589
0.7330 0.8835 0.8946
0.7958 0.8092 0.7993
0.6671 0.3534 0.3077
0.6818 0.9859 0.7759
0.7380 0.8012 0.7776
# ── Setelah Z-Score Standardization ──────────────────────────────────────────
head(df_z[, numeric_cols], 10) %>%
  kable(
    caption   = "Tabel 6.3 — Setelah Z-Score Standardization (satuan standar deviasi)",
    col.names = c("Salary (z)", "Performance Score (z)", "KPI Score (z)"),
    align     = c("r", "r", "r"),
    digits    = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, color = "#1565C0", bold = TRUE) %>%
  column_spec(2, color = "#1B5E20") %>%
  column_spec(3, color = "#E65100")
Tabel 6.3 — Setelah Z-Score Standardization (satuan standar deviasi)
Salary (z) Performance Score (z) KPI Score (z)
-0.1203 -0.0386 -1.5170
-1.3894 -1.6285 -0.4422
-0.8595 -1.4747 -0.2493
-1.3780 -1.5965 -0.9934
-0.7991 0.4422 -1.2855
0.8475 1.1282 1.1398
1.0554 0.8910 0.8256
0.6292 -0.5643 -0.7950
0.6779 1.4551 0.7484
0.8640 0.8653 0.7540

7.3 Visualization

p_sal_before <- ggplot(df6, aes(x = salary)) +
  geom_histogram(bins = 20, fill = "#1565C0", alpha = 0.8, color = "white") +
  labs(title = "Salary — Original", x = "Salary", y = "Count") +
  theme_minimal(base_size = 12)

p_sal_after <- ggplot(df_norm, aes(x = salary)) +
  geom_histogram(bins = 20, fill = "#43A047", alpha = 0.8, color = "white") +
  labs(title = "Salary — Normalized [0, 1]", x = "Normalized Salary", y = "Count") +
  theme_minimal(base_size = 12)

p_perf_before <- ggplot(df6, aes(y = performance_score, x = factor(company_id),
                                  fill = factor(company_id))) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Performance Score — Original", x = "Company", y = "Score") +
  theme_minimal(base_size = 12)

p_perf_after <- ggplot(df_z, aes(y = performance_score, x = factor(company_id),
                                  fill = factor(company_id))) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8) +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Performance Score — Z-Score", x = "Company", y = "Z-Score") +
  theme_minimal(base_size = 12)

pc_sum <- df6 %>%
  group_by(performance_category) %>%
  summarise(n = n()) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

p_bracket <- ggplot(df6, aes(x = salary_bracket, fill = salary_bracket)) +
  geom_bar(show.legend = FALSE, width = 0.6, color = "white") +
  scale_fill_manual(values = c(
    "Low"       = "#EF9A9A",
    "Medium"    = "#FFF176",
    "High"      = "#A5D6A7",
    "Very High" = "#90CAF9"
  )) +
  labs(title = "Salary Bracket Distribution", x = "Bracket", y = "Count") +
  theme_minimal(base_size = 12)

p_perf_cat <- ggplot(pc_sum, aes(x = "", y = n, fill = performance_category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_stack(vjust = 0.5),
            size = 3.5, color = "white", fontface = "bold") +
  scale_fill_brewer(palette = "Set2", name = "Category") +
  labs(title = "Performance Category Distribution") +
  theme_void(base_size = 12)

grid.arrange(p_sal_before, p_sal_after,
             p_perf_before, p_perf_after,
             p_bracket, p_perf_cat, ncol = 2)

7.4 Interpretasi

Proses normalisasi dan standardisasi mengubah skala data tanpa mengubah distribusi aslinya. Setelah min-max normalization, nilai salary semuanya masuk ke rentang 0-1, sehingga lebih mudah dibandingkan antar variabel yang berbeda satuan. Z-score menghasilkan distribusi dengan rata-rata 0 dan standar deviasi 1, yang berguna untuk deteksi outlier. Feature engineering menambahkan informasi kategoris yang bisa langsung dipakai untuk analisis segmentasi.


8 Task 7 — Mini Project: Company KPI Dashboard

Tujuan : Membangun dashboard KPI komprehensif untuk analisis multi-perusahaan.

Deskripsi:

Mini project ini menggabungkan seluruh konsep dari tugas sebelumnya. Dataset dibangkitkan untuk 5–10 perusahaan dengan 50–200 karyawan per perusahaan. Kolom yang tersedia: employee_id, company_id, salary, performance_score, KPI_score, department. Program melakukan loop untuk menghitung ringkasan per perusahaan, mengelompokkan karyawan ke dalam KPI tiers (High: KPI ≥ 80, Medium: 60–79, Low: < 60), dan menganalisis performa per departemen. Visualisasi lanjutan mencakup grouped bar chart untuk perbandingan KPI antar perusahaan dan scatter plot dengan regression line untuk melihat hubungan antara salary dan performance_score.

Output:

  • Tabel ringkasan per perusahaan

  • Tabel top performers (KPI > 90)

  • Grouped bar chart KPI per perusahaan

  • Scatter plot salary vs performance dengan regression line

  • Analisis per departemen (rata-rata KPI terendah & tertinggi)

8.1 Load Data

dash_df <- read.csv("dashboard_data.csv")
cat("Total karyawan  :", nrow(dash_df), "\n")
## Total karyawan  : 988
cat("Total perusahaan:", length(unique(dash_df$company_id)), "\n")
## Total perusahaan: 7
head(dash_df, 10) %>%
  kable(
    caption     = "Tabel 7.1 — Preview dashboard_data.csv (10 baris pertama)",
    col.names   = c("Employee ID", "Company ID", "Salary", "Department",
                    "Performance Score", "KPI Score"),
    align       = c("c", "c", "r", "l", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(2, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, italic = TRUE, color = "#4A148C") %>%
  column_spec(5, color = "#424242") %>%
  column_spec(6, bold = TRUE, color = "#E65100")
Tabel 7.1 — Preview dashboard_data.csv (10 baris pertama)
Employee ID Company ID Salary Department Performance Score KPI Score
1 CO1 7,790.52 R&D 58.7 68.5
2 CO1 8,395.58 Legal 85.1 77.9
3 CO1 4,062.58 Customer Service 71.9 66.9
4 CO1 6,648.91 Supply Chain 77.4 80.6
5 CO1 8,809.21 Legal 56.7 58.9
6 CO1 4,820.13 IT 91.7 76.5
7 CO1 4,953.22 Customer Service 70.8 65.8
8 CO1 7,035.43 Legal 75.7 69.3
9 CO1 9,350.49 Legal 70.4 79.4
10 CO1 7,378.56 Supply Chain 70.4 55.7

8.2 KPI Tier Categorization & Summary

# ── Loop kategorisasi KPI tier ────────────────────────────────────────────────
kpi_tier <- c()
for (kpi in dash_df$KPI_score) {
  tier <- if (kpi >= 90) {
    "Tier 1 - Elite"
  } else if (kpi >= 75) {
    "Tier 2 - High"
  } else if (kpi >= 60) {
    "Tier 3 - Average"
  } else {
    "Tier 4 - Low"
  }
  kpi_tier <- c(kpi_tier, tier)
}
dash_df$kpi_tier <- kpi_tier

# ── Summary per company ───────────────────────────────────────────────────────
company_kpi_summary <- dash_df %>%
  group_by(company_id) %>%
  summarise(
    N_Employees    = n(),
    Avg_Salary     = round(mean(salary), 2),
    Avg_KPI        = round(mean(KPI_score), 2),
    Top_Performers = sum(KPI_score >= 90),
    .groups        = "drop"
  )

company_kpi_summary %>%
  kable(
    caption     = "Tabel 7.2 — Ringkasan KPI per Perusahaan",
    col.names   = c("Company", "Jml Karyawan", "Avg Salary (IDR)",
                    "Avg KPI", "Top Performers (KPI >= 90)"),
    align       = c("c", "c", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, bold = TRUE,
              color = ifelse(company_kpi_summary$Avg_KPI >= 70, "#1B5E20", "#E65100")) %>%
  column_spec(5, bold = TRUE, color = "white",
              background = ifelse(company_kpi_summary$Top_Performers >= 5,
                                  "#1B5E20", "#E53935"))
Tabel 7.2 — Ringkasan KPI per Perusahaan
Company Jml Karyawan Avg Salary (IDR) Avg KPI Top Performers (KPI >= 90)
CO1 153 7,553.79 67.93 6
CO2 192 7,986.33 68.65 13
CO3 164 8,488.27 69.36 12
CO4 148 8,810.85 70.31 9
CO5 114 9,173.87 68.52 3
CO6 75 9,916.63 70.64 3
CO7 142 10,624.84 68.96 11
# ── Distribusi KPI Tier ───────────────────────────────────────────────────────
tier_order   <- c("Tier 1 - Elite", "Tier 2 - High", "Tier 3 - Average", "Tier 4 - Low")
tier_summary <- dash_df %>%
  group_by(kpi_tier) %>%
  summarise(
    Jumlah     = n(),
    Avg_KPI    = round(mean(KPI_score), 2),
    Avg_Salary = round(mean(salary), 2),
    .groups    = "drop"
  ) %>%
  mutate(
    Persentase = round(Jumlah / sum(Jumlah) * 100, 1),
    kpi_tier   = factor(kpi_tier, levels = tier_order)
  ) %>%
  arrange(kpi_tier)

tier_summary %>%
  kable(
    caption     = "Tabel 7.3 — Distribusi Karyawan per KPI Tier (seluruh perusahaan)",
    col.names   = c("KPI Tier", "Jumlah", "Avg KPI", "Avg Salary (IDR)", "Persentase (%)"),
    align       = c("l", "c", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, color = "#1565C0", bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = c("#1B5E20", "#388E3C", "#E65100", "#C62828")) %>%
  column_spec(4, color = "#424242") %>%
  column_spec(5, color = "#6A1B9A") %>%
  row_spec(1, background = "#E8F5E9") %>%
  row_spec(4, background = "#FFEBEE")
Tabel 7.3 — Distribusi Karyawan per KPI Tier (seluruh perusahaan)
KPI Tier Jumlah Avg KPI Avg Salary (IDR) Persentase (%)
Tier 1 - Elite 57 93.34 9,132.32 5.8
Tier 2 - High 281 81.12 8,703.80 28.4
Tier 3 - Average 397 67.77 8,721.79 40.2
Tier 4 - Low 253 52.31 8,911.54 25.6

8.3 Visualizations

p_avg_kpi <- ggplot(company_kpi_summary,
                    aes(x = company_id, y = Avg_KPI, fill = company_id)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = round(Avg_KPI, 1)), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Average KPI Score per Company", x = "Company", y = "Avg KPI") +
  theme_minimal(base_size = 13) +
  ylim(0, 100)

p_top <- ggplot(company_kpi_summary,
                aes(x = company_id, y = Top_Performers, fill = company_id)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = Top_Performers), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top Performers (KPI >= 90) per Company",
       x = "Company", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_kpi_summary$Top_Performers) * 1.2)

grid.arrange(p_avg_kpi, p_top, ncol = 2)

dept_summary <- dash_df %>%
  group_by(company_id, department) %>%
  summarise(Avg_Salary = round(mean(salary), 2), .groups = "drop")

ggplot(dept_summary, aes(x = department, y = Avg_Salary, fill = company_id)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  scale_fill_brewer(palette = "Dark2", name = "Company") +
  labs(
    title    = "Average Salary by Department and Company",
    subtitle = "Grouped Bar Chart",
    x        = "Department",
    y        = "Avg Salary"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x     = element_text(angle = 30, hjust = 1),
    legend.position = "bottom"
  )

p_scatter <- ggplot(dash_df, aes(x = performance_score, y = KPI_score,
                                  color = company_id)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_smooth(aes(group = 1), method = "lm", color = "black",
              se = TRUE, linewidth = 1.2) +
  scale_color_brewer(palette = "Set2", name = "Company") +
  labs(
    title    = "Performance Score vs KPI Score",
    subtitle = "Scatter plot with regression line",
    x        = "Performance Score",
    y        = "KPI Score"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

p_sal_tier <- ggplot(dash_df, aes(x = kpi_tier, y = salary, fill = kpi_tier)) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8, outlier.size = 1) +
  scale_fill_manual(values = c(
    "Tier 1 - Elite"   = "#1B5E20",
    "Tier 2 - High"    = "#388E3C",
    "Tier 3 - Average" = "#FBC02D",
    "Tier 4 - Low"     = "#C62828"
  )) +
  labs(title = "Salary Distribution by KPI Tier",
       x = "KPI Tier", y = "Salary") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

grid.arrange(p_scatter, p_sal_tier, ncol = 2)

8.4 Interpretasi

Dashboard KPI dari 7 perusahaan dengan hampir 1000 karyawan ini memberikan gambaran yang cukup komprehensif. Rata-rata KPI antar perusahaan tidak terlalu jauh, menandakan performa yang relatif seimbang. Scatter plot antara performance score dan KPI score menunjukkan korelasi positif yang jelas — makin tinggi performa, makin tinggi KPI, meski ada noise karena KPI juga dipengaruhi faktor lain. Dari boxplot salary per KPI tier, distribusi gaji antar tier tidak berbeda drastis karena dalam dataset ini gaji dibangkitkan bervariasi tanpa bergantung langsung pada KPI. Perbedaan rata-rata gaji antar divisi pada grouped bar chart bisa menjadi bahan analisis lanjutan untuk kebijakan kompensasi.

9 Kesimpulan

Praktikum ini berhasil mengimplementasikan fungsi, loop, kondisional, dan visualisasi dalam tujuh tugas data science. Fungsi membuat kode reusable, nested loop menangani data berhierarki, kondisional memungkinkan pengambilan keputusan dinamis, dan visualisasi menyajikan insight secara intuitif. Kombinasi keempatnya merupakan fondasi penting dalam membangun alur kerja data science yang efisien dan terstruktur.