Assignment Week 5 ~ Functions and loops

Lulu Najla Salsabila

INSTITUT TEKNOLOGI SAINS BANDUNG

1 Pendahuluan

Praktikum ini bertujuan untuk mengimplementasikan konsep fungsi, loop, dan kondisional dalam menyelesaikan berbagai permasalahan data science. Seluruh tugas disusun secara bertahap dari tingkat dasar hingga lanjutan, mencakup simulasi data, transformasi data, analisis statistik, hingga visualisasi interaktif. Praktikum ini juga menekankan pada otomatisasi workflow data science menggunakan R.

2 Task 1 — Dynamic Multi-Formula Function

Tujuan: Membangun fungsi fleksibel yang dapat menghitung berbagai jenis persamaan matematika.

Deskripsi:

Fungsi compute_formula(x, formula) dibuat untuk menerima input nilai x dan jenis formula (“linear”, “quadratic”, “cubic”, atau “exponential”). Fungsi ini mengembalikan hasil perhitungan sesuai dengan formula yang dipilih. Implementasi menggunakan nested loop untuk menghitung semua formula sekaligus dalam rentang x = 1:20. Program juga melakukan validasi input untuk memastikan formula yang dimasukkan pengguna valid. Hasil akhir dari keempat formula divisualisasikan dalam satu grafik overlay sehingga pola pertumbuhan setiap fungsi dapat dibandingkan secara langsung.

Output:

Tabel hasil perhitungan untuk setiap formula
Grafik gabungan (linear, kuadratik, kubik, eksponensial) dalam satu plot

2.1 Function & Computation

# ── Definisi fungsi ──────────────────────────────────────────────────────────
compute_formula <- function(x, formula) {
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
  if (!formula %in% valid_formulas) {
    stop(paste("Invalid formula. Choose from:", paste(valid_formulas, collapse = ", ")))
  }
  switch(formula,
    "linear"      = 2 * x + 3,
    "quadratic"   = x^2 - 3 * x + 2,
    "cubic"       = x^3 - 4 * x^2 + x + 6,
    "exponential" = exp(0.3 * x)
  )
}

# ── Hitung semua formula untuk x = 1:20 dengan nested loop ──────────────────
x_values   <- 1:20
formulas   <- c("linear", "quadratic", "cubic", "exponential")
results_df <- data.frame()

for (x in x_values) {
  for (f in formulas) {
    y          <- compute_formula(x, f)
    results_df <- rbind(results_df, data.frame(x = x, formula = f, y = y))
  }
}

# ── Tabel pivot: tiap formula jadi kolom ────────────────────────────────────
results_wide        <- tidyr::pivot_wider(results_df, names_from = formula, values_from = y)
results_wide[, 2:5] <- round(results_wide[, 2:5], 3)

results_wide %>%
  kable(
    caption   = "Tabel 1.1 — Nilai f(x) untuk Setiap Formula (x = 1 sampai 20)",
    col.names = c("x", "Exponential", "Cubic", "Linear", "Quadratic"),
    align     = c("c", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#424242") %>%
  column_spec(2, color = "#9C27B0", bold = TRUE) %>%
  column_spec(3, color = "#FF5722") %>%
  column_spec(4, color = "#2196F3") %>%
  column_spec(5, color = "#4CAF50")

Tabel 1.1 — Nilai f(x) untuk Setiap Formula (x = 1 sampai 20)
x	Exponential	Cubic	Linear	Quadratic
1	5	0	4	1.350
2	7	0	0	1.822
3	9	2	0	2.460
4	11	6	10	3.320
5	13	12	36	4.482
6	15	20	84	6.050
7	17	30	160	8.166
8	19	42	270	11.023
9	21	56	420	14.880
10	23	72	616	20.086
11	25	90	864	27.113
12	27	110	1170	36.598
13	29	132	1540	49.402
14	31	156	1980	66.686
15	33	182	2496	90.017
16	35	210	3094	121.510
17	37	240	3780	164.022
18	39	272	4560	221.406
19	41	306	5440	298.867
20	43	342	6426	403.429

2.2 Visualization

ggplot(results_df, aes(x = x, y = y, color = formula)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2, alpha = 0.7) +
  scale_color_manual(values = c(
    "linear"      = "#2196F3",
    "quadratic"   = "#4CAF50",
    "cubic"       = "#FF5722",
    "exponential" = "#9C27B0"
  )) +
  labs(
    title    = "Comparison of Mathematical Formulas (x = 1 to 20)",
    subtitle = "Linear, Quadratic, Cubic, and Exponential",
    x        = "x",
    y        = "f(x)",
    color    = "Formula"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

2.3 Interpretasi

Dari grafik di atas terlihat bahwa keempat formula memberikan pola pertumbuhan yang sangat berbeda. Formula linear (2x + 3) naik secara konsisten dan landai. Formula quadratic membentuk kurva parabola yang mulai turun di awal lalu naik kembali. Formula cubic punya lekukan lebih kompleks karena derajatnya tiga. Sementara formula exponential (e^0.3x) menunjukkan pertumbuhan yang paling drastis — nilainya meledak tajam di atas x = 15. Ini menunjukkan betapa besar perbedaan perilaku fungsi berdasarkan jenisnya, terutama ketika x semakin besar.

3 Task 2 — Nested Simulation: Multi-Sales & Discounts

Tujuan: Simulasi data penjualan untuk banyak salesperson dengan sistem diskon dinamis.

Deskripsi:

Fungsi simulate_sales(n_salesperson, days) menghasilkan dataset penjualan secara acak. Setiap salesperson memiliki data penjualan harian selama jumlah hari yang ditentukan. Nested loop digunakan untuk mengiterasi setiap salesperson dan setiap hari. Diskon diberikan secara kondisional berdasarkan besaran penjualan (semakin besar penjualan, semakin besar diskon). Fungsi nested tambahan dibuat untuk menghitung total penjualan kumulatif per salesperson. Statistik ringkasan seperti total penjualan, rata-rata penjualan, dan total diskon ditampilkan.

Output:

Dataset simulasi (Sales ID, Day, Sales Amount, Discount Rate)
Ringkasan statistik per salesperson
Grafik cumulative sales per salesperson

3.1 Load Data

sales_df <- read.csv("sales_data.csv")
cat("Dimensi data:", nrow(sales_df), "baris,", ncol(sales_df), "kolom\n")

## Dimensi data: 150 baris, 4 kolom

head(sales_df, 10) %>%
  kable(
    caption     = "Tabel 2.1 — Preview sales_data.csv (10 baris pertama)",
    col.names   = c("Sales ID", "Day", "Sales Amount", "Discount Rate"),
    align       = c("c", "c", "r", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, color = "#E65100")

Tabel 2.1 — Preview sales_data.csv (10 baris pertama)
Sales ID	Day	Sales Amount	Discount Rate
1	1	1,314.91	0.15
1	2	147.52	0.05
1	3	622.56	0.10
1	4	524.10	0.10
1	5	1,499.30	0.15
1	6	1,385.73	0.15
1	7	1,795.14	0.20
1	8	265.18	0.05
1	9	901.65	0.10
1	10	156.61	0.05

3.2 Functions & Simulation

# ── Nested function: hitung net sales & kumulatif per salesperson ────────────
apply_discount <- function(amount, rate) {
  amount * (1 - rate)
}

get_cumulative_sales <- function(df, salesperson_id) {
  sub_df                <- df[df$sales_id == salesperson_id, ]
  sub_df                <- sub_df[order(sub_df$day), ]
  sub_df$net_sales      <- round(mapply(apply_discount,
                                        sub_df$sales_amount,
                                        sub_df$discount_rate), 2)
  sub_df$cumulative_net <- cumsum(sub_df$net_sales)
  return(sub_df)
}

simulate_sales <- function(df) {
  all_cumulative  <- data.frame()
  salesperson_ids <- unique(df$sales_id)
  for (sid in salesperson_ids) {
    all_cumulative <- rbind(all_cumulative, get_cumulative_sales(df, sid))
  }
  return(all_cumulative)
}

# ── Jalankan simulasi ────────────────────────────────────────────────────────
sim_result <- simulate_sales(sales_df)

head(sim_result, 10) %>%
  kable(
    caption     = "Tabel 2.2 — Hasil Simulasi: Net Sales & Cumulative Net (10 baris pertama)",
    col.names   = c("Sales ID", "Day", "Sales Amount", "Discount Rate",
                    "Net Sales", "Cumulative Net"),
    align       = c("c", "c", "r", "r", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(5, color = "#1B5E20", bold = TRUE) %>%
  column_spec(6, color = "#6A1B9A", bold = TRUE)

Tabel 2.2 — Hasil Simulasi: Net Sales & Cumulative Net (10 baris pertama)
Sales ID	Day	Sales Amount	Discount Rate	Net Sales	Cumulative Net
1	1	1,314.91	0.15	1,117.67	1,117.67
1	2	147.52	0.05	140.14	1,257.81
1	3	622.56	0.10	560.30	1,818.11
1	4	524.10	0.10	471.69	2,289.80
1	5	1,499.30	0.15	1,274.40	3,564.20
1	6	1,385.73	0.15	1,177.87	4,742.07
1	7	1,795.14	0.20	1,436.11	6,178.18
1	8	265.18	0.05	251.92	6,430.10
1	9	901.65	0.10	811.48	7,241.58
1	10	156.61	0.05	148.78	7,390.36

3.3 Summary Statistics

summary_sales <- sim_result %>%
  group_by(sales_id) %>%
  summarise(
    Total_Gross  = round(sum(sales_amount), 2),
    Total_Net    = round(sum(net_sales), 2),
    Avg_Discount = round(mean(discount_rate) * 100, 1),
    Max_Net_Day  = round(max(net_sales), 2),
    .groups      = "drop"
  )

summary_sales %>%
  kable(
    caption     = "Tabel 2.3 — Ringkasan Penjualan per Salesperson",
    col.names   = c("Sales ID", "Total Gross (IDR)", "Total Net (IDR)",
                    "Avg Discount (%)", "Max Net/Day"),
    align       = c("c", "r", "r", "c", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
  column_spec(2, color = "#424242") %>%
  column_spec(3, bold = TRUE, color = "#1B5E20") %>%
  column_spec(4, color = "#E65100")

Tabel 2.3 — Ringkasan Penjualan per Salesperson
Sales ID	Total Gross (IDR)	Total Net (IDR)	Avg Discount (%)	Max Net/Day
1	27,150.32	23,098.05	11.7	1,534.96
2	30,183.54	25,728.56	12.3	1,559.14
3	31,852.49	26,809.95	13.3	1,596.26
4	33,147.16	27,901.67	13.2	1,594.10
5	30,854.48	25,759.86	13.3	1,519.36

3.4 Visualization

ggplot(sim_result, aes(x = day, y = cumulative_net,
                       color = factor(sales_id), group = sales_id)) +
  geom_line(linewidth = 1.1) +
  scale_color_brewer(palette = "Set1", name = "Salesperson ID") +
  labs(
    title    = "Cumulative Net Sales per Salesperson (30 Days)",
    subtitle = "After applying conditional discount rates",
    x        = "Day",
    y        = "Cumulative Net Sales (IDR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

3.5 Interpretasi

Simulasi penjualan selama 30 hari menunjukkan bahwa masing-masing salesperson punya trajektori kumulatif yang terus naik, yang wajar karena penjualan terjadi setiap hari. Setelah diskon diterapkan secara kondisional (semakin besar penjualan, semakin besar diskon), nilai net sales sedikit lebih rendah dari gross-nya, tapi polanya tetap konsisten naik. Perbedaan kemiringan antar salesperson mencerminkan seberapa sering mereka berhasil menjual di nilai tinggi.

4 Task 3 — Multi-Level Performance Categorization

Tujuan: Mengelompokkan performa penjualan ke dalam kategori bertingkat.

Deskripsi:

Fungsi categorize_performance(sales_amount) menerima vektor angka penjualan dan mengelompokkannya ke dalam 5 kategori: Excellent (≥ 1500), Very Good (1200–1499), Good (900–1199), Average (500–899), Poor (< 500). Proses dilakukan dengan loop untuk setiap elemen vektor. Setelah kategorisasi selesai, fungsi menghitung persentase jumlah data di setiap kategori. Hasilnya divisualisasikan dalam bentuk bar plot dan pie chart untuk memudahkan interpretasi distribusi performa.

Output:

Tabel frekuensi dan persentase per kategori
Bar plot distribusi kategori
Pie chart proporsi kategori

4.1 Function & Categorization

# ── Fungsi kategorisasi performa ─────────────────────────────────────────────
categorize_performance <- function(sales_amount) {
  categories <- c()
  for (amt in sales_amount) {
    cat_label <- if (amt >= 1800) {
      "Excellent"
    } else if (amt >= 1400) {
      "Very Good"
    } else if (amt >= 900) {
      "Good"
    } else if (amt >= 400) {
      "Average"
    } else {
      "Poor"
    }
    categories <- c(categories, cat_label)
  }
  return(categories)
}

# ── Terapkan ke data & hitung persentase ─────────────────────────────────────
sales_df$performance_cat <- categorize_performance(sales_df$sales_amount)

cat_order   <- c("Excellent", "Very Good", "Good", "Average", "Poor")
cat_summary <- sales_df %>%
  group_by(performance_cat) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(percentage = round(count / sum(count) * 100, 1)) %>%
  arrange(factor(performance_cat, levels = cat_order))

cat_summary %>%
  kable(
    caption   = "Tabel 3.1 — Distribusi Kategori Performa Penjualan",
    col.names = c("Kategori", "Jumlah", "Persentase (%)"),
    align     = c("l", "c", "c")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, color = "#1565C0", bold = TRUE) %>%
  column_spec(3, color = "#6A1B9A") %>%
  row_spec(which(cat_summary$performance_cat == "Excellent"), background = "#E8F5E9") %>%
  row_spec(which(cat_summary$performance_cat == "Poor"),      background = "#FFEBEE")

Tabel 3.1 — Distribusi Kategori Performa Penjualan
Kategori	Jumlah	Persentase (%)
Excellent	14	9.3
Very Good	30	20.0
Good	40	26.7
Average	41	27.3
Poor	25	16.7

4.2 Visualization

palette_cat <- c(
  "Excellent" = "#1B5E20",
  "Very Good" = "#388E3C",
  "Good"      = "#FBC02D",
  "Average"   = "#F57C00",
  "Poor"      = "#C62828"
)

cat_summary$performance_cat <- factor(cat_summary$performance_cat, levels = cat_order)

p_bar <- ggplot(cat_summary, aes(x = performance_cat, y = count,
                                  fill = performance_cat)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = paste0(count, "\n(", percentage, "%)")),
            vjust = -0.3, size = 3.8, fontface = "bold") +
  scale_fill_manual(values = palette_cat) +
  labs(title = "Sales Performance Distribution — Bar Chart",
       x = "Category", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(cat_summary$count) * 1.2)

p_pie <- ggplot(cat_summary, aes(x = "", y = count, fill = performance_cat)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  scale_fill_manual(values = palette_cat, name = "Category") +
  geom_text(aes(label = paste0(percentage, "%")),
            position = position_stack(vjust = 0.5),
            size = 4, color = "white", fontface = "bold") +
  labs(title = "Sales Performance Distribution — Pie Chart") +
  theme_void(base_size = 13)

grid.arrange(p_bar, p_pie, ncol = 2)

4.3 Interpretasi

Dari 150 data penjualan harian, distribusi performa terbagi cukup merata karena data dibangkitkan secara acak uniform antara 100–2000. Kategori Good dan Average mendominasi karena rentang nilainya paling lebar (400–1400). Sementara Excellent dan Poor cukup sedikit karena berada di ujung ekstrem. Ini menggambarkan kondisi penjualan yang tipikal — sebagian besar transaksi berada di kisaran menengah, dan hanya sebagian kecil yang benar-benar luar biasa atau sangat rendah.

5 Task 4 — Multi-Company Dataset Simulation

Tujuan: Simulasi data karyawan untuk banyak perusahaan dengan atribut lengkap.

Deskripsi:

Fungsi generate_company_data(n_company, n_employees) menghasilkan dataset dengan kolom: company_id, employee_id, salary, department, performance_score (0–100), dan KPI_score (0–100). Nested loop digunakan: loop luar untuk setiap perusahaan, loop dalam untuk setiap karyawan. Kondisi logis diterapkan untuk menentukan top performer (KPI > 90). Setelah data terbentuk, program menghitung ringkasan per perusahaan: rata-rata gaji, rata-rata performance score, rata-rata KPI, jumlah top performer, dan departemen dengan KPI tertinggi.

Output:

Dataset simulasi karyawan
Tabel ringkasan per perusahaan
Bar plot perbandingan rata-rata KPI antar perusahaan
Scatter plot salary vs performance_score

5.1 Load Data

company_df <- read.csv("company_data.csv")
cat("Dimensi:", nrow(company_df), "baris,", ncol(company_df), "kolom\n")

## Dimensi: 160 baris, 6 kolom

head(company_df, 10) %>%
  kable(
    caption     = "Tabel 4.1 — Preview company_data.csv (10 baris pertama)",
    col.names   = c("Company ID", "Employee ID", "Salary", "Department",
                    "Performance Score", "KPI Score"),
    align       = c("c", "c", "r", "l", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(4, italic = TRUE, color = "#4A148C") %>%
  column_spec(5, color = "#1B5E20") %>%
  column_spec(6, bold = TRUE, color = "#E65100")

Tabel 4.1 — Preview company_data.csv (10 baris pertama)
Company ID	Employee ID	Salary	Department	Performance Score	KPI Score
1	1	8,257.20	Marketing	75.9	45.4
1	2	3,768.31	Finance	51.1	64.9
1	3	5,642.61	Operations	53.5	68.4
1	4	3,808.80	Finance	51.6	54.9
1	5	5,856.06	Operations	83.4	49.6
1	6	11,680.23	Finance	94.1	93.6
1	7	12,415.43	Finance	90.4	87.9
1	8	10,907.79	HR	67.7	58.5
1	9	11,080.37	HR	99.2	86.5
1	10	11,738.56	Engineering	90.0	86.6

5.2 Function & Summary

# ── Nested loops + conditional: summary per company ──────────────────────────
generate_company_summary <- function(df) {
  company_ids  <- unique(df$company_id)
  summary_list <- list()

  for (cid in company_ids) {
    sub_df         <- df[df$company_id == cid, ]
    n_emp          <- nrow(sub_df)
    top_performers <- 0

    for (i in 1:n_emp) {
      if (sub_df$KPI_score[i] > 90) {
        top_performers <- top_performers + 1
      }
    }

    summary_list[[length(summary_list) + 1]] <- data.frame(
      Company         = paste("Company", cid),
      N_Employees     = n_emp,
      Avg_Salary      = round(mean(sub_df$salary), 2),
      Avg_Performance = round(mean(sub_df$performance_score), 2),
      Max_KPI         = round(max(sub_df$KPI_score), 2),
      Top_Performers  = top_performers
    )
  }
  return(do.call(rbind, summary_list))
}

company_summary <- generate_company_summary(company_df)

company_summary %>%
  kable(
    caption     = "Tabel 4.2 — Ringkasan per Perusahaan: Gaji, Performa & Top Performers",
    col.names   = c("Perusahaan", "Jml Karyawan", "Avg Salary (IDR)",
                    "Avg Performance", "Max KPI", "Top Performers"),
    align       = c("l", "c", "r", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(3, color = "#1565C0", bold = TRUE) %>%
  column_spec(5, color = "#E65100", bold = TRUE) %>%
  column_spec(6, bold = TRUE, color = "white",
              background = ifelse(company_summary$Top_Performers >= 3,
                                  "#1B5E20", "#E53935"))

Tabel 4.2 — Ringkasan per Perusahaan: Gaji, Performa & Top Performers
Perusahaan	Jml Karyawan	Avg Salary (IDR)	Avg Performance	Max KPI	Top Performers
Company 1	40	8,091.14	77.09	99.9	6
Company 2	40	8,983.58	78.89	98.4	12
Company 3	40	8,568.23	75.11	99.5	10
Company 4	40	9,087.17	74.92	99.5	7

5.3 Visualization

p1 <- ggplot(company_summary, aes(x = Company, y = Avg_Salary, fill = Company)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = scales::comma(Avg_Salary)), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "Blues", direction = -1) +
  labs(title = "Average Salary per Company", x = "", y = "Avg Salary") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_summary$Avg_Salary) * 1.15)

p2 <- ggplot(company_summary, aes(x = Company, y = Top_Performers, fill = Company)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = Top_Performers), vjust = -0.4, size = 4, fontface = "bold") +
  scale_fill_brewer(palette = "Oranges", direction = -1) +
  labs(title = "Top Performers (KPI > 90) per Company", x = "", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_summary$Top_Performers) * 1.2)

grid.arrange(p1, p2, ncol = 2)

5.4 Interpretasi

Hasil summary per perusahaan menunjukkan variasi yang cukup nyata antar company. Rata-rata gaji berbeda karena data dibangkitkan secara acak dalam rentang yang luas (3000–15000). Jumlah top performer (KPI > 90) juga bervariasi — ini bergantung pada seberapa banyak karyawan yang punya performance score di atas 85. Perusahaan dengan lebih banyak top performer cenderung punya Max KPI yang lebih tinggi, mencerminkan distribusi kinerja yang heterogen sebagaimana sering terjadi di dunia nyata.

6 Task 5 — Monte Carlo Simulation: π & Probability

Tujuan: Estimasi nilai π menggunakan metode Monte Carlo dan analisis probabilitas.

Deskripsi:

Fungsi monte_carlo_pi(n_points) menghasilkan titik acak (x, y) dalam rentang [0,1] sebanyak n_points. Titik-titik tersebut dihitung jaraknya dari titik pusat (0.5, 0.5) untuk menentukan apakah berada di dalam lingkaran (radius = 0.5). Nilai π diestimasi dengan rumus:

\[ \pi \approx 4 \times \frac{\text{jumlah titik dalam lingkaran}}{\text{total titik}} \]

Fungsi juga menghitung probabilitas titik acak jatuh di dalam sub-area tertentu. Simulasi ini dijalankan dengan loop untuk berbagai jumlah titik guna melihat konvergensi estimasi π. Visualisasi menggunakan scatter plot membedakan titik di dalam lingkaran (warna hijau) dan di luar lingkaran (warna merah).

Output:

Estimasi nilai π
Probabilitas titik dalam sub-area
Scatter plot titik dalam vs luar lingkaran

6.1 Function & Simulation

# ── Fungsi Monte Carlo ────────────────────────────────────────────────────────
monte_carlo_pi <- function(n_points) {
  set.seed(123)
  x <- runif(n_points, -1, 1)
  y <- runif(n_points, -1, 1)

  inside_circle  <- (x^2 + y^2) <= 1
  pi_estimate    <- 4 * sum(inside_circle) / n_points

  in_subsquare   <- abs(x) <= 0.5 & abs(y) <= 0.5
  prob_subsquare <- sum(in_subsquare) / n_points

  return(list(
    pi_estimate    = pi_estimate,
    prob_subsquare = prob_subsquare,
    x              = x,
    y              = y,
    inside_circle  = inside_circle
  ))
}

# ── Konvergensi dengan berbagai ukuran n ─────────────────────────────────────
n_values   <- c(100, 500, 1000, 5000, 10000)
pi_results <- data.frame()

for (n in n_values) {
  res        <- monte_carlo_pi(n)
  pi_results <- rbind(pi_results, data.frame(
    n_points       = n,
    pi_estimate    = round(res$pi_estimate, 5),
    error          = round(abs(res$pi_estimate - pi), 5),
    prob_subsquare = round(res$prob_subsquare, 4)
  ))
}

cat("Pi aktual:", pi, "\n")

## Pi aktual: 3.141593

pi_results %>%
  kable(
    caption     = "Tabel 5.1 — Estimasi π Monte Carlo (berbagai ukuran sampel)",
    col.names   = c("N Points", "Estimasi π", "Error vs π Asli", "P(Sub-Square)"),
    align       = c("r", "r", "r", "r"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "#1565C0") %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(pi_results$error < 0.01, "#1B5E20", "#E65100")) %>%
  row_spec(nrow(pi_results), background = "#E8F5E9")

Tabel 5.1 — Estimasi π Monte Carlo (berbagai ukuran sampel)
N Points	Estimasi π	Error vs π Asli	P(Sub-Square)
100	3.4000	0.25841	0.2600
500	3.2000	0.05841	0.2400
1,000	3.2000	0.05841	0.2650
5,000	3.1632	0.02161	0.2630
10,000	3.1576	0.01601	0.2507

6.2 Visualization

res_plot <- monte_carlo_pi(2000)
plot_df  <- data.frame(
  x      = res_plot$x,
  y      = res_plot$y,
  status = ifelse(res_plot$inside_circle, "Inside Circle", "Outside Circle")
)

p_scatter <- ggplot(plot_df, aes(x = x, y = y, color = status)) +
  geom_point(alpha = 0.5, size = 0.8) +
  scale_color_manual(values = c("Inside Circle"  = "#1565C0",
                                "Outside Circle" = "#E53935")) +
  coord_fixed() +
  annotate("path",
           x = cos(seq(0, 2 * pi, length.out = 300)),
           y = sin(seq(0, 2 * pi, length.out = 300)),
           color = "black", linewidth = 0.8) +
  labs(
    title    = "Monte Carlo Simulation (n = 2000)",
    subtitle = paste0("pi estimate = ", round(res_plot$pi_estimate, 4)),
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

p_conv <- ggplot(pi_results, aes(x = n_points, y = pi_estimate)) +
  geom_line(color = "#1565C0", linewidth = 1.2) +
  geom_point(color = "#E53935", size = 3) +
  geom_hline(yintercept = pi, linetype = "dashed", color = "gray40") +
  geom_text(aes(label = round(pi_estimate, 4)), vjust = -0.8, size = 3.5) +
  annotate("text", x = max(pi_results$n_points) * 0.8, y = pi + 0.01,
           label = "True pi", color = "gray40") +
  scale_x_log10(labels = scales::comma) +
  labs(
    title = "Convergence of pi Estimate vs Sample Size",
    x     = "Number of Points (log scale)",
    y     = "pi Estimate"
  ) +
  theme_minimal(base_size = 13)

grid.arrange(p_scatter, p_conv, ncol = 2)

6.3 Interpretasi

Simulasi Monte Carlo bekerja dengan melempar titik acak ke dalam kotak bujursangkar, lalu menghitung berapa yang jatuh di dalam lingkaran. Rasionya dikalikan 4 untuk mendapatkan estimasi nilai pi. Hasilnya sangat menarik — semakin besar n, semakin dekat estimasi ke nilai pi asli (3.14159). Dengan n = 10.000 poin, errornya sudah sangat kecil. Probabilitas titik jatuh di sub-square sekitar 25%, yang masuk akal secara geometris karena luasnya (1x1) adalah seperempat dari total luas kotak (2x2).

7 Task 6 — Advanced Data Transformation & Feature Engineering

Tujuan: Transformasi data dan pembuatan fitur baru menggunakan pendekatan looping.

Deskripsi:

Dibuat dua fungsi transformasi: normalize_columns(df) untuk melakukan min-max normalization (rentang 0–1) dan z_score(df) untuk melakukan standardization (mean = 0, sd = 1). Kedua fungsi menggunakan loop untuk mengiterasi setiap kolom numerik dalam dataframe. Setelah transformasi selesai, program membuat fitur baru seperti performance_category (berdasarkan performance_score) dan salary_bracket (berdasarkan kuartil gaji). Perbandingan distribusi data sebelum dan sesudah transformasi divisualisasikan menggunakan histogram dan boxplot secara berdampingan.

Output:

Dataframe hasil normalisasi
Dataframe hasil z-score
Histogram perbandingan distribusi
Boxplot perbandingan distribusi

7.1 Load Data & Functions

df6          <- read.csv("company_data.csv")
numeric_cols <- c("salary", "performance_score", "KPI_score")
cat("Dimensi:", nrow(df6), "baris,", ncol(df6), "kolom\n")

## Dimensi: 160 baris, 6 kolom

# ── Min-Max Normalization (loop-based) ───────────────────────────────────────
normalize_columns <- function(df, cols) {
  df_norm <- df
  for (col in cols) {
    mn             <- min(df[[col]], na.rm = TRUE)
    mx             <- max(df[[col]], na.rm = TRUE)
    df_norm[[col]] <- round((df[[col]] - mn) / (mx - mn), 4)
  }
  return(df_norm)
}

# ── Z-Score Standardization (loop-based) ─────────────────────────────────────
z_score <- function(df, cols) {
  df_z <- df
  for (col in cols) {
    mu          <- mean(df[[col]], na.rm = TRUE)
    sigma       <- sd(df[[col]], na.rm = TRUE)
    df_z[[col]] <- round((df[[col]] - mu) / sigma, 4)
  }
  return(df_z)
}

df_norm <- normalize_columns(df6, numeric_cols)
df_z    <- z_score(df6, numeric_cols)

# ── Feature Engineering ───────────────────────────────────────────────────────
df6$performance_category <- categorize_performance(df6$performance_score)
df6$salary_bracket <- cut(
  df6$salary,
  breaks         = c(0, 5000, 8000, 11000, Inf),
  labels         = c("Low", "Medium", "High", "Very High"),
  include.lowest = TRUE
)

cat("Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.\n")

## Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.

7.2 Preview Data

# ── Data asli ─────────────────────────────────────────────────────────────────
head(df6, 10) %>%
  kable(
    caption     = "Tabel 6.1 — Data Asli (sebelum transformasi)",
    col.names   = c("Company ID", "Employee ID", "Salary", "Department",
                    "Performance Score", "KPI Score",
                    "Performance Category", "Salary Bracket"),
    align       = c("c", "c", "r", "l", "r", "r", "l", "l"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(5, color = "#6A1B9A") %>%
  column_spec(7, italic = TRUE, color = "#E65100") %>%
  column_spec(8, italic = TRUE, color = "#0277BD")

Tabel 6.1 — Data Asli (sebelum transformasi)
Company ID	Employee ID	Salary	Department	Performance Score	KPI Score	Performance Category	Salary Bracket
1	1	8,257.20	Marketing	75.9	45.4	Poor	High
1	2	3,768.31	Finance	51.1	64.9	Poor	Low
1	3	5,642.61	Operations	53.5	68.4	Poor	Medium
1	4	3,808.80	Finance	51.6	54.9	Poor	Low
1	5	5,856.06	Operations	83.4	49.6	Poor	Medium
1	6	11,680.23	Finance	94.1	93.6	Poor	Very High
1	7	12,415.43	Finance	90.4	87.9	Poor	Very High
1	8	10,907.79	HR	67.7	58.5	Poor	High
1	9	11,080.37	HR	99.2	86.5	Poor	Very High
1	10	11,738.56	Engineering	90.0	86.6	Poor	Very High

# ── Setelah Min-Max Normalization ─────────────────────────────────────────────
head(df_norm[, numeric_cols], 10) %>%
  kable(
    caption   = "Tabel 6.2 — Setelah Min-Max Normalization (rentang [0, 1])",
    col.names = c("Salary (norm)", "Performance Score (norm)", "KPI Score (norm)"),
    align     = c("r", "r", "r"),
    digits    = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, color = "#1565C0", bold = TRUE) %>%
  column_spec(2, color = "#1B5E20") %>%
  column_spec(3, color = "#E65100")

Tabel 6.2 — Setelah Min-Max Normalization (rentang [0, 1])
Salary (norm)	Performance Score (norm)	KPI Score (norm)
0.4408	0.5181	0.0886
0.0576	0.0201	0.4147
0.2176	0.0683	0.4732
0.0610	0.0301	0.2475
0.2358	0.6687	0.1589
0.7330	0.8835	0.8946
0.7958	0.8092	0.7993
0.6671	0.3534	0.3077
0.6818	0.9859	0.7759
0.7380	0.8012	0.7776

# ── Setelah Z-Score Standardization ──────────────────────────────────────────
head(df_z[, numeric_cols], 10) %>%
  kable(
    caption   = "Tabel 6.3 — Setelah Z-Score Standardization (satuan standar deviasi)",
    col.names = c("Salary (z)", "Performance Score (z)", "KPI Score (z)"),
    align     = c("r", "r", "r"),
    digits    = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, color = "#1565C0", bold = TRUE) %>%
  column_spec(2, color = "#1B5E20") %>%
  column_spec(3, color = "#E65100")

Tabel 6.3 — Setelah Z-Score Standardization (satuan standar deviasi)
Salary (z)	Performance Score (z)	KPI Score (z)
-0.1203	-0.0386	-1.5170
-1.3894	-1.6285	-0.4422
-0.8595	-1.4747	-0.2493
-1.3780	-1.5965	-0.9934
-0.7991	0.4422	-1.2855
0.8475	1.1282	1.1398
1.0554	0.8910	0.8256
0.6292	-0.5643	-0.7950
0.6779	1.4551	0.7484
0.8640	0.8653	0.7540

7.3 Visualization

p_sal_before <- ggplot(df6, aes(x = salary)) +
  geom_histogram(bins = 20, fill = "#1565C0", alpha = 0.8, color = "white") +
  labs(title = "Salary — Original", x = "Salary", y = "Count") +
  theme_minimal(base_size = 12)

p_sal_after <- ggplot(df_norm, aes(x = salary)) +
  geom_histogram(bins = 20, fill = "#43A047", alpha = 0.8, color = "white") +
  labs(title = "Salary — Normalized [0, 1]", x = "Normalized Salary", y = "Count") +
  theme_minimal(base_size = 12)

p_perf_before <- ggplot(df6, aes(y = performance_score, x = factor(company_id),
                                  fill = factor(company_id))) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Performance Score — Original", x = "Company", y = "Score") +
  theme_minimal(base_size = 12)

p_perf_after <- ggplot(df_z, aes(y = performance_score, x = factor(company_id),
                                  fill = factor(company_id))) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8) +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Performance Score — Z-Score", x = "Company", y = "Z-Score") +
  theme_minimal(base_size = 12)

pc_sum <- df6 %>%
  group_by(performance_category) %>%
  summarise(n = n()) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

p_bracket <- ggplot(df6, aes(x = salary_bracket, fill = salary_bracket)) +
  geom_bar(show.legend = FALSE, width = 0.6, color = "white") +
  scale_fill_manual(values = c(
    "Low"       = "#EF9A9A",
    "Medium"    = "#FFF176",
    "High"      = "#A5D6A7",
    "Very High" = "#90CAF9"
  )) +
  labs(title = "Salary Bracket Distribution", x = "Bracket", y = "Count") +
  theme_minimal(base_size = 12)

p_perf_cat <- ggplot(pc_sum, aes(x = "", y = n, fill = performance_category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_stack(vjust = 0.5),
            size = 3.5, color = "white", fontface = "bold") +
  scale_fill_brewer(palette = "Set2", name = "Category") +
  labs(title = "Performance Category Distribution") +
  theme_void(base_size = 12)

grid.arrange(p_sal_before, p_sal_after,
             p_perf_before, p_perf_after,
             p_bracket, p_perf_cat, ncol = 2)

7.4 Interpretasi

Proses normalisasi dan standardisasi mengubah skala data tanpa mengubah distribusi aslinya. Setelah min-max normalization, nilai salary semuanya masuk ke rentang 0-1, sehingga lebih mudah dibandingkan antar variabel yang berbeda satuan. Z-score menghasilkan distribusi dengan rata-rata 0 dan standar deviasi 1, yang berguna untuk deteksi outlier. Feature engineering menambahkan informasi kategoris yang bisa langsung dipakai untuk analisis segmentasi.

8 Task 7 — Mini Project: Company KPI Dashboard

Tujuan : Membangun dashboard KPI komprehensif untuk analisis multi-perusahaan.

Deskripsi:

Mini project ini menggabungkan seluruh konsep dari tugas sebelumnya. Dataset dibangkitkan untuk 5–10 perusahaan dengan 50–200 karyawan per perusahaan. Kolom yang tersedia: employee_id, company_id, salary, performance_score, KPI_score, department. Program melakukan loop untuk menghitung ringkasan per perusahaan, mengelompokkan karyawan ke dalam KPI tiers (High: KPI ≥ 80, Medium: 60–79, Low: < 60), dan menganalisis performa per departemen. Visualisasi lanjutan mencakup grouped bar chart untuk perbandingan KPI antar perusahaan dan scatter plot dengan regression line untuk melihat hubungan antara salary dan performance_score.

Output:

Tabel ringkasan per perusahaan
Tabel top performers (KPI > 90)
Grouped bar chart KPI per perusahaan
Scatter plot salary vs performance dengan regression line
Analisis per departemen (rata-rata KPI terendah & tertinggi)

8.1 Load Data

dash_df <- read.csv("dashboard_data.csv")
cat("Total karyawan  :", nrow(dash_df), "\n")

## Total karyawan  : 988

cat("Total perusahaan:", length(unique(dash_df$company_id)), "\n")

## Total perusahaan: 7

head(dash_df, 10) %>%
  kable(
    caption     = "Tabel 7.1 — Preview dashboard_data.csv (10 baris pertama)",
    col.names   = c("Employee ID", "Company ID", "Salary", "Department",
                    "Performance Score", "KPI Score"),
    align       = c("c", "c", "r", "l", "r", "r"),
    digits      = 2,
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(2, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, italic = TRUE, color = "#4A148C") %>%
  column_spec(5, color = "#424242") %>%
  column_spec(6, bold = TRUE, color = "#E65100")

Tabel 7.1 — Preview dashboard_data.csv (10 baris pertama)
Employee ID	Company ID	Salary	Department	Performance Score	KPI Score
1	CO1	7,790.52	R&D	58.7	68.5
2	CO1	8,395.58	Legal	85.1	77.9
3	CO1	4,062.58	Customer Service	71.9	66.9
4	CO1	6,648.91	Supply Chain	77.4	80.6
5	CO1	8,809.21	Legal	56.7	58.9
6	CO1	4,820.13	IT	91.7	76.5
7	CO1	4,953.22	Customer Service	70.8	65.8
8	CO1	7,035.43	Legal	75.7	69.3
9	CO1	9,350.49	Legal	70.4	79.4
10	CO1	7,378.56	Supply Chain	70.4	55.7

8.2 KPI Tier Categorization & Summary

# ── Loop kategorisasi KPI tier ────────────────────────────────────────────────
kpi_tier <- c()
for (kpi in dash_df$KPI_score) {
  tier <- if (kpi >= 90) {
    "Tier 1 - Elite"
  } else if (kpi >= 75) {
    "Tier 2 - High"
  } else if (kpi >= 60) {
    "Tier 3 - Average"
  } else {
    "Tier 4 - Low"
  }
  kpi_tier <- c(kpi_tier, tier)
}
dash_df$kpi_tier <- kpi_tier

# ── Summary per company ───────────────────────────────────────────────────────
company_kpi_summary <- dash_df %>%
  group_by(company_id) %>%
  summarise(
    N_Employees    = n(),
    Avg_Salary     = round(mean(salary), 2),
    Avg_KPI        = round(mean(KPI_score), 2),
    Top_Performers = sum(KPI_score >= 90),
    .groups        = "drop"
  )

company_kpi_summary %>%
  kable(
    caption     = "Tabel 7.2 — Ringkasan KPI per Perusahaan",
    col.names   = c("Company", "Jml Karyawan", "Avg Salary (IDR)",
                    "Avg KPI", "Top Performers (KPI >= 90)"),
    align       = c("c", "c", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
  column_spec(3, color = "#1B5E20", bold = TRUE) %>%
  column_spec(4, bold = TRUE,
              color = ifelse(company_kpi_summary$Avg_KPI >= 70, "#1B5E20", "#E65100")) %>%
  column_spec(5, bold = TRUE, color = "white",
              background = ifelse(company_kpi_summary$Top_Performers >= 5,
                                  "#1B5E20", "#E53935"))

Tabel 7.2 — Ringkasan KPI per Perusahaan
Company	Jml Karyawan	Avg Salary (IDR)	Avg KPI	Top Performers (KPI >= 90)
CO1	153	7,553.79	67.93	6
CO2	192	7,986.33	68.65	13
CO3	164	8,488.27	69.36	12
CO4	148	8,810.85	70.31	9
CO5	114	9,173.87	68.52	3
CO6	75	9,916.63	70.64	3
CO7	142	10,624.84	68.96	11

# ── Distribusi KPI Tier ───────────────────────────────────────────────────────
tier_order   <- c("Tier 1 - Elite", "Tier 2 - High", "Tier 3 - Average", "Tier 4 - Low")
tier_summary <- dash_df %>%
  group_by(kpi_tier) %>%
  summarise(
    Jumlah     = n(),
    Avg_KPI    = round(mean(KPI_score), 2),
    Avg_Salary = round(mean(salary), 2),
    .groups    = "drop"
  ) %>%
  mutate(
    Persentase = round(Jumlah / sum(Jumlah) * 100, 1),
    kpi_tier   = factor(kpi_tier, levels = tier_order)
  ) %>%
  arrange(kpi_tier)

tier_summary %>%
  kable(
    caption     = "Tabel 7.3 — Distribusi Karyawan per KPI Tier (seluruh perusahaan)",
    col.names   = c("KPI Tier", "Jumlah", "Avg KPI", "Avg Salary (IDR)", "Persentase (%)"),
    align       = c("l", "c", "r", "r", "c"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, color = "#1565C0", bold = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = c("#1B5E20", "#388E3C", "#E65100", "#C62828")) %>%
  column_spec(4, color = "#424242") %>%
  column_spec(5, color = "#6A1B9A") %>%
  row_spec(1, background = "#E8F5E9") %>%
  row_spec(4, background = "#FFEBEE")

Tabel 7.3 — Distribusi Karyawan per KPI Tier (seluruh perusahaan)
KPI Tier	Jumlah	Avg KPI	Avg Salary (IDR)	Persentase (%)
Tier 1 - Elite	57	93.34	9,132.32	5.8
Tier 2 - High	281	81.12	8,703.80	28.4
Tier 3 - Average	397	67.77	8,721.79	40.2
Tier 4 - Low	253	52.31	8,911.54	25.6

8.3 Visualizations

p_avg_kpi <- ggplot(company_kpi_summary,
                    aes(x = company_id, y = Avg_KPI, fill = company_id)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = round(Avg_KPI, 1)), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Average KPI Score per Company", x = "Company", y = "Avg KPI") +
  theme_minimal(base_size = 13) +
  ylim(0, 100)

p_top <- ggplot(company_kpi_summary,
                aes(x = company_id, y = Top_Performers, fill = company_id)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = Top_Performers), vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top Performers (KPI >= 90) per Company",
       x = "Company", y = "Count") +
  theme_minimal(base_size = 13) +
  ylim(0, max(company_kpi_summary$Top_Performers) * 1.2)

grid.arrange(p_avg_kpi, p_top, ncol = 2)

dept_summary <- dash_df %>%
  group_by(company_id, department) %>%
  summarise(Avg_Salary = round(mean(salary), 2), .groups = "drop")

ggplot(dept_summary, aes(x = department, y = Avg_Salary, fill = company_id)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  scale_fill_brewer(palette = "Dark2", name = "Company") +
  labs(
    title    = "Average Salary by Department and Company",
    subtitle = "Grouped Bar Chart",
    x        = "Department",
    y        = "Avg Salary"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x     = element_text(angle = 30, hjust = 1),
    legend.position = "bottom"
  )

p_scatter <- ggplot(dash_df, aes(x = performance_score, y = KPI_score,
                                  color = company_id)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_smooth(aes(group = 1), method = "lm", color = "black",
              se = TRUE, linewidth = 1.2) +
  scale_color_brewer(palette = "Set2", name = "Company") +
  labs(
    title    = "Performance Score vs KPI Score",
    subtitle = "Scatter plot with regression line",
    x        = "Performance Score",
    y        = "KPI Score"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

p_sal_tier <- ggplot(dash_df, aes(x = kpi_tier, y = salary, fill = kpi_tier)) +
  geom_boxplot(show.legend = FALSE, alpha = 0.8, outlier.size = 1) +
  scale_fill_manual(values = c(
    "Tier 1 - Elite"   = "#1B5E20",
    "Tier 2 - High"    = "#388E3C",
    "Tier 3 - Average" = "#FBC02D",
    "Tier 4 - Low"     = "#C62828"
  )) +
  labs(title = "Salary Distribution by KPI Tier",
       x = "KPI Tier", y = "Salary") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

grid.arrange(p_scatter, p_sal_tier, ncol = 2)

8.4 Interpretasi

Dashboard KPI dari 7 perusahaan dengan hampir 1000 karyawan ini memberikan gambaran yang cukup komprehensif. Rata-rata KPI antar perusahaan tidak terlalu jauh, menandakan performa yang relatif seimbang. Scatter plot antara performance score dan KPI score menunjukkan korelasi positif yang jelas — makin tinggi performa, makin tinggi KPI, meski ada noise karena KPI juga dipengaruhi faktor lain. Dari boxplot salary per KPI tier, distribusi gaji antar tier tidak berbeda drastis karena dalam dataset ini gaji dibangkitkan bervariasi tanpa bergantung langsung pada KPI. Perbedaan rata-rata gaji antar divisi pada grouped bar chart bisa menjadi bahan analisis lanjutan untuk kebijakan kompensasi.

9 Kesimpulan

Praktikum ini berhasil mengimplementasikan fungsi, loop, kondisional, dan visualisasi dalam tujuh tugas data science. Fungsi membuat kode reusable, nested loop menangani data berhierarki, kondisional memungkinkan pengambilan keputusan dinamis, dan visualisasi menyajikan insight secara intuitif. Kombinasi keempatnya merupakan fondasi penting dalam membangun alur kerja data science yang efisien dan terstruktur.