Assignment Week 5 ~ Functions and loops
Lulu Najla Salsabila
INSTITUT TEKNOLOGI SAINS BANDUNG
1 Pendahuluan
Praktikum ini bertujuan untuk mengimplementasikan konsep fungsi, loop, dan kondisional dalam menyelesaikan berbagai permasalahan data science. Seluruh tugas disusun secara bertahap dari tingkat dasar hingga lanjutan, mencakup simulasi data, transformasi data, analisis statistik, hingga visualisasi interaktif. Praktikum ini juga menekankan pada otomatisasi workflow data science menggunakan R.
2 Task 1 — Dynamic Multi-Formula Function
Tujuan: Membangun fungsi fleksibel yang dapat menghitung berbagai jenis persamaan matematika.
Deskripsi:
Fungsi compute_formula(x, formula) dibuat untuk menerima input nilai x dan jenis formula (“linear”, “quadratic”, “cubic”, atau “exponential”). Fungsi ini mengembalikan hasil perhitungan sesuai dengan formula yang dipilih. Implementasi menggunakan nested loop untuk menghitung semua formula sekaligus dalam rentang x = 1:20. Program juga melakukan validasi input untuk memastikan formula yang dimasukkan pengguna valid. Hasil akhir dari keempat formula divisualisasikan dalam satu grafik overlay sehingga pola pertumbuhan setiap fungsi dapat dibandingkan secara langsung.
Output:
Tabel hasil perhitungan untuk setiap formula
Grafik gabungan (linear, kuadratik, kubik, eksponensial) dalam satu plot
2.1 Function & Computation
# ── Definisi fungsi ──────────────────────────────────────────────────────────
compute_formula <- function(x, formula) {
valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
if (!formula %in% valid_formulas) {
stop(paste("Invalid formula. Choose from:", paste(valid_formulas, collapse = ", ")))
}
switch(formula,
"linear" = 2 * x + 3,
"quadratic" = x^2 - 3 * x + 2,
"cubic" = x^3 - 4 * x^2 + x + 6,
"exponential" = exp(0.3 * x)
)
}
# ── Hitung semua formula untuk x = 1:20 dengan nested loop ──────────────────
x_values <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")
results_df <- data.frame()
for (x in x_values) {
for (f in formulas) {
y <- compute_formula(x, f)
results_df <- rbind(results_df, data.frame(x = x, formula = f, y = y))
}
}
# ── Tabel pivot: tiap formula jadi kolom ────────────────────────────────────
results_wide <- tidyr::pivot_wider(results_df, names_from = formula, values_from = y)
results_wide[, 2:5] <- round(results_wide[, 2:5], 3)
results_wide %>%
kable(
caption = "Tabel 1.1 — Nilai f(x) untuk Setiap Formula (x = 1 sampai 20)",
col.names = c("x", "Exponential", "Cubic", "Linear", "Quadratic"),
align = c("c", "r", "r", "r", "r")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = TRUE,
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#424242") %>%
column_spec(2, color = "#9C27B0", bold = TRUE) %>%
column_spec(3, color = "#FF5722") %>%
column_spec(4, color = "#2196F3") %>%
column_spec(5, color = "#4CAF50")| x | Exponential | Cubic | Linear | Quadratic |
|---|---|---|---|---|
| 1 | 5 | 0 | 4 | 1.350 |
| 2 | 7 | 0 | 0 | 1.822 |
| 3 | 9 | 2 | 0 | 2.460 |
| 4 | 11 | 6 | 10 | 3.320 |
| 5 | 13 | 12 | 36 | 4.482 |
| 6 | 15 | 20 | 84 | 6.050 |
| 7 | 17 | 30 | 160 | 8.166 |
| 8 | 19 | 42 | 270 | 11.023 |
| 9 | 21 | 56 | 420 | 14.880 |
| 10 | 23 | 72 | 616 | 20.086 |
| 11 | 25 | 90 | 864 | 27.113 |
| 12 | 27 | 110 | 1170 | 36.598 |
| 13 | 29 | 132 | 1540 | 49.402 |
| 14 | 31 | 156 | 1980 | 66.686 |
| 15 | 33 | 182 | 2496 | 90.017 |
| 16 | 35 | 210 | 3094 | 121.510 |
| 17 | 37 | 240 | 3780 | 164.022 |
| 18 | 39 | 272 | 4560 | 221.406 |
| 19 | 41 | 306 | 5440 | 298.867 |
| 20 | 43 | 342 | 6426 | 403.429 |
2.2 Visualization
ggplot(results_df, aes(x = x, y = y, color = formula)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2, alpha = 0.7) +
scale_color_manual(values = c(
"linear" = "#2196F3",
"quadratic" = "#4CAF50",
"cubic" = "#FF5722",
"exponential" = "#9C27B0"
)) +
labs(
title = "Comparison of Mathematical Formulas (x = 1 to 20)",
subtitle = "Linear, Quadratic, Cubic, and Exponential",
x = "x",
y = "f(x)",
color = "Formula"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")2.3 Interpretasi
Dari grafik di atas terlihat bahwa keempat formula memberikan pola pertumbuhan yang sangat berbeda. Formula linear (2x + 3) naik secara konsisten dan landai. Formula quadratic membentuk kurva parabola yang mulai turun di awal lalu naik kembali. Formula cubic punya lekukan lebih kompleks karena derajatnya tiga. Sementara formula exponential (e^0.3x) menunjukkan pertumbuhan yang paling drastis — nilainya meledak tajam di atas x = 15. Ini menunjukkan betapa besar perbedaan perilaku fungsi berdasarkan jenisnya, terutama ketika x semakin besar.
3 Task 2 — Nested Simulation: Multi-Sales & Discounts
Tujuan: Simulasi data penjualan untuk banyak salesperson dengan sistem diskon dinamis.
Deskripsi:
Fungsi simulate_sales(n_salesperson, days) menghasilkan dataset penjualan secara acak. Setiap salesperson memiliki data penjualan harian selama jumlah hari yang ditentukan. Nested loop digunakan untuk mengiterasi setiap salesperson dan setiap hari. Diskon diberikan secara kondisional berdasarkan besaran penjualan (semakin besar penjualan, semakin besar diskon). Fungsi nested tambahan dibuat untuk menghitung total penjualan kumulatif per salesperson. Statistik ringkasan seperti total penjualan, rata-rata penjualan, dan total diskon ditampilkan.
Output:
Dataset simulasi (Sales ID, Day, Sales Amount, Discount Rate)
Ringkasan statistik per salesperson
Grafik cumulative sales per salesperson
3.1 Load Data
sales_df <- read.csv("sales_data.csv")
cat("Dimensi data:", nrow(sales_df), "baris,", ncol(sales_df), "kolom\n")## Dimensi data: 150 baris, 4 kolom
head(sales_df, 10) %>%
kable(
caption = "Tabel 2.1 — Preview sales_data.csv (10 baris pertama)",
col.names = c("Sales ID", "Day", "Sales Amount", "Discount Rate"),
align = c("c", "c", "r", "r"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
column_spec(3, color = "#1B5E20", bold = TRUE) %>%
column_spec(4, color = "#E65100")| Sales ID | Day | Sales Amount | Discount Rate |
|---|---|---|---|
| 1 | 1 | 1,314.91 | 0.15 |
| 1 | 2 | 147.52 | 0.05 |
| 1 | 3 | 622.56 | 0.10 |
| 1 | 4 | 524.10 | 0.10 |
| 1 | 5 | 1,499.30 | 0.15 |
| 1 | 6 | 1,385.73 | 0.15 |
| 1 | 7 | 1,795.14 | 0.20 |
| 1 | 8 | 265.18 | 0.05 |
| 1 | 9 | 901.65 | 0.10 |
| 1 | 10 | 156.61 | 0.05 |
3.2 Functions & Simulation
# ── Nested function: hitung net sales & kumulatif per salesperson ────────────
apply_discount <- function(amount, rate) {
amount * (1 - rate)
}
get_cumulative_sales <- function(df, salesperson_id) {
sub_df <- df[df$sales_id == salesperson_id, ]
sub_df <- sub_df[order(sub_df$day), ]
sub_df$net_sales <- round(mapply(apply_discount,
sub_df$sales_amount,
sub_df$discount_rate), 2)
sub_df$cumulative_net <- cumsum(sub_df$net_sales)
return(sub_df)
}
simulate_sales <- function(df) {
all_cumulative <- data.frame()
salesperson_ids <- unique(df$sales_id)
for (sid in salesperson_ids) {
all_cumulative <- rbind(all_cumulative, get_cumulative_sales(df, sid))
}
return(all_cumulative)
}
# ── Jalankan simulasi ────────────────────────────────────────────────────────
sim_result <- simulate_sales(sales_df)
head(sim_result, 10) %>%
kable(
caption = "Tabel 2.2 — Hasil Simulasi: Net Sales & Cumulative Net (10 baris pertama)",
col.names = c("Sales ID", "Day", "Sales Amount", "Discount Rate",
"Net Sales", "Cumulative Net"),
align = c("c", "c", "r", "r", "r", "r"),
digits = 2,
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
column_spec(5, color = "#1B5E20", bold = TRUE) %>%
column_spec(6, color = "#6A1B9A", bold = TRUE)| Sales ID | Day | Sales Amount | Discount Rate | Net Sales | Cumulative Net |
|---|---|---|---|---|---|
| 1 | 1 | 1,314.91 | 0.15 | 1,117.67 | 1,117.67 |
| 1 | 2 | 147.52 | 0.05 | 140.14 | 1,257.81 |
| 1 | 3 | 622.56 | 0.10 | 560.30 | 1,818.11 |
| 1 | 4 | 524.10 | 0.10 | 471.69 | 2,289.80 |
| 1 | 5 | 1,499.30 | 0.15 | 1,274.40 | 3,564.20 |
| 1 | 6 | 1,385.73 | 0.15 | 1,177.87 | 4,742.07 |
| 1 | 7 | 1,795.14 | 0.20 | 1,436.11 | 6,178.18 |
| 1 | 8 | 265.18 | 0.05 | 251.92 | 6,430.10 |
| 1 | 9 | 901.65 | 0.10 | 811.48 | 7,241.58 |
| 1 | 10 | 156.61 | 0.05 | 148.78 | 7,390.36 |
3.3 Summary Statistics
summary_sales <- sim_result %>%
group_by(sales_id) %>%
summarise(
Total_Gross = round(sum(sales_amount), 2),
Total_Net = round(sum(net_sales), 2),
Avg_Discount = round(mean(discount_rate) * 100, 1),
Max_Net_Day = round(max(net_sales), 2),
.groups = "drop"
)
summary_sales %>%
kable(
caption = "Tabel 2.3 — Ringkasan Penjualan per Salesperson",
col.names = c("Sales ID", "Total Gross (IDR)", "Total Net (IDR)",
"Avg Discount (%)", "Max Net/Day"),
align = c("c", "r", "r", "c", "r"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#1565C0") %>%
column_spec(2, color = "#424242") %>%
column_spec(3, bold = TRUE, color = "#1B5E20") %>%
column_spec(4, color = "#E65100")| Sales ID | Total Gross (IDR) | Total Net (IDR) | Avg Discount (%) | Max Net/Day |
|---|---|---|---|---|
| 1 | 27,150.32 | 23,098.05 | 11.7 | 1,534.96 |
| 2 | 30,183.54 | 25,728.56 | 12.3 | 1,559.14 |
| 3 | 31,852.49 | 26,809.95 | 13.3 | 1,596.26 |
| 4 | 33,147.16 | 27,901.67 | 13.2 | 1,594.10 |
| 5 | 30,854.48 | 25,759.86 | 13.3 | 1,519.36 |
3.4 Visualization
ggplot(sim_result, aes(x = day, y = cumulative_net,
color = factor(sales_id), group = sales_id)) +
geom_line(linewidth = 1.1) +
scale_color_brewer(palette = "Set1", name = "Salesperson ID") +
labs(
title = "Cumulative Net Sales per Salesperson (30 Days)",
subtitle = "After applying conditional discount rates",
x = "Day",
y = "Cumulative Net Sales (IDR)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")3.5 Interpretasi
Simulasi penjualan selama 30 hari menunjukkan bahwa masing-masing salesperson punya trajektori kumulatif yang terus naik, yang wajar karena penjualan terjadi setiap hari. Setelah diskon diterapkan secara kondisional (semakin besar penjualan, semakin besar diskon), nilai net sales sedikit lebih rendah dari gross-nya, tapi polanya tetap konsisten naik. Perbedaan kemiringan antar salesperson mencerminkan seberapa sering mereka berhasil menjual di nilai tinggi.
4 Task 3 — Multi-Level Performance Categorization
Tujuan: Mengelompokkan performa penjualan ke dalam kategori bertingkat.
Deskripsi:
Fungsi categorize_performance(sales_amount) menerima vektor angka penjualan dan mengelompokkannya ke dalam 5 kategori: Excellent (≥ 1500), Very Good (1200–1499), Good (900–1199), Average (500–899), Poor (< 500). Proses dilakukan dengan loop untuk setiap elemen vektor. Setelah kategorisasi selesai, fungsi menghitung persentase jumlah data di setiap kategori. Hasilnya divisualisasikan dalam bentuk bar plot dan pie chart untuk memudahkan interpretasi distribusi performa.
Output:
Tabel frekuensi dan persentase per kategori
Bar plot distribusi kategori
Pie chart proporsi kategori
4.1 Function & Categorization
# ── Fungsi kategorisasi performa ─────────────────────────────────────────────
categorize_performance <- function(sales_amount) {
categories <- c()
for (amt in sales_amount) {
cat_label <- if (amt >= 1800) {
"Excellent"
} else if (amt >= 1400) {
"Very Good"
} else if (amt >= 900) {
"Good"
} else if (amt >= 400) {
"Average"
} else {
"Poor"
}
categories <- c(categories, cat_label)
}
return(categories)
}
# ── Terapkan ke data & hitung persentase ─────────────────────────────────────
sales_df$performance_cat <- categorize_performance(sales_df$sales_amount)
cat_order <- c("Excellent", "Very Good", "Good", "Average", "Poor")
cat_summary <- sales_df %>%
group_by(performance_cat) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = round(count / sum(count) * 100, 1)) %>%
arrange(factor(performance_cat, levels = cat_order))
cat_summary %>%
kable(
caption = "Tabel 3.1 — Distribusi Kategori Performa Penjualan",
col.names = c("Kategori", "Jumlah", "Persentase (%)"),
align = c("l", "c", "c")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, color = "#1565C0", bold = TRUE) %>%
column_spec(3, color = "#6A1B9A") %>%
row_spec(which(cat_summary$performance_cat == "Excellent"), background = "#E8F5E9") %>%
row_spec(which(cat_summary$performance_cat == "Poor"), background = "#FFEBEE")| Kategori | Jumlah | Persentase (%) |
|---|---|---|
| Excellent | 14 | 9.3 |
| Very Good | 30 | 20.0 |
| Good | 40 | 26.7 |
| Average | 41 | 27.3 |
| Poor | 25 | 16.7 |
4.2 Visualization
palette_cat <- c(
"Excellent" = "#1B5E20",
"Very Good" = "#388E3C",
"Good" = "#FBC02D",
"Average" = "#F57C00",
"Poor" = "#C62828"
)
cat_summary$performance_cat <- factor(cat_summary$performance_cat, levels = cat_order)
p_bar <- ggplot(cat_summary, aes(x = performance_cat, y = count,
fill = performance_cat)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = paste0(count, "\n(", percentage, "%)")),
vjust = -0.3, size = 3.8, fontface = "bold") +
scale_fill_manual(values = palette_cat) +
labs(title = "Sales Performance Distribution — Bar Chart",
x = "Category", y = "Count") +
theme_minimal(base_size = 13) +
ylim(0, max(cat_summary$count) * 1.2)
p_pie <- ggplot(cat_summary, aes(x = "", y = count, fill = performance_cat)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y") +
scale_fill_manual(values = palette_cat, name = "Category") +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5),
size = 4, color = "white", fontface = "bold") +
labs(title = "Sales Performance Distribution — Pie Chart") +
theme_void(base_size = 13)
grid.arrange(p_bar, p_pie, ncol = 2)4.3 Interpretasi
Dari 150 data penjualan harian, distribusi performa terbagi cukup merata karena data dibangkitkan secara acak uniform antara 100–2000. Kategori Good dan Average mendominasi karena rentang nilainya paling lebar (400–1400). Sementara Excellent dan Poor cukup sedikit karena berada di ujung ekstrem. Ini menggambarkan kondisi penjualan yang tipikal — sebagian besar transaksi berada di kisaran menengah, dan hanya sebagian kecil yang benar-benar luar biasa atau sangat rendah.
5 Task 4 — Multi-Company Dataset Simulation
Tujuan: Simulasi data karyawan untuk banyak perusahaan dengan atribut lengkap.
Deskripsi:
Fungsi generate_company_data(n_company, n_employees) menghasilkan dataset dengan kolom: company_id, employee_id, salary, department, performance_score (0–100), dan KPI_score (0–100). Nested loop digunakan: loop luar untuk setiap perusahaan, loop dalam untuk setiap karyawan. Kondisi logis diterapkan untuk menentukan top performer (KPI > 90). Setelah data terbentuk, program menghitung ringkasan per perusahaan: rata-rata gaji, rata-rata performance score, rata-rata KPI, jumlah top performer, dan departemen dengan KPI tertinggi.
Output:
Dataset simulasi karyawan
Tabel ringkasan per perusahaan
Bar plot perbandingan rata-rata KPI antar perusahaan
Scatter plot salary vs performance_score
5.1 Load Data
company_df <- read.csv("company_data.csv")
cat("Dimensi:", nrow(company_df), "baris,", ncol(company_df), "kolom\n")## Dimensi: 160 baris, 6 kolom
head(company_df, 10) %>%
kable(
caption = "Tabel 4.1 — Preview company_data.csv (10 baris pertama)",
col.names = c("Company ID", "Employee ID", "Salary", "Department",
"Performance Score", "KPI Score"),
align = c("c", "c", "r", "l", "r", "r"),
digits = 2,
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
column_spec(4, italic = TRUE, color = "#4A148C") %>%
column_spec(5, color = "#1B5E20") %>%
column_spec(6, bold = TRUE, color = "#E65100")| Company ID | Employee ID | Salary | Department | Performance Score | KPI Score |
|---|---|---|---|---|---|
| 1 | 1 | 8,257.20 | Marketing | 75.9 | 45.4 |
| 1 | 2 | 3,768.31 | Finance | 51.1 | 64.9 |
| 1 | 3 | 5,642.61 | Operations | 53.5 | 68.4 |
| 1 | 4 | 3,808.80 | Finance | 51.6 | 54.9 |
| 1 | 5 | 5,856.06 | Operations | 83.4 | 49.6 |
| 1 | 6 | 11,680.23 | Finance | 94.1 | 93.6 |
| 1 | 7 | 12,415.43 | Finance | 90.4 | 87.9 |
| 1 | 8 | 10,907.79 | HR | 67.7 | 58.5 |
| 1 | 9 | 11,080.37 | HR | 99.2 | 86.5 |
| 1 | 10 | 11,738.56 | Engineering | 90.0 | 86.6 |
5.2 Function & Summary
# ── Nested loops + conditional: summary per company ──────────────────────────
generate_company_summary <- function(df) {
company_ids <- unique(df$company_id)
summary_list <- list()
for (cid in company_ids) {
sub_df <- df[df$company_id == cid, ]
n_emp <- nrow(sub_df)
top_performers <- 0
for (i in 1:n_emp) {
if (sub_df$KPI_score[i] > 90) {
top_performers <- top_performers + 1
}
}
summary_list[[length(summary_list) + 1]] <- data.frame(
Company = paste("Company", cid),
N_Employees = n_emp,
Avg_Salary = round(mean(sub_df$salary), 2),
Avg_Performance = round(mean(sub_df$performance_score), 2),
Max_KPI = round(max(sub_df$KPI_score), 2),
Top_Performers = top_performers
)
}
return(do.call(rbind, summary_list))
}
company_summary <- generate_company_summary(company_df)
company_summary %>%
kable(
caption = "Tabel 4.2 — Ringkasan per Perusahaan: Gaji, Performa & Top Performers",
col.names = c("Perusahaan", "Jml Karyawan", "Avg Salary (IDR)",
"Avg Performance", "Max KPI", "Top Performers"),
align = c("l", "c", "r", "r", "r", "c"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE) %>%
column_spec(3, color = "#1565C0", bold = TRUE) %>%
column_spec(5, color = "#E65100", bold = TRUE) %>%
column_spec(6, bold = TRUE, color = "white",
background = ifelse(company_summary$Top_Performers >= 3,
"#1B5E20", "#E53935"))| Perusahaan | Jml Karyawan | Avg Salary (IDR) | Avg Performance | Max KPI | Top Performers |
|---|---|---|---|---|---|
| Company 1 | 40 | 8,091.14 | 77.09 | 99.9 | 6 |
| Company 2 | 40 | 8,983.58 | 78.89 | 98.4 | 12 |
| Company 3 | 40 | 8,568.23 | 75.11 | 99.5 | 10 |
| Company 4 | 40 | 9,087.17 | 74.92 | 99.5 | 7 |
5.3 Visualization
p1 <- ggplot(company_summary, aes(x = Company, y = Avg_Salary, fill = Company)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = scales::comma(Avg_Salary)), vjust = -0.4, size = 3.5) +
scale_fill_brewer(palette = "Blues", direction = -1) +
labs(title = "Average Salary per Company", x = "", y = "Avg Salary") +
theme_minimal(base_size = 13) +
ylim(0, max(company_summary$Avg_Salary) * 1.15)
p2 <- ggplot(company_summary, aes(x = Company, y = Top_Performers, fill = Company)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = Top_Performers), vjust = -0.4, size = 4, fontface = "bold") +
scale_fill_brewer(palette = "Oranges", direction = -1) +
labs(title = "Top Performers (KPI > 90) per Company", x = "", y = "Count") +
theme_minimal(base_size = 13) +
ylim(0, max(company_summary$Top_Performers) * 1.2)
grid.arrange(p1, p2, ncol = 2)5.4 Interpretasi
Hasil summary per perusahaan menunjukkan variasi yang cukup nyata antar company. Rata-rata gaji berbeda karena data dibangkitkan secara acak dalam rentang yang luas (3000–15000). Jumlah top performer (KPI > 90) juga bervariasi — ini bergantung pada seberapa banyak karyawan yang punya performance score di atas 85. Perusahaan dengan lebih banyak top performer cenderung punya Max KPI yang lebih tinggi, mencerminkan distribusi kinerja yang heterogen sebagaimana sering terjadi di dunia nyata.
6 Task 5 — Monte Carlo Simulation: π & Probability
Tujuan: Estimasi nilai π menggunakan metode Monte Carlo dan analisis probabilitas.
Deskripsi:
Fungsi monte_carlo_pi(n_points) menghasilkan titik acak (x, y) dalam rentang [0,1] sebanyak n_points. Titik-titik tersebut dihitung jaraknya dari titik pusat (0.5, 0.5) untuk menentukan apakah berada di dalam lingkaran (radius = 0.5). Nilai π diestimasi dengan rumus:
\[ \pi \approx 4 \times \frac{\text{jumlah titik dalam lingkaran}}{\text{total titik}} \]
Fungsi juga menghitung probabilitas titik acak jatuh di dalam sub-area tertentu. Simulasi ini dijalankan dengan loop untuk berbagai jumlah titik guna melihat konvergensi estimasi π. Visualisasi menggunakan scatter plot membedakan titik di dalam lingkaran (warna hijau) dan di luar lingkaran (warna merah).
Output:
Estimasi nilai π
Probabilitas titik dalam sub-area
Scatter plot titik dalam vs luar lingkaran
6.1 Function & Simulation
# ── Fungsi Monte Carlo ────────────────────────────────────────────────────────
monte_carlo_pi <- function(n_points) {
set.seed(123)
x <- runif(n_points, -1, 1)
y <- runif(n_points, -1, 1)
inside_circle <- (x^2 + y^2) <= 1
pi_estimate <- 4 * sum(inside_circle) / n_points
in_subsquare <- abs(x) <= 0.5 & abs(y) <= 0.5
prob_subsquare <- sum(in_subsquare) / n_points
return(list(
pi_estimate = pi_estimate,
prob_subsquare = prob_subsquare,
x = x,
y = y,
inside_circle = inside_circle
))
}
# ── Konvergensi dengan berbagai ukuran n ─────────────────────────────────────
n_values <- c(100, 500, 1000, 5000, 10000)
pi_results <- data.frame()
for (n in n_values) {
res <- monte_carlo_pi(n)
pi_results <- rbind(pi_results, data.frame(
n_points = n,
pi_estimate = round(res$pi_estimate, 5),
error = round(abs(res$pi_estimate - pi), 5),
prob_subsquare = round(res$prob_subsquare, 4)
))
}
cat("Pi aktual:", pi, "\n")## Pi aktual: 3.141593
pi_results %>%
kable(
caption = "Tabel 5.1 — Estimasi π Monte Carlo (berbagai ukuran sampel)",
col.names = c("N Points", "Estimasi π", "Error vs π Asli", "P(Sub-Square)"),
align = c("r", "r", "r", "r"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "#1565C0") %>%
column_spec(2, bold = TRUE) %>%
column_spec(3, bold = TRUE,
color = ifelse(pi_results$error < 0.01, "#1B5E20", "#E65100")) %>%
row_spec(nrow(pi_results), background = "#E8F5E9")| N Points | Estimasi π | Error vs π Asli | P(Sub-Square) |
|---|---|---|---|
| 100 | 3.4000 | 0.25841 | 0.2600 |
| 500 | 3.2000 | 0.05841 | 0.2400 |
| 1,000 | 3.2000 | 0.05841 | 0.2650 |
| 5,000 | 3.1632 | 0.02161 | 0.2630 |
| 10,000 | 3.1576 | 0.01601 | 0.2507 |
6.2 Visualization
res_plot <- monte_carlo_pi(2000)
plot_df <- data.frame(
x = res_plot$x,
y = res_plot$y,
status = ifelse(res_plot$inside_circle, "Inside Circle", "Outside Circle")
)
p_scatter <- ggplot(plot_df, aes(x = x, y = y, color = status)) +
geom_point(alpha = 0.5, size = 0.8) +
scale_color_manual(values = c("Inside Circle" = "#1565C0",
"Outside Circle" = "#E53935")) +
coord_fixed() +
annotate("path",
x = cos(seq(0, 2 * pi, length.out = 300)),
y = sin(seq(0, 2 * pi, length.out = 300)),
color = "black", linewidth = 0.8) +
labs(
title = "Monte Carlo Simulation (n = 2000)",
subtitle = paste0("pi estimate = ", round(res_plot$pi_estimate, 4)),
color = NULL
) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")
p_conv <- ggplot(pi_results, aes(x = n_points, y = pi_estimate)) +
geom_line(color = "#1565C0", linewidth = 1.2) +
geom_point(color = "#E53935", size = 3) +
geom_hline(yintercept = pi, linetype = "dashed", color = "gray40") +
geom_text(aes(label = round(pi_estimate, 4)), vjust = -0.8, size = 3.5) +
annotate("text", x = max(pi_results$n_points) * 0.8, y = pi + 0.01,
label = "True pi", color = "gray40") +
scale_x_log10(labels = scales::comma) +
labs(
title = "Convergence of pi Estimate vs Sample Size",
x = "Number of Points (log scale)",
y = "pi Estimate"
) +
theme_minimal(base_size = 13)
grid.arrange(p_scatter, p_conv, ncol = 2)6.3 Interpretasi
Simulasi Monte Carlo bekerja dengan melempar titik acak ke dalam kotak bujursangkar, lalu menghitung berapa yang jatuh di dalam lingkaran. Rasionya dikalikan 4 untuk mendapatkan estimasi nilai pi. Hasilnya sangat menarik — semakin besar n, semakin dekat estimasi ke nilai pi asli (3.14159). Dengan n = 10.000 poin, errornya sudah sangat kecil. Probabilitas titik jatuh di sub-square sekitar 25%, yang masuk akal secara geometris karena luasnya (1x1) adalah seperempat dari total luas kotak (2x2).
7 Task 6 — Advanced Data Transformation & Feature Engineering
Tujuan: Transformasi data dan pembuatan fitur baru menggunakan pendekatan looping.
Deskripsi:
Dibuat dua fungsi transformasi: normalize_columns(df) untuk melakukan min-max normalization (rentang 0–1) dan z_score(df) untuk melakukan standardization (mean = 0, sd = 1). Kedua fungsi menggunakan loop untuk mengiterasi setiap kolom numerik dalam dataframe. Setelah transformasi selesai, program membuat fitur baru seperti performance_category (berdasarkan performance_score) dan salary_bracket (berdasarkan kuartil gaji). Perbandingan distribusi data sebelum dan sesudah transformasi divisualisasikan menggunakan histogram dan boxplot secara berdampingan.
Output:
Dataframe hasil normalisasi
Dataframe hasil z-score
Histogram perbandingan distribusi
Boxplot perbandingan distribusi
7.1 Load Data & Functions
df6 <- read.csv("company_data.csv")
numeric_cols <- c("salary", "performance_score", "KPI_score")
cat("Dimensi:", nrow(df6), "baris,", ncol(df6), "kolom\n")## Dimensi: 160 baris, 6 kolom
# ── Min-Max Normalization (loop-based) ───────────────────────────────────────
normalize_columns <- function(df, cols) {
df_norm <- df
for (col in cols) {
mn <- min(df[[col]], na.rm = TRUE)
mx <- max(df[[col]], na.rm = TRUE)
df_norm[[col]] <- round((df[[col]] - mn) / (mx - mn), 4)
}
return(df_norm)
}
# ── Z-Score Standardization (loop-based) ─────────────────────────────────────
z_score <- function(df, cols) {
df_z <- df
for (col in cols) {
mu <- mean(df[[col]], na.rm = TRUE)
sigma <- sd(df[[col]], na.rm = TRUE)
df_z[[col]] <- round((df[[col]] - mu) / sigma, 4)
}
return(df_z)
}
df_norm <- normalize_columns(df6, numeric_cols)
df_z <- z_score(df6, numeric_cols)
# ── Feature Engineering ───────────────────────────────────────────────────────
df6$performance_category <- categorize_performance(df6$performance_score)
df6$salary_bracket <- cut(
df6$salary,
breaks = c(0, 5000, 8000, 11000, Inf),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE
)
cat("Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.\n")## Fitur baru: performance_category dan salary_bracket berhasil ditambahkan.
7.2 Preview Data
# ── Data asli ─────────────────────────────────────────────────────────────────
head(df6, 10) %>%
kable(
caption = "Tabel 6.1 — Data Asli (sebelum transformasi)",
col.names = c("Company ID", "Employee ID", "Salary", "Department",
"Performance Score", "KPI Score",
"Performance Category", "Salary Bracket"),
align = c("c", "c", "r", "l", "r", "r", "l", "l"),
digits = 2,
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = TRUE,
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
column_spec(3, color = "#1B5E20", bold = TRUE) %>%
column_spec(5, color = "#6A1B9A") %>%
column_spec(7, italic = TRUE, color = "#E65100") %>%
column_spec(8, italic = TRUE, color = "#0277BD")| Company ID | Employee ID | Salary | Department | Performance Score | KPI Score | Performance Category | Salary Bracket |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 8,257.20 | Marketing | 75.9 | 45.4 | Poor | High |
| 1 | 2 | 3,768.31 | Finance | 51.1 | 64.9 | Poor | Low |
| 1 | 3 | 5,642.61 | Operations | 53.5 | 68.4 | Poor | Medium |
| 1 | 4 | 3,808.80 | Finance | 51.6 | 54.9 | Poor | Low |
| 1 | 5 | 5,856.06 | Operations | 83.4 | 49.6 | Poor | Medium |
| 1 | 6 | 11,680.23 | Finance | 94.1 | 93.6 | Poor | Very High |
| 1 | 7 | 12,415.43 | Finance | 90.4 | 87.9 | Poor | Very High |
| 1 | 8 | 10,907.79 | HR | 67.7 | 58.5 | Poor | High |
| 1 | 9 | 11,080.37 | HR | 99.2 | 86.5 | Poor | Very High |
| 1 | 10 | 11,738.56 | Engineering | 90.0 | 86.6 | Poor | Very High |
# ── Setelah Min-Max Normalization ─────────────────────────────────────────────
head(df_norm[, numeric_cols], 10) %>%
kable(
caption = "Tabel 6.2 — Setelah Min-Max Normalization (rentang [0, 1])",
col.names = c("Salary (norm)", "Performance Score (norm)", "KPI Score (norm)"),
align = c("r", "r", "r"),
digits = 4
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, color = "#1565C0", bold = TRUE) %>%
column_spec(2, color = "#1B5E20") %>%
column_spec(3, color = "#E65100")| Salary (norm) | Performance Score (norm) | KPI Score (norm) |
|---|---|---|
| 0.4408 | 0.5181 | 0.0886 |
| 0.0576 | 0.0201 | 0.4147 |
| 0.2176 | 0.0683 | 0.4732 |
| 0.0610 | 0.0301 | 0.2475 |
| 0.2358 | 0.6687 | 0.1589 |
| 0.7330 | 0.8835 | 0.8946 |
| 0.7958 | 0.8092 | 0.7993 |
| 0.6671 | 0.3534 | 0.3077 |
| 0.6818 | 0.9859 | 0.7759 |
| 0.7380 | 0.8012 | 0.7776 |
# ── Setelah Z-Score Standardization ──────────────────────────────────────────
head(df_z[, numeric_cols], 10) %>%
kable(
caption = "Tabel 6.3 — Setelah Z-Score Standardization (satuan standar deviasi)",
col.names = c("Salary (z)", "Performance Score (z)", "KPI Score (z)"),
align = c("r", "r", "r"),
digits = 4
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, color = "#1565C0", bold = TRUE) %>%
column_spec(2, color = "#1B5E20") %>%
column_spec(3, color = "#E65100")| Salary (z) | Performance Score (z) | KPI Score (z) |
|---|---|---|
| -0.1203 | -0.0386 | -1.5170 |
| -1.3894 | -1.6285 | -0.4422 |
| -0.8595 | -1.4747 | -0.2493 |
| -1.3780 | -1.5965 | -0.9934 |
| -0.7991 | 0.4422 | -1.2855 |
| 0.8475 | 1.1282 | 1.1398 |
| 1.0554 | 0.8910 | 0.8256 |
| 0.6292 | -0.5643 | -0.7950 |
| 0.6779 | 1.4551 | 0.7484 |
| 0.8640 | 0.8653 | 0.7540 |
7.3 Visualization
p_sal_before <- ggplot(df6, aes(x = salary)) +
geom_histogram(bins = 20, fill = "#1565C0", alpha = 0.8, color = "white") +
labs(title = "Salary — Original", x = "Salary", y = "Count") +
theme_minimal(base_size = 12)
p_sal_after <- ggplot(df_norm, aes(x = salary)) +
geom_histogram(bins = 20, fill = "#43A047", alpha = 0.8, color = "white") +
labs(title = "Salary — Normalized [0, 1]", x = "Normalized Salary", y = "Count") +
theme_minimal(base_size = 12)
p_perf_before <- ggplot(df6, aes(y = performance_score, x = factor(company_id),
fill = factor(company_id))) +
geom_boxplot(show.legend = FALSE, alpha = 0.8) +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "Performance Score — Original", x = "Company", y = "Score") +
theme_minimal(base_size = 12)
p_perf_after <- ggplot(df_z, aes(y = performance_score, x = factor(company_id),
fill = factor(company_id))) +
geom_boxplot(show.legend = FALSE, alpha = 0.8) +
scale_fill_brewer(palette = "Pastel2") +
labs(title = "Performance Score — Z-Score", x = "Company", y = "Z-Score") +
theme_minimal(base_size = 12)
pc_sum <- df6 %>%
group_by(performance_category) %>%
summarise(n = n()) %>%
mutate(pct = round(n / sum(n) * 100, 1))
p_bracket <- ggplot(df6, aes(x = salary_bracket, fill = salary_bracket)) +
geom_bar(show.legend = FALSE, width = 0.6, color = "white") +
scale_fill_manual(values = c(
"Low" = "#EF9A9A",
"Medium" = "#FFF176",
"High" = "#A5D6A7",
"Very High" = "#90CAF9"
)) +
labs(title = "Salary Bracket Distribution", x = "Bracket", y = "Count") +
theme_minimal(base_size = 12)
p_perf_cat <- ggplot(pc_sum, aes(x = "", y = n, fill = performance_category)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y") +
geom_text(aes(label = paste0(pct, "%")),
position = position_stack(vjust = 0.5),
size = 3.5, color = "white", fontface = "bold") +
scale_fill_brewer(palette = "Set2", name = "Category") +
labs(title = "Performance Category Distribution") +
theme_void(base_size = 12)
grid.arrange(p_sal_before, p_sal_after,
p_perf_before, p_perf_after,
p_bracket, p_perf_cat, ncol = 2)7.4 Interpretasi
Proses normalisasi dan standardisasi mengubah skala data tanpa mengubah distribusi aslinya. Setelah min-max normalization, nilai salary semuanya masuk ke rentang 0-1, sehingga lebih mudah dibandingkan antar variabel yang berbeda satuan. Z-score menghasilkan distribusi dengan rata-rata 0 dan standar deviasi 1, yang berguna untuk deteksi outlier. Feature engineering menambahkan informasi kategoris yang bisa langsung dipakai untuk analisis segmentasi.
8 Task 7 — Mini Project: Company KPI Dashboard
Tujuan : Membangun dashboard KPI komprehensif untuk analisis multi-perusahaan.
Deskripsi:
Mini project ini menggabungkan seluruh konsep dari tugas sebelumnya. Dataset dibangkitkan untuk 5–10 perusahaan dengan 50–200 karyawan per perusahaan. Kolom yang tersedia: employee_id, company_id, salary, performance_score, KPI_score, department. Program melakukan loop untuk menghitung ringkasan per perusahaan, mengelompokkan karyawan ke dalam KPI tiers (High: KPI ≥ 80, Medium: 60–79, Low: < 60), dan menganalisis performa per departemen. Visualisasi lanjutan mencakup grouped bar chart untuk perbandingan KPI antar perusahaan dan scatter plot dengan regression line untuk melihat hubungan antara salary dan performance_score.
Output:
Tabel ringkasan per perusahaan
Tabel top performers (KPI > 90)
Grouped bar chart KPI per perusahaan
Scatter plot salary vs performance dengan regression line
Analisis per departemen (rata-rata KPI terendah & tertinggi)
8.1 Load Data
## Total karyawan : 988
## Total perusahaan: 7
head(dash_df, 10) %>%
kable(
caption = "Tabel 7.1 — Preview dashboard_data.csv (10 baris pertama)",
col.names = c("Employee ID", "Company ID", "Salary", "Department",
"Performance Score", "KPI Score"),
align = c("c", "c", "r", "l", "r", "r"),
digits = 2,
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(2, bold = TRUE, color = "white", background = "#0277BD") %>%
column_spec(3, color = "#1B5E20", bold = TRUE) %>%
column_spec(4, italic = TRUE, color = "#4A148C") %>%
column_spec(5, color = "#424242") %>%
column_spec(6, bold = TRUE, color = "#E65100")| Employee ID | Company ID | Salary | Department | Performance Score | KPI Score |
|---|---|---|---|---|---|
| 1 | CO1 | 7,790.52 | R&D | 58.7 | 68.5 |
| 2 | CO1 | 8,395.58 | Legal | 85.1 | 77.9 |
| 3 | CO1 | 4,062.58 | Customer Service | 71.9 | 66.9 |
| 4 | CO1 | 6,648.91 | Supply Chain | 77.4 | 80.6 |
| 5 | CO1 | 8,809.21 | Legal | 56.7 | 58.9 |
| 6 | CO1 | 4,820.13 | IT | 91.7 | 76.5 |
| 7 | CO1 | 4,953.22 | Customer Service | 70.8 | 65.8 |
| 8 | CO1 | 7,035.43 | Legal | 75.7 | 69.3 |
| 9 | CO1 | 9,350.49 | Legal | 70.4 | 79.4 |
| 10 | CO1 | 7,378.56 | Supply Chain | 70.4 | 55.7 |
8.2 KPI Tier Categorization & Summary
# ── Loop kategorisasi KPI tier ────────────────────────────────────────────────
kpi_tier <- c()
for (kpi in dash_df$KPI_score) {
tier <- if (kpi >= 90) {
"Tier 1 - Elite"
} else if (kpi >= 75) {
"Tier 2 - High"
} else if (kpi >= 60) {
"Tier 3 - Average"
} else {
"Tier 4 - Low"
}
kpi_tier <- c(kpi_tier, tier)
}
dash_df$kpi_tier <- kpi_tier
# ── Summary per company ───────────────────────────────────────────────────────
company_kpi_summary <- dash_df %>%
group_by(company_id) %>%
summarise(
N_Employees = n(),
Avg_Salary = round(mean(salary), 2),
Avg_KPI = round(mean(KPI_score), 2),
Top_Performers = sum(KPI_score >= 90),
.groups = "drop"
)
company_kpi_summary %>%
kable(
caption = "Tabel 7.2 — Ringkasan KPI per Perusahaan",
col.names = c("Company", "Jml Karyawan", "Avg Salary (IDR)",
"Avg KPI", "Top Performers (KPI >= 90)"),
align = c("c", "c", "r", "r", "c"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE, color = "white", background = "#0277BD") %>%
column_spec(3, color = "#1B5E20", bold = TRUE) %>%
column_spec(4, bold = TRUE,
color = ifelse(company_kpi_summary$Avg_KPI >= 70, "#1B5E20", "#E65100")) %>%
column_spec(5, bold = TRUE, color = "white",
background = ifelse(company_kpi_summary$Top_Performers >= 5,
"#1B5E20", "#E53935"))| Company | Jml Karyawan | Avg Salary (IDR) | Avg KPI | Top Performers (KPI >= 90) |
|---|---|---|---|---|
| CO1 | 153 | 7,553.79 | 67.93 | 6 |
| CO2 | 192 | 7,986.33 | 68.65 | 13 |
| CO3 | 164 | 8,488.27 | 69.36 | 12 |
| CO4 | 148 | 8,810.85 | 70.31 | 9 |
| CO5 | 114 | 9,173.87 | 68.52 | 3 |
| CO6 | 75 | 9,916.63 | 70.64 | 3 |
| CO7 | 142 | 10,624.84 | 68.96 | 11 |
# ── Distribusi KPI Tier ───────────────────────────────────────────────────────
tier_order <- c("Tier 1 - Elite", "Tier 2 - High", "Tier 3 - Average", "Tier 4 - Low")
tier_summary <- dash_df %>%
group_by(kpi_tier) %>%
summarise(
Jumlah = n(),
Avg_KPI = round(mean(KPI_score), 2),
Avg_Salary = round(mean(salary), 2),
.groups = "drop"
) %>%
mutate(
Persentase = round(Jumlah / sum(Jumlah) * 100, 1),
kpi_tier = factor(kpi_tier, levels = tier_order)
) %>%
arrange(kpi_tier)
tier_summary %>%
kable(
caption = "Tabel 7.3 — Distribusi Karyawan per KPI Tier (seluruh perusahaan)",
col.names = c("KPI Tier", "Jumlah", "Avg KPI", "Avg Salary (IDR)", "Persentase (%)"),
align = c("l", "c", "r", "r", "c"),
format.args = list(big.mark = ",")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE,
position = "center",
font_size = 13
) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, color = "#1565C0", bold = TRUE) %>%
column_spec(3, bold = TRUE,
color = c("#1B5E20", "#388E3C", "#E65100", "#C62828")) %>%
column_spec(4, color = "#424242") %>%
column_spec(5, color = "#6A1B9A") %>%
row_spec(1, background = "#E8F5E9") %>%
row_spec(4, background = "#FFEBEE")| KPI Tier | Jumlah | Avg KPI | Avg Salary (IDR) | Persentase (%) |
|---|---|---|---|---|
| Tier 1 - Elite | 57 | 93.34 | 9,132.32 | 5.8 |
| Tier 2 - High | 281 | 81.12 | 8,703.80 | 28.4 |
| Tier 3 - Average | 397 | 67.77 | 8,721.79 | 40.2 |
| Tier 4 - Low | 253 | 52.31 | 8,911.54 | 25.6 |
8.3 Visualizations
p_avg_kpi <- ggplot(company_kpi_summary,
aes(x = company_id, y = Avg_KPI, fill = company_id)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = round(Avg_KPI, 1)), vjust = -0.4, fontface = "bold", size = 4) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Average KPI Score per Company", x = "Company", y = "Avg KPI") +
theme_minimal(base_size = 13) +
ylim(0, 100)
p_top <- ggplot(company_kpi_summary,
aes(x = company_id, y = Top_Performers, fill = company_id)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = Top_Performers), vjust = -0.4, fontface = "bold", size = 4) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Top Performers (KPI >= 90) per Company",
x = "Company", y = "Count") +
theme_minimal(base_size = 13) +
ylim(0, max(company_kpi_summary$Top_Performers) * 1.2)
grid.arrange(p_avg_kpi, p_top, ncol = 2)dept_summary <- dash_df %>%
group_by(company_id, department) %>%
summarise(Avg_Salary = round(mean(salary), 2), .groups = "drop")
ggplot(dept_summary, aes(x = department, y = Avg_Salary, fill = company_id)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
scale_fill_brewer(palette = "Dark2", name = "Company") +
labs(
title = "Average Salary by Department and Company",
subtitle = "Grouped Bar Chart",
x = "Department",
y = "Avg Salary"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "bottom"
)p_scatter <- ggplot(dash_df, aes(x = performance_score, y = KPI_score,
color = company_id)) +
geom_point(alpha = 0.35, size = 1.5) +
geom_smooth(aes(group = 1), method = "lm", color = "black",
se = TRUE, linewidth = 1.2) +
scale_color_brewer(palette = "Set2", name = "Company") +
labs(
title = "Performance Score vs KPI Score",
subtitle = "Scatter plot with regression line",
x = "Performance Score",
y = "KPI Score"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")
p_sal_tier <- ggplot(dash_df, aes(x = kpi_tier, y = salary, fill = kpi_tier)) +
geom_boxplot(show.legend = FALSE, alpha = 0.8, outlier.size = 1) +
scale_fill_manual(values = c(
"Tier 1 - Elite" = "#1B5E20",
"Tier 2 - High" = "#388E3C",
"Tier 3 - Average" = "#FBC02D",
"Tier 4 - Low" = "#C62828"
)) +
labs(title = "Salary Distribution by KPI Tier",
x = "KPI Tier", y = "Salary") +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
grid.arrange(p_scatter, p_sal_tier, ncol = 2)8.4 Interpretasi
Dashboard KPI dari 7 perusahaan dengan hampir 1000 karyawan ini memberikan gambaran yang cukup komprehensif. Rata-rata KPI antar perusahaan tidak terlalu jauh, menandakan performa yang relatif seimbang. Scatter plot antara performance score dan KPI score menunjukkan korelasi positif yang jelas — makin tinggi performa, makin tinggi KPI, meski ada noise karena KPI juga dipengaruhi faktor lain. Dari boxplot salary per KPI tier, distribusi gaji antar tier tidak berbeda drastis karena dalam dataset ini gaji dibangkitkan bervariasi tanpa bergantung langsung pada KPI. Perbedaan rata-rata gaji antar divisi pada grouped bar chart bisa menjadi bahan analisis lanjutan untuk kebijakan kompensasi.
9 Kesimpulan
Praktikum ini berhasil mengimplementasikan fungsi, loop, kondisional, dan visualisasi dalam tujuh tugas data science. Fungsi membuat kode reusable, nested loop menangani data berhierarki, kondisional memungkinkan pengambilan keputusan dinamis, dan visualisasi menyajikan insight secara intuitif. Kombinasi keempatnya merupakan fondasi penting dalam membangun alur kerja data science yang efisien dan terstruktur.