LSP Politeknik Statistika STIS — 2026
Skema: Data Scientist (DS)
Studi Kasus: Prediksi Risiko Stunting Balita
Nama Asesi: Fajar Fithra Ramadhan
NIM: 212212594
Kelas/No Absen: 4SK2 / 10
Kode Unit: J.62DMI00.001.1 | Judul Unit: Menentukan Objektif Bisnis
Bukti ini berhubungan dengan pengetahuan, keterampilan, dan sikap kerja yang dibutuhkan dalam menentukan objektif bisnis dari proyek data science prediksi risiko stunting pada balita di 50 kabupaten/kota prioritas Indonesia, sebagai bagian dari Program Percepatan Penurunan Stunting (PPS) Nasional 2025–2026.
Stunting pada balita merupakan kondisi gagal tumbuh akibat kekurangan gizi kronis yang berdampak jangka panjang pada kualitas sumber daya manusia Indonesia. Berdasarkan data terkini, prevalensi stunting nasional masih berada di angka 21,5%, jauh dari target nasional sebesar <14% pada tahun 2026. Salah satu kendala utama adalah intervensi PMT (Pemberian Makanan Tambahan) yang belum tepat sasaran akibat tidak adanya sistem deteksi dini berbasis data untuk mengidentifikasi balita yang berisiko stunting secara lebih awal dan akurat. Kementerian Kesehatan bekerja sama dengan BPS membentuk Tim Data Science guna membangun model prediksi risiko stunting berbasis data survei gizi balita di 50 kabupaten/kota prioritas.
| No | Objektif Bisnis | Metrik Kesuksesan (Kuantitatif) |
|---|---|---|
| 1 | Membangun model prediksi risiko stunting yang akurat untuk mendeteksi balita berisiko secara dini | Recall/Sensitivity ≥ 80% pada data uji; AUC-ROC ≥ 0,80 |
| 2 | Menurunkan proporsi balita stunting yang tidak terdeteksi (False Negative) agar intervensi PMT lebih tepat sasaran | False Negative Rate < 20%; cakupan intervensi tepat sasaran meningkat ≥ 30% |
| 3 | Mengidentifikasi faktor-faktor risiko dominan stunting di 50 kab/kota prioritas sebagai dasar kebijakan | Minimal 5 variabel prediktor signifikan (p < 0,05); model menjelaskan pola risiko per wilayah |
| 4 | Mendukung percepatan penurunan prevalensi stunting nasional dari 21,5% menuju target <14% pada 2026 | Penurunan prevalensi ≥ 7,5 poin persentase dalam 2 tahun dengan dukungan sistem deteksi |
| Terminologi | Definisi |
|---|---|
| Stunting | Kondisi gagal tumbuh pada balita akibat kekurangan gizi kronis, ditandai Z-score TB/U < -2 SD (WHO) |
| Z-score TB/U | Nilai standar tinggi badan menurut umur; indikator utama status stunting |
| PMT | Pemberian Makanan Tambahan; program intervensi gizi bagi balita berisiko |
| Class Imbalance | Kondisi distribusi kelas target yang tidak seimbang dalam dataset |
| CRISP-DM | Cross Industry Standard Process for Data Mining; metodologi standar proyek data science |
| Recall/Sensitivity | Metrik evaluasi kemampuan model mendeteksi kasus positif (Stunting) yang sebenarnya |
| False Negative | Kondisi saat balita Stunting diprediksi sebagai Normal oleh model |
| Feature Engineering | Proses pembuatan fitur baru dari data mentah untuk meningkatkan performa model |
data_stunting_raw.csv) dianggap representatif untuk mewakili populasi balita di 50 kab/kota prioritas| No | Jenis Risiko | Deskripsi Risiko | Rencana Mitigasi |
|---|---|---|---|
| 1 | Data | Missing values pada bb_lahir (20), pend_ibu (15), kunjungan_posyandu (10); outlier ekstrem pada zscore_tbu | Imputasi median per kelompok usia (numerik) & modus (kategorik); penanganan outlier dengan batas WHO; dokumentasi lengkap |
| 2 | Kebijakan | Perubahan kebijakan intervensi stunting selama proyek dapat membuat model tidak relevan | Koordinasi rutin dengan Kemenkes & BPS; penyusunan klausul pembaruan model berkala (min. 1×/tahun) |
| 3 | Teknis | Overfitting akibat data terbatas (500 obs) & class imbalance Normal:Stunting = 2,88:1 | 5-fold stratified cross-validation; teknik oversampling (ROSE/SMOTE); metrik utama: Recall & AUC-ROC |
| 4 | Etika | Data balita bersifat sensitif; penggunaan tidak bertanggung jawab berpotensi menimbulkan stigmatisasi | Anonimisasi data; izin penggunaan dari instansi berwenang; tidak publikasi data individual; mematuhi regulasi PDP |
| 5 | Operasional | Keterbatasan kapasitas SDM kab/kota dalam mengoperasikan sistem prediksi berbasis ML | Penyusunan user manual; pelatihan petugas Puskesmas & Dinkes; dashboard sederhana berbasis model |
| No | Komponen | Estimasi Biaya |
|---|---|---|
| 1 | SDM | |
| NA | — Project Manager (12 bln) | Rp 120.000.000 |
| NA | — 2 Data Scientist (12 bln) | Rp 240.000.000 |
| NA | — 1 Data Engineer + 1 Domain Expert | Rp 180.000.000 |
| 2 | Infrastruktur | |
| NA | — Server cloud & storage | Rp 60.000.000 |
| NA | — Laptop/workstation (5 unit) | Rp 50.000.000 |
| NA | — Maintenance IT (12 bln) | Rp 24.000.000 |
| 3 | Data | |
| NA | — Pengumpulan data lapangan | Rp 80.000.000 |
| NA | — Validasi & quality control | Rp 20.000.000 |
| 4 | Lisensi & Lain-lain | |
| NA | — Lisensi software analitik | Rp 15.000.000 |
| NA | — Pelatihan & sosialisasi | Rp 40.000.000 |
| NA | — Dokumentasi & pelaporan | Rp 21.000.000 |
| TOTAL ESTIMASI BIAYA | Rp 850.000.000 |
Kode Unit: J.62DMI00.002.1
library(readr); library(tidyverse); library(knitr); library(kableExtra)
data_stunting_raw <- read_csv("C:/Users/User/Downloads/data_stunting_raw.csv")tabel_target <- data_stunting_raw %>%
count(status_stunting) %>%
mutate(persentase = round(n / sum(n) * 100, 2))
kable(tabel_target,
caption = "Distribusi Frekuensi Variabel Target status_stunting",
col.names = c("Status Stunting","Frekuensi (n)","Persentase (%)"),
align = "lrr") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, position = "center") %>%
row_spec(which(tabel_target$status_stunting == "Stunting"),
background = "#FFEBEE", color = "#B71C1C", bold = TRUE) %>%
row_spec(which(tabel_target$status_stunting == "Normal"),
background = "#E3F2FD", color = "#0D47A1") %>%
row_spec(0, background = "#1A237E", color = "white", bold = TRUE)| Status Stunting | Frekuensi (n) | Persentase (%) |
|---|---|---|
| Normal | 371 | 74.2 |
| Stunting | 129 | 25.8 |
pct_min <- tabel_target %>% filter(status_stunting == "Stunting") %>% pull(persentase)
cat(sprintf(
"Kelas minoritas (Stunting) : %s%%\nKelas mayoritas (Normal) : %s%%\nRasio imbalance : %.2f : 1\nKategori imbalance : Mild Imbalance (20–40%%)\n",
pct_min, 100 - pct_min, (100 - pct_min) / pct_min))#> Kelas minoritas (Stunting) : 25.8%
#> Kelas mayoritas (Normal) : 74.2%
#> Rasio imbalance : 2.88 : 1
#> Kategori imbalance : Mild Imbalance (20–40%)
ggplot(data_stunting_raw, aes(x = status_stunting, fill = status_stunting)) +
geom_bar(width = 0.5, colour = "white", linewidth = 0.5) +
geom_text(stat = "count",
aes(label = paste0(..count.., "\n(",
round(..count.. / nrow(data_stunting_raw) * 100, 1), "%)")),
vjust = -0.3, size = 4.5, fontface = "bold") +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.18)), limits = c(0, NA)) +
labs(title = "Distribusi Variabel Target: Status Stunting",
subtitle = "Dataset data_stunting_raw.csv — 500 Balita, 50 Kab/Kota Prioritas",
x = "Status Stunting", y = "Jumlah Balita", fill = "Status",
caption = "Sumber: data_stunting_raw.csv | Rasio Imbalance 2,88 : 1 (Mild Imbalance)") +
annotate("label", x = 1.5, y = 420,
label = "Rasio Imbalance\n2,88 : 1\n(Mild Imbalance)",
fill = "#FFF9C4", colour = "#E65100", size = 3.2,
label.size = 0.4, fontface = "italic")Figure 2.1: Distribusi Variabel Target Status Stunting (n = 500 Balita)
Dari 500 balita, 371 berstatus Normal (74,2%) dan 129 berstatus Stunting (25,8%). Rasio ketidakseimbangan adalah 2,88 : 1 yang tergolong Mild Imbalance (20–40%). Kondisi ini tetap perlu diperhatikan karena dapat membuat model bias ke kelas mayoritas dalam mendeteksi kelas Stunting yang paling penting dideteksi dalam konteks kesehatan publik.
| Metrik | Rumus | Threshold | Alasan |
|---|---|---|---|
| Recall / Sensitivity | TP / (TP + FN) | ≥ 80% | METRIK UTAMA — Mendeteksi semua balita Stunting yang sebenarnya |
| Precision | TP / (TP + FP) | ≥ 70% | Ketepatan prediksi agar intervensi tidak terbuang sia-sia |
| F1-Score | 2 × (P × R) / (P + R) | ≥ 75% | Keseimbangan antara Recall dan Precision |
| AUC-ROC | Area under ROC curve | ≥ 0,80 | Kemampuan diskriminasi model secara keseluruhan |
| Specificity | TN / (TN + FP) | Sekunder | Mendeteksi balita Normal yang sebenarnya |
False Negative jauh lebih kritis dibanding False Positive. Balita Stunting yang tidak terdeteksi tidak mendapat intervensi PMT → gagal tumbuh permanen, target nasional tidak tercapai. Sebaliknya, False Positive hanya menyebabkan inefisiensi anggaran PMT tanpa bahaya medis.
Kode Unit: J.62DMI00.005.1
#> spc_tbl_ [500 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ id_balita : chr [1:500] "BAL0001" "BAL0002" "BAL0003" "BAL0004" ...
#> $ kab_kota : chr [1:500] "KAB_02" "KAB_07" "KAB_27" "KAB_14" ...
#> $ jenis_kelamin : chr [1:500] "L" "L" "L" "L" ...
#> $ usia_bulan : num [1:500] 33 24 10 27 55 29 21 32 26 23 ...
#> $ bb_lahir : num [1:500] 2.16 2.95 3.03 3 3.07 2.76 3.11 3.08 2.52 3.71 ...
#> $ tb_sekarang : num [1:500] 78.4 65.2 61.1 74.6 90.3 73.1 79.7 70.8 65.2 72 ...
#> $ bb_sekarang : num [1:500] 9.39 10.61 8.45 11.99 12.78 ...
#> $ pend_ibu : chr [1:500] "SD" "SMP" "SD" "SMA" ...
#> $ pendapatan : num [1:500] 2828197 1981942 779264 7148622 7289472 ...
#> $ akses_air : chr [1:500] "Tidak" "Ya" "Ya" "Ya" ...
#> $ sanitasi : chr [1:500] "Baik" "Baik" "Kurang" "Kurang" ...
#> $ asi_eksklusif : chr [1:500] "Tidak" "Ya" "Ya" "Ya" ...
#> $ imunisasi : chr [1:500] "Lengkap" "Tidak Lengkap" "Lengkap" "Lengkap" ...
#> $ diare_3bln : num [1:500] 0 2 0 1 0 0 1 0 0 0 ...
#> $ kunjungan_posyandu: num [1:500] 11 11 1 5 7 5 4 7 8 3 ...
#> $ zscore_tbu : num [1:500] 0.246 -2.549 -2.2 -0.189 1.257 ...
#> $ status_stunting : chr [1:500] "Normal" "Stunting" "Stunting" "Normal" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. id_balita = col_character(),
#> .. kab_kota = col_character(),
#> .. jenis_kelamin = col_character(),
#> .. usia_bulan = col_double(),
#> .. bb_lahir = col_double(),
#> .. tb_sekarang = col_double(),
#> .. bb_sekarang = col_double(),
#> .. pend_ibu = col_character(),
#> .. pendapatan = col_double(),
#> .. akses_air = col_character(),
#> .. sanitasi = col_character(),
#> .. asi_eksklusif = col_character(),
#> .. imunisasi = col_character(),
#> .. diare_3bln = col_double(),
#> .. kunjungan_posyandu = col_double(),
#> .. zscore_tbu = col_double(),
#> .. status_stunting = col_character()
#> .. )
#> - attr(*, "problems")=<externalptr>
#> # A tibble: 5 × 17
#> id_balita kab_kota jenis_kelamin usia_bulan bb_lahir tb_sekarang bb_sekarang
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 BAL0001 KAB_02 L 33 2.16 78.4 9.39
#> 2 BAL0002 KAB_07 L 24 2.95 65.2 10.6
#> 3 BAL0003 KAB_27 L 10 3.03 61.1 8.45
#> 4 BAL0004 KAB_14 L 27 3 74.6 12.0
#> 5 BAL0005 KAB_14 P 55 3.07 90.3 12.8
#> # ℹ 10 more variables: pend_ibu <chr>, pendapatan <dbl>, akses_air <chr>,
#> # sanitasi <chr>, asi_eksklusif <chr>, imunisasi <chr>, diare_3bln <dbl>,
#> # kunjungan_posyandu <dbl>, zscore_tbu <dbl>, status_stunting <chr>
| No | Variabel | Tipe | Skala | Contoh Nilai | Keterangan |
|---|---|---|---|---|---|
| 1 | id_balita | Kategorik | Nominal | BAL0001 | ID unik; tidak dipakai pemodelan |
| 2 | kab_kota | Kategorik | Nominal | KAB_02 | 50 wilayah; tidak dipakai pemodelan |
| 3 | jenis_kelamin | Kategorik | Nominal | L, P | Perlu binary encoding |
| 4 | usia_bulan | Numerik | Rasio | 33, 55 | Usia balita dalam bulan |
| 5 | bb_lahir | Numerik | Rasio | 2.16, 3.03 | Berat lahir (kg); ada 20 NA |
| 6 | tb_sekarang | Numerik | Rasio | 78.4, 65.2 | Tinggi badan saat ini (cm) |
| 7 | bb_sekarang | Numerik | Rasio | 9.39, 12.78 | Berat badan saat ini (kg) |
| 8 | pend_ibu | Kategorik | Ordinal | SD, SMP, SMA | Bertingkat; perlu ordinal encoding; ada 15 NA |
| 9 | pendapatan | Numerik | Rasio | 779264, 7148622 | Pendapatan keluarga (Rp) |
| 10 | akses_air | Kategorik | Nominal | Ya, Tidak | Perlu binary encoding |
| 11 | sanitasi | Kategorik | Nominal | Baik, Kurang | Perlu binary encoding |
| 12 | asi_eksklusif | Kategorik | Nominal | Ya, Tidak | Perlu binary encoding |
| 13 | imunisasi | Kategorik | Nominal | Lengkap, Tidak Lengkap | Perlu binary encoding |
| 14 | diare_3bln | Numerik | Rasio | 0, 1, 2 | Episode diare 3 bulan terakhir |
| 15 | kunjungan_posyandu | Numerik | Rasio | 1, 5, 11 | Frekuensi kunjungan; ada 10 NA |
| 16 | zscore_tbu | Numerik | Interval | -2.549, 0.246 | Z-score TB/U WHO; ada outlier ekstrem |
| 17 | status_stunting | Kategorik | Nominal | Normal, Stunting | Variabel TARGET |
library(corrplot)
numerik_vars <- data_stunting_raw %>%
select(usia_bulan, bb_lahir, tb_sekarang, bb_sekarang,
pendapatan, diare_3bln, kunjungan_posyandu, zscore_tbu)
cor_matrix <- cor(numerik_vars, use = "complete.obs")
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "#37474F",
tl.col = "#1A237E",
tl.srt = 40,
tl.cex = 0.85,
number.cex = 0.72,
col = colorRampPalette(c(COL_STUNTING, "white", COL_NORMAL))(200),
cl.cex = 0.75,
title = "Matriks Korelasi Variabel Numerik — Data Stunting",
mar = c(0, 0, 2.5, 0))Figure 3.1: Matriks Korelasi Variabel Numerik — Dataset Stunting
cor_df <- as.data.frame(as.table(cor_matrix)) %>%
filter(Var1 != Var2) %>%
mutate(abs_cor = abs(Freq)) %>%
arrange(desc(abs_cor)) %>%
filter(!duplicated(abs_cor)) %>%
head(5)
kable(cor_df, digits = 3,
caption = "Top 5 Pasangan Variabel dengan Korelasi Tertinggi",
col.names = c("Variabel 1","Variabel 2","Nilai Korelasi","|Korelasi|")) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE, position = "center") %>%
row_spec(1, background = "#FFEBEE", bold = TRUE, color = "#B71C1C") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)| Variabel 1 | Variabel 2 | Nilai Korelasi | |Korelasi| |
|---|---|---|---|
| zscore_tbu | tb_sekarang | 0.880 | 0.880 |
| bb_sekarang | usia_bulan | 0.493 | 0.493 |
| tb_sekarang | usia_bulan | 0.470 | 0.470 |
| bb_sekarang | tb_sekarang | 0.261 | 0.261 |
| bb_sekarang | bb_lahir | -0.091 | 0.091 |
Korelasi tertinggi:
zscore_tbudantb_sekarang(r = 0,880) — sangat logis secara medis karena Z-score TB/U diturunkan dari nilai tinggi badan aktual. Perlu diwaspadai potensi multikolinearitas antara kedua variabel ini dalam pemodelan.
library(e1071)
stat_desk <- data_stunting_raw %>%
select(usia_bulan, bb_lahir, tb_sekarang, bb_sekarang,
pendapatan, diare_3bln, kunjungan_posyandu, zscore_tbu) %>%
summarise(across(everything(), list(
Min = ~min(., na.rm = TRUE),
Maks = ~max(., na.rm = TRUE),
Mean = ~mean(., na.rm = TRUE),
Median = ~median(., na.rm = TRUE),
SD = ~sd(., na.rm = TRUE),
P25 = ~quantile(., 0.25, na.rm = TRUE),
P75 = ~quantile(., 0.75, na.rm = TRUE),
Skew = ~skewness(., na.rm = TRUE)
), .names = "{.col}__{.fn}")) %>%
pivot_longer(everything(), names_to = c("Variabel","Statistik"), names_sep = "__") %>%
pivot_wider(names_from = Statistik, values_from = value)
kable(stat_desk, digits = 2,
caption = "Statistik Deskriptif Variabel Numerik",
align = c("l", rep("r", 8))) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE, background = "#E8EAF6") %>%
column_spec(9, background = "#FFF9C4",
color = ifelse(abs(as.numeric(stat_desk$Skew)) >= 1, COL_STUNTING, "black"),
bold = ifelse(abs(as.numeric(stat_desk$Skew)) >= 1, TRUE, FALSE)) %>%
add_header_above(c(" " = 1, "Nilai Ekstrem" = 2, "Pemusatan" = 2,
"Sebaran" = 3, "Bentuk" = 1)) %>%
row_spec(0, background = "#1A237E", color = "white", bold = TRUE)| Variabel | Min | Maks | Mean | Median | SD | P25 | P75 | Skew |
|---|---|---|---|---|---|---|---|---|
| usia_bulan | 6.00 | 59.00 | 32.96 | 32.50 | 15.77 | 19.00 | 48.00 | 0.04 |
| bb_lahir | 1.54 | 4.43 | 3.10 | 3.10 | 0.51 | 2.76 | 3.45 | -0.07 |
| tb_sekarang | 5.00 | 200.00 | 74.93 | 74.70 | 12.47 | 68.60 | 81.00 | 2.45 |
| bb_sekarang | 0.50 | 50.00 | 12.10 | 12.02 | 3.17 | 10.37 | 13.77 | 4.04 |
| pendapatan | 500000.00 | 14881706.00 | 3063535.35 | 2650597.50 | 1959533.69 | 1697066.25 | 3879258.00 | 1.94 |
| diare_3bln | 0.00 | 5.00 | 0.99 | 0.50 | 1.32 | 0.00 | 2.00 | 1.42 |
| kunjungan_posyandu | 1.00 | 12.00 | 6.62 | 7.00 | 3.56 | 3.00 | 10.00 | -0.05 |
| zscore_tbu | -20.29 | 36.51 | -0.74 | -0.77 | 3.13 | -2.02 | 0.61 | 3.69 |
ggplot(data_stunting_raw, aes(x = zscore_tbu, fill = status_stunting)) +
geom_histogram(bins = 42, alpha = 0.75, position = "identity",
colour = "white", linewidth = 0.15) +
geom_vline(xintercept = -2, linetype = "dashed", colour = COL_ORANGE, linewidth = 1) +
annotate("rect", xmin = -Inf, xmax = -2, ymin = -Inf, ymax = Inf,
fill = "#FFCCBC", alpha = 0.12) +
annotate("label", x = -4.2, y = 25, label = "Zona Stunting\n(< −2 SD)",
fill = "#FFF3E0", colour = COL_ORANGE, size = 3, label.size = 0.35) +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
scale_x_continuous(breaks = seq(-6, 4, 1)) +
coord_cartesian(xlim = c(-8, 6)) +
labs(title = "Distribusi Z-Score TB/U berdasarkan Status Stunting",
subtitle = "Outlier ekstrem (di luar rentang ±6 SD) terpotong untuk keterbacaan",
x = "Z-Score TB/U", y = "Frekuensi", fill = "Status Stunting",
caption = "Garis putus-putus oranye: cut-off WHO −2 SD")Figure 3.2: Distribusi Z-Score TB/U berdasarkan Status Stunting
library(patchwork)
make_box <- function(var, label, unit) {
ggplot(data_stunting_raw, aes(x = status_stunting, y = .data[[var]],
fill = status_stunting)) +
geom_violin(alpha = 0.25, trim = TRUE, colour = NA) +
geom_boxplot(width = 0.3, outlier.colour = COL_ORANGE, outlier.shape = 21,
outlier.size = 1.8, outlier.fill = "#FFB300", linewidth = 0.45) +
stat_summary(fun = mean, geom = "point", shape = 23,
size = 3, fill = "white", colour = COL_ACCENT) +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
labs(title = label, x = "Status Stunting", y = unit) +
theme(legend.position = "none")
}
p1 <- make_box("tb_sekarang", "Tinggi Badan (cm)", "Tinggi Badan (cm)")
p2 <- make_box("bb_sekarang", "Berat Badan (kg)", "Berat Badan (kg)")
p3 <- make_box("zscore_tbu", "Z-Score TB/U", "Z-Score TB/U")
(p1 | p2 | p3) +
plot_annotation(
title = "Distribusi Variabel Antropometri per Status Stunting",
subtitle = "Violin + Boxplot | Berlian putih = rata-rata | Titik oranye = outlier",
theme = theme(plot.title = element_text(face = "bold", colour = "#1A237E", size = 13),
plot.subtitle = element_text(colour = "#546E7A", size = 9.5))
)Figure 3.3: Distribusi Tinggi & Berat Badan per Status Stunting
ggplot(data_stunting_raw %>% filter(zscore_tbu > -8, zscore_tbu < 6),
aes(x = tb_sekarang, y = zscore_tbu, colour = status_stunting)) +
geom_point(alpha = 0.5, size = 1.8, shape = 16) +
geom_smooth(aes(group = status_stunting), method = "lm", se = TRUE,
linewidth = 0.9) +
geom_hline(yintercept = -2, linetype = "dashed", colour = COL_ORANGE, linewidth = 0.9) +
annotate("label", x = 115, y = -1.55, label = "Cut-off −2 SD",
fill = "#FFF3E0", colour = COL_ORANGE, size = 3, label.size = 0.3) +
scale_colour_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
labs(title = "Scatter Plot: Tinggi Badan vs Z-Score TB/U",
subtitle = "Korelasi positif sangat kuat (r = 0,880) | Outlier ekstrem dipotong",
x = "Tinggi Badan Sekarang (cm)", y = "Z-Score TB/U",
colour = "Status Stunting",
caption = "Sumber: data_stunting_raw.csv")Figure 3.4: Scatter Plot: Tinggi Badan vs Z-Score TB/U
skew_df <- data_stunting_raw %>%
select(usia_bulan, bb_lahir, tb_sekarang, bb_sekarang,
pendapatan, diare_3bln, kunjungan_posyandu, zscore_tbu) %>%
summarise(across(everything(), ~round(skewness(., na.rm = TRUE), 3))) %>%
pivot_longer(everything(), names_to = "Variabel", values_to = "Skewness") %>%
mutate(Kategori = case_when(
abs(Skewness) < 0.5 ~ "Simetris",
abs(Skewness) < 1 ~ "Skewed Sedang",
TRUE ~ "Skewed Berat"
))
kable(skew_df, caption = "Nilai Skewness Variabel Numerik",
col.names = c("Variabel","Nilai Skewness","Kategori")) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE, position = "center") %>%
row_spec(which(skew_df$Kategori == "Skewed Berat"),
background = "#FFEBEE", color = COL_STUNTING, bold = TRUE) %>%
row_spec(which(skew_df$Kategori == "Simetris"),
background = "#E8F5E9") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)| Variabel | Nilai Skewness | Kategori |
|---|---|---|
| usia_bulan | 0.045 | Simetris |
| bb_lahir | -0.073 | Simetris |
| tb_sekarang | 2.450 | Skewed Berat |
| bb_sekarang | 4.039 | Skewed Berat |
| pendapatan | 1.939 | Skewed Berat |
| diare_3bln | 1.422 | Skewed Berat |
| kunjungan_posyandu | -0.049 | Simetris |
| zscore_tbu | 3.695 | Skewed Berat |
buat_tabel_proporsi <- function(data, var) {
data %>%
group_by(across(all_of(c(var, "status_stunting")))) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(across(all_of(var))) %>%
mutate(proporsi = round(n / sum(n) * 100, 1)) %>%
ungroup()
}
for (v in c("pend_ibu","akses_air","sanitasi","asi_eksklusif","imunisasi")) {
tbl <- buat_tabel_proporsi(data_stunting_raw, v)
cat("\n\n**Tabel Proporsi:", v, "vs status_stunting**\n\n")
print(kable(tbl, caption = paste("Proporsi:", v, "vs Status Stunting")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, position = "center"))
}Tabel Proporsi: pend_ibu vs status_stunting
| pend_ibu | status_stunting | n | proporsi |
|---|---|---|---|
| PT | Normal | 37 | 84.1 |
| PT | Stunting | 7 | 15.9 |
| SD | Normal | 75 | 68.2 |
| SD | Stunting | 35 | 31.8 |
| SMA | Normal | 107 | 69.9 |
| SMA | Stunting | 46 | 30.1 |
| SMP | Normal | 123 | 80.4 |
| SMP | Stunting | 30 | 19.6 |
| Tidak Sekolah | Normal | 19 | 76.0 |
| Tidak Sekolah | Stunting | 6 | 24.0 |
| NA | Normal | 10 | 66.7 |
| NA | Stunting | 5 | 33.3 |
Tabel Proporsi: akses_air vs status_stunting
| akses_air | status_stunting | n | proporsi |
|---|---|---|---|
| Tidak | Normal | 101 | 78.3 |
| Tidak | Stunting | 28 | 21.7 |
| Ya | Normal | 270 | 72.8 |
| Ya | Stunting | 101 | 27.2 |
Tabel Proporsi: sanitasi vs status_stunting
| sanitasi | status_stunting | n | proporsi |
|---|---|---|---|
| Baik | Normal | 220 | 74.3 |
| Baik | Stunting | 76 | 25.7 |
| Kurang | Normal | 151 | 74.0 |
| Kurang | Stunting | 53 | 26.0 |
Tabel Proporsi: asi_eksklusif vs status_stunting
| asi_eksklusif | status_stunting | n | proporsi |
|---|---|---|---|
| Tidak | Normal | 166 | 73.1 |
| Tidak | Stunting | 61 | 26.9 |
| Ya | Normal | 205 | 75.1 |
| Ya | Stunting | 68 | 24.9 |
Tabel Proporsi: imunisasi vs status_stunting
| imunisasi | status_stunting | n | proporsi |
|---|---|---|---|
| Lengkap | Normal | 254 | 72.8 |
| Lengkap | Stunting | 95 | 27.2 |
| Tidak Lengkap | Normal | 117 | 77.5 |
| Tidak Lengkap | Stunting | 34 | 22.5 |
data_stunting_raw %>%
filter(!is.na(pend_ibu)) %>%
mutate(pend_ibu = factor(pend_ibu,
levels = c("Tidak Sekolah","SD","SMP","SMA","PT"))) %>%
count(pend_ibu, status_stunting) %>%
group_by(pend_ibu) %>%
mutate(pct = round(n / sum(n) * 100, 1)) %>%
ggplot(aes(x = pend_ibu, y = pct, fill = status_stunting)) +
geom_col(position = "fill", colour = "white", linewidth = 0.4, width = 0.65) +
geom_text(aes(label = paste0(pct, "%")),
position = position_fill(vjust = 0.5),
colour = "white", size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Proporsi Status Stunting per Tingkat Pendidikan Ibu",
subtitle = "Semakin tinggi pendidikan ibu → prevalensi stunting cenderung menurun",
x = "Tingkat Pendidikan Ibu", y = "Proporsi (%)", fill = "Status Stunting",
caption = "Prevalensi stunting tertinggi: SD (31,8%) | terendah: PT (15,9%)")Figure 3.5: Proporsi Status Stunting per Pendidikan Ibu
data_stunting_raw %>%
select(akses_air, sanitasi, asi_eksklusif, imunisasi, status_stunting) %>%
pivot_longer(-status_stunting, names_to = "variabel", values_to = "kategori") %>%
mutate(variabel = recode(variabel,
akses_air = "Akses Air Bersih",
sanitasi = "Sanitasi",
asi_eksklusif = "ASI Eksklusif",
imunisasi = "Imunisasi")) %>%
count(variabel, kategori, status_stunting) %>%
group_by(variabel, kategori) %>%
mutate(pct = round(n / sum(n) * 100, 1)) %>%
ggplot(aes(x = kategori, y = pct, fill = status_stunting)) +
geom_col(position = "fill", colour = "white", linewidth = 0.35, width = 0.6) +
geom_text(aes(label = paste0(pct, "%")),
position = position_fill(vjust = 0.5),
colour = "white", size = 3.2, fontface = "bold") +
facet_wrap(~variabel, scales = "free_x", nrow = 2) +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Proporsi Status Stunting per Variabel Kategorik",
subtitle = "Faktor akses layanan kesehatan dan perilaku kesehatan",
x = NULL, y = "Proporsi (%)", fill = "Status Stunting") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))Figure 3.6: Proporsi Status Stunting per Variabel Kategorik (Panel)
Kode Unit: J.62DMI00.006.1
| Variabel | Batas Bawah (−3SD) | Batas Atas (+3SD) | Dasar Referensi |
|---|---|---|---|
| tb_sekarang | 61,2 cm | 124,0 cm | WHO Height-for-Age 6–59 bln |
| bb_sekarang | 5,5 kg | 24,0 kg | WHO Weight-for-Age 6–59 bln |
| bb_lahir | 2,0 kg | 4,5 kg | WHO: BBLR <2500g; LGA >4500g |
| zscore_tbu | −6,0 SD | +6,0 SD | WHO biologically implausible |
outlier_tb <- data_stunting_raw %>%
filter(tb_sekarang < 61.2 | tb_sekarang > 124.0) %>%
select(id_balita, usia_bulan, tb_sekarang, status_stunting)
outlier_bb <- data_stunting_raw %>%
filter(bb_sekarang < 5.5 | bb_sekarang > 24.0) %>%
select(id_balita, usia_bulan, bb_sekarang, status_stunting)
outlier_bbl <- data_stunting_raw %>%
filter(!is.na(bb_lahir), bb_lahir < 2.0 | bb_lahir > 4.5) %>%
select(id_balita, usia_bulan, bb_lahir, status_stunting)
outlier_zscore <- data_stunting_raw %>%
filter(zscore_tbu < -6 | zscore_tbu > 6) %>%
select(id_balita, usia_bulan, zscore_tbu, status_stunting)
cat("=== Outlier tb_sekarang (WHO: 61.2–124.0 cm) ===\n")#> === Outlier tb_sekarang (WHO: 61.2–124.0 cm) ===
#> Jumlah outlier: 42
#> # A tibble: 10 × 4
#> id_balita usia_bulan tb_sekarang status_stunting
#> <chr> <dbl> <dbl> <chr>
#> 1 BAL0003 10 61.1 Stunting
#> 2 BAL0023 6 61 Normal
#> 3 BAL0040 18 60.8 Stunting
#> 4 BAL0060 12 60.5 Stunting
#> 5 BAL0071 24 56.2 Stunting
#> 6 BAL0078 11 53.2 Stunting
#> 7 BAL0079 7 52.1 Stunting
#> 8 BAL0085 11 58.1 Stunting
#> 9 BAL0102 16 58.9 Stunting
#> 10 BAL0109 16 58.7 Stunting
#>
#> === Outlier bb_sekarang (WHO: 5.5–24.0 kg) ===
#> Jumlah outlier: 3
#> # A tibble: 3 × 4
#> id_balita usia_bulan bb_sekarang status_stunting
#> <chr> <dbl> <dbl> <chr>
#> 1 BAL0006 29 35 Normal
#> 2 BAL0028 52 0.5 Normal
#> 3 BAL0212 56 50 Normal
#>
#> === Outlier bb_lahir (WHO: 2.0–4.5 kg) ===
#> Jumlah outlier: 9
#> # A tibble: 9 × 4
#> id_balita usia_bulan bb_lahir status_stunting
#> <chr> <dbl> <dbl> <chr>
#> 1 BAL0013 28 1.72 Normal
#> 2 BAL0132 9 1.95 Normal
#> 3 BAL0155 21 1.83 Normal
#> 4 BAL0156 16 1.85 Normal
#> 5 BAL0249 57 1.54 Normal
#> 6 BAL0326 45 1.7 Stunting
#> 7 BAL0358 42 1.98 Normal
#> 8 BAL0370 37 1.66 Stunting
#> 9 BAL0494 9 1.78 Normal
#>
#> === Outlier zscore_tbu (WHO: -6 s/d +6) ===
#> Jumlah outlier: 5
#> # A tibble: 5 × 4
#> id_balita usia_bulan zscore_tbu status_stunting
#> <chr> <dbl> <dbl> <chr>
#> 1 BAL0129 51 -18.4 Stunting
#> 2 BAL0301 29 -20.3 Stunting
#> 3 BAL0341 11 -15.5 Stunting
#> 4 BAL0390 54 27.0 Normal
#> 5 BAL0467 19 36.5 Normal
rekap_outlier <- data.frame(
Variabel = c("tb_sekarang","bb_sekarang","bb_lahir","zscore_tbu","TOTAL"),
Jumlah = c(nrow(outlier_tb), nrow(outlier_bb), nrow(outlier_bbl),
nrow(outlier_zscore),
nrow(outlier_tb)+nrow(outlier_bb)+nrow(outlier_bbl)+nrow(outlier_zscore)),
Persen = round(c(nrow(outlier_tb), nrow(outlier_bb), nrow(outlier_bbl),
nrow(outlier_zscore),
nrow(outlier_tb)+nrow(outlier_bb)+nrow(outlier_bbl)+nrow(outlier_zscore)
) / 500 * 100, 1),
Batas_WHO = c("61,2–124,0 cm","5,5–24,0 kg","2,0–4,5 kg","−6 s/d +6","—")
)
kable(rekap_outlier, caption = "Rekap Total Nilai Tidak Valid Berdasarkan Batas WHO",
col.names = c("Variabel","Jumlah Tidak Valid","Persentase (%)","Batas WHO")) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE,
position = "center") %>%
row_spec(nrow(rekap_outlier), background = "#B71C1C", color = "white", bold = TRUE) %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)| Variabel | Jumlah Tidak Valid | Persentase (%) | Batas WHO |
|---|---|---|---|
| tb_sekarang | 42 | 8.4 | 61,2–124,0 cm |
| bb_sekarang | 3 | 0.6 | 5,5–24,0 kg |
| bb_lahir | 9 | 1.8 | 2,0–4,5 kg |
| zscore_tbu | 5 | 1.0 | −6 s/d +6 |
| TOTAL | 59 | 11.8 | — |
#>
#> === Missing Values per Variabel ===
data_stunting_raw %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variabel", values_to = "NA_Count") %>%
filter(NA_Count > 0) %>%
mutate(Persen = round(NA_Count/500*100, 1)) %>%
print()#> # A tibble: 3 × 3
#> Variabel NA_Count Persen
#> <chr> <int> <dbl>
#> 1 bb_lahir 20 4
#> 2 pend_ibu 15 3
#> 3 kunjungan_posyandu 10 2
n_obs <- nrow(data_stunting_raw); n_pred <- 14; min_obs <- n_pred * 10
cat(sprintf(
"\n=== Kecukupan Data (Rule of Thumb: 10 obs/prediktor) ===\nObservasi : %d\nPrediktor aktif : %d\nMinimum (10x) : %d\nRasio : %.2f\nKesimpulan : %s\n",
n_obs, n_pred, min_obs, n_obs/n_pred,
ifelse(n_obs >= min_obs, "CUKUP ✓ (rasio 35,71 ≥ 10)", "TIDAK CUKUP ✗")))#>
#> === Kecukupan Data (Rule of Thumb: 10 obs/prediktor) ===
#> Observasi : 500
#> Prediktor aktif : 14
#> Minimum (10x) : 140
#> Rasio : 35.71
#> Kesimpulan : CUKUP ✓ (rasio 35,71 ≥ 10)
Kode Unit: J.62DMI00.007.1
#> === Dimensi Dataset ===
#> Jumlah baris (observasi): 500
#> Jumlah kolom (variabel) : 17
#>
#> === Distribusi Target: status_stunting ===
data_stunting_raw %>%
count(status_stunting) %>%
mutate(persen = round(n/sum(n)*100, 2)) %>% print()#> # A tibble: 2 × 3
#> status_stunting n persen
#> <chr> <int> <dbl>
#> 1 Normal 371 74.2
#> 2 Stunting 129 25.8
#>
#> === Proporsi Stunting per Jenis Kelamin ===
data_stunting_raw %>%
group_by(jenis_kelamin, status_stunting) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(jenis_kelamin) %>%
mutate(proporsi = round(n/sum(n)*100, 2)) %>% print()#> # A tibble: 4 × 4
#> # Groups: jenis_kelamin [2]
#> jenis_kelamin status_stunting n proporsi
#> <chr> <chr> <int> <dbl>
#> 1 L Normal 184 69.4
#> 2 L Stunting 81 30.6
#> 3 P Normal 187 79.6
#> 4 P Stunting 48 20.4
#>
#> === Proporsi Stunting per Kelompok Usia ===
data_stunting_raw %>%
mutate(kel_usia = case_when(
usia_bulan <= 11 ~ "6-11 bln",
usia_bulan <= 23 ~ "12-23 bln",
usia_bulan <= 35 ~ "24-35 bln",
usia_bulan <= 47 ~ "36-47 bln",
TRUE ~ "48-59 bln")) %>%
group_by(kel_usia, status_stunting) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(kel_usia) %>%
mutate(proporsi = round(n/sum(n)*100, 2)) %>%
arrange(kel_usia) %>% print()#> # A tibble: 10 × 4
#> # Groups: kel_usia [5]
#> kel_usia status_stunting n proporsi
#> <chr> <chr> <int> <dbl>
#> 1 12-23 bln Normal 90 77.6
#> 2 12-23 bln Stunting 26 22.4
#> 3 24-35 bln Normal 87 73.7
#> 4 24-35 bln Stunting 31 26.3
#> 5 36-47 bln Normal 66 72.5
#> 6 36-47 bln Stunting 25 27.5
#> 7 48-59 bln Normal 95 74.8
#> 8 48-59 bln Stunting 32 25.2
#> 9 6-11 bln Normal 33 68.8
#> 10 6-11 bln Stunting 15 31.2
#>
#> === Simulasi Stratified Split 80:20 ===
#> Training (80%): Normal ≈ 297 | Stunting ≈ 103
#> Test (20%): Normal ≈ 74 | Stunting ≈ 26
#>
#> === Missing Values per Variabel Prediktor ===
data_stunting_raw %>%
select(-id_balita, -kab_kota, -status_stunting) %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variabel", values_to = "Jumlah_NA") %>%
print()#> # A tibble: 14 × 2
#> Variabel Jumlah_NA
#> <chr> <int>
#> 1 jenis_kelamin 0
#> 2 usia_bulan 0
#> 3 bb_lahir 20
#> 4 tb_sekarang 0
#> 5 bb_sekarang 0
#> 6 pend_ibu 15
#> 7 pendapatan 0
#> 8 akses_air 0
#> 9 sanitasi 0
#> 10 asi_eksklusif 0
#> 11 imunisasi 0
#> 12 diare_3bln 0
#> 13 kunjungan_posyandu 10
#> 14 zscore_tbu 0
Kode Unit: J.62DMI00.008.1
# ── Kondisi sebelum cleaning ──────────────────────────────────────────────────
cat("=== KONDISI DATA SEBELUM CLEANING ===\n")#> === KONDISI DATA SEBELUM CLEANING ===
#> Jumlah baris: 500
#>
#> --- Missing Values ---
data_stunting_raw %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variabel", values_to = "NA_Count") %>%
filter(NA_Count > 0) %>%
mutate(Persen = round(NA_Count / 500 * 100, 1)) %>% print()#> # A tibble: 3 × 3
#> Variabel NA_Count Persen
#> <chr> <int> <dbl>
#> 1 bb_lahir 20 4
#> 2 pend_ibu 15 3
#> 3 kunjungan_posyandu 10 2
n_tb <- sum(data_stunting_raw$tb_sekarang < 30 | data_stunting_raw$tb_sekarang > 130, na.rm=TRUE)
n_bb <- sum(data_stunting_raw$bb_sekarang < 2 | data_stunting_raw$bb_sekarang > 20, na.rm=TRUE)
n_bbl <- sum(data_stunting_raw$bb_lahir < 0.5| data_stunting_raw$bb_lahir > 5.5, na.rm=TRUE)
cat(sprintf("\n--- Outlier (Batas Soal) ---\ntb_sekarang (30-130 cm): %d\nbb_sekarang (2-20 kg) : %d\nbb_lahir (0.5-5.5 kg): %d\nDuplikat: %d\n",
n_tb, n_bb, n_bbl, sum(duplicated(data_stunting_raw))))#>
#> --- Outlier (Batas Soal) ---
#> tb_sekarang (30-130 cm): 5
#> bb_sekarang (2-20 kg) : 3
#> bb_lahir (0.5-5.5 kg): 0
#> Duplikat: 0
# ── Langkah 1: Hapus outlier ──────────────────────────────────────────────────
data_step1 <- data_stunting_raw %>%
filter(tb_sekarang >= 30 & tb_sekarang <= 130,
bb_sekarang >= 2 & bb_sekarang <= 20,
(bb_lahir >= 0.5 & bb_lahir <= 5.5) | is.na(bb_lahir))
cat("\nSetelah hapus outlier:", nrow(data_step1), "baris (dihapus:",
nrow(data_stunting_raw) - nrow(data_step1), ")\n")#>
#> Setelah hapus outlier: 492 baris (dihapus: 8 )
# ── Langkah 2: Imputasi ───────────────────────────────────────────────────────
data_step2 <- data_step1 %>%
mutate(kelompok_usia = case_when(
usia_bulan <= 12 ~ "0-12", usia_bulan <= 24 ~ "13-24",
usia_bulan <= 36 ~ "25-36", usia_bulan <= 48 ~ "37-48", TRUE ~ "49-59"))
median_bb <- data_step2 %>% group_by(kelompok_usia) %>%
summarise(med_bb = median(bb_lahir, na.rm=TRUE))
median_kun <- data_step2 %>% group_by(kelompok_usia) %>%
summarise(med_kun = median(kunjungan_posyandu, na.rm=TRUE))
modus_pend <- data_step2 %>% filter(!is.na(pend_ibu)) %>%
count(pend_ibu, sort=TRUE) %>% slice(1) %>% pull(pend_ibu)
cat("\nModus pend_ibu:", modus_pend, "\n")#>
#> Modus pend_ibu: SMP
data_step2 <- data_step2 %>%
left_join(median_bb, by = "kelompok_usia") %>%
left_join(median_kun, by = "kelompok_usia") %>%
mutate(
bb_lahir = ifelse(is.na(bb_lahir), med_bb, bb_lahir),
kunjungan_posyandu = ifelse(is.na(kunjungan_posyandu), med_kun, kunjungan_posyandu),
pend_ibu = ifelse(is.na(pend_ibu), modus_pend, pend_ibu)
) %>% select(-med_bb, -med_kun)
cat("Missing setelah imputasi:\n")#> Missing setelah imputasi:
data_step2 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variabel", values_to = "NA") %>%
filter(NA > 0) %>% print()#> # A tibble: 0 × 2
#> # ℹ 2 variables: Variabel <chr>, NA <int>
#> (Tidak ada output = tidak ada missing tersisa)
# ── Langkah 3: Hapus duplikat ─────────────────────────────────────────────────
data_step3 <- data_step2 %>% distinct()
cat("\nSetelah hapus duplikat:", nrow(data_step3), "baris\n")#>
#> Setelah hapus duplikat: 492 baris
#>
#> === SUMMARY SEBELUM vs SESUDAH ===
#> --- Sebelum ---
data_stunting_raw %>%
select(tb_sekarang, bb_sekarang, bb_lahir, kunjungan_posyandu) %>%
summary() %>% print()#> tb_sekarang bb_sekarang bb_lahir kunjungan_posyandu
#> Min. : 5.00 Min. : 0.50 Min. :1.540 Min. : 1.000
#> 1st Qu.: 68.60 1st Qu.:10.37 1st Qu.:2.760 1st Qu.: 3.000
#> Median : 74.70 Median :12.02 Median :3.100 Median : 7.000
#> Mean : 74.93 Mean :12.10 Mean :3.102 Mean : 6.618
#> 3rd Qu.: 81.00 3rd Qu.:13.77 3rd Qu.:3.450 3rd Qu.:10.000
#> Max. :200.00 Max. :50.00 Max. :4.430 Max. :12.000
#> NA's :20 NA's :10
#> --- Sesudah ---
data_step3 %>%
select(tb_sekarang, bb_sekarang, bb_lahir, kunjungan_posyandu) %>%
summary() %>% print()#> tb_sekarang bb_sekarang bb_lahir kunjungan_posyandu
#> Min. : 50.20 Min. : 5.88 Min. :1.540 Min. : 1.00
#> 1st Qu.: 68.67 1st Qu.:10.37 1st Qu.:2.788 1st Qu.: 3.00
#> Median : 74.70 Median :12.01 Median :3.100 Median : 7.00
#> Mean : 74.80 Mean :12.00 Mean :3.105 Mean : 6.62
#> 3rd Qu.: 80.90 3rd Qu.:13.66 3rd Qu.:3.440 3rd Qu.:10.00
#> Max. :104.10 Max. :19.34 Max. :4.430 Max. :12.00
# ── Simpan ────────────────────────────────────────────────────────────────────
data_stunting_clean <- data_step3 %>% select(-kelompok_usia)
write_csv(data_stunting_clean, "C:/Users/User/Downloads/data_stunting_clean.csv")
cat("\n✓ data_stunting_clean.csv tersimpan |",
nrow(data_stunting_clean), "baris ×", ncol(data_stunting_clean), "kolom\n")#>
#> ✓ data_stunting_clean.csv tersimpan | 492 baris × 17 kolom
make_box_clean <- function(df, var, unit, label, col_fill) {
ggplot(df, aes(y = .data[[var]])) +
geom_boxplot(fill = col_fill, colour = darken_col(col_fill, 0.3),
outlier.colour = COL_ORANGE, outlier.shape = 21,
outlier.size = 1.5, width = 0.45) +
labs(title = label, y = unit) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
}
darken_col <- function(col, factor = 0.3) {
rgb_col <- col2rgb(col) / 255
rgb(rgb_col[1]*factor, rgb_col[2]*factor, rgb_col[3]*factor)
}
p1a <- ggplot(data_stunting_raw, aes(y = tb_sekarang)) +
geom_boxplot(fill = "#FFCDD2", colour = COL_STUNTING,
outlier.colour = COL_ORANGE, outlier.shape = 21, width = 0.45) +
labs(title = "Tinggi Badan\n(Sebelum)", y = "cm") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
p1b <- ggplot(data_step3, aes(y = tb_sekarang)) +
geom_boxplot(fill = "#C8E6C9", colour = COL_GREEN,
outlier.colour = COL_ORANGE, outlier.shape = 21, width = 0.45) +
labs(title = "Tinggi Badan\n(Sesudah)", y = "cm") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
p2a <- ggplot(data_stunting_raw, aes(y = bb_sekarang)) +
geom_boxplot(fill = "#FFCDD2", colour = COL_STUNTING,
outlier.colour = COL_ORANGE, outlier.shape = 21, width = 0.45) +
labs(title = "Berat Badan\n(Sebelum)", y = "kg") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
p2b <- ggplot(data_step3, aes(y = bb_sekarang)) +
geom_boxplot(fill = "#C8E6C9", colour = COL_GREEN,
outlier.colour = COL_ORANGE, outlier.shape = 21, width = 0.45) +
labs(title = "Berat Badan\n(Sesudah)", y = "kg") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
(p1a | p1b | p2a | p2b) +
plot_annotation(
title = "Perbandingan Distribusi Antropometri Sebelum dan Sesudah Cleaning",
subtitle = "Merah = sebelum cleaning (outlier terlihat) | Hijau = sesudah cleaning (bersih)",
theme = theme(plot.title = element_text(face = "bold", colour = "#1A237E", size = 12),
plot.subtitle = element_text(colour = "#546E7A", size = 9))
)Figure 6.1: Perbandingan Distribusi Sebelum dan Sesudah Cleaning
prev_sblm <- data_stunting_raw %>%
count(status_stunting) %>%
mutate(persen = round(n/sum(n)*100, 2), dataset = "Sebelum Cleaning")
prev_ssdh <- data_stunting_clean %>%
count(status_stunting) %>%
mutate(persen = round(n/sum(n)*100, 2), dataset = "Sesudah Cleaning")
perbandingan <- bind_rows(prev_sblm, prev_ssdh) %>%
select(dataset, status_stunting, n, persen)
kable(perbandingan, caption = "Perbandingan Prevalensi Stunting Sebelum dan Sesudah Cleaning",
col.names = c("Dataset","Status","n","%"), align = "llrr") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE, position = "center") %>%
collapse_rows(columns = 1, valign = "middle") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)| Dataset | Status | n | % |
|---|---|---|---|
| Sebelum Cleaning | Normal | 371 | 74.20 |
| Stunting | 129 | 25.80 | |
| Sesudah Cleaning | Normal | 366 | 74.39 |
| Stunting | 126 | 25.61 |
pct_s <- prev_sblm %>% filter(status_stunting == "Stunting") %>% pull(persen)
pct_a <- prev_ssdh %>% filter(status_stunting == "Stunting") %>% pull(persen)
cat(sprintf("Prevalensi Stunting sebelum : %s%%\nPrevalensi Stunting sesudah : %s%%\nSelisih : %s poin persentase\n",
pct_s, pct_a, round(pct_a - pct_s, 2)))#> Prevalensi Stunting sebelum : 25.8%
#> Prevalensi Stunting sesudah : 25.61%
#> Selisih : -0.19 poin persentase
bind_rows(prev_sblm, prev_ssdh) %>%
ggplot(aes(x = dataset, y = persen, fill = status_stunting)) +
geom_col(position = "fill", colour = "white", linewidth = 0.5, width = 0.5) +
geom_text(aes(label = paste0(persen, "%")),
position = position_fill(vjust = 0.5),
colour = "white", size = 4.2, fontface = "bold") +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Perbandingan Prevalensi Stunting",
subtitle = "Sebelum vs Sesudah Pembersihan Data",
x = NULL, y = "Proporsi (%)", fill = "Status Stunting",
caption = "Sumber: data_stunting_raw.csv & data_stunting_clean.csv")Figure 6.2: Perbandingan Prevalensi Stunting Sebelum dan Sesudah Cleaning
Kode Unit: J.62DMI00.009.1
data_stunting_clean <- read_csv("C:/Users/User/Downloads/data_stunting_clean.csv")
data_encoded <- data_stunting_clean %>%
mutate(
pend_ibu_enc = case_when(
pend_ibu == "Tidak Sekolah" ~ 0, pend_ibu == "SD" ~ 1,
pend_ibu == "SMP" ~ 2, pend_ibu == "SMA" ~ 3,
pend_ibu == "PT" ~ 4, TRUE ~ NA_real_),
jenis_kelamin_enc = ifelse(jenis_kelamin == "L", 1, 0),
akses_air_enc = ifelse(akses_air == "Ya", 1, 0),
sanitasi_enc = ifelse(sanitasi == "Baik", 1, 0),
asi_eksklusif_enc = ifelse(asi_eksklusif == "Ya", 1, 0),
imunisasi_enc = ifelse(imunisasi == "Lengkap", 1, 0)
)
cat("=== Hasil Encoding pend_ibu (Ordinal) ===\n")#> === Hasil Encoding pend_ibu (Ordinal) ===
#> # A tibble: 5 × 3
#> pend_ibu pend_ibu_enc n
#> <chr> <dbl> <int>
#> 1 Tidak Sekolah 0 25
#> 2 SD 1 109
#> 3 SMP 2 165
#> 4 SMA 3 149
#> 5 PT 4 44
#>
#> === Hasil Encoding Variabel Binary ===
data_encoded %>%
select(jenis_kelamin, jenis_kelamin_enc, akses_air, akses_air_enc,
sanitasi, sanitasi_enc, asi_eksklusif, asi_eksklusif_enc,
imunisasi, imunisasi_enc) %>%
distinct() %>% arrange(jenis_kelamin_enc) %>% print(n = 20)#> # A tibble: 32 × 10
#> jenis_kelamin jenis_kelamin_enc akses_air akses_air_enc sanitasi sanitasi_enc
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 P 0 Tidak 0 Baik 1
#> 2 P 0 Ya 1 Kurang 0
#> 3 P 0 Ya 1 Baik 1
#> 4 P 0 Ya 1 Baik 1
#> 5 P 0 Tidak 0 Kurang 0
#> 6 P 0 Tidak 0 Baik 1
#> 7 P 0 Ya 1 Kurang 0
#> 8 P 0 Tidak 0 Baik 1
#> 9 P 0 Ya 1 Kurang 0
#> 10 P 0 Ya 1 Baik 1
#> 11 P 0 Tidak 0 Kurang 0
#> 12 P 0 Tidak 0 Kurang 0
#> 13 P 0 Tidak 0 Baik 1
#> 14 P 0 Tidak 0 Kurang 0
#> 15 P 0 Ya 1 Baik 1
#> 16 P 0 Ya 1 Kurang 0
#> 17 L 1 Tidak 0 Baik 1
#> 18 L 1 Ya 1 Baik 1
#> 19 L 1 Ya 1 Kurang 0
#> 20 L 1 Tidak 0 Kurang 0
#> # ℹ 12 more rows
#> # ℹ 4 more variables: asi_eksklusif <chr>, asi_eksklusif_enc <dbl>,
#> # imunisasi <chr>, imunisasi_enc <dbl>
data_encoded <- data_encoded %>%
mutate(
bmi = bb_sekarang / (tb_sekarang / 100)^2,
rasio_bb = bb_sekarang / bb_lahir
)
cat("=== Ringkasan 2 Fitur Baru per Status Stunting ===\n")#> === Ringkasan 2 Fitur Baru per Status Stunting ===
data_encoded %>%
select(bmi, rasio_bb, status_stunting) %>%
group_by(status_stunting) %>%
summarise(
n = n(),
BMI_mean = round(mean(bmi, na.rm=TRUE), 3),
BMI_median = round(median(bmi, na.rm=TRUE), 3),
BMI_sd = round(sd(bmi, na.rm=TRUE), 3),
RasioBB_mean = round(mean(rasio_bb, na.rm=TRUE), 3),
RasioBB_median = round(median(rasio_bb, na.rm=TRUE), 3),
RasioBB_sd = round(sd(rasio_bb, na.rm=TRUE), 3)
) %>% print()#> # A tibble: 2 × 8
#> status_stunting n BMI_mean BMI_median BMI_sd RasioBB_mean RasioBB_median
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Normal 366 20.0 19.6 4.02 3.95 3.85
#> 2 Stunting 126 27.9 27.1 5.97 4.11 3.86
#> # ℹ 1 more variable: RasioBB_sd <dbl>
#>
#> === Uji Mann-Whitney: BMI vs status_stunting ===
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: bmi by status_stunting
#> W = 5824, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0
#>
#> === Uji Mann-Whitney: rasio_bb vs status_stunting ===
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: rasio_bb by status_stunting
#> W = 22114, p-value = 0.4928
#> alternative hypothesis: true location shift is not equal to 0
library(patchwork)
make_violin <- function(var, label, unit) {
ggplot(data_encoded, aes(x = status_stunting, y = .data[[var]],
fill = status_stunting)) +
geom_violin(alpha = 0.3, trim = FALSE, colour = NA) +
geom_boxplot(width = 0.22, outlier.colour = COL_ORANGE, outlier.shape = 21,
outlier.size = 1.8, outlier.fill = "#FFB300", linewidth = 0.45) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3.5,
fill = "white", colour = COL_ACCENT) +
scale_fill_manual(values = c("Normal" = COL_NORMAL, "Stunting" = COL_STUNTING)) +
labs(title = label, x = "Status Stunting", y = unit) +
theme(legend.position = "none")
}
(make_violin("bmi", "Indeks Massa Tubuh (BMI)", "BMI (kg/m²)") |
make_violin("rasio_bb", "Rasio BB Sekarang / BB Lahir", "Rasio BB")) +
plot_annotation(
title = "Distribusi Fitur Rekayasa per Status Stunting",
subtitle = "Violin = distribusi | Box = kuartil | Berlian putih = rata-rata",
theme = theme(plot.title = element_text(face = "bold", colour = "#1A237E", size = 12),
plot.subtitle = element_text(colour = "#546E7A", size = 9))
)Figure 7.1: Distribusi Fitur Rekayasa Baru per Status Stunting (Violin + Boxplot)
minmax <- function(x) (x - min(x, na.rm=TRUE)) / (max(x, na.rm=TRUE) - min(x, na.rm=TRUE))
vars_norm <- c("usia_bulan","bb_lahir","tb_sekarang","bb_sekarang","pendapatan",
"diare_3bln","kunjungan_posyandu","zscore_tbu","bmi","rasio_bb","pend_ibu_enc")
data_normalized <- data_encoded %>%
mutate(across(all_of(vars_norm), minmax, .names = "{.col}_norm"))
cat("=== Verifikasi Range Setelah Min-Max Normalisasi ===\n")#> === Verifikasi Range Setelah Min-Max Normalisasi ===
data_normalized %>%
select(ends_with("_norm")) %>%
summarise(across(everything(), list(
Min = ~round(min(., na.rm=TRUE), 3),
Max = ~round(max(., na.rm=TRUE), 3)))) %>%
pivot_longer(everything(), names_to = c("Variabel","Stat"),
names_sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = Stat, values_from = value) %>%
print(n = 15)#> # A tibble: 11 × 3
#> Variabel Min Max
#> <chr> <dbl> <dbl>
#> 1 usia_bulan_norm 0 1
#> 2 bb_lahir_norm 0 1
#> 3 tb_sekarang_norm 0 1
#> 4 bb_sekarang_norm 0 1
#> 5 pendapatan_norm 0 1
#> 6 diare_3bln_norm 0 1
#> 7 kunjungan_posyandu_norm 0 1
#> 8 zscore_tbu_norm 0 1
#> 9 bmi_norm 0 1
#> 10 rasio_bb_norm 0 1
#> 11 pend_ibu_enc_norm 0 1
Dimensi: TRS + JRES | KUK: 2.3–2.4
# ── Library ───────────────────────────────────────────────────────────────────
library(car)
library(dplyr)
library(kableExtra)
# ── Persiapan data & helper ───────────────────────────────────────────────────
fitur_model <- data_encoded %>%
select(usia_bulan, bb_lahir, tb_sekarang, bb_sekarang,
pendapatan, diare_3bln, kunjungan_posyandu,
bmi, rasio_bb, pend_ibu_enc,
jenis_kelamin_enc, akses_air_enc, sanitasi_enc,
asi_eksklusif_enc, imunisasi_enc,
status_stunting) %>%
mutate(status_stunting_num = ifelse(status_stunting == "Stunting", 1, 0)) %>%
drop_na()
semua_prediktor <- c("usia_bulan", "bb_lahir", "tb_sekarang", "bb_sekarang",
"pendapatan", "diare_3bln", "kunjungan_posyandu",
"bmi", "rasio_bb", "pend_ibu_enc",
"jenis_kelamin_enc", "akses_air_enc", "sanitasi_enc",
"asi_eksklusif_enc", "imunisasi_enc")
kandidat_drop <- c("tb_sekarang", "bb_sekarang", "bmi", "rasio_bb")
hitung_vif <- function(data, prediktor) {
formula_str <- paste("status_stunting_num ~", paste(prediktor, collapse = " + "))
vif(lm(as.formula(formula_str), data = data))
}
render_vif_table <- function(vif_vals, caption_text) {
df <- data.frame(Fitur = names(vif_vals), VIF = round(vif_vals, 3)) %>%
arrange(desc(VIF))
kable(df, caption = caption_text,
col.names = c("Fitur", "VIF"), align = "lr", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE) %>%
row_spec(which(df$VIF > 10), background = "#FFEBEE",
color = COL_STUNTING, bold = TRUE) %>%
row_spec(which(df$VIF <= 5), background = "#E8F5E9")
}
# ══════════════════════════════════════════════════════════════════════════════
# [0] BASELINE — Semua variabel
# ══════════════════════════════════════════════════════════════════════════════
vif_base <- hitung_vif(fitur_model, semua_prediktor)
vif_base_df <- data.frame(Fitur = names(vif_base), VIF = round(vif_base, 3)) %>%
arrange(desc(VIF))
render_vif_table(vif_base, "Baseline — Nilai VIF Semua Prediktor")| Fitur | VIF | |
|---|---|---|
| bb_sekarang | bb_sekarang | 21.610 |
| rasio_bb | rasio_bb | 20.178 |
| tb_sekarang | tb_sekarang | 18.240 |
| bmi | bmi | 17.548 |
| bb_lahir | bb_lahir | 8.717 |
| usia_bulan | usia_bulan | 2.390 |
| akses_air_enc | akses_air_enc | 1.042 |
| asi_eksklusif_enc | asi_eksklusif_enc | 1.039 |
| diare_3bln | diare_3bln | 1.038 |
| pend_ibu_enc | pend_ibu_enc | 1.035 |
| imunisasi_enc | imunisasi_enc | 1.029 |
| kunjungan_posyandu | kunjungan_posyandu | 1.026 |
| sanitasi_enc | sanitasi_enc | 1.026 |
| jenis_kelamin_enc | jenis_kelamin_enc | 1.023 |
| pendapatan | pendapatan | 1.021 |
Baseline: Max VIF = 21.61 | Jumlah VIF > 10 = 4 — ❌ Terdeteksi multikolinearitas berat.
# ══════════════════════════════════════════════════════════════════════════════
# [1] ITERASI 1 — Keluarkan 1 variabel
# ══════════════════════════════════════════════════════════════════════════════
hasil_iter1 <- lapply(kandidat_drop, function(var) {
v <- hitung_vif(fitur_model, setdiff(semua_prediktor, var))
data.frame(Variabel_Dikeluarkan = var,
Max_VIF = round(max(v), 3),
N_VIF_gt10 = sum(v > 10))
}) %>% bind_rows() %>% arrange(N_VIF_gt10, Max_VIF)
kable(hasil_iter1,
caption = "Iterasi 1 — Efek Pengeluaran 1 Variabel terhadap VIF",
align = "lrr", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE) %>%
row_spec(1, background = "#E3F2FD", bold = TRUE)| Variabel_Dikeluarkan | Max_VIF | N_VIF_gt10 |
|---|---|---|
| bb_sekarang | 11.488 | 2 |
| tb_sekarang | 20.146 | 2 |
| bmi | 20.160 | 2 |
| rasio_bb | 18.210 | 3 |
# ══════════════════════════════════════════════════════════════════════════════
# [2] ITERASI 2 — Keluarkan 2 variabel
# ══════════════════════════════════════════════════════════════════════════════
hasil_iter2 <- combn(kandidat_drop, 2, simplify = FALSE) %>%
lapply(function(combo) {
v <- hitung_vif(fitur_model, setdiff(semua_prediktor, combo))
data.frame(Variabel_Dikeluarkan = paste(combo, collapse = " + "),
Max_VIF = round(max(v), 3),
N_VIF_gt10 = sum(v > 10))
}) %>% bind_rows() %>% arrange(N_VIF_gt10, Max_VIF)
kable(hasil_iter2,
caption = "Iterasi 2 — Efek Pengeluaran 2 Variabel terhadap VIF",
align = "lrr", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE) %>%
row_spec(1, background = "#E3F2FD", bold = TRUE)| Variabel_Dikeluarkan | Max_VIF | N_VIF_gt10 |
|---|---|---|
| bmi + rasio_bb | 2.386 | 0 |
| tb_sekarang + rasio_bb | 2.684 | 0 |
| bb_sekarang + bmi | 2.836 | 0 |
| bb_sekarang + rasio_bb | 3.973 | 0 |
| tb_sekarang + bb_sekarang | 4.241 | 0 |
| tb_sekarang + bmi | 20.134 | 2 |
# ══════════════════════════════════════════════════════════════════════════════
# [3] ITERASI 3 — Keluarkan 3 variabel
# ══════════════════════════════════════════════════════════════════════════════
hasil_iter3 <- combn(kandidat_drop, 3, simplify = FALSE) %>%
lapply(function(combo) {
v <- hitung_vif(fitur_model, setdiff(semua_prediktor, combo))
data.frame(Variabel_Dikeluarkan = paste(combo, collapse = " + "),
Max_VIF = round(max(v), 3),
N_VIF_gt10 = sum(v > 10))
}) %>% bind_rows() %>% arrange(N_VIF_gt10, Max_VIF)
kable(hasil_iter3,
caption = "Iterasi 3 — Efek Pengeluaran 3 Variabel terhadap VIF",
align = "lrr", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE) %>%
row_spec(1, background = "#E3F2FD", bold = TRUE)| Variabel_Dikeluarkan | Max_VIF | N_VIF_gt10 |
|---|---|---|
| tb_sekarang + bb_sekarang + rasio_bb | 1.058 | 0 |
| tb_sekarang + bmi + rasio_bb | 1.623 | 0 |
| bb_sekarang + bmi + rasio_bb | 1.817 | 0 |
| tb_sekarang + bb_sekarang + bmi | 2.836 | 0 |
# ══════════════════════════════════════════════════════════════════════════════
# [4] RINGKASAN KOMPARATIF & KEPUTUSAN OPTIMAL
# ══════════════════════════════════════════════════════════════════════════════
ringkasan <- data.frame(
Skenario = c("Baseline (0 dikeluarkan)",
paste0("Iterasi 1: keluarkan ", hasil_iter1$Variabel_Dikeluarkan[1]),
paste0("Iterasi 2: keluarkan ", hasil_iter2$Variabel_Dikeluarkan[1]),
paste0("Iterasi 3: keluarkan ", hasil_iter3$Variabel_Dikeluarkan[1])),
Fitur_Tersisa = c(length(semua_prediktor),
length(semua_prediktor) - 1,
length(semua_prediktor) - 2,
length(semua_prediktor) - 3),
Max_VIF = c(round(max(vif_base), 3),
hasil_iter1$Max_VIF[1],
hasil_iter2$Max_VIF[1],
hasil_iter3$Max_VIF[1]),
N_VIF_gt10 = c(sum(vif_base > 10),
hasil_iter1$N_VIF_gt10[1],
hasil_iter2$N_VIF_gt10[1],
hasil_iter3$N_VIF_gt10[1]),
Status = c("❌ Multikolinear berat",
ifelse(hasil_iter1$N_VIF_gt10[1] == 0, "✅ Bersih", "⚠️ Masih ada"),
ifelse(hasil_iter2$N_VIF_gt10[1] == 0, "✅ Bersih", "⚠️ Masih ada"),
ifelse(hasil_iter3$N_VIF_gt10[1] == 0, "✅ Bersih", "⚠️ Masih ada")),
stringsAsFactors = FALSE
)
optimal_idx <- which(ringkasan$N_VIF_gt10 == 0 &
ringkasan$Fitur_Tersisa ==
max(ringkasan$Fitur_Tersisa[ringkasan$N_VIF_gt10 == 0]))
if (length(optimal_idx) == 0) optimal_idx <- which.min(ringkasan$Max_VIF)
kable(ringkasan,
caption = "Ringkasan Komparatif Skenario Pengeluaran Variabel",
align = "lrrrc", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE) %>%
row_spec(1, background = "#FFEBEE") %>%
row_spec(optimal_idx, background = "#C8E6C9", bold = TRUE)| Skenario | Fitur_Tersisa | Max_VIF | N_VIF_gt10 | Status |
|---|---|---|---|---|
| Baseline (0 dikeluarkan) | 15 | 21.610 | 4 | ❌ Multikolinear berat | |
| Iterasi 1: keluarkan bb_sekarang | 14 | 11.488 | 2 | ⚠️ Masih ada |
| Iterasi 2: keluarkan bmi + rasio_bb | 13 | 2.386 | 0 | ✅ Bersih | |
| Iterasi 3: keluarkan tb_sekarang + bb_sekarang + rasio_bb | 12 | 1.058 | 0 | ✅ Bersih | |
✅ Keputusan Optimal: Keluarkan Iterasi 2: keluarkan bmi + rasio_bb → Fitur tersisa: 13 | Max VIF: 2.386 | Semua VIF ≤ 10: YA ✅
# ══════════════════════════════════════════════════════════════════════════════
# [5] VIF FINAL & KORELASI
# ══════════════════════════════════════════════════════════════════════════════
n_drop_optimal <- length(semua_prediktor) - ringkasan$Fitur_Tersisa[optimal_idx]
vars_optimal_drop <- switch(as.character(n_drop_optimal),
"1" = hasil_iter1$Variabel_Dikeluarkan[1],
"2" = unlist(strsplit(hasil_iter2$Variabel_Dikeluarkan[1], " \\+ ")),
"3" = unlist(strsplit(hasil_iter3$Variabel_Dikeluarkan[1], " \\+ "))
)
pred_final <- setdiff(semua_prediktor, vars_optimal_drop)
vif_final <- hitung_vif(fitur_model, pred_final)
render_vif_table(vif_final,
paste("VIF Final — Variabel dikeluarkan:",
paste(vars_optimal_drop, collapse = ", ")))| Fitur | VIF | |
|---|---|---|
| usia_bulan | usia_bulan | 2.386 |
| tb_sekarang | tb_sekarang | 1.817 |
| bb_sekarang | bb_sekarang | 1.623 |
| akses_air_enc | akses_air_enc | 1.041 |
| asi_eksklusif_enc | asi_eksklusif_enc | 1.038 |
| bb_lahir | bb_lahir | 1.033 |
| pend_ibu_enc | pend_ibu_enc | 1.033 |
| diare_3bln | diare_3bln | 1.030 |
| imunisasi_enc | imunisasi_enc | 1.025 |
| kunjungan_posyandu | kunjungan_posyandu | 1.024 |
| sanitasi_enc | sanitasi_enc | 1.022 |
| pendapatan | pendapatan | 1.021 |
| jenis_kelamin_enc | jenis_kelamin_enc | 1.013 |
# ── Matriks korelasi (pasangan r > 0.9) ──────────────────────────────────────
cor_mat <- cor(fitur_model %>%
select(usia_bulan, bb_lahir, tb_sekarang, bb_sekarang,
pendapatan, diare_3bln, kunjungan_posyandu,
bmi, rasio_bb),
use = "complete.obs")
cor_tinggi <- as.data.frame(as.table(cor_mat)) %>%
filter(Var1 != Var2, abs(Freq) > 0.9) %>%
arrange(desc(abs(Freq)))
if (nrow(cor_tinggi) == 0) {
cat("\n✅ Tidak ada pasangan variabel dengan korelasi r > 0.9.",
"\n Multikolinearitas bersifat multivariat, bukan berpasangan.\n")
} else {
kable(cor_tinggi,
caption = "Pasangan Variabel dengan Korelasi r > 0.9",
col.names = c("Var 1", "Var 2", "Korelasi (r)"),
align = "llr", digits = 3, booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE, position = "center") %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)
}#>
#> ✅ Tidak ada pasangan variabel dengan korelasi r > 0.9.
#> Multikolinearitas bersifat multivariat, bukan berpasangan.
data_stunting_model <- data_encoded %>%
select(usia_bulan, jenis_kelamin_enc, bb_lahir, tb_sekarang, bb_sekarang,
pend_ibu_enc, pendapatan, akses_air_enc, sanitasi_enc,
kunjungan_posyandu, asi_eksklusif_enc, imunisasi_enc, diare_3bln,
bmi, rasio_bb, status_stunting) %>%
drop_na()
cat("=== Dataset Final untuk Modeling ===\n")#> === Dataset Final untuk Modeling ===
#> Dimensi : 492 16
#> Variabel : usia_bulan jenis_kelamin_enc bb_lahir tb_sekarang bb_sekarang pend_ibu_enc pendapatan akses_air_enc sanitasi_enc kunjungan_posyandu asi_eksklusif_enc imunisasi_enc diare_3bln bmi rasio_bb status_stunting
#> Missing : 0
write_csv(data_stunting_model, "C:/Users/User/Downloads/data_stunting_model.csv")
cat("\n✓ data_stunting_model.csv tersimpan!\n")#>
#> ✓ data_stunting_model.csv tersimpan!
Kode Unit: J.62DMI00.012.1
| Algoritma | Tipe | Asumsi | Keunggulan | Kelemahan | Dipilih |
|---|---|---|---|---|---|
| Logistic Regression | Parametrik / Linear | Linearitas log-odds; tidak ada multikolinearitas berat | Interpretabel; efisien; output probabilitas; baseline yang baik | Sensitif multikolinearitas; kurang baik untuk hubungan non-linear | ✓ Ya (Baseline) |
| Decision Tree | Non-parametrik / Tree | Tidak ada asumsi distribusi; bisa menangani data campuran | Mudah diinterpretasi secara visual; fast training; seleksi fitur otomatis | Cenderung overfit; tidak stabil (sensitif terhadap perubahan data) | ✓ Ya |
| Random Forest | Ensemble / Bagging | Tidak ada asumsi distribusi; robust terhadap outlier & imbalance | Performa tinggi; robust terhadap noise & imbalance; feature importance | Black-box; lambat untuk data besar; butuh tuning hyperparameter | ✓ Ya (Terbaik) |
library(caret)
data_stunting_model <- read_csv("C:/Users/User/Downloads/data_stunting_model.csv") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal","Stunting")))
cat("Dimensi data:", dim(data_stunting_model), "\n")#> Dimensi data: 492 16
#> Distribusi target:
#>
#> Normal Stunting
#> 0.744 0.256
# ── Skenario 1: 80:20 Split ───────────────────────────────────────────────────
set.seed(42)
split_idx <- createDataPartition(data_stunting_model$status_stunting, p = 0.8, list = FALSE)
train_set <- data_stunting_model[split_idx,]
test_set <- data_stunting_model[-split_idx,]
cat("\n=== SKENARIO 1: Percentage Split 80:20 ===\n")#>
#> === SKENARIO 1: Percentage Split 80:20 ===
#> Training: 394 | Test: 98
#> Distribusi Training:
#>
#> Normal Stunting
#> 74.4 25.6
#> Distribusi Test:
#>
#> Normal Stunting
#> 74.5 25.5
# ── Skenario 2: 5-Fold CV ────────────────────────────────────────────────────
set.seed(42)
cv_folds <- createFolds(data_stunting_model$status_stunting, k = 5,
list = TRUE, returnTrain = FALSE)
cat("\n=== SKENARIO 2: Stratified 5-Fold Cross Validation ===\n")#>
#> === SKENARIO 2: Stratified 5-Fold Cross Validation ===
for (i in 1:5) {
ft <- data_stunting_model[cv_folds[[i]],]
fr <- data_stunting_model[-cv_folds[[i]],]
cat(sprintf("Fold %d | Train: %d | Test: %d | Stunting di test: %d (%.1f%%)\n",
i, nrow(fr), nrow(ft),
sum(ft$status_stunting == "Stunting"),
mean(ft$status_stunting == "Stunting") * 100))
}#> Fold 1 | Train: 392 | Test: 100 | Stunting di test: 26 (26.0%)
#> Fold 2 | Train: 394 | Test: 98 | Stunting di test: 25 (25.5%)
#> Fold 3 | Train: 394 | Test: 98 | Stunting di test: 25 (25.5%)
#> Fold 4 | Train: 394 | Test: 98 | Stunting di test: 25 (25.5%)
#> Fold 5 | Train: 394 | Test: 98 | Stunting di test: 25 (25.5%)
ctrl_cv <- trainControl(method = "cv", number = 5, classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final", sampling = NULL)
saveRDS(train_set, "C:/Users/User/Downloads/train_set.rds")
saveRDS(test_set, "C:/Users/User/Downloads/test_set.rds")
saveRDS(ctrl_cv, "C:/Users/User/Downloads/ctrl_cv.rds")
cat("\n✓ train_set.rds, test_set.rds, ctrl_cv.rds tersimpan!\n")#>
#> ✓ train_set.rds, test_set.rds, ctrl_cv.rds tersimpan!
Kode Unit: J.62DMI00.013.1
# ── Muat data & set factor ────────────────────────────────────────────────────
data_stunting_model <- read_csv("C:/Users/User/Downloads/data_stunting_model.csv") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal", "Stunting")))
train_set <- readRDS("C:/Users/User/Downloads/train_set.rds") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal", "Stunting")))
test_set <- readRDS("C:/Users/User/Downloads/test_set.rds") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal", "Stunting")))
cat("Train:", nrow(train_set), "| Test:", nrow(test_set), "\n")#> Train: 394 | Test: 98
# ── Bangun Logistic Regression ────────────────────────────────────────────────
set.seed(42)
model_lr <- glm(status_stunting ~ . - bmi - rasio_bb,
data = train_set,
family = binomial(link = "logit"))
cat("\n=== RINGKASAN MODEL LOGISTIC REGRESSION ===\n")#>
#> === RINGKASAN MODEL LOGISTIC REGRESSION ===
#>
#> Call:
#> glm(formula = status_stunting ~ . - bmi - rasio_bb, family = binomial(link = "logit"),
#> data = train_set)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.983e+03 3.382e+05 0.012 0.991
#> usia_bulan 2.557e+01 2.677e+03 0.010 0.992
#> jenis_kelamin_enc 1.160e+01 2.181e+04 0.001 1.000
#> bb_lahir -9.238e+00 2.255e+04 0.000 1.000
#> tb_sekarang -6.811e+01 5.504e+03 -0.012 0.990
#> bb_sekarang 5.446e+00 2.109e+04 0.000 1.000
#> pend_ibu_enc -1.817e+01 1.066e+04 -0.002 0.999
#> pendapatan -5.384e-06 4.766e-03 -0.001 0.999
#> akses_air_enc 1.096e+01 7.466e+04 0.000 1.000
#> sanitasi_enc -1.467e+01 7.683e+04 0.000 1.000
#> kunjungan_posyandu 8.950e-01 3.924e+03 0.000 1.000
#> asi_eksklusif_enc 1.060e+01 2.307e+04 0.000 1.000
#> imunisasi_enc -9.341e+00 1.395e+04 -0.001 0.999
#> diare_3bln 4.636e-01 5.128e+03 0.000 1.000
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4.4853e+02 on 393 degrees of freedom
#> Residual deviance: 3.8287e-07 on 380 degrees of freedom
#> AIC: 28
#>
#> Number of Fisher Scoring iterations: 25
# ── Cek Perfect Separation ────────────────────────────────────────────────────
cat("\n=== CEK PERFECT SEPARATION ===\n")#>
#> === CEK PERFECT SEPARATION ===
#> Null deviance : 448.5290 (df = 393)
#> Residual deviance: 3.828667e-07 (df = 380)
#> Iterasi Fisher : 25
if (model_lr$deviance < 1e-3) {
cat(" ⚠️ Perfect/quasi-complete separation terdeteksi!\n")
cat(" SE sangat besar → p-value tidak reliable.\n")
cat(" CI dihitung via Wald (±1.96×SE) sebagai aproksimasi.\n")
}#> ⚠️ Perfect/quasi-complete separation terdeteksi!
#> SE sangat besar → p-value tidak reliable.
#> CI dihitung via Wald (±1.96×SE) sebagai aproksimasi.
# ── Koefisien + p-value, diurutkan dari Pr(>|z|) terkecil ────────────────────
cat("\n=== KOEFISIEN MODEL (Diurutkan: p-value terkecil → terbesar) ===\n")#>
#> === KOEFISIEN MODEL (Diurutkan: p-value terkecil → terbesar) ===
smry <- summary(model_lr)$coefficients
coef_df <- as.data.frame(smry) %>%
rownames_to_column("Variabel") %>%
rename(Estimate = Estimate,
Std_Error = `Std. Error`,
z_value = `z value`,
p_value = `Pr(>|z|)`) %>%
arrange(p_value) %>% # ← urutkan terkecil
mutate(
Signifikan = case_when(
p_value < 0.001 ~ "***",
p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*",
p_value < 0.1 ~ ".",
TRUE ~ ""
)
)
# Tampilkan semua, highlight top-3
kable(coef_df %>%
mutate(across(c(Estimate, Std_Error, z_value), ~round(., 4)),
p_value = formatC(p_value, format = "e", digits = 4)),
caption = "Koefisien LR — Diurutkan Berdasarkan p-value (Terkecil ke Terbesar)",
align = "lrrrrc") %>%
kable_styling(bootstrap_options = c("striped","hover","bordered"),
full_width = FALSE, position = "center") %>%
row_spec(1:3, background = "#C8E6C9", bold = TRUE) %>% # top-3 hijau
row_spec(which(coef_df$p_value < 0.05 & !((1:nrow(coef_df)) %in% 1:3)),
background = "#FFF9C4") %>% # sig lain kuning
add_header_above(c(" " = 1, "Estimasi" = 3, "Uji" = 2))| Variabel | Estimate | Std_Error | z_value | p_value | Signifikan |
|---|---|---|---|---|---|
| tb_sekarang | -68.1059 | 5503.7636 | -0.0124 | 9.9013e-01 | |
| (Intercept) | 3982.8677 | 338213.5118 | 0.0118 | 9.9060e-01 | |
| usia_bulan | 25.5675 | 2677.4763 | 0.0095 | 9.9238e-01 | |
| pend_ibu_enc | -18.1671 | 10657.0762 | -0.0017 | 9.9864e-01 | |
| pendapatan | 0.0000 | 0.0048 | -0.0011 | 9.9910e-01 | |
| imunisasi_enc | -9.3415 | 13953.9854 | -0.0007 | 9.9947e-01 | |
| jenis_kelamin_enc | 11.6017 | 21810.9579 | 0.0005 | 9.9958e-01 | |
| asi_eksklusif_enc | 10.6046 | 23066.6935 | 0.0005 | 9.9963e-01 | |
| bb_lahir | -9.2382 | 22549.4959 | -0.0004 | 9.9967e-01 | |
| bb_sekarang | 5.4457 | 21089.7194 | 0.0003 | 9.9979e-01 | |
| kunjungan_posyandu | 0.8950 | 3924.2929 | 0.0002 | 9.9982e-01 | |
| sanitasi_enc | -14.6695 | 76833.7371 | -0.0002 | 9.9985e-01 | |
| akses_air_enc | 10.9590 | 74658.0120 | 0.0001 | 9.9988e-01 | |
| diare_3bln | 0.4636 | 5127.9207 | 0.0001 | 9.9993e-01 |
# ── Cetak Top-3 Paling Signifikan ─────────────────────────────────────────────
cat("\n=== TOP 3 VARIABEL PALING SIGNIFIKAN (p-value terkecil) ===\n")#>
#> === TOP 3 VARIABEL PALING SIGNIFIKAN (p-value terkecil) ===
top3 <- coef_df %>% slice(1:3)
for (i in 1:3) {
cat(sprintf(" [%d] %-25s | p = %s | z = %.4f | Estimate = %.4f\n",
i,
top3$Variabel[i],
formatC(top3$p_value[i], format = "e", digits = 4),
top3$z_value[i],
top3$Estimate[i]))
}#> [1] tb_sekarang | p = 9.9013e-01 | z = -0.0124 | Estimate = -68.1059
#> [2] (Intercept) | p = 9.9060e-01 | z = 0.0118 | Estimate = 3982.8677
#> [3] usia_bulan | p = 9.9238e-01 | z = 0.0095 | Estimate = 25.5675
# ── Odds Ratio dengan CI Wald (robust terhadap separation) ───────────────────
cat("\n=== ODDS RATIO + 95% CI Wald ===\n")#>
#> === ODDS RATIO + 95% CI Wald ===
or_df <- data.frame(
Variabel = names(coef(model_lr)),
Koefisien = round(coef(model_lr), 4),
OddsRatio = round(exp(coef(model_lr)), 4),
CI_Lower = round(exp(coef(model_lr) - 1.96 * sqrt(diag(vcov(model_lr)))), 4),
CI_Upper = round(exp(coef(model_lr) + 1.96 * sqrt(diag(vcov(model_lr)))), 4)
) %>%
left_join(coef_df %>% select(Variabel, p_value), by = "Variabel") %>%
arrange(p_value) # ← ikut urutan signifikansi
kable(or_df %>% select(-p_value),
caption = "Odds Ratio + 95% CI Wald (diurutkan: paling signifikan)",
align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE, position = "center") %>%
row_spec(1:3, background = "#C8E6C9", bold = TRUE)| Variabel | Koefisien | OddsRatio | CI_Lower | CI_Upper |
|---|---|---|---|---|
| tb_sekarang | -68.1059 | 0.000000e+00 | 0.0000 | Inf |
| (Intercept) | 3982.8677 | Inf | 0.0000 | Inf |
| usia_bulan | 25.5675 | 1.270041e+11 | 0.0000 | Inf |
| pend_ibu_enc | -18.1671 | 0.000000e+00 | 0.0000 | Inf |
| pendapatan | 0.0000 | 1.000000e+00 | 0.9907 | 1.0094 |
| imunisasi_enc | -9.3415 | 1.000000e-04 | 0.0000 | Inf |
| jenis_kelamin_enc | 11.6017 | 1.092788e+05 | 0.0000 | Inf |
| asi_eksklusif_enc | 10.6046 | 4.032096e+04 | 0.0000 | Inf |
| bb_lahir | -9.2382 | 1.000000e-04 | 0.0000 | Inf |
| bb_sekarang | 5.4457 | 2.317620e+02 | 0.0000 | Inf |
| kunjungan_posyandu | 0.8950 | 2.447400e+00 | 0.0000 | Inf |
| sanitasi_enc | -14.6695 | 0.000000e+00 | 0.0000 | Inf |
| akses_air_enc | 10.9590 | 5.746826e+04 | 0.0000 | Inf |
| diare_3bln | 0.4636 | 1.589800e+00 | 0.0000 | Inf |
library(rpart); library(rpart.plot)
set.seed(42)
ctrl_dt <- trainControl(method = "cv", number = 5, classProbs = TRUE,
summaryFunction = twoClassSummary)
cp_grid <- expand.grid(cp = seq(0.001, 0.05, by = 0.001))
model_dt_cv <- train(status_stunting ~ ., data = train_set, method = "rpart",
trControl = ctrl_dt, tuneGrid = cp_grid, metric = "ROC")
cat("=== HASIL TUNING cp ===\n")#> === HASIL TUNING cp ===
#> cp ROC Sens Spec
#> 1 0.001 0.8495003 0.9147282 0.7119048
#> 2 0.002 0.8495003 0.9147282 0.7119048
#> 3 0.003 0.8495003 0.9147282 0.7119048
#> 4 0.004 0.8495003 0.9147282 0.7119048
#> 5 0.005 0.8495003 0.9147282 0.7119048
#> 6 0.006 0.8495003 0.9147282 0.7119048
#> 7 0.007 0.8495003 0.9147282 0.7119048
#> 8 0.008 0.8495003 0.9147282 0.7119048
#> 9 0.009 0.8495003 0.9147282 0.7119048
#> 10 0.010 0.8495003 0.9147282 0.7119048
#> 11 0.011 0.8495003 0.9147282 0.7119048
#> 12 0.012 0.8495003 0.9147282 0.7119048
#> 13 0.013 0.8439730 0.9216248 0.7023810
#> 14 0.014 0.8439730 0.9216248 0.7023810
#> 15 0.015 0.8439730 0.9216248 0.7023810
#> 16 0.016 0.8439730 0.9216248 0.7023810
#> 17 0.017 0.8439730 0.9216248 0.7023810
#> 18 0.018 0.8439730 0.9216248 0.7023810
#> 19 0.019 0.8439730 0.9216248 0.7023810
#> 20 0.020 0.8439730 0.9216248 0.7023810
#> 21 0.021 0.8439730 0.9216248 0.7023810
#> 22 0.022 0.8439730 0.9216248 0.7023810
#> 23 0.023 0.8439730 0.9216248 0.7023810
#> 24 0.024 0.8439730 0.9216248 0.7023810
#> 25 0.025 0.8310069 0.9419638 0.6823810
#> 26 0.026 0.8310069 0.9419638 0.6823810
#> 27 0.027 0.8310069 0.9419638 0.6823810
#> 28 0.028 0.8310069 0.9419638 0.6823810
#> 29 0.029 0.8310069 0.9419638 0.6823810
#> 30 0.030 0.8310069 0.9419638 0.6823810
#> 31 0.031 0.8310069 0.9419638 0.6823810
#> 32 0.032 0.8310069 0.9419638 0.6823810
#> 33 0.033 0.8310069 0.9419638 0.6823810
#> 34 0.034 0.8310069 0.9419638 0.6823810
#> 35 0.035 0.8310069 0.9419638 0.6823810
#> 36 0.036 0.8310069 0.9419638 0.6823810
#> 37 0.037 0.8310069 0.9419638 0.6823810
#> 38 0.038 0.8323629 0.9453536 0.6723810
#> 39 0.039 0.8323629 0.9453536 0.6723810
#> 40 0.040 0.8323629 0.9453536 0.6723810
#> 41 0.041 0.8323629 0.9453536 0.6723810
#> 42 0.042 0.8323629 0.9453536 0.6723810
#> 43 0.043 0.8323629 0.9453536 0.6723810
#> 44 0.044 0.8277866 0.9284044 0.6723810
#> 45 0.045 0.8277866 0.9284044 0.6723810
#> 46 0.046 0.8277866 0.9284044 0.6723810
#> 47 0.047 0.8277866 0.9284044 0.6723810
#> 48 0.048 0.8277866 0.9284044 0.6723810
#> 49 0.049 0.8277866 0.9284044 0.6723810
#> 50 0.050 0.8277866 0.9284044 0.6723810
#>
#> cp Optimal: 0.012
res_dt <- model_dt_cv$results
best_roc <- max(res_dt$ROC)
se_roc <- sd(res_dt$ROC) / sqrt(5)
cp_1se <- res_dt$cp[res_dt$ROC >= (best_roc - se_roc)][1]
cat(sprintf("Best ROC: %.4f | 1-SE threshold: %.4f | cp (1-SE): %s\n",
best_roc, best_roc - se_roc, cp_1se))#> Best ROC: 0.8495 | 1-SE threshold: 0.8457 | cp (1-SE): 0.001
set.seed(42)
model_dt_final <- rpart(status_stunting ~ ., data = train_set, method = "class",
control = rpart.control(cp = cp_1se))
printcp(model_dt_final)#>
#> Classification tree:
#> rpart(formula = status_stunting ~ ., data = train_set, method = "class",
#> control = rpart.control(cp = cp_1se))
#>
#> Variables actually used in tree construction:
#> [1] bmi rasio_bb tb_sekarang usia_bulan
#>
#> Root node error: 101/394 = 0.25635
#>
#> n= 394
#>
#> CP nsplit rel error xerror xstd
#> 1 0.356436 0 1.00000 1.00000 0.085807
#> 2 0.089109 1 0.64356 0.69307 0.075119
#> 3 0.054455 2 0.55446 0.58416 0.070126
#> 4 0.019802 5 0.37624 0.54455 0.068110
#> 5 0.009901 7 0.33663 0.53465 0.067588
#> 6 0.001000 8 0.32673 0.54455 0.068110
#> bmi tb_sekarang bb_sekarang rasio_bb
#> 61.0341758 41.1727881 13.8814786 12.4942513
#> usia_bulan bb_lahir pendapatan pend_ibu_enc
#> 10.3824263 3.4376910 1.4257306 0.8554017
#> kunjungan_posyandu
#> 0.7136781
rpart.plot(model_dt_final, type = 4, extra = 104, under = TRUE,
fallen.leaves = TRUE, branch.lty = 3, shadow.col = "gray75",
col = "#1A237E", main = paste0("Decision Tree Prediksi Stunting (cp = ", cp_1se, ")"),
cex = 0.78, box.palette = c(COL_NORMAL, COL_STUNTING))Figure 9.1: Decision Tree Prediksi Stunting (cp Optimal — 1-SE Rule)
library(randomForest)
set.seed(42)
ctrl_rf <- trainControl(method = "cv", number = 5, classProbs = TRUE,
summaryFunction = twoClassSummary)
mtry_grid <- expand.grid(mtry = 2:8)
model_rf_cv <- train(status_stunting ~ ., data = train_set, method = "rf",
trControl = ctrl_rf, tuneGrid = mtry_grid,
metric = "ROC", ntree = 200)
cat("=== HASIL TUNING mtry ===\n")#> === HASIL TUNING mtry ===
#> mtry ROC Sens Spec
#> 1 2 0.9215732 0.9794857 0.6028571
#> 2 3 0.9382779 0.9692577 0.6223810
#> 3 4 0.9401904 0.9657510 0.6623810
#> 4 5 0.9470196 0.9589714 0.7023810
#> 5 6 0.9543966 0.9590298 0.7219048
#> 6 7 0.9521226 0.9589129 0.7319048
#> 7 8 0.9527794 0.9657510 0.7409524
#>
#> mtry Optimal: 6
Figure 9.2: Tuning Hyperparameter mtry — Random Forest
imp_df <- varImp(model_rf_cv)$importance %>%
as.data.frame() %>%
tibble::rownames_to_column("Fitur") %>%
rename(Importance = Overall) %>%
arrange(Importance)
ggplot(imp_df, aes(x = reorder(Fitur, Importance), y = Importance, fill = Importance)) +
geom_col(colour = "white", width = 0.72) +
geom_text(aes(label = round(Importance, 1)), hjust = -0.1,
size = 3.2, colour = "#37474F") +
scale_fill_gradient(low = "#BBDEFB", high = COL_NORMAL) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
labs(title = "Feature Importance — Random Forest",
subtitle = paste0("mtry optimal = ", model_rf_cv$bestTune$mtry, " | ntree = 200"),
x = NULL, y = "Importance Score (ROC)",
caption = "Semakin tinggi skor → semakin penting fitur dalam prediksi") +
theme(legend.position = "none")Figure 9.3: Feature Importance — Random Forest
#> Distribusi kelas training (Sebelum ROSE):
#>
#> Normal Stunting
#> 293 101
train_rose <- ROSE(status_stunting ~ ., data = train_set, seed = 42)$data
cat("\nDistribusi kelas training (Sesudah ROSE):\n")#>
#> Distribusi kelas training (Sesudah ROSE):
#>
#> Normal Stunting
#> 203 191
pred_rf_no <- predict(model_rf_cv, newdata = test_set)
cm_no <- confusionMatrix(pred_rf_no, test_set$status_stunting, positive = "Stunting")
set.seed(42)
model_rf_rose <- train(status_stunting ~ ., data = train_rose, method = "rf",
trControl = trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary),
tuneGrid = expand.grid(mtry = model_rf_cv$bestTune$mtry),
metric = "ROC", ntree = 200)
pred_rf_rose <- predict(model_rf_rose, newdata = test_set)
cm_rose <- confusionMatrix(pred_rf_rose, test_set$status_stunting, positive = "Stunting")
cat("\n=== RF TANPA ROSE ===\n"); print(cm_no)#>
#> === RF TANPA ROSE ===
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal Stunting
#> Normal 72 5
#> Stunting 1 20
#>
#> Accuracy : 0.9388
#> 95% CI : (0.8715, 0.9772)
#> No Information Rate : 0.7449
#> P-Value [Acc > NIR] : 6.059e-07
#>
#> Kappa : 0.83
#>
#> Mcnemar's Test P-Value : 0.2207
#>
#> Sensitivity : 0.8000
#> Specificity : 0.9863
#> Pos Pred Value : 0.9524
#> Neg Pred Value : 0.9351
#> Prevalence : 0.2551
#> Detection Rate : 0.2041
#> Detection Prevalence : 0.2143
#> Balanced Accuracy : 0.8932
#>
#> 'Positive' Class : Stunting
#>
#>
#> === RF DENGAN ROSE ===
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal Stunting
#> Normal 66 7
#> Stunting 7 18
#>
#> Accuracy : 0.8571
#> 95% CI : (0.7719, 0.9196)
#> No Information Rate : 0.7449
#> P-Value [Acc > NIR] : 0.005306
#>
#> Kappa : 0.6241
#>
#> Mcnemar's Test P-Value : 1.000000
#>
#> Sensitivity : 0.7200
#> Specificity : 0.9041
#> Pos Pred Value : 0.7200
#> Neg Pred Value : 0.9041
#> Prevalence : 0.2551
#> Detection Rate : 0.1837
#> Detection Prevalence : 0.2551
#> Balanced Accuracy : 0.8121
#>
#> 'Positive' Class : Stunting
#>
df_rose_cmp <- data.frame(
Metrik = c("Sensitivity","Specificity","Accuracy","F1-Score"),
Tanpa_ROSE = c(round(cm_no$byClass["Sensitivity"], 4),
round(cm_no$byClass["Specificity"], 4),
round(cm_no$overall["Accuracy"], 4),
round(cm_no$byClass["F1"], 4)),
Dengan_ROSE = c(round(cm_rose$byClass["Sensitivity"], 4),
round(cm_rose$byClass["Specificity"], 4),
round(cm_rose$overall["Accuracy"], 4),
round(cm_rose$byClass["F1"], 4))
)
kable(df_rose_cmp, caption = "Perbandingan Performa RF: Tanpa vs Dengan ROSE",
col.names = c("Metrik","Tanpa ROSE","Dengan ROSE")) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE, position = "center") %>%
row_spec(1, background = "#FCE4EC", bold = TRUE) %>%
row_spec(0, background = "#37474F", color = "white", bold = TRUE)| Metrik | Tanpa ROSE | Dengan ROSE | |
|---|---|---|---|
| Sensitivity | Sensitivity | 0.8000 | 0.7200 |
| Specificity | Specificity | 0.9863 | 0.9041 |
| Accuracy | Accuracy | 0.9388 | 0.8571 |
| F1 | F1-Score | 0.8696 | 0.7200 |
Kode Unit: J.62DMI00.014.1
library(pROC)
test_set <- readRDS("C:/Users/User/Downloads/test_set.rds") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal","Stunting")))
train_set <- readRDS("C:/Users/User/Downloads/train_set.rds") %>%
mutate(status_stunting = factor(status_stunting, levels = c("Normal","Stunting")))
# ── Model 1: Logistic Regression ─────────────────────────────────────────────
set.seed(42)
model_lr <- glm(status_stunting ~ . - bmi - rasio_bb,
data = train_set,
family = binomial(link = "logit"))
pred_lr_prob <- predict(model_lr, newdata = test_set, type = "response")
pred_lr_lbl <- factor(ifelse(pred_lr_prob >= 0.5, "Stunting","Normal"),
levels = c("Normal","Stunting"))
cm_lr <- confusionMatrix(pred_lr_lbl, test_set$status_stunting, positive = "Stunting")
roc_lr <- roc(test_set$status_stunting, pred_lr_prob,
levels = c("Normal","Stunting"), quiet = TRUE)
cat("=== Logistic Regression ===\n"); print(cm_lr)#> === Logistic Regression ===
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal Stunting
#> Normal 72 1
#> Stunting 1 24
#>
#> Accuracy : 0.9796
#> 95% CI : (0.9282, 0.9975)
#> No Information Rate : 0.7449
#> P-Value [Acc > NIR] : 1.729e-10
#>
#> Kappa : 0.9463
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9600
#> Specificity : 0.9863
#> Pos Pred Value : 0.9600
#> Neg Pred Value : 0.9863
#> Prevalence : 0.2551
#> Detection Rate : 0.2449
#> Detection Prevalence : 0.2551
#> Balanced Accuracy : 0.9732
#>
#> 'Positive' Class : Stunting
#>
#> AUC-ROC: 0.9989
# ── Model 2: Decision Tree ────────────────────────────────────────────────────
set.seed(42)
model_dt_fin <- rpart(status_stunting ~ ., data = train_set,
method = "class", control = rpart.control(cp = 0.001))
pred_dt_prob <- predict(model_dt_fin, newdata = test_set, type = "prob")[,"Stunting"]
pred_dt_lbl <- predict(model_dt_fin, newdata = test_set, type = "class")
cm_dt <- confusionMatrix(pred_dt_lbl, test_set$status_stunting, positive = "Stunting")
roc_dt <- roc(test_set$status_stunting, pred_dt_prob,
levels = c("Normal","Stunting"), quiet = TRUE)
cat("\n=== Decision Tree ===\n"); print(cm_dt)#>
#> === Decision Tree ===
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal Stunting
#> Normal 68 7
#> Stunting 5 18
#>
#> Accuracy : 0.8776
#> 95% CI : (0.7959, 0.9351)
#> No Information Rate : 0.7449
#> P-Value [Acc > NIR] : 0.0009983
#>
#> Kappa : 0.6691
#>
#> Mcnemar's Test P-Value : 0.7728300
#>
#> Sensitivity : 0.7200
#> Specificity : 0.9315
#> Pos Pred Value : 0.7826
#> Neg Pred Value : 0.9067
#> Prevalence : 0.2551
#> Detection Rate : 0.1837
#> Detection Prevalence : 0.2347
#> Balanced Accuracy : 0.8258
#>
#> 'Positive' Class : Stunting
#>
#> AUC-ROC: 0.9099
# ── Model 3: Random Forest ────────────────────────────────────────────────────
set.seed(42)
model_rf_fin <- randomForest(status_stunting ~ ., data = train_set, mtry = 6, ntree = 200)
pred_rf_prob <- predict(model_rf_fin, newdata = test_set, type = "prob")[,"Stunting"]
pred_rf_lbl <- predict(model_rf_fin, newdata = test_set, type = "class")
cm_rf <- confusionMatrix(pred_rf_lbl, test_set$status_stunting, positive = "Stunting")
roc_rf <- roc(test_set$status_stunting, pred_rf_prob,
levels = c("Normal","Stunting"), quiet = TRUE)
cat("\n=== Random Forest ===\n"); print(cm_rf)#>
#> === Random Forest ===
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal Stunting
#> Normal 73 5
#> Stunting 0 20
#>
#> Accuracy : 0.949
#> 95% CI : (0.8849, 0.9832)
#> No Information Rate : 0.7449
#> P-Value [Acc > NIR] : 1.099e-07
#>
#> Kappa : 0.8563
#>
#> Mcnemar's Test P-Value : 0.07364
#>
#> Sensitivity : 0.8000
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9359
#> Prevalence : 0.2551
#> Detection Rate : 0.2041
#> Detection Prevalence : 0.2041
#> Balanced Accuracy : 0.9000
#>
#> 'Positive' Class : Stunting
#>
#> AUC-ROC: 0.9816
rekap <- data.frame(
Model = c("Logistic Regression","Decision Tree","Random Forest"),
Accuracy = c(round(cm_lr$overall["Accuracy"], 4),
round(cm_dt$overall["Accuracy"], 4),
round(cm_rf$overall["Accuracy"], 4)),
Sensitivity = c(round(cm_lr$byClass["Sensitivity"], 4),
round(cm_dt$byClass["Sensitivity"], 4),
round(cm_rf$byClass["Sensitivity"], 4)),
Specificity = c(round(cm_lr$byClass["Specificity"], 4),
round(cm_dt$byClass["Specificity"], 4),
round(cm_rf$byClass["Specificity"], 4)),
Precision = c(round(cm_lr$byClass["Precision"], 4),
round(cm_dt$byClass["Precision"], 4),
round(cm_rf$byClass["Precision"], 4)),
F1_Score = c(round(cm_lr$byClass["F1"], 4),
round(cm_dt$byClass["F1"], 4),
round(cm_rf$byClass["F1"], 4)),
AUC_ROC = c(round(auc(roc_lr), 4),
round(auc(roc_dt), 4),
round(auc(roc_rf), 4)),
Kappa = c(round(cm_lr$overall["Kappa"], 4),
round(cm_dt$overall["Kappa"], 4),
round(cm_rf$overall["Kappa"], 4))
)
kable(rekap, caption = "Rekap Perbandingan Metrik Evaluasi — Ketiga Model",
align = c("l", rep("r",7))) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE, background = "#E8EAF6") %>%
column_spec(3, background = "#E8F5E9",
bold = rekap$Sensitivity == max(rekap$Sensitivity),
color = ifelse(rekap$Sensitivity == max(rekap$Sensitivity), COL_GREEN, "black")) %>%
column_spec(7, background = "#E3F2FD",
bold = rekap$AUC_ROC == max(rekap$AUC_ROC),
color = ifelse(rekap$AUC_ROC == max(rekap$AUC_ROC), COL_NORMAL, "black")) %>%
add_header_above(c(" " = 1, "Performa Model" = 7)) %>%
row_spec(0, background = "#1A237E", color = "white", bold = TRUE)| Model | Accuracy | Sensitivity | Specificity | Precision | F1_Score | AUC_ROC | Kappa |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9796 | 0.96 | 0.9863 | 0.9600 | 0.9600 | 0.9989 | 0.9463 |
| Decision Tree | 0.8776 | 0.72 | 0.9315 | 0.7826 | 0.7500 | 0.9099 | 0.6691 |
| Random Forest | 0.9490 | 0.80 | 1.0000 | 1.0000 | 0.8889 | 0.9816 | 0.8563 |
roc_data <- bind_rows(
data.frame(fpr = 1 - roc_lr$specificities, tpr = roc_lr$sensitivities,
Model = paste0("Logistic Regression (AUC = ", round(auc(roc_lr),3), ")")),
data.frame(fpr = 1 - roc_dt$specificities, tpr = roc_dt$sensitivities,
Model = paste0("Decision Tree (AUC = ", round(auc(roc_dt),3), ")")),
data.frame(fpr = 1 - roc_rf$specificities, tpr = roc_rf$sensitivities,
Model = paste0("Random Forest (AUC = ", round(auc(roc_rf),3), ")"))
)
col_roc <- setNames(
c(COL_NORMAL, COL_GREEN, COL_STUNTING),
c(paste0("Logistic Regression (AUC = ", round(auc(roc_lr),3), ")"),
paste0("Decision Tree (AUC = ", round(auc(roc_dt),3), ")"),
paste0("Random Forest (AUC = ", round(auc(roc_rf),3), ")"))
)
ggplot(roc_data, aes(x = fpr, y = tpr, colour = Model, linetype = Model)) +
geom_line(linewidth = 1.1) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", colour = "gray55", linewidth = 0.65) +
geom_point(data = roc_data %>% group_by(Model) %>%
filter(abs(fpr - 0.2) == min(abs(fpr - 0.2))),
size = 3, shape = 16) +
annotate("text", x = 0.6, y = 0.35, label = "Garis Acak (AUC = 0.5)",
colour = "gray50", size = 3, fontface = "italic") +
scale_colour_manual(values = col_roc) +
scale_linetype_manual(values = c("solid","longdash","dotdash")) +
coord_equal(xlim = c(0,1), ylim = c(0,1)) +
labs(title = "ROC Curve — Perbandingan Ketiga Model",
subtitle = "Prediksi Risiko Stunting Balita | Test Set (20%)",
x = "1 − Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
colour = "Model", linetype = "Model",
caption = "Sumber: data_stunting_model.csv (test set)") +
theme(legend.position = "bottom",
legend.direction = "vertical",
legend.key.width = unit(2.2, "cm"))Figure 10.1: ROC Curve — Perbandingan Ketiga Model Klasifikasi Stunting
Kode Unit: J.62DMI00.015.1
| No | Tahapan | Status | Kesesuaian | Catatan |
|---|---|---|---|---|
| 1 | Pengumpulan Data | ✓ Selesai | ✓ Sesuai Rencana | Dataset 500 balita, 17 variabel tersedia; 3 variabel memiliki missing values |
| 2 | Telaah Data (EDA) | ✓ Selesai | ✓ Sesuai Rencana | Korelasi, distribusi, visualisasi, hipotesis — semua tersedia |
| 3 | Validasi Data | ✓ Selesai | ✓ Sesuai Rencana | 59 nilai tidak valid (batas WHO); 45 sel missing; kecukupan data terpenuhi (rasio 35,71) |
| 4 | Pembersihan Data | ✓ Selesai | ✓ Sesuai Rencana | Outlier dihapus (batas soal); imputasi per kelompok usia; data bersih tersimpan |
| 5 | Konstruksi Data | ✓ Selesai | ✓ Sesuai Rencana | Encoding, feature engineering (BMI, rasio BB), normalisasi, VIF — selesai |
| 6 | Skenario Model | ✓ Selesai | ✓ Sesuai Rencana | 2 skenario (80:20 split & 5-fold CV); metrik evaluasi ditetapkan |
| 7 | Pembangunan Model | ✓ Selesai | ✓ Sesuai Rencana | 3 model dibangun: LR, DT, RF; hyperparameter tuning dilakukan dengan CV |
| 8 | Evaluasi Model | ✓ Selesai | ✓ Sesuai Rencana | Confusion matrix, Sensitivity, Specificity, F1, AUC-ROC dihitung untuk ketiga model |
Laporan ini dibuat menggunakan R Markdown — diknit ke format Word dan Material HTML.
Tanggal: 30 April 2026 | LSP Politeknik Statistika STIS 2026