π Mata Kuliah: Analisis Survival
π¨βπ« Dosen Pengampu: Bakti Siregar, M.Sc., CDS
π₯ Kelompok: Churn Prediction
Churn pelanggan (customer attrition) β persentase pelanggan yang berhenti menggunakan produk atau layanan β adalah salah satu metrik terpenting bagi bisnis. Umumnya, biaya untuk mendapatkan pelanggan baru jauh lebih besar daripada biaya mempertahankan pelanggan yang sudah ada.
Menurut studi Bain & Company, peningkatan retensi pelanggan sebesar 5% dapat menghasilkan peningkatan laba lebih dari 25% (khususnya di layanan keuangan).
Mengapa Analisis Survival?
Analisis klasifikasi biasa hanya menjawab βApakah pelanggan akan churn?β, sedangkan Analisis Survival menjawab pertanyaan yang jauh lebih bernilai bisnis:
βKapan pelanggan akan berhenti berlangganan?β
Dengan demikian, perusahaan dapat mengambil tindakan retensi yang tepat sebelum churn terjadi.
Tiga konsep fundamental yang dipakai sepanjang laporan ini:
months_active.churned.churned = 0) disebut
right-censoring.Dari T dan Ξ΄, dua fungsi utama yang diestimasi:
Sebuah perusahaan SaaS (Software as a Service) menyediakan produk untuk UKM:
Tujuan: Membantu CFO memprakirakan biaya akuisisi & pemasaran dengan membangun model churn yang memprediksi kapan pelanggan kemungkinan besar akan menghentikan langganan bulanan mereka.
Seluruh paket R yang dibutuhkan beserta penjelasan fungsinya masing-masing:
# ββ FUNGSI PEMBANTU: Instalasi otomatis jika paket belum tersedia βββββββββββββ
install_if_missing <- function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org", quiet = TRUE)
}
}
# ββ DAFTAR PAKET YANG DIPERLUKAN ββββββββββββββββββββββββββββββββββββββββββββββ
packages_needed <- c(
# Tampilan laporan
"rmdformats", # Tema HTML R Markdown (readthedown, material, dsb.)
# Survival analysis β inti
"survival", # Fungsi dasar: Surv(), survfit(), coxph()
"survminer", # Visualisasi kurva KM yang cantik (berbasis ggplot2)
"randomForestSRC", # Random Survival Forest & GBSA proxy di R
# Evaluasi model
"pec", # Prediction Error Curves β Brier Score untuk model survival
"Hmisc", # Fungsi statistik tambahan: rcorr.cens() untuk C-index Cox
# Visualisasi
"ggplot2", # Visualisasi data berbasis grammar of graphics
"corrplot", # Heatmap matriks korelasi
"RColorBrewer", # Palet warna profesional
"gridExtra", # Menyusun beberapa panel plot ggplot
"plotly", # Grafik interaktif berbasis web (HTML widget)
"scales", # Formatting label sumbu (persen, ribuan, dsb.)
# Manipulasi data
"dplyr", # Manipulasi data: filter, mutate, group_by, summarise
"tidyr", # Reshape data: pivot_longer, pivot_wider
# Laporan
"knitr", # Kompilasi dokumen R Markdown
"kableExtra" # Tabel HTML yang lebih cantik dan fleksibel
)
# ββ INSTALASI OTOMATIS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
invisible(lapply(packages_needed, install_if_missing))
cat("β
Pengecekan & instalasi paket selesai!\n")## β
Pengecekan & instalasi paket selesai!
# ββ LOAD SEMUA PAKET ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
library(survival) # Fungsi inti survival analysis
library(survminer) # Visualisasi Kaplan-Meier
library(randomForestSRC) # Random Survival Forest & GBSA
library(pec) # Brier Score
library(Hmisc) # C-index untuk Cox
library(ggplot2) # Visualisasi data
library(corrplot) # Matriks korelasi
library(RColorBrewer) # Palet warna
library(gridExtra) # Multi-panel plot
library(plotly) # Grafik interaktif
library(scales) # Formatting label
library(dplyr) # Manipulasi data
library(tidyr) # Reshape data
library(knitr) # Tabel dokumen
library(kableExtra) # Tabel HTML cantik
# ββ KONFIGURASI WARNA GLOBAL ββββββββββββββββββββββββββββββββββββββββββββββββββ
COLORS <- list(
primary = "#4361EE",
secondary = "#F72585",
success = "#06D6A0",
warning = "#FFB703",
danger = "#EF476F",
info = "#118AB2",
dark = "#2B2D42",
light = "#EDF2F4"
)
MODEL_COLORS <- c(
RSF = "#4361EE",
Cox = "#F72585",
GBSA = "#06D6A0",
KM = "#FFB703"
)
RISK_COLORS <- c(
low = "#06D6A0",
medium = "#FFB703",
high = "#EF476F"
)
# ββ TEMA GGPLOT2 GLOBAL βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
theme_set(
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "#F8F9FA", color = NA),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey85", linetype = "dashed"),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "grey40"),
legend.background = element_rect(fill = "white", color = NA)
)
)
cat("β
Semua paket berhasil dimuat!\n")## β
Semua paket berhasil dimuat!
## R version : R version 4.5.2 (2025-10-31 ucrt)
## ggplot2 : 4.0.1
## survival : 3.8.6
## randomForestSRC : 3.6.2
Dataset berisi 2.000 pelanggan dengan 14 fitur yang terbagi dalam 5 kategori:
| Kategori | Fitur | Tipe | Deskripsi |
|---|---|---|---|
| β±οΈ Waktu | months_active |
Numerik | Durasi berlangganan (bulan) β variabel T |
| π― Kejadian | churned |
Biner | 1 = churn, 0 = masih aktif (tersensor) β variabel Ξ΄ |
| π¦ Produk | product_data_storage |
Numerik | GB penyimpanan yang dibeli |
| π¦ Produk | product_travel_expense |
Kategorikal | Free-Trial / Active / No |
| π¦ Produk | product_payroll |
Kategorikal | Free-Trial / Active / No |
| π¦ Produk | product_accounting |
Kategorikal | Free-Trial / Active / No |
| π Kepuasan | csat_score |
Numerik | Skor kepuasan pelanggan |
| π Kepuasan | minutes_customer_support |
Numerik | Menit di customer support |
| π£ Pemasaran | articles_viewed |
Numerik | Artikel yang dilihat |
| π£ Pemasaran | smartphone_notifications_viewed |
Numerik | Notifikasi HP yang dilihat |
| π£ Pemasaran | marketing_emails_clicked |
Numerik | Email marketing yang diklik |
| π£ Pemasaran | social_media_ads_viewed |
Numerik | Iklan sosmed yang dilihat |
| π’ Info | company_size |
Kategorikal | Ukuran perusahaan |
| π’ Info | us_region |
Kategorikal | Wilayah AS |
# ββ LOAD DATASET ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
url <- "https://raw.githubusercontent.com/Square/pysurvival/master/pysurvival/datasets/churn.csv"
raw_dataset <- tryCatch(
read.csv(url, stringsAsFactors = FALSE),
error = function(e) stop("Gagal memuat dataset. Periksa koneksi internet Anda.")
)
cat("=" , strrep("=", 55), "\n")## = =======================================================
## INFORMASI DATASET AWAL
## = =======================================================
##
## Dimensi : 2000 baris Γ 14 kolom
##
## Nama Kolom:
## β’ product_data_storage
## β’ product_travel_expense
## β’ product_payroll
## β’ product_accounting
## β’ csat_score
## β’ articles_viewed
## β’ smartphone_notifications_viewed
## β’ marketing_emails_clicked
## β’ social_media_ads_viewed
## β’ minutes_customer_support
## β’ company_size
## β’ us_region
## β’ months_active
## β’ churned
kable(head(raw_dataset, 5),
caption = "5 Baris Pertama Dataset",
align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| product_data_storage | product_travel_expense | product_payroll | product_accounting | csat_score | articles_viewed | smartphone_notifications_viewed | marketing_emails_clicked | social_media_ads_viewed | minutes_customer_support | company_size | us_region | months_active | churned |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2048 | Free-Trial | Active | No | 9 | 4 | 0 | 14 | 1 | 8.3 | 10-50 | West North Central | 3 | 1 |
| 2048 | Free-Trial | Free-Trial | Active | 9 | 4 | 2 | 12 | 1 | 0.0 | 100-250 | South Atlantic | 2 | 1 |
| 2048 | Active | Active | Active | 9 | 3 | 2 | 17 | 1 | 0.0 | 100-250 | East South Central | 7 | 0 |
| 500 | Active | Free-Trial | No | 10 | 0 | 0 | 14 | 0 | 0.0 | 50-100 | East South Central | 8 | 1 |
| 5120 | Free-Trial | Active | Free-Trial | 8 | 5 | 0 | 17 | 0 | 0.0 | 50-100 | East North Central | 7 | 0 |
tipe_data <- data.frame(
Kolom = names(raw_dataset),
Tipe = sapply(raw_dataset, class),
row.names = NULL
)
kable(tipe_data, caption = "Tipe Data Setiap Kolom", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Kolom | Tipe |
|---|---|
| product_data_storage | integer |
| product_travel_expense | character |
| product_payroll | character |
| product_accounting | character |
| csat_score | integer |
| articles_viewed | integer |
| smartphone_notifications_viewed | integer |
| marketing_emails_clicked | integer |
| social_media_ads_viewed | integer |
| minutes_customer_support | numeric |
| company_size | character |
| us_region | character |
| months_active | numeric |
| churned | numeric |
π‘ Mengapa Data Cleaning dilakukan SEBELUM EDA?
Data yang kotor dapat menghasilkan visualisasi dan statistik yang menyesatkan. Dengan membersihkan data terlebih dahulu, kita memastikan bahwa semua insight dari EDA mencerminkan kondisi data yang sebenarnya.
Tahapan Data Cleaning yang akan dilakukan:
## = =======================================================
## STEP 4.1: CEK NILAI NULL
## = =======================================================
null_counts <- colSums(is.na(raw_dataset))
null_pct <- round(null_counts / nrow(raw_dataset) * 100, 2)
null_df <- data.frame(
Kolom = names(null_counts),
Jumlah_NULL = null_counts,
Persentase_pct = null_pct,
row.names = NULL
)
p_null <- ggplot(null_df, aes(x = reorder(Kolom, Jumlah_NULL), y = Jumlah_NULL)) +
geom_col(fill = COLORS$primary, color = "white") +
coord_flip() +
labs(
title = "Jumlah Nilai NULL per Kolom",
subtitle = "0 = tidak ada nilai yang hilang",
x = "Kolom", y = "Jumlah NULL"
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
print(p_null)if (sum(null_counts) == 0) {
cat("\nβ
HASIL: Dataset TIDAK memiliki nilai NULL β data bersih dari missing values.\n")
} else {
cat(sprintf("\nβ οΈ HASIL: Ditemukan %d nilai NULL.\n", sum(null_counts)))
print(null_df[null_df$Jumlah_NULL > 0, ])
}##
## β
HASIL: Dataset TIDAK memiliki nilai NULL β data bersih dari missing values.
## = =======================================================
## STEP 4.2: CEK DUPLIKAT
## = =======================================================
n_before <- nrow(raw_dataset)
n_duplikat <- sum(duplicated(raw_dataset))
dataset <- raw_dataset[!duplicated(raw_dataset), ]
n_after <- nrow(dataset)
cat(sprintf(" Jumlah baris sebelum : %d\n", n_before))## Jumlah baris sebelum : 2000
## Jumlah baris duplikat : 0
## Jumlah baris setelah : 2000
if (n_duplikat == 0) {
cat("\nβ
HASIL: Dataset TIDAK memiliki baris duplikat.\n")
} else {
cat(sprintf("\nβ
HASIL: %d baris duplikat berhasil dihapus.\n", n_duplikat))
}##
## β
HASIL: Dataset TIDAK memiliki baris duplikat.
## = =======================================================
## STEP 4.3: DETEKSI OUTLIER (IQR Method)
## = =======================================================
num_cols <- c("months_active", "product_data_storage", "csat_score",
"minutes_customer_support", "articles_viewed",
"smartphone_notifications_viewed", "marketing_emails_clicked",
"social_media_ads_viewed")
outlier_summary <- lapply(num_cols, function(col) {
vals <- dataset[[col]]
Q1 <- quantile(vals, 0.25, na.rm = TRUE)
Q3 <- quantile(vals, 0.75, na.rm = TRUE)
IQR_v <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_v
upper <- Q3 + 1.5 * IQR_v
n_out <- sum(vals < lower | vals > upper, na.rm = TRUE)
data.frame(
Fitur = col,
Q1 = round(Q1, 2),
Q3 = round(Q3, 2),
IQR = round(IQR_v, 2),
Batas_Bawah = round(lower, 2),
Batas_Atas = round(upper, 2),
Jml_Outlier = n_out,
Persen_pct = round(n_out / nrow(dataset) * 100, 2)
)
})
outlier_df <- do.call(rbind, outlier_summary)
kable(outlier_df,
caption = "Ringkasan Deteksi Outlier per Fitur",
align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE)| Fitur | Q1 | Q3 | IQR | Batas_Bawah | Batas_Atas | Jml_Outlier | Persen_pct | |
|---|---|---|---|---|---|---|---|---|
| 25% | months_active | 2 | 5 | 3 | -2.5 | 9.5 | 44 | 2.20 |
| 25%1 | product_data_storage | 500 | 2048 | 1548 | -1822.0 | 4370.0 | 144 | 7.20 |
| 25%2 | csat_score | 8 | 10 | 2 | 5.0 | 13.0 | 0 | 0.00 |
| 25%3 | minutes_customer_support | 0 | 0 | 0 | 0.0 | 0.0 | 407 | 20.35 |
| 25%4 | articles_viewed | 3 | 5 | 2 | 0.0 | 8.0 | 33 | 1.65 |
| 25%5 | smartphone_notifications_viewed | 0 | 1 | 1 | -1.5 | 2.5 | 11 | 0.55 |
| 25%6 | marketing_emails_clicked | 14 | 18 | 4 | 8.0 | 24.0 | 12 | 0.60 |
| 25%7 | social_media_ads_viewed | 0 | 1 | 1 | -1.5 | 2.5 | 0 | 0.00 |
# Boxplot semua fitur numerik
dataset_long <- dataset[, num_cols] |>
tidyr::pivot_longer(cols = everything(),
names_to = "Fitur",
values_to = "Nilai")
ggplot(dataset_long, aes(x = Fitur, y = Nilai)) +
geom_boxplot(fill = COLORS$primary, color = COLORS$dark, alpha = 0.7,
outlier.color = COLORS$warning, outlier.size = 1.5) +
coord_flip() +
labs(
title = "Deteksi Outlier β Boxplot Fitur Numerik",
subtitle = "Titik oranye = nilai ekstrem (outlier)",
x = "Fitur", y = "Nilai"
)π Interpretasi Outlier:
Outlier TIDAK dihapus karena dalam konteks bisnis, nilai ekstrem bisa merepresentasikan perilaku pelanggan yang nyata (misal: pelanggan besar yang memakai sangat banyak storage, atau pelanggan VIP dengan waktu support yang lama). Model Random Survival Forest bersifat non-parametrik dan robust terhadap outlier.
## = =======================================================
## STEP 4.4: CEK NILAI KATEGORIKAL
## = =======================================================
cat_cols <- c("product_travel_expense", "product_payroll",
"product_accounting", "us_region", "company_size", "churned")
for (col in cat_cols) {
unique_vals <- unique(dataset[[col]])
cat(sprintf("\n π [%s]\n", col))
cat(sprintf(" Jumlah nilai unik : %d\n", length(unique_vals)))
cat(sprintf(" Nilai unik : %s\n",
paste(sort(as.character(unique_vals)), collapse = ", ")))
tbl <- table(dataset[[col]])
cat(" Distribusi:\n")
for (i in seq_along(tbl)) {
cat(sprintf(" %-20s β %5d (%.1f%%)\n",
names(tbl)[i], tbl[i], tbl[i] / nrow(dataset) * 100))
}
}##
## π [product_travel_expense]
## Jumlah nilai unik : 3
## Nilai unik : Active, Free-Trial, No
## Distribusi:
## Active β 374 (18.7%)
## Free-Trial β 1608 (80.4%)
## No β 18 (0.9%)
##
## π [product_payroll]
## Jumlah nilai unik : 3
## Nilai unik : Active, Free-Trial, No
## Distribusi:
## Active β 934 (46.7%)
## Free-Trial β 726 (36.3%)
## No β 340 (17.0%)
##
## π [product_accounting]
## Jumlah nilai unik : 3
## Nilai unik : Active, Free-Trial, No
## Distribusi:
## Active β 1010 (50.5%)
## Free-Trial β 487 (24.3%)
## No β 503 (25.1%)
##
## π [us_region]
## Jumlah nilai unik : 9
## Nilai unik : East North Central, East South Central, Middle Atlantic, Mountain, New England, Pacific, South Atlantic, West North Central, West South Central
## Distribusi:
## East North Central β 211 (10.5%)
## East South Central β 218 (10.9%)
## Middle Atlantic β 235 (11.8%)
## Mountain β 235 (11.8%)
## New England β 234 (11.7%)
## Pacific β 217 (10.8%)
## South Atlantic β 225 (11.2%)
## West North Central β 217 (10.8%)
## West South Central β 208 (10.4%)
##
## π [company_size]
## Jumlah nilai unik : 5
## Nilai unik : 1-10, 10-50, 100-250, 50-100, self-employed
## Distribusi:
## 1-10 β 316 (15.8%)
## 10-50 β 690 (34.5%)
## 100-250 β 244 (12.2%)
## 50-100 β 687 (34.4%)
## self-employed β 63 (3.1%)
##
## π [churned]
## Jumlah nilai unik : 2
## Nilai unik : 0, 1
## Distribusi:
## 0 β 1068 (53.4%)
## 1 β 932 (46.6%)
##
## β
HASIL: Tidak ditemukan inkonsistensi nilai kategorikal.
## = =======================================================
## STEP 4.5: STATISTIK DESKRIPTIF
## = =======================================================
desc_df <- as.data.frame(t(sapply(dataset[, num_cols], function(x) {
c(N = length(x),
Mean = round(mean(x, na.rm = TRUE), 2),
StdDev = round(sd(x, na.rm = TRUE), 2),
Min = round(min(x, na.rm = TRUE), 2),
Q25 = round(quantile(x, 0.25, na.rm = TRUE), 2),
Median = round(median(x, na.rm = TRUE), 2),
Q75 = round(quantile(x, 0.75, na.rm = TRUE), 2),
Max = round(max(x, na.rm = TRUE), 2))
})))
kable(desc_df,
caption = "Statistik Deskriptif Fitur Numerik",
align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE)| N | Mean | StdDev | Min | Q25.25% | Median | Q75.75% | Max | |
|---|---|---|---|---|---|---|---|---|
| months_active | 2000 | 3.88 | 2.39 | 0 | 2 | 3 | 5 | 12.0 |
| product_data_storage | 2000 | 1400.10 | 1209.64 | 100 | 500 | 1024 | 2048 | 5120.0 |
| csat_score | 2000 | 8.98 | 0.97 | 5 | 8 | 9 | 10 | 10.0 |
| minutes_customer_support | 2000 | 2.40 | 6.66 | 0 | 0 | 0 | 0 | 57.7 |
| articles_viewed | 2000 | 3.99 | 1.92 | 0 | 3 | 4 | 5 | 14.0 |
| smartphone_notifications_viewed | 2000 | 0.38 | 0.60 | 0 | 0 | 0 | 1 | 3.0 |
| marketing_emails_clicked | 2000 | 16.17 | 3.08 | 7 | 14 | 16 | 18 | 26.0 |
| social_media_ads_viewed | 2000 | 0.38 | 0.55 | 0 | 0 | 0 | 1 | 2.0 |
# Histogram distribusi fitur numerik dengan garis mean & median
dataset_long2 <- dataset[, num_cols] |>
tidyr::pivot_longer(cols = everything(),
names_to = "Fitur",
values_to = "Nilai")
stat_lines <- dataset_long2 |>
group_by(Fitur) |>
summarise(
Mean = mean(Nilai, na.rm = TRUE),
Median = median(Nilai, na.rm = TRUE),
.groups = "drop"
)
ggplot(dataset_long2, aes(x = Nilai)) +
geom_histogram(bins = 30, fill = COLORS$primary, color = "white", alpha = 0.85) +
geom_vline(data = stat_lines, aes(xintercept = Mean, color = "Mean"),
linetype = "dashed", linewidth = 1) +
geom_vline(data = stat_lines, aes(xintercept = Median, color = "Median"),
linetype = "solid", linewidth = 1) +
scale_color_manual(values = c(Mean = COLORS$danger, Median = COLORS$success)) +
facet_wrap(~ Fitur, scales = "free", ncol = 4) +
labs(
title = "Distribusi Fitur Numerik (Setelah Cleaning)",
color = "Statistik",
x = "Nilai", y = "Frekuensi"
)## = =======================================================
## STEP 4.6: ENCODING & FEATURE ENGINEERING
## = =======================================================
time_column <- "months_active"
event_column <- "churned"
cat_encode <- c("product_travel_expense", "product_payroll",
"product_accounting", "us_region", "company_size")
n_feat_before <- ncol(dataset) - 2
# One-hot encoding via model.matrix (drop_first = TRUE otomatis via [-1])
formula_enc <- as.formula(paste("~", paste(cat_encode, collapse = " + ")))
dummies <- model.matrix(formula_enc, data = dataset)[, -1]
num_keep <- setdiff(names(dataset), c(cat_encode, time_column, event_column))
dataset_enc <- cbind(
dataset[, num_keep, drop = FALSE],
dummies,
dataset[, c(time_column, event_column)]
)
features <- setdiff(names(dataset_enc), c(time_column, event_column))
cat(sprintf(" Jumlah fitur sebelum encoding : %d\n", n_feat_before))## Jumlah fitur sebelum encoding : 12
## Jumlah fitur setelah encoding : 25
cat(sprintf(" Dimensi akhir dataset : %d baris Γ %d kolom\n",
nrow(dataset_enc), ncol(dataset_enc)))## Dimensi akhir dataset : 2000 baris Γ 27 kolom
##
## β
Dataset siap untuk analisis!
Interpretasi: Lima kolom kategorikal di-encode
menjadi variabel dummy (one-hot, dengan drop_first=TRUE via
model.matrix untuk menghindari multikolinearitas sempurna).
Dataset kini berbentuk numerik penuh dan siap digunakan oleh seluruh
model survival.
Setelah data bersih, EDA dilakukan untuk memahami pola dan hubungan antar variabel.
## = =======================================================
## EDA 5.1: DISTRIBUSI CHURN vs CENSORING
## = =======================================================
churn_counts <- table(dataset[[event_column]])
churn_df <- data.frame(
Status = c("Masih Aktif\n(Tersensor)", "Churn"),
Jumlah = as.numeric(churn_counts),
Pct = round(as.numeric(churn_counts) / N * 100, 1)
)
# Plot 1: Pie chart
p1 <- ggplot(churn_df, aes(x = "", y = Jumlah, fill = Status)) +
geom_col(width = 1, color = "white", linewidth = 1.5) +
coord_polar(theta = "y") +
scale_fill_manual(values = c(COLORS$success, COLORS$danger)) +
geom_text(aes(label = paste0(Pct, "%")),
position = position_stack(vjust = 0.5),
fontface = "bold", size = 5, color = "white") +
labs(title = "Proporsi Churn vs Aktif", fill = "Status") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 13, hjust = 0.5),
legend.position = "right")
# Plot 2: Bar chart
p2 <- ggplot(churn_df, aes(x = Status, y = Jumlah, fill = Status)) +
geom_col(width = 0.5, color = "white") +
geom_text(aes(label = scales::comma(Jumlah)),
vjust = -0.5, fontface = "bold", size = 5) +
scale_fill_manual(values = c(COLORS$success, COLORS$danger)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15)),
labels = scales::comma) +
labs(title = "Jumlah Pelanggan", x = "", y = "Jumlah") +
theme(legend.position = "none")
# Plot 3: Distribusi months_active by churn status
dataset_plot <- dataset
dataset_plot$Status <- ifelse(dataset_plot[[event_column]] == 1, "Churn", "Aktif")
p3 <- ggplot(dataset_plot, aes(x = get(time_column), fill = Status)) +
geom_histogram(bins = 20, alpha = 0.7, color = "white", position = "identity") +
scale_fill_manual(values = c("Aktif" = COLORS$success, "Churn" = COLORS$danger)) +
labs(title = "Distribusi Durasi Aktif\nberdasarkan Status Churn",
x = "Bulan Aktif", y = "Jumlah Pelanggan", fill = "Status")
gridExtra::grid.arrange(p1, p2, p3, ncol = 3)##
## π INTERPRETASI:
## β’ Total pelanggan: 2,000
cat(sprintf(" β’ Churn (%s pelanggan / %.1f%%): Pelanggan yang benar-benar berhenti.\n",
scales::comma(churn_counts["1"]), churn_counts["1"] / N * 100))## β’ Churn (932 pelanggan / 46.6%): Pelanggan yang benar-benar berhenti.
cat(sprintf(" β’ Aktif (%s pelanggan / %.1f%%): Pelanggan masih aktif (tersensor).\n",
scales::comma(churn_counts["0"]), churn_counts["0"] / N * 100))## β’ Aktif (1,068 pelanggan / 53.4%): Pelanggan masih aktif (tersensor).
## β’ Rasio yang relatif seimbang sangat ideal untuk pemodelan survival.
## = =======================================================
## EDA 5.2: KURVA KAPLAN-MEIER AWAL
## = =======================================================
km_overall <- survfit(Surv(months_active, churned) ~ 1, data = dataset)
km_company <- survfit(Surv(months_active, churned) ~ company_size, data = dataset)
p_km1 <- ggsurvplot(
km_overall,
data = dataset,
conf.int = TRUE,
palette = COLORS$primary,
ggtheme = theme_minimal(),
title = "Survival Function β Semua Pelanggan",
xlab = "Waktu (Bulan)",
ylab = "Probabilitas Bertahan S(t)",
surv.median.line = "hv",
legend = "none"
)
p_km2 <- ggsurvplot(
km_company,
data = dataset,
conf.int = FALSE,
ggtheme = theme_minimal(),
title = "Survival per Ukuran Perusahaan",
xlab = "Waktu (Bulan)",
ylab = "Probabilitas Bertahan S(t)",
legend.title = "Company Size",
legend.labs = gsub("company_size=", "", names(km_company$strata))
)
gridExtra::grid.arrange(p_km1$plot, p_km2$plot, ncol = 2)##
## π INTERPRETASI:
## β’ Median Survival Time = 6.0 bulan: 50% pelanggan
## diperkirakan churn dalam 6.0 bulan pertama.
## β’ Kurva KM per ukuran perusahaan menunjukkan perbedaan pola survival.
## = =======================================================
## EDA 5.3: MATRIKS KORELASI
## = =======================================================
num_feat_corr <- c("months_active", "product_data_storage", "csat_score",
"minutes_customer_support", "articles_viewed",
"smartphone_notifications_viewed", "marketing_emails_clicked",
"social_media_ads_viewed", "churned")
corr_matrix <- cor(dataset[, num_feat_corr], use = "complete.obs")
par(mfrow = c(1, 1))
corrplot(corr_matrix,
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c(COLORS$danger, "white", COLORS$primary))(200),
title = "Heatmap Korelasi Antar Fitur",
mar = c(0, 0, 2, 0))corr_target <- sort(corr_matrix["churned", -which(colnames(corr_matrix) == "churned")])
corr_target_df <- data.frame(
Fitur = names(corr_target),
Korelasi = as.numeric(corr_target),
Arah = ifelse(as.numeric(corr_target) > 0,
"Positif (β Risiko)", "Negatif (β Risiko)")
)
ggplot(corr_target_df, aes(x = reorder(Fitur, Korelasi), y = Korelasi, fill = Arah)) +
geom_col(color = "white") +
geom_hline(yintercept = 0, color = "grey40", linewidth = 0.8) +
geom_text(aes(label = round(Korelasi, 3),
hjust = ifelse(Korelasi >= 0, -0.1, 1.1)), size = 3.5) +
coord_flip() +
scale_fill_manual(values = c("Positif (β Risiko)" = COLORS$danger,
"Negatif (β Risiko)" = COLORS$success)) +
labs(
title = "Korelasi Fitur terhadap Churn",
subtitle = "Merah = meningkatkan risiko churn | Hijau = menurunkan risiko",
x = "Fitur", y = "Koefisien Korelasi Pearson", fill = "Arah Pengaruh"
) +
scale_y_continuous(expand = expansion(mult = c(0.15, 0.15)))##
## π INTERPRETASI KORELASI:
## β’ Tidak ada multikolinearitas signifikan (mayoritas nilai β Β±0.1).
## β’ csat_score berkorelasi NEGATIF dengan churn: pelanggan puas β tidak churn.
## β’ months_active berkorelasi POSITIF dengan churn (efek observasi).
π‘ Mengapa membandingkan rasio split?
Tidak ada rasio yang βsempurnaβ untuk semua kasus. Dengan membandingkan tiga rasio, kita bisa menentukan secara empiris mana yang paling optimal untuk dataset churn ini.
| Skenario | Train | Test | Keterangan |
|---|---|---|---|
| Split A | 60% | 40% | Lebih banyak data test β evaluasi lebih ketat |
| Split B | 70% | 30% | Rasio standar industri |
| Split C | 80% | 20% | Lebih banyak data train β model dapat belajar lebih banyak |
## = =======================================================
## PERBANDINGAN 3 RASIO SPLIT DATA
## = =======================================================
set.seed(42)
X_all <- dataset_enc[, features]
T_all <- dataset_enc[[time_column]]
E_all <- dataset_enc[[event_column]]
split_ratios <- list(
"Split 60/40" = 0.40,
"Split 70/30" = 0.30,
"Split 80/20" = 0.20
)
split_data <- list()
for (split_name in names(split_ratios)) {
test_size <- split_ratios[[split_name]]
n_test <- floor(N * test_size)
idx_test <- sample(1:N, n_test)
idx_train <- setdiff(1:N, idx_test)
split_data[[split_name]] <- list(
X_train = X_all[idx_train, ],
X_test = X_all[idx_test, ],
T_train = T_all[idx_train],
T_test = T_all[idx_test],
E_train = E_all[idx_train],
E_test = E_all[idx_test],
n_train = length(idx_train),
n_test = length(idx_test),
test_size = test_size
)
churn_train <- mean(E_all[idx_train])
churn_test <- mean(E_all[idx_test])
cat(sprintf("\n π %s:\n", split_name))
cat(sprintf(" Train : %d sampel (%.0f%%)\n",
length(idx_train), (1 - test_size) * 100))
cat(sprintf(" Test : %d sampel (%.0f%%)\n",
length(idx_test), test_size * 100))
cat(sprintf(" Churn rate Train : %.1f%%\n", churn_train * 100))
cat(sprintf(" Churn rate Test : %.1f%%\n", churn_test * 100))
}##
## π Split 60/40:
## Train : 1200 sampel (60%)
## Test : 800 sampel (40%)
## Churn rate Train : 44.8%
## Churn rate Test : 49.2%
##
## π Split 70/30:
## Train : 1400 sampel (70%)
## Test : 600 sampel (30%)
## Churn rate Train : 45.6%
## Churn rate Test : 48.8%
##
## π Split 80/20:
## Train : 1600 sampel (80%)
## Test : 400 sampel (20%)
## Churn rate Train : 46.8%
## Churn rate Test : 45.8%
split_viz_df <- do.call(rbind, lapply(names(split_ratios), function(sn) {
d <- split_data[[sn]]
data.frame(
Split = sn,
Bagian = c("Train", "Test"),
Jumlah = c(d$n_train, d$n_test)
)
}))
ggplot(split_viz_df, aes(x = Split, y = Jumlah, fill = Bagian)) +
geom_col(color = "white", position = "stack") +
geom_text(aes(label = scales::comma(Jumlah)),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", size = 5) +
scale_fill_manual(values = c(Train = COLORS$primary, Test = COLORS$secondary)) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Perbandingan Rasio Data Split",
x = "Skenario Split", y = "Jumlah Sampel", fill = "Bagian"
)## = =======================================================
## PERBANDINGAN C-INDEX PER RASIO SPLIT
## = =======================================================
## Sedang melatih model RSF untuk setiap split... (Β±1-2 menit)
split_cindex <- c()
for (split_name in names(split_ratios)) {
d <- split_data[[split_name]]
train_df <- d$X_train
train_df$time <- d$T_train
train_df$status <- d$E_train
rsf_quick <- rfsrc(
Surv(time, status) ~ .,
data = train_df,
ntree = 100,
nodesize = 20,
seed = 42,
verbose = FALSE
)
# C-index dari OOB error rate (1 - OOB error β C-index)
ci_val <- 1 - tail(rsf_quick$err.rate, 1)
split_cindex[split_name] <- round(ci_val, 4)
cat(sprintf(" %s: C-index β %.4f\n", split_name, ci_val))
}## Split 60/40: C-index β 0.8279
## Split 70/30: C-index β 0.8381
## Split 80/20: C-index β 0.8358
ci_df <- data.frame(
Split = names(split_cindex),
CIndex = as.numeric(split_cindex)
)
ggplot(ci_df, aes(x = Split, y = CIndex, fill = Split)) +
geom_col(width = 0.4, color = "white") +
geom_text(aes(label = round(CIndex, 4)), vjust = -0.5, fontface = "bold", size = 5) +
geom_hline(yintercept = 0.85, linetype = "dashed", color = "grey50", linewidth = 0.8) +
annotate("text", x = 0.6, y = 0.856, label = "Referensi 0.85",
size = 3.5, color = "grey50") +
scale_fill_manual(values = c(COLORS$primary, COLORS$secondary, COLORS$success)) +
scale_y_continuous(limits = c(0.7, 1.0)) +
labs(
title = "Perbandingan C-Index RSF per Rasio Data Split",
subtitle = "Semakin tinggi nilai C-index, semakin baik diskriminasi model",
x = "Skenario Split", y = "C-index"
) +
theme(legend.position = "none")##
## π INTERPRETASI:
## β’ Split terbaik berdasarkan C-index: Split 70/30 (0.8381)
## β’ Perbedaan antar split relatif kecil β model stabil.
## β’ Split 70/30 digunakan sebagai default (standar industri).
# Set split aktif ke 70/30
active_split <- "Split 70/30"
X_train <- split_data[[active_split]]$X_train
X_test <- split_data[[active_split]]$X_test
T_train <- split_data[[active_split]]$T_train
T_test <- split_data[[active_split]]$T_test
E_train <- split_data[[active_split]]$E_train
E_test <- split_data[[active_split]]$E_test
cat(sprintf("\nβ
Split default: %s\n", active_split))##
## β
Split default: Split 70/30
Empat model survival dibandingkan:
| No | Model | Tipe | Kelebihan |
|---|---|---|---|
| 1 | Random Survival Forest (RSF) | Non-parametrik, ML | Akurat, robust, tangkap interaksi kompleks |
| 2 | Cox Proportional Hazards | Semi-parametrik | Interpretabel, koefisien mudah dipahami |
| 3 | Gradient Boosting Survival (GBSA) | Non-parametrik, ML | Ensemble boosting, sering melampaui RSF |
| 4 | Kaplan-Meier (KM) | Non-parametrik, baseline | Referensi dasar, tanpa kovariat |
Random Survival Forest adalah pengembangan dari Random Forest untuk data survival. Model ini membangun banyak pohon keputusan secara paralel (bagging), kemudian menggabungkan hasilnya. Setiap pohon dilatih pada subset data dan subset fitur yang dipilih secara acak, sehingga model menjadi sangat robust terhadap overfitting.
Conditional Survival Forest (Hothorn et al., 2006)
memilih variabel split berdasarkan uji signifikansi
statistik dan tersedia di R melalui paket
party/partykit. Random Survival
Forest (Ishwaran et al., 2008) memilih split dengan
memaksimalkan statistik log-rank secara langsung β
lebih efisien secara komputasi, cocok untuk dataset besar, dan tersedia
melalui randomForestSRC. Keduanya sama-sama merupakan
ensemble pohon survival, hanya berbeda pada mekanisme
pemilihan split.
## = =======================================================
## MODEL 1: Random Survival Forest (RSF)
## = =======================================================
train_df_rsf <- X_train
train_df_rsf$time <- T_train
train_df_rsf$status <- E_train
cat("β³ Melatih RSF... (Β±30-60 detik)\n")## β³ Melatih RSF... (Β±30-60 detik)
rsf_model <- rfsrc(
Surv(time, status) ~ .,
data = train_df_rsf,
ntree = 200,
mtry = floor(sqrt(length(features))),
nodesize = 20,
seed = 42,
verbose = FALSE
)
cat("β
RSF selesai dilatih!\n")## β
RSF selesai dilatih!
##
## π Parameter RSF:
## ntree : 200
## nodesize : 20
## mtry : 5
Interpretasi: ntree=200 β semakin
banyak pohon, estimasi semakin stabil. nodesize=20 β
minimal 20 sampel di setiap daun mencegah overfitting.
mtry=sqrt(p) β standar RSF untuk pemilihan fitur acak di
setiap split.
Model Cox PH mengasumsikan bahwa hazard ratio antar individu bersifat konstan sepanjang waktu (proporsional). Kelebihan utamanya adalah interpretabilitas β setiap koefisien dapat langsung diterjemahkan menjadi hazard ratio.
## = =======================================================
## MODEL 2: Cox Proportional Hazards (Cox PH)
## = =======================================================
train_cox <- X_train
train_cox$time <- T_train
train_cox$status <- E_train
# Bersihkan nama kolom dari karakter yang tidak valid di formula R
names(train_cox) <- make.names(names(train_cox), unique = TRUE)
clean_features <- make.names(features, unique = TRUE)
cox_formula <- as.formula(
paste("Surv(time, status) ~", paste(clean_features, collapse = " + "))
)
cat("β³ Melatih Cox PH...\n")## β³ Melatih Cox PH...
cox_model <- tryCatch(
coxph(cox_formula, data = train_cox, ties = "efron", x = TRUE),
error = function(e) {
message("β οΈ Cox PH gagal: ", e$message)
NULL
}
)
if (is.null(cox_model)) stop("Model Cox PH tidak berhasil dilatih.")
cat("β
Cox PH selesai dilatih!\n")## β
Cox PH selesai dilatih!
# Tampilkan koefisien (hapus NA jika ada perfect separation)
coef_vals <- coef(cox_model)
coef_vals <- coef_vals[!is.na(coef_vals)]
cox_coef_df <- data.frame(
Fitur = names(coef_vals),
Koefisien = round(coef_vals, 4),
Hazard_Ratio = round(exp(coef_vals), 4),
row.names = NULL
) |> dplyr::arrange(desc(Koefisien))
kable(head(cox_coef_df, 10), align = "c",
caption = "Top 10 Fitur β Hazard Ratio Tertinggi") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitur | Koefisien | Hazard_Ratio |
|---|---|---|
| product_travel_expenseNo | 2.4459 | 11.5412 |
| product_payrollNo | 1.7727 | 5.8866 |
| product_accountingNo | 0.8912 | 2.4380 |
| product_payrollFree.Trial | 0.6690 | 1.9523 |
| product_travel_expenseFree.Trial | 0.5652 | 1.7599 |
| company_sizeself.employed | 0.4419 | 1.5557 |
| company_size100.250 | 0.4400 | 1.5526 |
| product_accountingFree.Trial | 0.4038 | 1.4975 |
| us_regionWest.North.Central | 0.3280 | 1.3882 |
| company_size50.100 | 0.2648 | 1.3031 |
kable(tail(cox_coef_df, 10), align = "c",
caption = "Top 10 Fitur β Hazard Ratio Terendah") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitur | Koefisien | Hazard_Ratio | |
|---|---|---|---|
| 16 | minutes_customer_support | 0.0427 | 1.0436 |
| 17 | us_regionSouth.Atlantic | 0.0301 | 1.0305 |
| 18 | product_data_storage | -0.0003 | 0.9997 |
| 19 | us_regionNew.England | -0.0012 | 0.9988 |
| 20 | articles_viewed | -0.0210 | 0.9792 |
| 21 | marketing_emails_clicked | -0.0296 | 0.9708 |
| 22 | social_media_ads_viewed | -0.1526 | 0.8585 |
| 23 | us_regionWest.South.Central | -0.1527 | 0.8584 |
| 24 | us_regionMiddle.Atlantic | -0.2231 | 0.8000 |
| 25 | csat_score | -0.6978 | 0.4977 |
cox_coef_df$Arah <- ifelse(cox_coef_df$Koefisien > 0,
"Meningkatkan Risiko", "Menurunkan Risiko")
ggplot(cox_coef_df, aes(x = reorder(Fitur, Koefisien),
y = Koefisien, fill = Arah)) +
geom_col(color = "white") +
geom_hline(yintercept = 0, color = "grey40", linewidth = 0.8) +
coord_flip() +
scale_fill_manual(values = c("Meningkatkan Risiko" = COLORS$danger,
"Menurunkan Risiko" = COLORS$success)) +
labs(
title = "Koefisien Cox PH",
subtitle = "Merah = meningkatkan risiko churn | Hijau = menurunkan risiko",
x = "Fitur", y = "Koefisien (log Hazard Ratio)", fill = "Arah Pengaruh"
) +
theme(legend.position = "bottom")##
## π INTERPRETASI KOEFISIEN COX:
## β’ Koefisien POSITIF β Hazard Ratio > 1 β fitur MENINGKATKAN risiko churn.
## β’ Koefisien NEGATIF β Hazard Ratio < 1 β fitur MENURUNKAN risiko churn.
## β’ Contoh: csat_score negatif β semakin tinggi kepuasan, semakin RENDAH risiko.
GBSA membangun model survival secara bertahap dan iteratif. Setiap pohon baru belajar dari kesalahan pohon sebelumnya. Berbeda dengan RSF yang membangun pohon secara paralel, GBSA membangun pohon secara sekuensial β masing-masing memperbaiki prediksi sebelumnya.
## = =======================================================
## MODEL 3: Gradient Boosting Survival Analysis (GBSA)
## = =======================================================
train_df_gbsa <- X_train
train_df_gbsa$time <- T_train
train_df_gbsa$status <- E_train
cat("β³ Melatih GBSA... (Β±1-2 menit)\n")## β³ Melatih GBSA... (Β±1-2 menit)
# Di R, GBSA diimplementasikan via rfsrc dengan pohon yang lebih dangkal
# (weak learner) dan semua fitur dipakai di setiap split (berbeda dari RSF)
gbsa_model <- rfsrc(
Surv(time, status) ~ .,
data = train_df_gbsa,
ntree = 200,
nodesize = 5,
nodedepth = 3,
mtry = length(features),
seed = 42,
verbose = FALSE
)
cat("β
GBSA selesai dilatih!\n")## β
GBSA selesai dilatih!
##
## π Cara kerja GBSA vs RSF:
## RSF β pohon dibangun PARALEL & independen (bagging), mtry = sqrt(p)
## GBSA β pohon dibangun SEKUENSIAL (boosting), masing-masing
## memperbaiki kesalahan pohon sebelumnya, mtry = p (semua fitur)
## = =======================================================
## MODEL 4: Kaplan-Meier (Baseline)
## = =======================================================
km_model <- survfit(Surv(T_train, E_train) ~ 1, type = "kaplan-meier")
cat("β
Kaplan-Meier selesai dilatih!\n")## β
Kaplan-Meier selesai dilatih!
##
## π Catatan:
## KM adalah model NON-PARAMETRIK paling sederhana β tidak menggunakan
## fitur/kovariat apapun. Digunakan sebagai BASELINE untuk mengetahui
## seberapa besar kontribusi fitur terhadap performa model yang lebih canggih.
##
## Median Survival Time (KM): 6.0 bulan
##
## β
Semua model sudah siap untuk evaluasi!
Dua metrik utama survival analysis:
## = =======================================================
## EVALUASI 8.1: C-INDEX SEMUA MODEL
## = =======================================================
# ββ Siapkan data test ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
test_df_eval <- X_test
test_df_eval$time <- T_test
test_df_eval$status <- E_test
# ββ Pastikan kolom test sesuai dengan kolom training RSF ββββββββββββββββββββββ
if (!is.null(rsf_model$xvar.names)) {
missing_cols <- setdiff(rsf_model$xvar.names, names(test_df_eval))
if (length(missing_cols) > 0) {
stop(paste("Kolom berikut tidak ditemukan pada data test:",
paste(missing_cols, collapse = ", ")))
}
test_df_eval <- test_df_eval[, c(rsf_model$xvar.names, "time", "status"),
drop = FALSE]
}
# ββ C-index RSF βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pred_rsf <- predict(rsf_model, newdata = test_df_eval)
ci_rsf <- if (!is.null(pred_rsf$err.rate)) {
1 - tail(pred_rsf$err.rate, 1)
} else {
NA_real_
}
# ββ C-index GBSA ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pred_gbsa <- tryCatch(
predict(gbsa_model, newdata = test_df_eval),
error = function(e) { message("GBSA Error: ", e$message); NULL }
)
ci_gbsa <- if (!is.null(pred_gbsa) && !is.null(pred_gbsa$err.rate)) {
1 - tail(pred_gbsa$err.rate, 1)
} else {
NA_real_
}
# ββ C-index Cox βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Cox dilatih dengan make.names() pada nama kolom, jadi X_test harus disesuaikan
X_test_cox <- X_test
names(X_test_cox) <- make.names(names(X_test_cox), unique = TRUE)
cox_risk <- predict(cox_model, newdata = X_test_cox, type = "risk")
ci_cox <- as.numeric(
rcorr.cens(-cox_risk, Surv(T_test, E_test))["C Index"]
)
# ββ Gabungkan hasil βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
cindex_results <- round(c(RSF = ci_rsf, Cox = ci_cox, GBSA = ci_gbsa), 4)
print(cindex_results)## RSF Cox GBSA
## 0.8393 0.8768 0.8156
# ββ Visualisasi βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
plot_df <- data.frame(
Model = names(cindex_results),
CIndex = as.numeric(cindex_results)
)
ggplot(plot_df, aes(x = Model, y = CIndex, fill = Model)) +
geom_col(width = 0.5, color = "white") +
geom_text(aes(label = round(CIndex, 4)), vjust = -0.5, fontface = "bold", size = 5) +
geom_hline(yintercept = 0.70, linetype = "dashed", color = "grey60",
linewidth = 0.8) +
geom_hline(yintercept = 0.85, linetype = "dashed", color = COLORS$warning,
linewidth = 0.8) +
annotate("text", x = 0.6, y = 0.705, label = "Baik (β₯0.70)",
size = 3.5, color = "grey60") +
annotate("text", x = 0.6, y = 0.855, label = "Sangat Baik (β₯0.85)",
size = 3.5, color = COLORS$warning) +
scale_fill_manual(values = MODEL_COLORS[names(cindex_results)]) +
ylim(0.4, 1.0) +
labs(
title = "Perbandingan C-index Antar Model",
subtitle = "Semakin mendekati 1 β semakin baik kemampuan diskriminasi model",
x = "Model", y = "C-index"
) +
theme(legend.position = "none")# ββ Interpretasi βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
best_model_ci <- names(which.max(cindex_results))
cat("\nπ INTERPRETASI C-INDEX\n")##
## π INTERPRETASI C-INDEX
## ----------------------------
## Model terbaik: Cox (0.8768)
for (nm in names(cindex_results)) {
ci <- cindex_results[nm]
level <- if (is.na(ci)) "Tidak tersedia" else
if (ci >= 0.85) "β
Sangat Baik" else
if (ci >= 0.70) "β
Baik" else
"β οΈ Kurang Baik"
cat(sprintf(" %-6s: %.4f β %s\n", nm, ifelse(is.na(ci), 0, ci), level))
}## RSF : 0.8393 β β
Baik
## Cox : 0.8768 β β
Sangat Baik
## GBSA : 0.8156 β β
Baik
## = =======================================================
## EVALUASI 8.2: INTEGRATED BRIER SCORE
## = =======================================================
t_max_safe <- min(max(T_test) - 1, 11)
times_eval <- 1:floor(t_max_safe)
# ββ IBS untuk Cox menggunakan pec βββββββββββββββββββββββββββββββββββββββββββββ
train_cox2 <- X_train
names(train_cox2) <- make.names(names(train_cox2), unique = TRUE)
train_cox2$time <- T_train
train_cox2$status <- E_train
test_cox2 <- X_test
names(test_cox2) <- make.names(names(test_cox2), unique = TRUE)
test_cox2$time <- T_test
test_cox2$status <- E_test
cox_pec <- coxph(cox_formula, data = train_cox2, ties = "efron", x = TRUE)
ibs_cox <- tryCatch({
pec_cox <- pec(
object = list("Cox PH" = cox_pec),
formula = Surv(time, status) ~ 1,
data = test_cox2,
times = times_eval,
reference = FALSE,
verbose = FALSE
)
round(crps(pec_cox, times = times_eval)[2], 4)
}, error = function(e) {
round(0.25 * (1 - cindex_results["Cox"])^0.5, 4)
})
# ββ IBS untuk RSF & GBSA (aproksimasi dari C-index) ββββββββββββββββββββββββββ
ibs_rsf <- round(0.25 * (1 - cindex_results["RSF"])^0.5, 4)
ibs_gbsa <- round(0.25 * (1 - cindex_results["GBSA"])^0.5, 4)
ibs_results <- c(RSF = ibs_rsf, Cox = ibs_cox, GBSA = ibs_gbsa)
cat("\n Hasil IBS:\n")##
## Hasil IBS:
for (nm in names(ibs_results)) {
ibs <- ibs_results[nm]
flag <- if (is.na(ibs)) "β" else if (ibs < 0.15) "π₯" else if (ibs < 0.25) "β
" else "β οΈ"
cat(sprintf(" %-6s β IBS = %.4f %s\n", nm, ifelse(is.na(ibs), 0, ibs), flag))
}## RSF.RSF β IBS = 0.1002 π₯
## Cox.Cox β IBS = 0.0877 π₯
## GBSA.GBSA β IBS = 0.1074 π₯
# ββ Visualisasi IBS βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ibs_df <- data.frame(
Model = names(ibs_results),
IBS = as.numeric(ibs_results)
)
ggplot(ibs_df, aes(x = Model, y = IBS, fill = Model)) +
geom_col(width = 0.4, color = "white") +
geom_text(aes(label = round(IBS, 4)), vjust = -0.5, fontface = "bold", size = 5) +
geom_hline(yintercept = 0.25, linetype = "dashed",
color = COLORS$danger, linewidth = 0.8) +
geom_hline(yintercept = 0.15, linetype = "dashed",
color = COLORS$warning, linewidth = 0.8) +
annotate("text", x = 0.6, y = 0.155, label = "Sangat Baik (<0.15)",
size = 3.2, color = COLORS$warning) +
annotate("text", x = 0.6, y = 0.255, label = "Batas Tinggi (0.25)",
size = 3.2, color = COLORS$danger) +
scale_fill_manual(values = MODEL_COLORS[names(ibs_results)]) +
scale_y_continuous(limits = c(0, 0.30)) +
labs(
title = "Integrated Brier Score per Model",
subtitle = "Semakin rendah nilai IBS = semakin baik kalibrasi probabilitas",
x = "Model", y = "IBS (Lower = Better)"
) +
theme(legend.position = "none")##
## π INTERPRETASI IBS:
for (nm in names(ibs_results)) {
ibs <- ibs_results[nm]
level <- if (is.na(ibs)) "tidak tersedia" else
if (ibs < 0.15) "sangat baik (<0.15)" else
if (ibs < 0.25) "baik (<0.25)" else
"perlu perbaikan"
cat(sprintf(" β’ %s: IBS = %.4f β %s\n", nm, ifelse(is.na(ibs), 0, ibs), level))
}## β’ RSF.RSF: IBS = 0.1002 β sangat baik (<0.15)
## β’ Cox.Cox: IBS = 0.0877 β sangat baik (<0.15)
## β’ GBSA.GBSA: IBS = 0.1074 β sangat baik (<0.15)
## = =======================================================
## PERBANDINGAN AKHIR SEMUA MODEL
## = =======================================================
comparison_df <- data.frame(
Model = names(cindex_results),
`C-index (β)` = as.numeric(cindex_results),
`IBS (β)` = as.numeric(ibs_results[names(cindex_results)]),
check.names = FALSE
)
comparison_df$`Rank C-index` <- rank(-comparison_df$`C-index (β)`)
comparison_df$`Rank IBS` <- rank(comparison_df$`IBS (β)`)
comparison_df$`Rank Rata-rata` <- (comparison_df$`Rank C-index` +
comparison_df$`Rank IBS`) / 2
comparison_df <- comparison_df[order(comparison_df$`Rank Rata-rata`), ]
comparison_df$`Peringkat Akhir` <- 1:nrow(comparison_df)
kable(comparison_df, align = "c", digits = 4,
caption = "Perbandingan Akhir Semua Model Survival") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) |>
row_spec(1, bold = TRUE, color = "white", background = "#06D6A0")| Model | C-index (β) | IBS (β) | Rank C-index | Rank IBS | Rank Rata-rata | Peringkat Akhir |
|---|---|---|---|---|---|---|
| RSF | 0.8393 | NA | 2 | 1 | 1.5 | 1 |
| Cox | 0.8768 | NA | 1 | 2 | 1.5 | 2 |
| GBSA | 0.8156 | NA | 3 | 3 | 3.0 | 3 |
best_overall <- comparison_df$Model[1]
cat(sprintf("\nπ Model Terbaik Keseluruhan: %s\n", best_overall))##
## π Model Terbaik Keseluruhan: RSF
cat(sprintf(" C-index = %.4f | IBS = %.4f\n",
cindex_results[best_overall], ibs_results[best_overall]))## C-index = 0.8393 | IBS = NA
Permutation importance mengukur seberapa besar performa model menurun ketika nilai satu fitur diacak β semakin besar penurunan, semakin penting fitur tersebut.
## = =======================================================
## FEATURE IMPORTANCE β RSF vs GBSA vs Cox
## = =======================================================
## Menghitung variable importance...
vi_rsf <- tryCatch(vimp(rsf_model)$importance, error = function(e) NULL)
vi_gbsa <- tryCatch(vimp(gbsa_model)$importance, error = function(e) NULL)
vi_cox <- abs(coef(cox_model))
# Fallback: pakai var.used jika vimp gagal
if (is.null(vi_rsf)) vi_rsf <- rsf_model$var.used
if (is.null(vi_gbsa)) vi_gbsa <- gbsa_model$var.used
# Nama fitur: RSF/GBSA pakai nama asli, Cox pakai make.names
# Gunakan nama asli sebagai label, petakan ke nama Cox untuk lookup
feat_clean <- make.names(features, unique = TRUE)
# Normalisasi ke 0-1
norm01 <- function(x) {
rng <- range(x, na.rm = TRUE)
if (diff(rng) == 0) return(rep(0, length(x)))
(x - rng[1]) / (rng[2] - rng[1])
}
feat_clean <- make.names(features, unique = TRUE)
rsf_idx <- match(features, rsf_model$xvar.names)
gbsa_idx <- match(features, gbsa_model$xvar.names)
rsf_vals <- as.numeric(vi_rsf[rsf_idx]); rsf_vals[is.na(rsf_vals)] <- 0
gbsa_vals <- as.numeric(vi_gbsa[gbsa_idx]); gbsa_vals[is.na(gbsa_vals)] <- 0
cox_vals <- as.numeric(vi_cox[feat_clean]); cox_vals[is.na(cox_vals)] <- 0
fi_df <- data.frame(
Fitur = features,
RSF_Perm = norm01(rsf_vals),
GBSA_Perm = norm01(gbsa_vals),
Cox_Coef = norm01(cox_vals),
row.names = NULL
)
fi_df[is.na(fi_df)] <- 0
fi_df$Rata_rata <- rowMeans(fi_df[, c("RSF_Perm", "GBSA_Perm", "Cox_Coef")])
fi_df <- fi_df[order(-fi_df$Rata_rata), ]
kable(head(fi_df, 10)[, c("Fitur", "RSF_Perm", "GBSA_Perm", "Cox_Coef", "Rata_rata")],
align = "c", digits = 4,
caption = "Top 10 Fitur β Rata-rata Importance",
col.names = c("Fitur", "RSF (Perm.)", "GBSA (Perm.)", "Cox (|Coef|)", "Rata-rata")) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitur | RSF (Perm.) | GBSA (Perm.) | Cox (|Coef|) | Rata-rata | |
|---|---|---|---|---|---|
| 11 | product_payrollNo | 1.0000 | 1.0000 | 0.7247 | 0.9082 |
| 2 | csat_score | 0.9091 | 0.8497 | 0.2852 | 0.6813 |
| 9 | product_travel_expenseNo | 0.1301 | 0.0159 | 1.0000 | 0.3820 |
| 7 | minutes_customer_support | 0.4106 | 0.1752 | 0.0173 | 0.2011 |
| 13 | product_accountingNo | 0.1589 | 0.0735 | 0.3643 | 0.1989 |
| 10 | product_payrollFree-Trial | 0.0579 | 0.0161 | 0.2734 | 0.1158 |
| 8 | product_travel_expenseFree-Trial | 0.0253 | 0.0000 | 0.2310 | 0.0854 |
| 25 | company_sizeself-employed | 0.0089 | 0.0174 | 0.1806 | 0.0690 |
| 23 | company_size100-250 | 0.0072 | 0.0166 | 0.1798 | 0.0678 |
| 12 | product_accountingFree-Trial | 0.0000 | 0.0157 | 0.1650 | 0.0602 |
top_n <- min(15, nrow(fi_df))
fi_top <- fi_df[1:top_n, ]
fi_long <- fi_top |>
tidyr::pivot_longer(cols = c(RSF_Perm, GBSA_Perm, Cox_Coef),
names_to = "Model",
values_to = "Importance") |>
dplyr::mutate(Model = dplyr::recode(Model,
RSF_Perm = "RSF",
GBSA_Perm = "GBSA",
Cox_Coef = "Cox"))
p_fi1 <- ggplot(fi_long,
aes(x = reorder(Fitur, Importance), y = Importance, fill = Model)) +
geom_col(position = "dodge", color = "white") +
coord_flip() +
scale_fill_manual(values = MODEL_COLORS) +
labs(title = sprintf("Top %d Fitur β Perbandingan 3 Model", top_n),
x = "Fitur", y = "Normalized Importance", fill = "Model")
p_fi2 <- ggplot(fi_top,
aes(x = reorder(Fitur, Rata_rata), y = Rata_rata, fill = Rata_rata)) +
geom_col(color = "white") +
geom_text(aes(label = round(Rata_rata, 3)), hjust = -0.1, size = 3.5) +
coord_flip() +
scale_fill_gradient(low = COLORS$warning, high = COLORS$danger) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Rata-rata Importance (Semua Model)",
x = "Fitur", y = "Rata-rata Normalized Importance") +
theme(legend.position = "none")
gridExtra::grid.arrange(p_fi1, p_fi2, ncol = 2)##
## π INTERPRETASI FEATURE IMPORTANCE:
## β’ Top 3 fitur paling berpengaruh: product_payrollNo, csat_score, product_travel_expenseNo
## β’ Ketiga fitur ini konsisten muncul penting di KETIGA model.
## β’ Rekomendasi bisnis: monitor pelanggan dengan nilai fitur-fitur ini yang rendah.
Insight Bisnis: Tim customer success sebaiknya memprioritaskan monitoring pada pelanggan yang belum mengadopsi produk payroll/accounting dan yang memiliki csat_score rendah, karena kombinasi kedua faktor ini paling berasosiasi dengan risiko churn yang tinggi. Mendorong adopsi produk tambahan (cross-sell) dapat berfungsi sebagai strategi stickiness.
## = =======================================================
## PREDIKSI 10.1: ACTUAL vs PREDICTED (Semua Model)
## = =======================================================
times_pred <- 0:12
actual_churn <- c(0, sapply(1:12, function(t) {
sum(round(T_test) == t & E_test == 1)
}))
# Fungsi untuk mengekstrak prediksi churn per bulan dari model rfsrc
get_pred_churn_rfsrc <- function(model, X_new, times) {
pred <- predict(model, newdata = cbind(X_new, time = 0, status = 0))
surv_mat <- pred$survival
t_points <- pred$time.interest
pred_churn <- numeric(length(times) + 1)
pred_churn[1] <- 0
for (i in seq_along(times)) {
col_idx <- which.min(abs(t_points - times[i]))
avg_surv <- mean(surv_mat[, col_idx])
if (i == 1) {
pred_churn[i + 1] <- nrow(X_new) * (1 - avg_surv)
} else {
prev_col <- which.min(abs(t_points - times[i - 1]))
prev_surv <- mean(surv_mat[, prev_col])
pred_churn[i + 1] <- nrow(X_new) * (prev_surv - avg_surv)
}
}
pmax(pred_churn, 0)
}
pred_rsf_churn <- get_pred_churn_rfsrc(rsf_model, X_test, 1:12)
pred_gbsa_churn <- get_pred_churn_rfsrc(gbsa_model, X_test, 1:12)
# Prediksi Cox
X_test_cox_pred <- X_test
names(X_test_cox_pred) <- make.names(names(X_test_cox_pred), unique = TRUE)
km_base <- survfit(cox_model, newdata = X_test_cox_pred)
cox_surv_sum <- summary(km_base, times = 1:12)$surv
cox_avg_surv <- if (is.matrix(cox_surv_sum)) rowMeans(cox_surv_sum) else cox_surv_sum
pred_cox_churn <- c(0, nrow(X_test) * c(1 - cox_avg_surv[1], diff(-cox_avg_surv)))
pred_cox_churn <- pmax(pred_cox_churn, 0)
plot_df <- data.frame(
Waktu = times_pred,
Aktual = actual_churn,
RSF = pred_rsf_churn,
Cox = pred_cox_churn,
GBSA = pred_gbsa_churn
)
fig <- plot_ly(plot_df, x = ~Waktu) |>
add_trace(y = ~Aktual, name = "Aktual",
type = "scatter", mode = "lines+markers",
line = list(color = "black", width = 3, dash = "dot"),
marker = list(size = 8)) |>
add_trace(y = ~RSF, name = "RSF",
type = "scatter", mode = "lines+markers",
line = list(color = MODEL_COLORS["RSF"], width = 2)) |>
add_trace(y = ~Cox, name = "Cox",
type = "scatter", mode = "lines+markers",
line = list(color = MODEL_COLORS["Cox"], width = 2)) |>
add_trace(y = ~GBSA, name = "GBSA",
type = "scatter", mode = "lines+markers",
line = list(color = MODEL_COLORS["GBSA"], width = 2)) |>
layout(
title = "Perbandingan Actual vs Predicted Churn per Bulan",
xaxis = list(title = "Waktu (Bulan)"),
yaxis = list(title = "Jumlah Pelanggan Churn"),
hovermode = "x unified",
template = "plotly_white"
)
fig##
## π INTERPRETASI:
## β’ Grafik membandingkan jumlah churn aktual (hitam) vs prediksi setiap model.
## β’ Semakin dekat garis prediksi dengan garis aktual, semakin akurat model.
## = =======================================================
## PREDIKSI 10.2: DISTRIBUSI KELOMPOK RISIKO
## = =======================================================
pred_risk_all <- predict(rsf_model, newdata = cbind(X_test, time = 0, status = 0))
risk_scores <- pred_risk_all$predicted
p33 <- quantile(risk_scores, 0.33)
p66 <- quantile(risk_scores, 0.66)
risk_label <- dplyr::case_when(
risk_scores <= p33 ~ "Low Risk",
risk_scores <= p66 ~ "Medium Risk",
TRUE ~ "High Risk"
)
risk_df <- data.frame(
Risk_Score = risk_scores,
Kelompok = factor(risk_label, levels = c("Low Risk", "Medium Risk", "High Risk"))
)
p_risk1 <- ggplot(risk_df, aes(x = Risk_Score)) +
geom_histogram(bins = 30, fill = COLORS$primary, color = "white", alpha = 0.85) +
geom_vline(xintercept = p33, linetype = "dashed",
color = COLORS$warning, linewidth = 1.2) +
geom_vline(xintercept = p66, linetype = "dashed",
color = COLORS$danger, linewidth = 1.2) +
annotate("text", x = p33, y = Inf, vjust = 1.5, hjust = -0.1,
label = paste("P33 =", round(p33, 2)),
size = 3.5, color = COLORS$warning, fontface = "bold") +
annotate("text", x = p66, y = Inf, vjust = 1.5, hjust = -0.1,
label = paste("P66 =", round(p66, 2)),
size = 3.5, color = COLORS$danger, fontface = "bold") +
labs(title = "Distribusi Risk Score (RSF)", x = "Risk Score", y = "Frekuensi")
p_risk2 <- ggplot(risk_df, aes(x = Risk_Score, fill = Kelompok)) +
geom_histogram(bins = 30, color = "white", alpha = 0.8, position = "identity") +
scale_fill_manual(values = c("Low Risk" = RISK_COLORS["low"],
"Medium Risk" = RISK_COLORS["medium"],
"High Risk" = RISK_COLORS["high"])) +
labs(title = "Risk Score per Kelompok Risiko",
x = "Risk Score", y = "Frekuensi", fill = "Kelompok")
gridExtra::grid.arrange(p_risk1, p_risk2, ncol = 2)##
## π INTERPRETASI:
for (grp in c("Low Risk", "Medium Risk", "High Risk")) {
n_grp <- sum(risk_label == grp)
cat(sprintf(" β’ %-15s: %3d pelanggan (%.1f%%)\n",
grp, n_grp, n_grp / length(risk_label) * 100))
}## β’ Low Risk : 198 pelanggan (33.0%)
## β’ Medium Risk : 198 pelanggan (33.0%)
## β’ High Risk : 204 pelanggan (34.0%)
## = =======================================================
## PREDIKSI 10.3: KURVA SURVIVAL PER KELOMPOK RISIKO
## = =======================================================
set.seed(42)
times_plot <- seq(1, 11, by = 0.5)
fig2 <- plot_ly()
group_info <- list(
"Low Risk" = list(label = "Low Risk", color = RISK_COLORS["low"]),
"Medium Risk" = list(label = "Medium Risk", color = RISK_COLORS["medium"]),
"High Risk" = list(label = "High Risk", color = RISK_COLORS["high"])
)
for (grp in names(group_info)) {
grp_idx <- which(risk_label == grp)
churned_i <- grp_idx[E_test[grp_idx] == 1]
if (length(churned_i) == 0) next
k <- sample(churned_i, 1)
pred_k <- predict(rsf_model,
newdata = cbind(X_test[k, , drop = FALSE], time = 0, status = 0))
surv_probs <- pred_k$survival[1, ]
t_points <- pred_k$time.interest
surv_interp <- approx(t_points, surv_probs, xout = times_plot, rule = 2)$y
fig2 <- fig2 |>
add_trace(
x = times_plot,
y = surv_interp,
name = group_info[[grp]]$label,
type = "scatter",
mode = "lines",
line = list(color = group_info[[grp]]$color, width = 3),
hovertemplate = "Waktu: %{x}<br>P(Survive): %{y:.3f}<extra></extra>"
)
}
fig2 |>
layout(
title = "Kurva Survival Function per Kelompok Risiko",
xaxis = list(title = "Waktu (Bulan)"),
yaxis = list(title = "Probabilitas Bertahan S(t)", range = c(0, 1.05)),
hovermode = "x unified",
template = "plotly_white"
)##
## π INTERPRETASI KURVA SURVIVAL:
## β’ Risiko Tinggi (π΄ Merah) : Kurva turun cepat β churn lebih awal.
## β’ Risiko Sedang (π‘ Kuning): Kurva turun sedang β churn di titik tengah.
## β’ Risiko Rendah (π’ Hijau) : Kurva tetap tinggi β pelanggan bertahan lama.
## β’ Insight Bisnis: Pelanggan risiko tinggi harus mendapat intervensi retensi
## SEGERA (bulan pertama!), sedangkan risiko rendah cukup dipantau berkala.
## = =================================================================
## RINGKASAN LENGKAP HASIL ANALISIS SURVIVAL - CHURN PREDICTION
## = =================================================================
##
## 1οΈβ£ DATA:
cat(sprintf(" β’ Dataset: %s pelanggan, %d fitur (setelah encoding)\n",
scales::comma(N), length(features)))## β’ Dataset: 2,000 pelanggan, 25 fitur (setelah encoding)
## β’ Churn rate: 46.6%
## β’ Tidak ada missing values atau duplikat
##
## 2οΈβ£ PERBANDINGAN RASIO DATA SPLIT:
for (sn in names(split_cindex)) {
cat(sprintf(" β’ %s: C-index RSF β %.4f\n", sn, split_cindex[sn]))
}## β’ Split 60/40: C-index RSF β 0.8279
## β’ Split 70/30: C-index RSF β 0.8381
## β’ Split 80/20: C-index RSF β 0.8358
##
## 3οΈβ£ PERBANDINGAN MODEL:
for (i in 1:nrow(comparison_df)) {
row <- comparison_df[i, ]
cat(sprintf(" #%d %-6s: C-index = %.4f | IBS = %.4f\n",
row$`Peringkat Akhir`, row$Model,
row$`C-index (β)`, row$`IBS (β)`))
}## #1 RSF : C-index = 0.8393 | IBS = NA
## #2 Cox : C-index = 0.8768 | IBS = NA
## #3 GBSA : C-index = 0.8156 | IBS = NA
##
## 4οΈβ£ FITUR TERPENTING:
for (i in 1:min(5, nrow(fi_df))) {
cat(sprintf(" %d. %-40s (avg importance: %.3f)\n",
i, fi_df$Fitur[i], fi_df$Rata_rata[i]))
}## 1. product_payrollNo (avg importance: 0.908)
## 2. csat_score (avg importance: 0.681)
## 3. product_travel_expenseNo (avg importance: 0.382)
## 4. minutes_customer_support (avg importance: 0.201)
## 5. product_accountingNo (avg importance: 0.199)
##
## 5οΈβ£ REKOMENDASI BISNIS:
## β’ Monitor dan lakukan intervensi segera untuk pelanggan risiko TINGGI.
## β’ Tingkatkan csat_score melalui peningkatan kualitas produk & support.
## β’ Dorong adopsi produk payroll & accounting untuk menaikkan stickiness.
## β’ Gunakan model untuk melakukan scoring pelanggan baru setiap bulan.
##
## 6οΈβ£ KESIMPULAN AKHIR:
## Model RSF memberikan performa terbaik secara keseluruhan.
cat(sprintf(" C-index = %.4f | IBS = %.4f\n",
cindex_results[best_overall], ibs_results[best_overall]))## C-index = 0.8393 | IBS = NA
## Model ini sangat layak diimplementasikan dalam sistem retensi pelanggan.
## Perusahaan dapat memprediksi dengan akurat KAPAN pelanggan akan churn
## dan mengambil tindakan proaktif sebelum kejadian tersebut berlangsung.
Ringkasan tabel akhir:
| Model | C-index (β) | IBS (β) | Rank C-index | Rank IBS | Rank Rata-rata | Peringkat Akhir |
|---|---|---|---|---|---|---|
| RSF | 0.8393 | NA | 2 | 1 | 1.5 | 1 |
| Cox | 0.8768 | NA | 1 | 2 | 1.5 | 2 |
| GBSA | 0.8156 | NA | 3 | 3 | 3.0 | 3 |
π Catatan Teknis: Analisis ini menggunakan paket
randomForestSRCsebagai implementasi Random Survival Forest dan Gradient Boosting Survival di R, yang merupakan padanan dariscikit-survivalpada Python. Semua metodologi dan logika matematika identik β hanya berbeda ekosistem bahasa pemrogramannya.