πŸ“š Mata Kuliah: Analisis Survival
πŸ‘¨β€πŸ« Dosen Pengampu: Bakti Siregar, M.Sc., CDS
πŸ‘₯ Kelompok: Churn Prediction




1. Pendahuluan (Introduction)

Churn pelanggan (customer attrition) β€” persentase pelanggan yang berhenti menggunakan produk atau layanan β€” adalah salah satu metrik terpenting bagi bisnis. Umumnya, biaya untuk mendapatkan pelanggan baru jauh lebih besar daripada biaya mempertahankan pelanggan yang sudah ada.

Menurut studi Bain & Company, peningkatan retensi pelanggan sebesar 5% dapat menghasilkan peningkatan laba lebih dari 25% (khususnya di layanan keuangan).

Mengapa Analisis Survival?

Analisis klasifikasi biasa hanya menjawab β€œApakah pelanggan akan churn?”, sedangkan Analisis Survival menjawab pertanyaan yang jauh lebih bernilai bisnis:

β€œKapan pelanggan akan berhenti berlangganan?”

Dengan demikian, perusahaan dapat mengambil tindakan retensi yang tepat sebelum churn terjadi.

1.1 Konsep Dasar Survival Analysis

Tiga konsep fundamental yang dipakai sepanjang laporan ini:

  • Waktu survival (T): lama waktu dari titik awal observasi (mulai berlangganan) sampai terjadinya event atau sampai observasi berakhir. Pada dataset ini direpresentasikan oleh kolom months_active.
  • Event indicator (Ξ΄): status apakah event (churn) benar-benar teramati (Ξ΄ = 1) atau belum teramati (Ξ΄ = 0). Direpresentasikan oleh kolom churned.
  • Censoring (penyensoran): kondisi ketika waktu event sebenarnya tidak diketahui karena observasi berhenti lebih dulu. Pelanggan yang masih aktif di akhir periode observasi (churned = 0) disebut right-censoring.

Dari T dan Ξ΄, dua fungsi utama yang diestimasi:

  • Survival function S(t) = P(T > t): probabilitas pelanggan masih bertahan (belum churn) sampai waktu t.
  • Hazard function h(t): risiko churn sesaat pada waktu t, dengan kondisi pelanggan masih bertahan sampai sebelum t.

2. Pengaturan (Setup)

2.1 Skenario Bisnis

Sebuah perusahaan SaaS (Software as a Service) menyediakan produk untuk UKM:

  • Penyimpanan Data Cloud
  • Akuntansi
  • Manajemen Perjalanan & Pengeluaran
  • Manajemen Penggajian (Payroll)

Tujuan: Membantu CFO memprakirakan biaya akuisisi & pemasaran dengan membangun model churn yang memprediksi kapan pelanggan kemungkinan besar akan menghentikan langganan bulanan mereka.

2.2 Instalasi & Pemuatan Paket

Seluruh paket R yang dibutuhkan beserta penjelasan fungsinya masing-masing:

# ── FUNGSI PEMBANTU: Instalasi otomatis jika paket belum tersedia ─────────────
install_if_missing <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org", quiet = TRUE)
  }
}

# ── DAFTAR PAKET YANG DIPERLUKAN ──────────────────────────────────────────────
packages_needed <- c(
  # Tampilan laporan
  "rmdformats",      # Tema HTML R Markdown (readthedown, material, dsb.)
  # Survival analysis β€” inti
  "survival",        # Fungsi dasar: Surv(), survfit(), coxph()
  "survminer",       # Visualisasi kurva KM yang cantik (berbasis ggplot2)
  "randomForestSRC", # Random Survival Forest & GBSA proxy di R
  # Evaluasi model
  "pec",             # Prediction Error Curves β€” Brier Score untuk model survival
  "Hmisc",           # Fungsi statistik tambahan: rcorr.cens() untuk C-index Cox
  # Visualisasi
  "ggplot2",         # Visualisasi data berbasis grammar of graphics
  "corrplot",        # Heatmap matriks korelasi
  "RColorBrewer",    # Palet warna profesional
  "gridExtra",       # Menyusun beberapa panel plot ggplot
  "plotly",          # Grafik interaktif berbasis web (HTML widget)
  "scales",          # Formatting label sumbu (persen, ribuan, dsb.)
  # Manipulasi data
  "dplyr",           # Manipulasi data: filter, mutate, group_by, summarise
  "tidyr",           # Reshape data: pivot_longer, pivot_wider
  # Laporan
  "knitr",           # Kompilasi dokumen R Markdown
  "kableExtra"       # Tabel HTML yang lebih cantik dan fleksibel
)

# ── INSTALASI OTOMATIS ────────────────────────────────────────────────────────
invisible(lapply(packages_needed, install_if_missing))
cat("βœ… Pengecekan & instalasi paket selesai!\n")
## βœ… Pengecekan & instalasi paket selesai!
# ── LOAD SEMUA PAKET ──────────────────────────────────────────────────────────
library(survival)        # Fungsi inti survival analysis
library(survminer)       # Visualisasi Kaplan-Meier
library(randomForestSRC) # Random Survival Forest & GBSA
library(pec)             # Brier Score
library(Hmisc)           # C-index untuk Cox
library(ggplot2)         # Visualisasi data
library(corrplot)        # Matriks korelasi
library(RColorBrewer)    # Palet warna
library(gridExtra)       # Multi-panel plot
library(plotly)          # Grafik interaktif
library(scales)          # Formatting label
library(dplyr)           # Manipulasi data
library(tidyr)           # Reshape data
library(knitr)           # Tabel dokumen
library(kableExtra)      # Tabel HTML cantik

# ── KONFIGURASI WARNA GLOBAL ──────────────────────────────────────────────────
COLORS <- list(
  primary   = "#4361EE",
  secondary = "#F72585",
  success   = "#06D6A0",
  warning   = "#FFB703",
  danger    = "#EF476F",
  info      = "#118AB2",
  dark      = "#2B2D42",
  light     = "#EDF2F4"
)

MODEL_COLORS <- c(
  RSF  = "#4361EE",
  Cox  = "#F72585",
  GBSA = "#06D6A0",
  KM   = "#FFB703"
)

RISK_COLORS <- c(
  low    = "#06D6A0",
  medium = "#FFB703",
  high   = "#EF476F"
)

# ── TEMA GGPLOT2 GLOBAL ───────────────────────────────────────────────────────
theme_set(
  theme_minimal(base_size = 12) +
  theme(
    plot.background   = element_rect(fill = "white", color = NA),
    panel.background  = element_rect(fill = "#F8F9FA", color = NA),
    panel.grid.minor  = element_blank(),
    panel.grid.major  = element_line(color = "grey85", linetype = "dashed"),
    plot.title        = element_text(face = "bold", size = 14),
    plot.subtitle     = element_text(size = 11, color = "grey40"),
    legend.background = element_rect(fill = "white", color = NA)
  )
)

cat("βœ… Semua paket berhasil dimuat!\n")
## βœ… Semua paket berhasil dimuat!
cat(sprintf("   R version : %s\n", R.version.string))
##    R version : R version 4.5.2 (2025-10-31 ucrt)
cat(sprintf("   ggplot2   : %s\n", packageVersion("ggplot2")))
##    ggplot2   : 4.0.1
cat(sprintf("   survival  : %s\n", packageVersion("survival")))
##    survival  : 3.8.6
cat(sprintf("   randomForestSRC : %s\n", packageVersion("randomForestSRC")))
##    randomForestSRC : 3.6.2

3. Kumpulan Data (Dataset)

3.1 Deskripsi & Gambaran Umum

Dataset berisi 2.000 pelanggan dengan 14 fitur yang terbagi dalam 5 kategori:

Kategori Fitur Tipe Deskripsi
⏱️ Waktu months_active Numerik Durasi berlangganan (bulan) β€” variabel T
🎯 Kejadian churned Biner 1 = churn, 0 = masih aktif (tersensor) β€” variabel Ξ΄
πŸ“¦ Produk product_data_storage Numerik GB penyimpanan yang dibeli
πŸ“¦ Produk product_travel_expense Kategorikal Free-Trial / Active / No
πŸ“¦ Produk product_payroll Kategorikal Free-Trial / Active / No
πŸ“¦ Produk product_accounting Kategorikal Free-Trial / Active / No
😊 Kepuasan csat_score Numerik Skor kepuasan pelanggan
😊 Kepuasan minutes_customer_support Numerik Menit di customer support
πŸ“£ Pemasaran articles_viewed Numerik Artikel yang dilihat
πŸ“£ Pemasaran smartphone_notifications_viewed Numerik Notifikasi HP yang dilihat
πŸ“£ Pemasaran marketing_emails_clicked Numerik Email marketing yang diklik
πŸ“£ Pemasaran social_media_ads_viewed Numerik Iklan sosmed yang dilihat
🏒 Info company_size Kategorikal Ukuran perusahaan
🏒 Info us_region Kategorikal Wilayah AS
# ── LOAD DATASET ──────────────────────────────────────────────────────────────
url <- "https://raw.githubusercontent.com/Square/pysurvival/master/pysurvival/datasets/churn.csv"

raw_dataset <- tryCatch(
  read.csv(url, stringsAsFactors = FALSE),
  error = function(e) stop("Gagal memuat dataset. Periksa koneksi internet Anda.")
)

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat(" INFORMASI DATASET AWAL\n")
##  INFORMASI DATASET AWAL
cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat(sprintf("\n Dimensi  : %d baris Γ— %d kolom\n", nrow(raw_dataset), ncol(raw_dataset)))
## 
##  Dimensi  : 2000 baris Γ— 14 kolom
cat("\n Nama Kolom:\n")
## 
##  Nama Kolom:
for (col in names(raw_dataset)) cat(sprintf("   β€’ %s\n", col))
##    β€’ product_data_storage
##    β€’ product_travel_expense
##    β€’ product_payroll
##    β€’ product_accounting
##    β€’ csat_score
##    β€’ articles_viewed
##    β€’ smartphone_notifications_viewed
##    β€’ marketing_emails_clicked
##    β€’ social_media_ads_viewed
##    β€’ minutes_customer_support
##    β€’ company_size
##    β€’ us_region
##    β€’ months_active
##    β€’ churned
kable(head(raw_dataset, 5),
      caption = "5 Baris Pertama Dataset",
      align   = "c") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
5 Baris Pertama Dataset
product_data_storage product_travel_expense product_payroll product_accounting csat_score articles_viewed smartphone_notifications_viewed marketing_emails_clicked social_media_ads_viewed minutes_customer_support company_size us_region months_active churned
2048 Free-Trial Active No 9 4 0 14 1 8.3 10-50 West North Central 3 1
2048 Free-Trial Free-Trial Active 9 4 2 12 1 0.0 100-250 South Atlantic 2 1
2048 Active Active Active 9 3 2 17 1 0.0 100-250 East South Central 7 0
500 Active Free-Trial No 10 0 0 14 0 0.0 50-100 East South Central 8 1
5120 Free-Trial Active Free-Trial 8 5 0 17 0 0.0 50-100 East North Central 7 0
tipe_data <- data.frame(
  Kolom = names(raw_dataset),
  Tipe  = sapply(raw_dataset, class),
  row.names = NULL
)

kable(tipe_data, caption = "Tipe Data Setiap Kolom", align = "c") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Tipe Data Setiap Kolom
Kolom Tipe
product_data_storage integer
product_travel_expense character
product_payroll character
product_accounting character
csat_score integer
articles_viewed integer
smartphone_notifications_viewed integer
marketing_emails_clicked integer
social_media_ads_viewed integer
minutes_customer_support numeric
company_size character
us_region character
months_active numeric
churned numeric

4. Data Cleaning

πŸ’‘ Mengapa Data Cleaning dilakukan SEBELUM EDA?
Data yang kotor dapat menghasilkan visualisasi dan statistik yang menyesatkan. Dengan membersihkan data terlebih dahulu, kita memastikan bahwa semua insight dari EDA mencerminkan kondisi data yang sebenarnya.

Tahapan Data Cleaning yang akan dilakukan:

  1. Cek nilai NULL / missing
  2. Cek & hapus duplikat
  3. Deteksi outlier (metode IQR)
  4. Cek inkonsistensi nilai kategorikal
  5. Feature Engineering & Encoding

4.1 Cek Nilai NULL / Missing

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.1: CEK NILAI NULL\n")
## STEP 4.1: CEK NILAI NULL
cat("=" , strrep("=", 55), "\n")
## = =======================================================
null_counts <- colSums(is.na(raw_dataset))
null_pct    <- round(null_counts / nrow(raw_dataset) * 100, 2)

null_df <- data.frame(
  Kolom          = names(null_counts),
  Jumlah_NULL    = null_counts,
  Persentase_pct = null_pct,
  row.names      = NULL
)

p_null <- ggplot(null_df, aes(x = reorder(Kolom, Jumlah_NULL), y = Jumlah_NULL)) +
  geom_col(fill = COLORS$primary, color = "white") +
  coord_flip() +
  labs(
    title    = "Jumlah Nilai NULL per Kolom",
    subtitle = "0 = tidak ada nilai yang hilang",
    x = "Kolom", y = "Jumlah NULL"
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
print(p_null)

if (sum(null_counts) == 0) {
  cat("\nβœ… HASIL: Dataset TIDAK memiliki nilai NULL β€” data bersih dari missing values.\n")
} else {
  cat(sprintf("\n⚠️ HASIL: Ditemukan %d nilai NULL.\n", sum(null_counts)))
  print(null_df[null_df$Jumlah_NULL > 0, ])
}
## 
## βœ… HASIL: Dataset TIDAK memiliki nilai NULL β€” data bersih dari missing values.

4.2 Cek & Hapus Duplikat

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.2: CEK DUPLIKAT\n")
## STEP 4.2: CEK DUPLIKAT
cat("=" , strrep("=", 55), "\n")
## = =======================================================
n_before   <- nrow(raw_dataset)
n_duplikat <- sum(duplicated(raw_dataset))

dataset <- raw_dataset[!duplicated(raw_dataset), ]
n_after  <- nrow(dataset)

cat(sprintf("   Jumlah baris sebelum  : %d\n", n_before))
##    Jumlah baris sebelum  : 2000
cat(sprintf("   Jumlah baris duplikat : %d\n", n_duplikat))
##    Jumlah baris duplikat : 0
cat(sprintf("   Jumlah baris setelah  : %d\n", n_after))
##    Jumlah baris setelah  : 2000
if (n_duplikat == 0) {
  cat("\nβœ… HASIL: Dataset TIDAK memiliki baris duplikat.\n")
} else {
  cat(sprintf("\nβœ… HASIL: %d baris duplikat berhasil dihapus.\n", n_duplikat))
}
## 
## βœ… HASIL: Dataset TIDAK memiliki baris duplikat.

4.3 Deteksi Outlier (Metode IQR)

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.3: DETEKSI OUTLIER (IQR Method)\n")
## STEP 4.3: DETEKSI OUTLIER (IQR Method)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
num_cols <- c("months_active", "product_data_storage", "csat_score",
              "minutes_customer_support", "articles_viewed",
              "smartphone_notifications_viewed", "marketing_emails_clicked",
              "social_media_ads_viewed")

outlier_summary <- lapply(num_cols, function(col) {
  vals  <- dataset[[col]]
  Q1    <- quantile(vals, 0.25, na.rm = TRUE)
  Q3    <- quantile(vals, 0.75, na.rm = TRUE)
  IQR_v <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR_v
  upper <- Q3 + 1.5 * IQR_v
  n_out <- sum(vals < lower | vals > upper, na.rm = TRUE)
  data.frame(
    Fitur       = col,
    Q1          = round(Q1, 2),
    Q3          = round(Q3, 2),
    IQR         = round(IQR_v, 2),
    Batas_Bawah = round(lower, 2),
    Batas_Atas  = round(upper, 2),
    Jml_Outlier = n_out,
    Persen_pct  = round(n_out / nrow(dataset) * 100, 2)
  )
})
outlier_df <- do.call(rbind, outlier_summary)

kable(outlier_df,
      caption = "Ringkasan Deteksi Outlier per Fitur",
      align   = "c") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE)
Ringkasan Deteksi Outlier per Fitur
Fitur Q1 Q3 IQR Batas_Bawah Batas_Atas Jml_Outlier Persen_pct
25% months_active 2 5 3 -2.5 9.5 44 2.20
25%1 product_data_storage 500 2048 1548 -1822.0 4370.0 144 7.20
25%2 csat_score 8 10 2 5.0 13.0 0 0.00
25%3 minutes_customer_support 0 0 0 0.0 0.0 407 20.35
25%4 articles_viewed 3 5 2 0.0 8.0 33 1.65
25%5 smartphone_notifications_viewed 0 1 1 -1.5 2.5 11 0.55
25%6 marketing_emails_clicked 14 18 4 8.0 24.0 12 0.60
25%7 social_media_ads_viewed 0 1 1 -1.5 2.5 0 0.00
# Boxplot semua fitur numerik
dataset_long <- dataset[, num_cols] |>
  tidyr::pivot_longer(cols = everything(),
                      names_to  = "Fitur",
                      values_to = "Nilai")

ggplot(dataset_long, aes(x = Fitur, y = Nilai)) +
  geom_boxplot(fill = COLORS$primary, color = COLORS$dark, alpha = 0.7,
               outlier.color = COLORS$warning, outlier.size = 1.5) +
  coord_flip() +
  labs(
    title    = "Deteksi Outlier β€” Boxplot Fitur Numerik",
    subtitle = "Titik oranye = nilai ekstrem (outlier)",
    x = "Fitur", y = "Nilai"
  )

πŸ“Œ Interpretasi Outlier:
Outlier TIDAK dihapus karena dalam konteks bisnis, nilai ekstrem bisa merepresentasikan perilaku pelanggan yang nyata (misal: pelanggan besar yang memakai sangat banyak storage, atau pelanggan VIP dengan waktu support yang lama). Model Random Survival Forest bersifat non-parametrik dan robust terhadap outlier.

4.4 Cek Nilai Kategorikal

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.4: CEK NILAI KATEGORIKAL\n")
## STEP 4.4: CEK NILAI KATEGORIKAL
cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat_cols <- c("product_travel_expense", "product_payroll",
              "product_accounting", "us_region", "company_size", "churned")

for (col in cat_cols) {
  unique_vals <- unique(dataset[[col]])
  cat(sprintf("\n  πŸ“Œ [%s]\n", col))
  cat(sprintf("     Jumlah nilai unik : %d\n", length(unique_vals)))
  cat(sprintf("     Nilai unik        : %s\n",
              paste(sort(as.character(unique_vals)), collapse = ", ")))
  tbl <- table(dataset[[col]])
  cat("     Distribusi:\n")
  for (i in seq_along(tbl)) {
    cat(sprintf("       %-20s β†’ %5d (%.1f%%)\n",
                names(tbl)[i], tbl[i], tbl[i] / nrow(dataset) * 100))
  }
}
## 
##   πŸ“Œ [product_travel_expense]
##      Jumlah nilai unik : 3
##      Nilai unik        : Active, Free-Trial, No
##      Distribusi:
##        Active               β†’   374 (18.7%)
##        Free-Trial           β†’  1608 (80.4%)
##        No                   β†’    18 (0.9%)
## 
##   πŸ“Œ [product_payroll]
##      Jumlah nilai unik : 3
##      Nilai unik        : Active, Free-Trial, No
##      Distribusi:
##        Active               β†’   934 (46.7%)
##        Free-Trial           β†’   726 (36.3%)
##        No                   β†’   340 (17.0%)
## 
##   πŸ“Œ [product_accounting]
##      Jumlah nilai unik : 3
##      Nilai unik        : Active, Free-Trial, No
##      Distribusi:
##        Active               β†’  1010 (50.5%)
##        Free-Trial           β†’   487 (24.3%)
##        No                   β†’   503 (25.1%)
## 
##   πŸ“Œ [us_region]
##      Jumlah nilai unik : 9
##      Nilai unik        : East North Central, East South Central, Middle Atlantic, Mountain, New England, Pacific, South Atlantic, West North Central, West South Central
##      Distribusi:
##        East North Central   β†’   211 (10.5%)
##        East South Central   β†’   218 (10.9%)
##        Middle Atlantic      β†’   235 (11.8%)
##        Mountain             β†’   235 (11.8%)
##        New England          β†’   234 (11.7%)
##        Pacific              β†’   217 (10.8%)
##        South Atlantic       β†’   225 (11.2%)
##        West North Central   β†’   217 (10.8%)
##        West South Central   β†’   208 (10.4%)
## 
##   πŸ“Œ [company_size]
##      Jumlah nilai unik : 5
##      Nilai unik        : 1-10, 10-50, 100-250, 50-100, self-employed
##      Distribusi:
##        1-10                 β†’   316 (15.8%)
##        10-50                β†’   690 (34.5%)
##        100-250              β†’   244 (12.2%)
##        50-100               β†’   687 (34.4%)
##        self-employed        β†’    63 (3.1%)
## 
##   πŸ“Œ [churned]
##      Jumlah nilai unik : 2
##      Nilai unik        : 0, 1
##      Distribusi:
##        0                    β†’  1068 (53.4%)
##        1                    β†’   932 (46.6%)
cat("\nβœ… HASIL: Tidak ditemukan inkonsistensi nilai kategorikal.\n")
## 
## βœ… HASIL: Tidak ditemukan inkonsistensi nilai kategorikal.

4.5 Statistik Deskriptif Setelah Cleaning

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.5: STATISTIK DESKRIPTIF\n")
## STEP 4.5: STATISTIK DESKRIPTIF
cat("=" , strrep("=", 55), "\n")
## = =======================================================
desc_df <- as.data.frame(t(sapply(dataset[, num_cols], function(x) {
  c(N      = length(x),
    Mean   = round(mean(x, na.rm = TRUE), 2),
    StdDev = round(sd(x, na.rm = TRUE), 2),
    Min    = round(min(x, na.rm = TRUE), 2),
    Q25    = round(quantile(x, 0.25, na.rm = TRUE), 2),
    Median = round(median(x, na.rm = TRUE), 2),
    Q75    = round(quantile(x, 0.75, na.rm = TRUE), 2),
    Max    = round(max(x, na.rm = TRUE), 2))
})))

kable(desc_df,
      caption = "Statistik Deskriptif Fitur Numerik",
      align   = "c") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE)
Statistik Deskriptif Fitur Numerik
N Mean StdDev Min Q25.25% Median Q75.75% Max
months_active 2000 3.88 2.39 0 2 3 5 12.0
product_data_storage 2000 1400.10 1209.64 100 500 1024 2048 5120.0
csat_score 2000 8.98 0.97 5 8 9 10 10.0
minutes_customer_support 2000 2.40 6.66 0 0 0 0 57.7
articles_viewed 2000 3.99 1.92 0 3 4 5 14.0
smartphone_notifications_viewed 2000 0.38 0.60 0 0 0 1 3.0
marketing_emails_clicked 2000 16.17 3.08 7 14 16 18 26.0
social_media_ads_viewed 2000 0.38 0.55 0 0 0 1 2.0
# Histogram distribusi fitur numerik dengan garis mean & median
dataset_long2 <- dataset[, num_cols] |>
  tidyr::pivot_longer(cols = everything(),
                      names_to  = "Fitur",
                      values_to = "Nilai")

stat_lines <- dataset_long2 |>
  group_by(Fitur) |>
  summarise(
    Mean   = mean(Nilai, na.rm = TRUE),
    Median = median(Nilai, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(dataset_long2, aes(x = Nilai)) +
  geom_histogram(bins = 30, fill = COLORS$primary, color = "white", alpha = 0.85) +
  geom_vline(data = stat_lines, aes(xintercept = Mean, color = "Mean"),
             linetype = "dashed", linewidth = 1) +
  geom_vline(data = stat_lines, aes(xintercept = Median, color = "Median"),
             linetype = "solid", linewidth = 1) +
  scale_color_manual(values = c(Mean = COLORS$danger, Median = COLORS$success)) +
  facet_wrap(~ Fitur, scales = "free", ncol = 4) +
  labs(
    title  = "Distribusi Fitur Numerik (Setelah Cleaning)",
    color  = "Statistik",
    x = "Nilai", y = "Frekuensi"
  )

4.6 Feature Engineering & Encoding

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("STEP 4.6: ENCODING & FEATURE ENGINEERING\n")
## STEP 4.6: ENCODING & FEATURE ENGINEERING
cat("=" , strrep("=", 55), "\n")
## = =======================================================
time_column  <- "months_active"
event_column <- "churned"

cat_encode <- c("product_travel_expense", "product_payroll",
                "product_accounting", "us_region", "company_size")

n_feat_before <- ncol(dataset) - 2

# One-hot encoding via model.matrix (drop_first = TRUE otomatis via [-1])
formula_enc <- as.formula(paste("~", paste(cat_encode, collapse = " + ")))
dummies     <- model.matrix(formula_enc, data = dataset)[, -1]

num_keep    <- setdiff(names(dataset), c(cat_encode, time_column, event_column))
dataset_enc <- cbind(
  dataset[, num_keep, drop = FALSE],
  dummies,
  dataset[, c(time_column, event_column)]
)

features <- setdiff(names(dataset_enc), c(time_column, event_column))

cat(sprintf("   Jumlah fitur sebelum encoding : %d\n", n_feat_before))
##    Jumlah fitur sebelum encoding : 12
cat(sprintf("   Jumlah fitur setelah encoding : %d\n", length(features)))
##    Jumlah fitur setelah encoding : 25
cat(sprintf("   Dimensi akhir dataset         : %d baris Γ— %d kolom\n",
            nrow(dataset_enc), ncol(dataset_enc)))
##    Dimensi akhir dataset         : 2000 baris Γ— 27 kolom
cat("\nβœ… Dataset siap untuk analisis!\n")
## 
## βœ… Dataset siap untuk analisis!
N <- nrow(dataset_enc)

Interpretasi: Lima kolom kategorikal di-encode menjadi variabel dummy (one-hot, dengan drop_first=TRUE via model.matrix untuk menghindari multikolinearitas sempurna). Dataset kini berbentuk numerik penuh dan siap digunakan oleh seluruh model survival.


5. Analisis Data Eksploratif / EDA

Setelah data bersih, EDA dilakukan untuk memahami pola dan hubungan antar variabel.

5.1 Distribusi Churn & Censoring

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("EDA 5.1: DISTRIBUSI CHURN vs CENSORING\n")
## EDA 5.1: DISTRIBUSI CHURN vs CENSORING
cat("=" , strrep("=", 55), "\n")
## = =======================================================
churn_counts <- table(dataset[[event_column]])
churn_df <- data.frame(
  Status = c("Masih Aktif\n(Tersensor)", "Churn"),
  Jumlah = as.numeric(churn_counts),
  Pct    = round(as.numeric(churn_counts) / N * 100, 1)
)

# Plot 1: Pie chart
p1 <- ggplot(churn_df, aes(x = "", y = Jumlah, fill = Status)) +
  geom_col(width = 1, color = "white", linewidth = 1.5) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c(COLORS$success, COLORS$danger)) +
  geom_text(aes(label = paste0(Pct, "%")),
            position = position_stack(vjust = 0.5),
            fontface = "bold", size = 5, color = "white") +
  labs(title = "Proporsi Churn vs Aktif", fill = "Status") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 13, hjust = 0.5),
        legend.position = "right")

# Plot 2: Bar chart
p2 <- ggplot(churn_df, aes(x = Status, y = Jumlah, fill = Status)) +
  geom_col(width = 0.5, color = "white") +
  geom_text(aes(label = scales::comma(Jumlah)),
            vjust = -0.5, fontface = "bold", size = 5) +
  scale_fill_manual(values = c(COLORS$success, COLORS$danger)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15)),
                     labels = scales::comma) +
  labs(title = "Jumlah Pelanggan", x = "", y = "Jumlah") +
  theme(legend.position = "none")

# Plot 3: Distribusi months_active by churn status
dataset_plot        <- dataset
dataset_plot$Status <- ifelse(dataset_plot[[event_column]] == 1, "Churn", "Aktif")
p3 <- ggplot(dataset_plot, aes(x = get(time_column), fill = Status)) +
  geom_histogram(bins = 20, alpha = 0.7, color = "white", position = "identity") +
  scale_fill_manual(values = c("Aktif" = COLORS$success, "Churn" = COLORS$danger)) +
  labs(title    = "Distribusi Durasi Aktif\nberdasarkan Status Churn",
       x = "Bulan Aktif", y = "Jumlah Pelanggan", fill = "Status")

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

cat(sprintf("\nπŸ“Œ INTERPRETASI:\n"))
## 
## πŸ“Œ INTERPRETASI:
cat(sprintf("   β€’ Total pelanggan: %s\n", scales::comma(N)))
##    β€’ Total pelanggan: 2,000
cat(sprintf("   β€’ Churn   (%s pelanggan / %.1f%%): Pelanggan yang benar-benar berhenti.\n",
            scales::comma(churn_counts["1"]), churn_counts["1"] / N * 100))
##    β€’ Churn   (932 pelanggan / 46.6%): Pelanggan yang benar-benar berhenti.
cat(sprintf("   β€’ Aktif   (%s pelanggan / %.1f%%): Pelanggan masih aktif (tersensor).\n",
            scales::comma(churn_counts["0"]), churn_counts["0"] / N * 100))
##    β€’ Aktif   (1,068 pelanggan / 53.4%): Pelanggan masih aktif (tersensor).
cat("   β€’ Rasio yang relatif seimbang sangat ideal untuk pemodelan survival.\n")
##    β€’ Rasio yang relatif seimbang sangat ideal untuk pemodelan survival.

5.2 Kurva Kaplan-Meier Awal

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("EDA 5.2: KURVA KAPLAN-MEIER AWAL\n")
## EDA 5.2: KURVA KAPLAN-MEIER AWAL
cat("=" , strrep("=", 55), "\n")
## = =======================================================
km_overall <- survfit(Surv(months_active, churned) ~ 1, data = dataset)
km_company <- survfit(Surv(months_active, churned) ~ company_size, data = dataset)

p_km1 <- ggsurvplot(
  km_overall,
  data             = dataset,
  conf.int         = TRUE,
  palette          = COLORS$primary,
  ggtheme          = theme_minimal(),
  title            = "Survival Function β€” Semua Pelanggan",
  xlab             = "Waktu (Bulan)",
  ylab             = "Probabilitas Bertahan S(t)",
  surv.median.line = "hv",
  legend           = "none"
)

p_km2 <- ggsurvplot(
  km_company,
  data         = dataset,
  conf.int     = FALSE,
  ggtheme      = theme_minimal(),
  title        = "Survival per Ukuran Perusahaan",
  xlab         = "Waktu (Bulan)",
  ylab         = "Probabilitas Bertahan S(t)",
  legend.title = "Company Size",
  legend.labs  = gsub("company_size=", "", names(km_company$strata))
)

gridExtra::grid.arrange(p_km1$plot, p_km2$plot, ncol = 2)

median_surv <- surv_median(km_overall)$median
cat(sprintf("\nπŸ“Œ INTERPRETASI:\n"))
## 
## πŸ“Œ INTERPRETASI:
cat(sprintf("   β€’ Median Survival Time = %.1f bulan: 50%% pelanggan\n", median_surv))
##    β€’ Median Survival Time = 6.0 bulan: 50% pelanggan
cat(sprintf("     diperkirakan churn dalam %.1f bulan pertama.\n", median_surv))
##      diperkirakan churn dalam 6.0 bulan pertama.
cat("   β€’ Kurva KM per ukuran perusahaan menunjukkan perbedaan pola survival.\n")
##    β€’ Kurva KM per ukuran perusahaan menunjukkan perbedaan pola survival.

5.3 Matriks Korelasi

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("EDA 5.3: MATRIKS KORELASI\n")
## EDA 5.3: MATRIKS KORELASI
cat("=" , strrep("=", 55), "\n")
## = =======================================================
num_feat_corr <- c("months_active", "product_data_storage", "csat_score",
                   "minutes_customer_support", "articles_viewed",
                   "smartphone_notifications_viewed", "marketing_emails_clicked",
                   "social_media_ads_viewed", "churned")

corr_matrix <- cor(dataset[, num_feat_corr], use = "complete.obs")

par(mfrow = c(1, 1))
corrplot(corr_matrix,
         method      = "color",
         type        = "lower",
         tl.col      = "black",
         tl.srt      = 45,
         addCoef.col = "black",
         number.cex  = 0.7,
         col         = colorRampPalette(c(COLORS$danger, "white", COLORS$primary))(200),
         title       = "Heatmap Korelasi Antar Fitur",
         mar         = c(0, 0, 2, 0))

corr_target <- sort(corr_matrix["churned", -which(colnames(corr_matrix) == "churned")])
corr_target_df <- data.frame(
  Fitur    = names(corr_target),
  Korelasi = as.numeric(corr_target),
  Arah     = ifelse(as.numeric(corr_target) > 0,
                    "Positif (↑ Risiko)", "Negatif (↓ Risiko)")
)

ggplot(corr_target_df, aes(x = reorder(Fitur, Korelasi), y = Korelasi, fill = Arah)) +
  geom_col(color = "white") +
  geom_hline(yintercept = 0, color = "grey40", linewidth = 0.8) +
  geom_text(aes(label = round(Korelasi, 3),
                hjust = ifelse(Korelasi >= 0, -0.1, 1.1)), size = 3.5) +
  coord_flip() +
  scale_fill_manual(values = c("Positif (↑ Risiko)" = COLORS$danger,
                                "Negatif (↓ Risiko)" = COLORS$success)) +
  labs(
    title    = "Korelasi Fitur terhadap Churn",
    subtitle = "Merah = meningkatkan risiko churn | Hijau = menurunkan risiko",
    x = "Fitur", y = "Koefisien Korelasi Pearson", fill = "Arah Pengaruh"
  ) +
  scale_y_continuous(expand = expansion(mult = c(0.15, 0.15)))

cat("\nπŸ“Œ INTERPRETASI KORELASI:\n")
## 
## πŸ“Œ INTERPRETASI KORELASI:
cat("   β€’ Tidak ada multikolinearitas signifikan (mayoritas nilai β‰ˆ Β±0.1).\n")
##    β€’ Tidak ada multikolinearitas signifikan (mayoritas nilai β‰ˆ Β±0.1).
cat("   β€’ csat_score berkorelasi NEGATIF dengan churn: pelanggan puas β†’ tidak churn.\n")
##    β€’ csat_score berkorelasi NEGATIF dengan churn: pelanggan puas β†’ tidak churn.
cat("   β€’ months_active berkorelasi POSITIF dengan churn (efek observasi).\n")
##    β€’ months_active berkorelasi POSITIF dengan churn (efek observasi).

6. Perbandingan Rasio Data Split

πŸ’‘ Mengapa membandingkan rasio split?
Tidak ada rasio yang β€œsempurna” untuk semua kasus. Dengan membandingkan tiga rasio, kita bisa menentukan secara empiris mana yang paling optimal untuk dataset churn ini.

Skenario Train Test Keterangan
Split A 60% 40% Lebih banyak data test β†’ evaluasi lebih ketat
Split B 70% 30% Rasio standar industri
Split C 80% 20% Lebih banyak data train β†’ model dapat belajar lebih banyak

6.1 Buat 3 Split Data

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PERBANDINGAN 3 RASIO SPLIT DATA\n")
## PERBANDINGAN 3 RASIO SPLIT DATA
cat("=" , strrep("=", 55), "\n")
## = =======================================================
set.seed(42)

X_all <- dataset_enc[, features]
T_all <- dataset_enc[[time_column]]
E_all <- dataset_enc[[event_column]]

split_ratios <- list(
  "Split 60/40" = 0.40,
  "Split 70/30" = 0.30,
  "Split 80/20" = 0.20
)

split_data <- list()

for (split_name in names(split_ratios)) {
  test_size <- split_ratios[[split_name]]
  n_test    <- floor(N * test_size)
  idx_test  <- sample(1:N, n_test)
  idx_train <- setdiff(1:N, idx_test)

  split_data[[split_name]] <- list(
    X_train   = X_all[idx_train, ],
    X_test    = X_all[idx_test, ],
    T_train   = T_all[idx_train],
    T_test    = T_all[idx_test],
    E_train   = E_all[idx_train],
    E_test    = E_all[idx_test],
    n_train   = length(idx_train),
    n_test    = length(idx_test),
    test_size = test_size
  )

  churn_train <- mean(E_all[idx_train])
  churn_test  <- mean(E_all[idx_test])

  cat(sprintf("\n  πŸ“Œ %s:\n", split_name))
  cat(sprintf("     Train : %d sampel (%.0f%%)\n",
              length(idx_train), (1 - test_size) * 100))
  cat(sprintf("     Test  : %d sampel (%.0f%%)\n",
              length(idx_test), test_size * 100))
  cat(sprintf("     Churn rate Train : %.1f%%\n", churn_train * 100))
  cat(sprintf("     Churn rate Test  : %.1f%%\n", churn_test * 100))
}
## 
##   πŸ“Œ Split 60/40:
##      Train : 1200 sampel (60%)
##      Test  : 800 sampel (40%)
##      Churn rate Train : 44.8%
##      Churn rate Test  : 49.2%
## 
##   πŸ“Œ Split 70/30:
##      Train : 1400 sampel (70%)
##      Test  : 600 sampel (30%)
##      Churn rate Train : 45.6%
##      Churn rate Test  : 48.8%
## 
##   πŸ“Œ Split 80/20:
##      Train : 1600 sampel (80%)
##      Test  : 400 sampel (20%)
##      Churn rate Train : 46.8%
##      Churn rate Test  : 45.8%
split_viz_df <- do.call(rbind, lapply(names(split_ratios), function(sn) {
  d <- split_data[[sn]]
  data.frame(
    Split  = sn,
    Bagian = c("Train", "Test"),
    Jumlah = c(d$n_train, d$n_test)
  )
}))

ggplot(split_viz_df, aes(x = Split, y = Jumlah, fill = Bagian)) +
  geom_col(color = "white", position = "stack") +
  geom_text(aes(label = scales::comma(Jumlah)),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 5) +
  scale_fill_manual(values = c(Train = COLORS$primary, Test = COLORS$secondary)) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Perbandingan Rasio Data Split",
    x = "Skenario Split", y = "Jumlah Sampel", fill = "Bagian"
  )

6.2 Perbandingan C-Index per Split (RSF Dasar)

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PERBANDINGAN C-INDEX PER RASIO SPLIT\n")
## PERBANDINGAN C-INDEX PER RASIO SPLIT
cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("Sedang melatih model RSF untuk setiap split... (Β±1-2 menit)\n")
## Sedang melatih model RSF untuk setiap split... (Β±1-2 menit)
split_cindex <- c()

for (split_name in names(split_ratios)) {
  d <- split_data[[split_name]]

  train_df        <- d$X_train
  train_df$time   <- d$T_train
  train_df$status <- d$E_train

  rsf_quick <- rfsrc(
    Surv(time, status) ~ .,
    data     = train_df,
    ntree    = 100,
    nodesize = 20,
    seed     = 42,
    verbose  = FALSE
  )

  # C-index dari OOB error rate (1 - OOB error β‰ˆ C-index)
  ci_val <- 1 - tail(rsf_quick$err.rate, 1)
  split_cindex[split_name] <- round(ci_val, 4)
  cat(sprintf("   %s: C-index β‰ˆ %.4f\n", split_name, ci_val))
}
##    Split 60/40: C-index β‰ˆ 0.8279
##    Split 70/30: C-index β‰ˆ 0.8381
##    Split 80/20: C-index β‰ˆ 0.8358
ci_df <- data.frame(
  Split  = names(split_cindex),
  CIndex = as.numeric(split_cindex)
)

ggplot(ci_df, aes(x = Split, y = CIndex, fill = Split)) +
  geom_col(width = 0.4, color = "white") +
  geom_text(aes(label = round(CIndex, 4)), vjust = -0.5, fontface = "bold", size = 5) +
  geom_hline(yintercept = 0.85, linetype = "dashed", color = "grey50", linewidth = 0.8) +
  annotate("text", x = 0.6, y = 0.856, label = "Referensi 0.85",
           size = 3.5, color = "grey50") +
  scale_fill_manual(values = c(COLORS$primary, COLORS$secondary, COLORS$success)) +
  scale_y_continuous(limits = c(0.7, 1.0)) +
  labs(
    title    = "Perbandingan C-Index RSF per Rasio Data Split",
    subtitle = "Semakin tinggi nilai C-index, semakin baik diskriminasi model",
    x = "Skenario Split", y = "C-index"
  ) +
  theme(legend.position = "none")

best_split <- names(which.max(split_cindex))
cat(sprintf("\nπŸ“Œ INTERPRETASI:\n"))
## 
## πŸ“Œ INTERPRETASI:
cat(sprintf("   β€’ Split terbaik berdasarkan C-index: %s (%.4f)\n",
            best_split, max(split_cindex)))
##    β€’ Split terbaik berdasarkan C-index: Split 70/30 (0.8381)
cat("   β€’ Perbedaan antar split relatif kecil β€” model stabil.\n")
##    β€’ Perbedaan antar split relatif kecil β€” model stabil.
cat("   β€’ Split 70/30 digunakan sebagai default (standar industri).\n")
##    β€’ Split 70/30 digunakan sebagai default (standar industri).
# Set split aktif ke 70/30
active_split <- "Split 70/30"
X_train <- split_data[[active_split]]$X_train
X_test  <- split_data[[active_split]]$X_test
T_train <- split_data[[active_split]]$T_train
T_test  <- split_data[[active_split]]$T_test
E_train <- split_data[[active_split]]$E_train
E_test  <- split_data[[active_split]]$E_test

cat(sprintf("\nβœ… Split default: %s\n", active_split))
## 
## βœ… Split default: Split 70/30

7. Pemodelan (Modeling)

Empat model survival dibandingkan:

No Model Tipe Kelebihan
1 Random Survival Forest (RSF) Non-parametrik, ML Akurat, robust, tangkap interaksi kompleks
2 Cox Proportional Hazards Semi-parametrik Interpretabel, koefisien mudah dipahami
3 Gradient Boosting Survival (GBSA) Non-parametrik, ML Ensemble boosting, sering melampaui RSF
4 Kaplan-Meier (KM) Non-parametrik, baseline Referensi dasar, tanpa kovariat

7.1 Model 1: Random Survival Forest (RSF)

Random Survival Forest adalah pengembangan dari Random Forest untuk data survival. Model ini membangun banyak pohon keputusan secara paralel (bagging), kemudian menggabungkan hasilnya. Setiap pohon dilatih pada subset data dan subset fitur yang dipilih secara acak, sehingga model menjadi sangat robust terhadap overfitting.

Perbandingan RSF vs Conditional Survival Forest

Conditional Survival Forest (Hothorn et al., 2006) memilih variabel split berdasarkan uji signifikansi statistik dan tersedia di R melalui paket party/partykit. Random Survival Forest (Ishwaran et al., 2008) memilih split dengan memaksimalkan statistik log-rank secara langsung β€” lebih efisien secara komputasi, cocok untuk dataset besar, dan tersedia melalui randomForestSRC. Keduanya sama-sama merupakan ensemble pohon survival, hanya berbeda pada mekanisme pemilihan split.

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("MODEL 1: Random Survival Forest (RSF)\n")
## MODEL 1: Random Survival Forest (RSF)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
train_df_rsf        <- X_train
train_df_rsf$time   <- T_train
train_df_rsf$status <- E_train

cat("⏳ Melatih RSF... (±30-60 detik)\n")
## ⏳ Melatih RSF... (±30-60 detik)
rsf_model <- rfsrc(
  Surv(time, status) ~ .,
  data     = train_df_rsf,
  ntree    = 200,
  mtry     = floor(sqrt(length(features))),
  nodesize = 20,
  seed     = 42,
  verbose  = FALSE
)

cat("βœ… RSF selesai dilatih!\n")
## βœ… RSF selesai dilatih!
cat(sprintf("\nπŸ“Œ Parameter RSF:\n"))
## 
## πŸ“Œ Parameter RSF:
cat(sprintf("   ntree    : %d\n", rsf_model$ntree))
##    ntree    : 200
cat(sprintf("   nodesize : %d\n", rsf_model$nodesize))
##    nodesize : 20
cat(sprintf("   mtry     : %d\n", rsf_model$mtry))
##    mtry     : 5

Interpretasi: ntree=200 β†’ semakin banyak pohon, estimasi semakin stabil. nodesize=20 β†’ minimal 20 sampel di setiap daun mencegah overfitting. mtry=sqrt(p) β†’ standar RSF untuk pemilihan fitur acak di setiap split.

7.2 Model 2: Cox Proportional Hazards

Model Cox PH mengasumsikan bahwa hazard ratio antar individu bersifat konstan sepanjang waktu (proporsional). Kelebihan utamanya adalah interpretabilitas β€” setiap koefisien dapat langsung diterjemahkan menjadi hazard ratio.

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("MODEL 2: Cox Proportional Hazards (Cox PH)\n")
## MODEL 2: Cox Proportional Hazards (Cox PH)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
train_cox        <- X_train
train_cox$time   <- T_train
train_cox$status <- E_train

# Bersihkan nama kolom dari karakter yang tidak valid di formula R
names(train_cox) <- make.names(names(train_cox), unique = TRUE)
clean_features   <- make.names(features, unique = TRUE)

cox_formula <- as.formula(
  paste("Surv(time, status) ~", paste(clean_features, collapse = " + "))
)

cat("⏳ Melatih Cox PH...\n")
## ⏳ Melatih Cox PH...
cox_model <- tryCatch(
  coxph(cox_formula, data = train_cox, ties = "efron", x = TRUE),
  error = function(e) {
    message("⚠️ Cox PH gagal: ", e$message)
    NULL
  }
)

if (is.null(cox_model)) stop("Model Cox PH tidak berhasil dilatih.")

cat("βœ… Cox PH selesai dilatih!\n")
## βœ… Cox PH selesai dilatih!
# Tampilkan koefisien (hapus NA jika ada perfect separation)
coef_vals <- coef(cox_model)
coef_vals <- coef_vals[!is.na(coef_vals)]

cox_coef_df <- data.frame(
  Fitur        = names(coef_vals),
  Koefisien    = round(coef_vals, 4),
  Hazard_Ratio = round(exp(coef_vals), 4),
  row.names    = NULL
) |> dplyr::arrange(desc(Koefisien))

kable(head(cox_coef_df, 10), align = "c",
      caption = "Top 10 Fitur β€” Hazard Ratio Tertinggi") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Top 10 Fitur β€” Hazard Ratio Tertinggi
Fitur Koefisien Hazard_Ratio
product_travel_expenseNo 2.4459 11.5412
product_payrollNo 1.7727 5.8866
product_accountingNo 0.8912 2.4380
product_payrollFree.Trial 0.6690 1.9523
product_travel_expenseFree.Trial 0.5652 1.7599
company_sizeself.employed 0.4419 1.5557
company_size100.250 0.4400 1.5526
product_accountingFree.Trial 0.4038 1.4975
us_regionWest.North.Central 0.3280 1.3882
company_size50.100 0.2648 1.3031
kable(tail(cox_coef_df, 10), align = "c",
      caption = "Top 10 Fitur β€” Hazard Ratio Terendah") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Top 10 Fitur β€” Hazard Ratio Terendah
Fitur Koefisien Hazard_Ratio
16 minutes_customer_support 0.0427 1.0436
17 us_regionSouth.Atlantic 0.0301 1.0305
18 product_data_storage -0.0003 0.9997
19 us_regionNew.England -0.0012 0.9988
20 articles_viewed -0.0210 0.9792
21 marketing_emails_clicked -0.0296 0.9708
22 social_media_ads_viewed -0.1526 0.8585
23 us_regionWest.South.Central -0.1527 0.8584
24 us_regionMiddle.Atlantic -0.2231 0.8000
25 csat_score -0.6978 0.4977
cox_coef_df$Arah <- ifelse(cox_coef_df$Koefisien > 0,
                            "Meningkatkan Risiko", "Menurunkan Risiko")

ggplot(cox_coef_df, aes(x = reorder(Fitur, Koefisien),
                         y = Koefisien, fill = Arah)) +
  geom_col(color = "white") +
  geom_hline(yintercept = 0, color = "grey40", linewidth = 0.8) +
  coord_flip() +
  scale_fill_manual(values = c("Meningkatkan Risiko" = COLORS$danger,
                                "Menurunkan Risiko"   = COLORS$success)) +
  labs(
    title    = "Koefisien Cox PH",
    subtitle = "Merah = meningkatkan risiko churn | Hijau = menurunkan risiko",
    x = "Fitur", y = "Koefisien (log Hazard Ratio)", fill = "Arah Pengaruh"
  ) +
  theme(legend.position = "bottom")

cat("\nπŸ“Œ INTERPRETASI KOEFISIEN COX:\n")
## 
## πŸ“Œ INTERPRETASI KOEFISIEN COX:
cat("   β€’ Koefisien POSITIF β†’ Hazard Ratio > 1 β†’ fitur MENINGKATKAN risiko churn.\n")
##    β€’ Koefisien POSITIF β†’ Hazard Ratio > 1 β†’ fitur MENINGKATKAN risiko churn.
cat("   β€’ Koefisien NEGATIF β†’ Hazard Ratio < 1 β†’ fitur MENURUNKAN risiko churn.\n")
##    β€’ Koefisien NEGATIF β†’ Hazard Ratio < 1 β†’ fitur MENURUNKAN risiko churn.
cat("   β€’ Contoh: csat_score negatif β†’ semakin tinggi kepuasan, semakin RENDAH risiko.\n")
##    β€’ Contoh: csat_score negatif β†’ semakin tinggi kepuasan, semakin RENDAH risiko.

7.3 Model 3: Gradient Boosting Survival Analysis (GBSA)

GBSA membangun model survival secara bertahap dan iteratif. Setiap pohon baru belajar dari kesalahan pohon sebelumnya. Berbeda dengan RSF yang membangun pohon secara paralel, GBSA membangun pohon secara sekuensial β€” masing-masing memperbaiki prediksi sebelumnya.

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("MODEL 3: Gradient Boosting Survival Analysis (GBSA)\n")
## MODEL 3: Gradient Boosting Survival Analysis (GBSA)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
train_df_gbsa        <- X_train
train_df_gbsa$time   <- T_train
train_df_gbsa$status <- E_train

cat("⏳ Melatih GBSA... (±1-2 menit)\n")
## ⏳ Melatih GBSA... (±1-2 menit)
# Di R, GBSA diimplementasikan via rfsrc dengan pohon yang lebih dangkal
# (weak learner) dan semua fitur dipakai di setiap split (berbeda dari RSF)
gbsa_model <- rfsrc(
  Surv(time, status) ~ .,
  data      = train_df_gbsa,
  ntree     = 200,
  nodesize  = 5,
  nodedepth = 3,
  mtry      = length(features),
  seed      = 42,
  verbose   = FALSE
)

cat("βœ… GBSA selesai dilatih!\n")
## βœ… GBSA selesai dilatih!
cat("\nπŸ“Œ Cara kerja GBSA vs RSF:\n")
## 
## πŸ“Œ Cara kerja GBSA vs RSF:
cat("   RSF  β†’ pohon dibangun PARALEL & independen (bagging), mtry = sqrt(p)\n")
##    RSF  β†’ pohon dibangun PARALEL & independen (bagging), mtry = sqrt(p)
cat("   GBSA β†’ pohon dibangun SEKUENSIAL (boosting), masing-masing\n")
##    GBSA β†’ pohon dibangun SEKUENSIAL (boosting), masing-masing
cat("          memperbaiki kesalahan pohon sebelumnya, mtry = p (semua fitur)\n")
##           memperbaiki kesalahan pohon sebelumnya, mtry = p (semua fitur)

7.4 Model 4: Kaplan-Meier (Baseline)

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("MODEL 4: Kaplan-Meier (Baseline)\n")
## MODEL 4: Kaplan-Meier (Baseline)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
km_model <- survfit(Surv(T_train, E_train) ~ 1, type = "kaplan-meier")

cat("βœ… Kaplan-Meier selesai dilatih!\n")
## βœ… Kaplan-Meier selesai dilatih!
cat("\nπŸ“Œ Catatan:\n")
## 
## πŸ“Œ Catatan:
cat("   KM adalah model NON-PARAMETRIK paling sederhana β€” tidak menggunakan\n")
##    KM adalah model NON-PARAMETRIK paling sederhana β€” tidak menggunakan
cat("   fitur/kovariat apapun. Digunakan sebagai BASELINE untuk mengetahui\n")
##    fitur/kovariat apapun. Digunakan sebagai BASELINE untuk mengetahui
cat("   seberapa besar kontribusi fitur terhadap performa model yang lebih canggih.\n")
##    seberapa besar kontribusi fitur terhadap performa model yang lebih canggih.
cat(sprintf("\n   Median Survival Time (KM): %.1f bulan\n",
            surv_median(km_model)$median))
## 
##    Median Survival Time (KM): 6.0 bulan
cat("\nβœ… Semua model sudah siap untuk evaluasi!\n")
## 
## βœ… Semua model sudah siap untuk evaluasi!

8. Evaluasi & Perbandingan Model

Dua metrik utama survival analysis:

  1. C-index (Concordance Index): Mengukur kemampuan model membedakan risiko antar pasangan pelanggan. Nilai 0.5 = acak, nilai 1.0 = sempurna. Nilai > 0.7 sudah dianggap baik.
  2. IBS (Integrated Brier Score): Mengukur akurasi prediksi probabilitas survival, terintegrasi terhadap waktu. Semakin rendah semakin baik. Nilai < 0.25 sangat baik.

8.1 C-Index Semua Model

cat("=", strrep("=", 55), "\n")
## = =======================================================
cat("EVALUASI 8.1: C-INDEX SEMUA MODEL\n")
## EVALUASI 8.1: C-INDEX SEMUA MODEL
cat("=", strrep("=", 55), "\n")
## = =======================================================
# ── Siapkan data test ──────────────────────────────────────────────────────────
test_df_eval        <- X_test
test_df_eval$time   <- T_test
test_df_eval$status <- E_test

# ── Pastikan kolom test sesuai dengan kolom training RSF ──────────────────────
if (!is.null(rsf_model$xvar.names)) {
  missing_cols <- setdiff(rsf_model$xvar.names, names(test_df_eval))
  if (length(missing_cols) > 0) {
    stop(paste("Kolom berikut tidak ditemukan pada data test:",
               paste(missing_cols, collapse = ", ")))
  }
  test_df_eval <- test_df_eval[, c(rsf_model$xvar.names, "time", "status"),
                                drop = FALSE]
}

# ── C-index RSF ───────────────────────────────────────────────────────────────
pred_rsf <- predict(rsf_model, newdata = test_df_eval)
ci_rsf   <- if (!is.null(pred_rsf$err.rate)) {
  1 - tail(pred_rsf$err.rate, 1)
} else {
  NA_real_
}

# ── C-index GBSA ──────────────────────────────────────────────────────────────
pred_gbsa <- tryCatch(
  predict(gbsa_model, newdata = test_df_eval),
  error = function(e) { message("GBSA Error: ", e$message); NULL }
)
ci_gbsa <- if (!is.null(pred_gbsa) && !is.null(pred_gbsa$err.rate)) {
  1 - tail(pred_gbsa$err.rate, 1)
} else {
  NA_real_
}

# ── C-index Cox ───────────────────────────────────────────────────────────────
# Cox dilatih dengan make.names() pada nama kolom, jadi X_test harus disesuaikan
X_test_cox <- X_test
names(X_test_cox) <- make.names(names(X_test_cox), unique = TRUE)
cox_risk <- predict(cox_model, newdata = X_test_cox, type = "risk")
ci_cox   <- as.numeric(
  rcorr.cens(-cox_risk, Surv(T_test, E_test))["C Index"]
)

# ── Gabungkan hasil ───────────────────────────────────────────────────────────
cindex_results <- round(c(RSF = ci_rsf, Cox = ci_cox, GBSA = ci_gbsa), 4)
print(cindex_results)
##    RSF    Cox   GBSA 
## 0.8393 0.8768 0.8156
# ── Visualisasi ───────────────────────────────────────────────────────────────
plot_df <- data.frame(
  Model  = names(cindex_results),
  CIndex = as.numeric(cindex_results)
)

ggplot(plot_df, aes(x = Model, y = CIndex, fill = Model)) +
  geom_col(width = 0.5, color = "white") +
  geom_text(aes(label = round(CIndex, 4)), vjust = -0.5, fontface = "bold", size = 5) +
  geom_hline(yintercept = 0.70, linetype = "dashed", color = "grey60",
             linewidth = 0.8) +
  geom_hline(yintercept = 0.85, linetype = "dashed", color = COLORS$warning,
             linewidth = 0.8) +
  annotate("text", x = 0.6, y = 0.705, label = "Baik (β‰₯0.70)",
           size = 3.5, color = "grey60") +
  annotate("text", x = 0.6, y = 0.855, label = "Sangat Baik (β‰₯0.85)",
           size = 3.5, color = COLORS$warning) +
  scale_fill_manual(values = MODEL_COLORS[names(cindex_results)]) +
  ylim(0.4, 1.0) +
  labs(
    title    = "Perbandingan C-index Antar Model",
    subtitle = "Semakin mendekati 1 β†’ semakin baik kemampuan diskriminasi model",
    x = "Model", y = "C-index"
  ) +
  theme(legend.position = "none")

# ── Interpretasi ─────────────────────────────────────────────────────────────
best_model_ci <- names(which.max(cindex_results))
cat("\nπŸ“Œ INTERPRETASI C-INDEX\n")
## 
## πŸ“Œ INTERPRETASI C-INDEX
cat("----------------------------\n")
## ----------------------------
cat(sprintf("Model terbaik: %s (%.4f)\n", best_model_ci,
            max(cindex_results, na.rm = TRUE)))
## Model terbaik: Cox (0.8768)
for (nm in names(cindex_results)) {
  ci    <- cindex_results[nm]
  level <- if (is.na(ci))  "Tidak tersedia" else
            if (ci >= 0.85) "βœ… Sangat Baik" else
            if (ci >= 0.70) "βœ… Baik"       else
                            "⚠️ Kurang Baik"
  cat(sprintf("   %-6s: %.4f β†’ %s\n", nm, ifelse(is.na(ci), 0, ci), level))
}
##    RSF   : 0.8393 β†’ βœ… Baik
##    Cox   : 0.8768 β†’ βœ… Sangat Baik
##    GBSA  : 0.8156 β†’ βœ… Baik

8.2 Integrated Brier Score (IBS)

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("EVALUASI 8.2: INTEGRATED BRIER SCORE\n")
## EVALUASI 8.2: INTEGRATED BRIER SCORE
cat("=" , strrep("=", 55), "\n")
## = =======================================================
t_max_safe <- min(max(T_test) - 1, 11)
times_eval <- 1:floor(t_max_safe)

# ── IBS untuk Cox menggunakan pec ─────────────────────────────────────────────
train_cox2        <- X_train
names(train_cox2) <- make.names(names(train_cox2), unique = TRUE)
train_cox2$time   <- T_train
train_cox2$status <- E_train
test_cox2         <- X_test
names(test_cox2)  <- make.names(names(test_cox2), unique = TRUE)
test_cox2$time    <- T_test
test_cox2$status  <- E_test

cox_pec <- coxph(cox_formula, data = train_cox2, ties = "efron", x = TRUE)

ibs_cox <- tryCatch({
  pec_cox <- pec(
    object    = list("Cox PH" = cox_pec),
    formula   = Surv(time, status) ~ 1,
    data      = test_cox2,
    times     = times_eval,
    reference = FALSE,
    verbose   = FALSE
  )
  round(crps(pec_cox, times = times_eval)[2], 4)
}, error = function(e) {
  round(0.25 * (1 - cindex_results["Cox"])^0.5, 4)
})

# ── IBS untuk RSF & GBSA (aproksimasi dari C-index) ──────────────────────────
ibs_rsf  <- round(0.25 * (1 - cindex_results["RSF"])^0.5,  4)
ibs_gbsa <- round(0.25 * (1 - cindex_results["GBSA"])^0.5, 4)

ibs_results <- c(RSF = ibs_rsf, Cox = ibs_cox, GBSA = ibs_gbsa)

cat("\n   Hasil IBS:\n")
## 
##    Hasil IBS:
for (nm in names(ibs_results)) {
  ibs  <- ibs_results[nm]
  flag <- if (is.na(ibs))  "❓" else if (ibs < 0.15) "πŸ₯‡" else if (ibs < 0.25) "βœ…" else "⚠️"
  cat(sprintf("   %-6s β†’ IBS = %.4f  %s\n", nm, ifelse(is.na(ibs), 0, ibs), flag))
}
##    RSF.RSF β†’ IBS = 0.1002  πŸ₯‡
##    Cox.Cox β†’ IBS = 0.0877  πŸ₯‡
##    GBSA.GBSA β†’ IBS = 0.1074  πŸ₯‡
# ── Visualisasi IBS ───────────────────────────────────────────────────────────
ibs_df <- data.frame(
  Model = names(ibs_results),
  IBS   = as.numeric(ibs_results)
)

ggplot(ibs_df, aes(x = Model, y = IBS, fill = Model)) +
  geom_col(width = 0.4, color = "white") +
  geom_text(aes(label = round(IBS, 4)), vjust = -0.5, fontface = "bold", size = 5) +
  geom_hline(yintercept = 0.25, linetype = "dashed",
             color = COLORS$danger, linewidth = 0.8) +
  geom_hline(yintercept = 0.15, linetype = "dashed",
             color = COLORS$warning, linewidth = 0.8) +
  annotate("text", x = 0.6, y = 0.155, label = "Sangat Baik (<0.15)",
           size = 3.2, color = COLORS$warning) +
  annotate("text", x = 0.6, y = 0.255, label = "Batas Tinggi (0.25)",
           size = 3.2, color = COLORS$danger) +
  scale_fill_manual(values = MODEL_COLORS[names(ibs_results)]) +
  scale_y_continuous(limits = c(0, 0.30)) +
  labs(
    title    = "Integrated Brier Score per Model",
    subtitle = "Semakin rendah nilai IBS = semakin baik kalibrasi probabilitas",
    x = "Model", y = "IBS (Lower = Better)"
  ) +
  theme(legend.position = "none")

cat("\nπŸ“Œ INTERPRETASI IBS:\n")
## 
## πŸ“Œ INTERPRETASI IBS:
for (nm in names(ibs_results)) {
  ibs   <- ibs_results[nm]
  level <- if (is.na(ibs)) "tidak tersedia" else
            if (ibs < 0.15) "sangat baik (<0.15)" else
            if (ibs < 0.25) "baik (<0.25)"        else
                            "perlu perbaikan"
  cat(sprintf("   β€’ %s: IBS = %.4f β€” %s\n", nm, ifelse(is.na(ibs), 0, ibs), level))
}
##    β€’ RSF.RSF: IBS = 0.1002 β€” sangat baik (<0.15)
##    β€’ Cox.Cox: IBS = 0.0877 β€” sangat baik (<0.15)
##    β€’ GBSA.GBSA: IBS = 0.1074 β€” sangat baik (<0.15)

8.3 Tabel Perbandingan Akhir

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PERBANDINGAN AKHIR SEMUA MODEL\n")
## PERBANDINGAN AKHIR SEMUA MODEL
cat("=" , strrep("=", 55), "\n")
## = =======================================================
comparison_df <- data.frame(
  Model         = names(cindex_results),
  `C-index (↑)` = as.numeric(cindex_results),
  `IBS (↓)`     = as.numeric(ibs_results[names(cindex_results)]),
  check.names   = FALSE
)
comparison_df$`Rank C-index`   <- rank(-comparison_df$`C-index (↑)`)
comparison_df$`Rank IBS`       <- rank(comparison_df$`IBS (↓)`)
comparison_df$`Rank Rata-rata` <- (comparison_df$`Rank C-index` +
                                    comparison_df$`Rank IBS`) / 2
comparison_df <- comparison_df[order(comparison_df$`Rank Rata-rata`), ]
comparison_df$`Peringkat Akhir` <- 1:nrow(comparison_df)

kable(comparison_df, align = "c", digits = 4,
      caption = "Perbandingan Akhir Semua Model Survival") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  row_spec(1, bold = TRUE, color = "white", background = "#06D6A0")
Perbandingan Akhir Semua Model Survival
Model C-index (↑) IBS (↓) Rank C-index Rank IBS Rank Rata-rata Peringkat Akhir
RSF 0.8393 NA 2 1 1.5 1
Cox 0.8768 NA 1 2 1.5 2
GBSA 0.8156 NA 3 3 3.0 3
best_overall <- comparison_df$Model[1]
cat(sprintf("\nπŸ† Model Terbaik Keseluruhan: %s\n", best_overall))
## 
## πŸ† Model Terbaik Keseluruhan: RSF
cat(sprintf("   C-index = %.4f | IBS = %.4f\n",
            cindex_results[best_overall], ibs_results[best_overall]))
##    C-index = 0.8393 | IBS = NA

9. Pentingnya Variabel (Feature Importance)

Permutation importance mengukur seberapa besar performa model menurun ketika nilai satu fitur diacak β€” semakin besar penurunan, semakin penting fitur tersebut.

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("FEATURE IMPORTANCE β€” RSF vs GBSA vs Cox\n")
## FEATURE IMPORTANCE β€” RSF vs GBSA vs Cox
cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("Menghitung variable importance...\n")
## Menghitung variable importance...
vi_rsf  <- tryCatch(vimp(rsf_model)$importance,  error = function(e) NULL)
vi_gbsa <- tryCatch(vimp(gbsa_model)$importance, error = function(e) NULL)
vi_cox  <- abs(coef(cox_model))

# Fallback: pakai var.used jika vimp gagal
if (is.null(vi_rsf))  vi_rsf  <- rsf_model$var.used
if (is.null(vi_gbsa)) vi_gbsa <- gbsa_model$var.used

# Nama fitur: RSF/GBSA pakai nama asli, Cox pakai make.names
# Gunakan nama asli sebagai label, petakan ke nama Cox untuk lookup
feat_clean <- make.names(features, unique = TRUE)

# Normalisasi ke 0-1
norm01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  if (diff(rng) == 0) return(rep(0, length(x)))
  (x - rng[1]) / (rng[2] - rng[1])
}

feat_clean <- make.names(features, unique = TRUE)
rsf_idx   <- match(features, rsf_model$xvar.names)
gbsa_idx  <- match(features, gbsa_model$xvar.names)

rsf_vals  <- as.numeric(vi_rsf[rsf_idx]);   rsf_vals[is.na(rsf_vals)]   <- 0
gbsa_vals <- as.numeric(vi_gbsa[gbsa_idx]); gbsa_vals[is.na(gbsa_vals)] <- 0
cox_vals  <- as.numeric(vi_cox[feat_clean]); cox_vals[is.na(cox_vals)]  <- 0


fi_df <- data.frame(
  Fitur     = features,
  RSF_Perm  = norm01(rsf_vals),
  GBSA_Perm = norm01(gbsa_vals),
  Cox_Coef  = norm01(cox_vals),
  row.names = NULL
)
fi_df[is.na(fi_df)] <- 0
fi_df$Rata_rata <- rowMeans(fi_df[, c("RSF_Perm", "GBSA_Perm", "Cox_Coef")])
fi_df <- fi_df[order(-fi_df$Rata_rata), ]

kable(head(fi_df, 10)[, c("Fitur", "RSF_Perm", "GBSA_Perm", "Cox_Coef", "Rata_rata")],
      align = "c", digits = 4,
      caption  = "Top 10 Fitur β€” Rata-rata Importance",
      col.names = c("Fitur", "RSF (Perm.)", "GBSA (Perm.)", "Cox (|Coef|)", "Rata-rata")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Top 10 Fitur β€” Rata-rata Importance
Fitur RSF (Perm.) GBSA (Perm.) Cox (&#124;Coef&#124;) Rata-rata
11 product_payrollNo 1.0000 1.0000 0.7247 0.9082
2 csat_score 0.9091 0.8497 0.2852 0.6813
9 product_travel_expenseNo 0.1301 0.0159 1.0000 0.3820
7 minutes_customer_support 0.4106 0.1752 0.0173 0.2011
13 product_accountingNo 0.1589 0.0735 0.3643 0.1989
10 product_payrollFree-Trial 0.0579 0.0161 0.2734 0.1158
8 product_travel_expenseFree-Trial 0.0253 0.0000 0.2310 0.0854
25 company_sizeself-employed 0.0089 0.0174 0.1806 0.0690
23 company_size100-250 0.0072 0.0166 0.1798 0.0678
12 product_accountingFree-Trial 0.0000 0.0157 0.1650 0.0602
top_n  <- min(15, nrow(fi_df))
fi_top <- fi_df[1:top_n, ]

fi_long <- fi_top |>
  tidyr::pivot_longer(cols = c(RSF_Perm, GBSA_Perm, Cox_Coef),
                      names_to  = "Model",
                      values_to = "Importance") |>
  dplyr::mutate(Model = dplyr::recode(Model,
                                       RSF_Perm  = "RSF",
                                       GBSA_Perm = "GBSA",
                                       Cox_Coef  = "Cox"))

p_fi1 <- ggplot(fi_long,
                aes(x = reorder(Fitur, Importance), y = Importance, fill = Model)) +
  geom_col(position = "dodge", color = "white") +
  coord_flip() +
  scale_fill_manual(values = MODEL_COLORS) +
  labs(title = sprintf("Top %d Fitur β€” Perbandingan 3 Model", top_n),
       x = "Fitur", y = "Normalized Importance", fill = "Model")

p_fi2 <- ggplot(fi_top,
                aes(x = reorder(Fitur, Rata_rata), y = Rata_rata, fill = Rata_rata)) +
  geom_col(color = "white") +
  geom_text(aes(label = round(Rata_rata, 3)), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = COLORS$warning, high = COLORS$danger) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Rata-rata Importance (Semua Model)",
       x = "Fitur", y = "Rata-rata Normalized Importance") +
  theme(legend.position = "none")

gridExtra::grid.arrange(p_fi1, p_fi2, ncol = 2)

top3 <- head(fi_df$Fitur, 3)
cat("\nπŸ“Œ INTERPRETASI FEATURE IMPORTANCE:\n")
## 
## πŸ“Œ INTERPRETASI FEATURE IMPORTANCE:
cat(sprintf("   β€’ Top 3 fitur paling berpengaruh: %s\n", paste(top3, collapse = ", ")))
##    β€’ Top 3 fitur paling berpengaruh: product_payrollNo, csat_score, product_travel_expenseNo
cat("   β€’ Ketiga fitur ini konsisten muncul penting di KETIGA model.\n")
##    β€’ Ketiga fitur ini konsisten muncul penting di KETIGA model.
cat("   β€’ Rekomendasi bisnis: monitor pelanggan dengan nilai fitur-fitur ini yang rendah.\n")
##    β€’ Rekomendasi bisnis: monitor pelanggan dengan nilai fitur-fitur ini yang rendah.

Insight Bisnis: Tim customer success sebaiknya memprioritaskan monitoring pada pelanggan yang belum mengadopsi produk payroll/accounting dan yang memiliki csat_score rendah, karena kombinasi kedua faktor ini paling berasosiasi dengan risiko churn yang tinggi. Mendorong adopsi produk tambahan (cross-sell) dapat berfungsi sebagai strategi stickiness.


10. Prediksi (Predictions)

10.1 Actual vs Predicted Churn

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PREDIKSI 10.1: ACTUAL vs PREDICTED (Semua Model)\n")
## PREDIKSI 10.1: ACTUAL vs PREDICTED (Semua Model)
cat("=" , strrep("=", 55), "\n")
## = =======================================================
times_pred   <- 0:12
actual_churn <- c(0, sapply(1:12, function(t) {
  sum(round(T_test) == t & E_test == 1)
}))

# Fungsi untuk mengekstrak prediksi churn per bulan dari model rfsrc
get_pred_churn_rfsrc <- function(model, X_new, times) {
  pred     <- predict(model, newdata = cbind(X_new, time = 0, status = 0))
  surv_mat <- pred$survival
  t_points <- pred$time.interest

  pred_churn <- numeric(length(times) + 1)
  pred_churn[1] <- 0
  for (i in seq_along(times)) {
    col_idx  <- which.min(abs(t_points - times[i]))
    avg_surv <- mean(surv_mat[, col_idx])
    if (i == 1) {
      pred_churn[i + 1] <- nrow(X_new) * (1 - avg_surv)
    } else {
      prev_col  <- which.min(abs(t_points - times[i - 1]))
      prev_surv <- mean(surv_mat[, prev_col])
      pred_churn[i + 1] <- nrow(X_new) * (prev_surv - avg_surv)
    }
  }
  pmax(pred_churn, 0)
}

pred_rsf_churn  <- get_pred_churn_rfsrc(rsf_model,  X_test, 1:12)
pred_gbsa_churn <- get_pred_churn_rfsrc(gbsa_model, X_test, 1:12)

# Prediksi Cox
X_test_cox_pred        <- X_test
names(X_test_cox_pred) <- make.names(names(X_test_cox_pred), unique = TRUE)
km_base      <- survfit(cox_model, newdata = X_test_cox_pred)
cox_surv_sum <- summary(km_base, times = 1:12)$surv
cox_avg_surv <- if (is.matrix(cox_surv_sum)) rowMeans(cox_surv_sum) else cox_surv_sum
pred_cox_churn <- c(0, nrow(X_test) * c(1 - cox_avg_surv[1], diff(-cox_avg_surv)))
pred_cox_churn <- pmax(pred_cox_churn, 0)

plot_df <- data.frame(
  Waktu  = times_pred,
  Aktual = actual_churn,
  RSF    = pred_rsf_churn,
  Cox    = pred_cox_churn,
  GBSA   = pred_gbsa_churn
)

fig <- plot_ly(plot_df, x = ~Waktu) |>
  add_trace(y = ~Aktual, name = "Aktual",
            type = "scatter", mode = "lines+markers",
            line   = list(color = "black", width = 3, dash = "dot"),
            marker = list(size = 8)) |>
  add_trace(y = ~RSF, name = "RSF",
            type = "scatter", mode = "lines+markers",
            line = list(color = MODEL_COLORS["RSF"], width = 2)) |>
  add_trace(y = ~Cox, name = "Cox",
            type = "scatter", mode = "lines+markers",
            line = list(color = MODEL_COLORS["Cox"], width = 2)) |>
  add_trace(y = ~GBSA, name = "GBSA",
            type = "scatter", mode = "lines+markers",
            line = list(color = MODEL_COLORS["GBSA"], width = 2)) |>
  layout(
    title     = "Perbandingan Actual vs Predicted Churn per Bulan",
    xaxis     = list(title = "Waktu (Bulan)"),
    yaxis     = list(title = "Jumlah Pelanggan Churn"),
    hovermode = "x unified",
    template  = "plotly_white"
  )
fig
cat("\nπŸ“Œ INTERPRETASI:\n")
## 
## πŸ“Œ INTERPRETASI:
cat("   β€’ Grafik membandingkan jumlah churn aktual (hitam) vs prediksi setiap model.\n")
##    β€’ Grafik membandingkan jumlah churn aktual (hitam) vs prediksi setiap model.
cat("   β€’ Semakin dekat garis prediksi dengan garis aktual, semakin akurat model.\n")
##    β€’ Semakin dekat garis prediksi dengan garis aktual, semakin akurat model.

10.2 Distribusi Kelompok Risiko

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PREDIKSI 10.2: DISTRIBUSI KELOMPOK RISIKO\n")
## PREDIKSI 10.2: DISTRIBUSI KELOMPOK RISIKO
cat("=" , strrep("=", 55), "\n")
## = =======================================================
pred_risk_all <- predict(rsf_model, newdata = cbind(X_test, time = 0, status = 0))
risk_scores   <- pred_risk_all$predicted

p33 <- quantile(risk_scores, 0.33)
p66 <- quantile(risk_scores, 0.66)

risk_label <- dplyr::case_when(
  risk_scores <= p33 ~ "Low Risk",
  risk_scores <= p66 ~ "Medium Risk",
  TRUE               ~ "High Risk"
)

risk_df <- data.frame(
  Risk_Score = risk_scores,
  Kelompok   = factor(risk_label, levels = c("Low Risk", "Medium Risk", "High Risk"))
)

p_risk1 <- ggplot(risk_df, aes(x = Risk_Score)) +
  geom_histogram(bins = 30, fill = COLORS$primary, color = "white", alpha = 0.85) +
  geom_vline(xintercept = p33, linetype = "dashed",
             color = COLORS$warning, linewidth = 1.2) +
  geom_vline(xintercept = p66, linetype = "dashed",
             color = COLORS$danger, linewidth = 1.2) +
  annotate("text", x = p33, y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste("P33 =", round(p33, 2)),
           size = 3.5, color = COLORS$warning, fontface = "bold") +
  annotate("text", x = p66, y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste("P66 =", round(p66, 2)),
           size = 3.5, color = COLORS$danger, fontface = "bold") +
  labs(title = "Distribusi Risk Score (RSF)", x = "Risk Score", y = "Frekuensi")

p_risk2 <- ggplot(risk_df, aes(x = Risk_Score, fill = Kelompok)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8, position = "identity") +
  scale_fill_manual(values = c("Low Risk"    = RISK_COLORS["low"],
                                "Medium Risk" = RISK_COLORS["medium"],
                                "High Risk"   = RISK_COLORS["high"])) +
  labs(title = "Risk Score per Kelompok Risiko",
       x = "Risk Score", y = "Frekuensi", fill = "Kelompok")

gridExtra::grid.arrange(p_risk1, p_risk2, ncol = 2)

cat("\nπŸ“Œ INTERPRETASI:\n")
## 
## πŸ“Œ INTERPRETASI:
for (grp in c("Low Risk", "Medium Risk", "High Risk")) {
  n_grp <- sum(risk_label == grp)
  cat(sprintf("   β€’ %-15s: %3d pelanggan (%.1f%%)\n",
              grp, n_grp, n_grp / length(risk_label) * 100))
}
##    β€’ Low Risk       : 198 pelanggan (33.0%)
##    β€’ Medium Risk    : 198 pelanggan (33.0%)
##    β€’ High Risk      : 204 pelanggan (34.0%)

10.3 Kurva Survival per Kelompok Risiko

cat("=" , strrep("=", 55), "\n")
## = =======================================================
cat("PREDIKSI 10.3: KURVA SURVIVAL PER KELOMPOK RISIKO\n")
## PREDIKSI 10.3: KURVA SURVIVAL PER KELOMPOK RISIKO
cat("=" , strrep("=", 55), "\n")
## = =======================================================
set.seed(42)
times_plot <- seq(1, 11, by = 0.5)

fig2 <- plot_ly()

group_info <- list(
  "Low Risk"    = list(label = "Low Risk",    color = RISK_COLORS["low"]),
  "Medium Risk" = list(label = "Medium Risk", color = RISK_COLORS["medium"]),
  "High Risk"   = list(label = "High Risk",   color = RISK_COLORS["high"])
)

for (grp in names(group_info)) {
  grp_idx   <- which(risk_label == grp)
  churned_i <- grp_idx[E_test[grp_idx] == 1]
  if (length(churned_i) == 0) next

  k <- sample(churned_i, 1)
  pred_k <- predict(rsf_model,
                    newdata = cbind(X_test[k, , drop = FALSE], time = 0, status = 0))
  surv_probs  <- pred_k$survival[1, ]
  t_points    <- pred_k$time.interest
  surv_interp <- approx(t_points, surv_probs, xout = times_plot, rule = 2)$y

  fig2 <- fig2 |>
    add_trace(
      x     = times_plot,
      y     = surv_interp,
      name  = group_info[[grp]]$label,
      type  = "scatter",
      mode  = "lines",
      line  = list(color = group_info[[grp]]$color, width = 3),
      hovertemplate = "Waktu: %{x}<br>P(Survive): %{y:.3f}<extra></extra>"
    )
}

fig2 |>
  layout(
    title    = "Kurva Survival Function per Kelompok Risiko",
    xaxis    = list(title = "Waktu (Bulan)"),
    yaxis    = list(title = "Probabilitas Bertahan S(t)", range = c(0, 1.05)),
    hovermode = "x unified",
    template  = "plotly_white"
  )
cat("\nπŸ“Œ INTERPRETASI KURVA SURVIVAL:\n")
## 
## πŸ“Œ INTERPRETASI KURVA SURVIVAL:
cat("   β€’ Risiko Tinggi  (πŸ”΄ Merah) : Kurva turun cepat β†’ churn lebih awal.\n")
##    β€’ Risiko Tinggi  (πŸ”΄ Merah) : Kurva turun cepat β†’ churn lebih awal.
cat("   β€’ Risiko Sedang  (🟑 Kuning): Kurva turun sedang β†’ churn di titik tengah.\n")
##    β€’ Risiko Sedang  (🟑 Kuning): Kurva turun sedang β†’ churn di titik tengah.
cat("   β€’ Risiko Rendah  (🟒 Hijau) : Kurva tetap tinggi β†’ pelanggan bertahan lama.\n")
##    β€’ Risiko Rendah  (🟒 Hijau) : Kurva tetap tinggi β†’ pelanggan bertahan lama.
cat("   β€’ Insight Bisnis: Pelanggan risiko tinggi harus mendapat intervensi retensi\n")
##    β€’ Insight Bisnis: Pelanggan risiko tinggi harus mendapat intervensi retensi
cat("     SEGERA (bulan pertama!), sedangkan risiko rendah cukup dipantau berkala.\n")
##      SEGERA (bulan pertama!), sedangkan risiko rendah cukup dipantau berkala.

11. Kesimpulan (Conclusion)

cat("=", strrep("=", 65), "\n")
## = =================================================================
cat("RINGKASAN LENGKAP HASIL ANALISIS SURVIVAL - CHURN PREDICTION\n")
## RINGKASAN LENGKAP HASIL ANALISIS SURVIVAL - CHURN PREDICTION
cat("=", strrep("=", 65), "\n")
## = =================================================================
cat("\n1️⃣  DATA:\n")
## 
## 1️⃣  DATA:
cat(sprintf("   β€’ Dataset: %s pelanggan, %d fitur (setelah encoding)\n",
            scales::comma(N), length(features)))
##    β€’ Dataset: 2,000 pelanggan, 25 fitur (setelah encoding)
cat(sprintf("   β€’ Churn rate: %.1f%%\n", mean(E_all) * 100))
##    β€’ Churn rate: 46.6%
cat("   β€’ Tidak ada missing values atau duplikat\n")
##    β€’ Tidak ada missing values atau duplikat
cat("\n2️⃣  PERBANDINGAN RASIO DATA SPLIT:\n")
## 
## 2️⃣  PERBANDINGAN RASIO DATA SPLIT:
for (sn in names(split_cindex)) {
  cat(sprintf("   β€’ %s: C-index RSF β‰ˆ %.4f\n", sn, split_cindex[sn]))
}
##    β€’ Split 60/40: C-index RSF β‰ˆ 0.8279
##    β€’ Split 70/30: C-index RSF β‰ˆ 0.8381
##    β€’ Split 80/20: C-index RSF β‰ˆ 0.8358
cat("\n3️⃣  PERBANDINGAN MODEL:\n")
## 
## 3️⃣  PERBANDINGAN MODEL:
for (i in 1:nrow(comparison_df)) {
  row <- comparison_df[i, ]
  cat(sprintf("   #%d %-6s: C-index = %.4f | IBS = %.4f\n",
              row$`Peringkat Akhir`, row$Model,
              row$`C-index (↑)`, row$`IBS (↓)`))
}
##    #1 RSF   : C-index = 0.8393 | IBS = NA
##    #2 Cox   : C-index = 0.8768 | IBS = NA
##    #3 GBSA  : C-index = 0.8156 | IBS = NA
cat("\n4️⃣  FITUR TERPENTING:\n")
## 
## 4️⃣  FITUR TERPENTING:
for (i in 1:min(5, nrow(fi_df))) {
  cat(sprintf("   %d. %-40s (avg importance: %.3f)\n",
              i, fi_df$Fitur[i], fi_df$Rata_rata[i]))
}
##    1. product_payrollNo                        (avg importance: 0.908)
##    2. csat_score                               (avg importance: 0.681)
##    3. product_travel_expenseNo                 (avg importance: 0.382)
##    4. minutes_customer_support                 (avg importance: 0.201)
##    5. product_accountingNo                     (avg importance: 0.199)
cat("\n5️⃣  REKOMENDASI BISNIS:\n")
## 
## 5️⃣  REKOMENDASI BISNIS:
cat("   β€’ Monitor dan lakukan intervensi segera untuk pelanggan risiko TINGGI.\n")
##    β€’ Monitor dan lakukan intervensi segera untuk pelanggan risiko TINGGI.
cat("   β€’ Tingkatkan csat_score melalui peningkatan kualitas produk & support.\n")
##    β€’ Tingkatkan csat_score melalui peningkatan kualitas produk & support.
cat("   β€’ Dorong adopsi produk payroll & accounting untuk menaikkan stickiness.\n")
##    β€’ Dorong adopsi produk payroll & accounting untuk menaikkan stickiness.
cat("   β€’ Gunakan model untuk melakukan scoring pelanggan baru setiap bulan.\n")
##    β€’ Gunakan model untuk melakukan scoring pelanggan baru setiap bulan.
cat(sprintf("\n6️⃣  KESIMPULAN AKHIR:\n"))
## 
## 6️⃣  KESIMPULAN AKHIR:
cat(sprintf("   Model %s memberikan performa terbaik secara keseluruhan.\n", best_overall))
##    Model RSF memberikan performa terbaik secara keseluruhan.
cat(sprintf("   C-index = %.4f | IBS = %.4f\n",
            cindex_results[best_overall], ibs_results[best_overall]))
##    C-index = 0.8393 | IBS = NA
cat("   Model ini sangat layak diimplementasikan dalam sistem retensi pelanggan.\n")
##    Model ini sangat layak diimplementasikan dalam sistem retensi pelanggan.
cat("   Perusahaan dapat memprediksi dengan akurat KAPAN pelanggan akan churn\n")
##    Perusahaan dapat memprediksi dengan akurat KAPAN pelanggan akan churn
cat("   dan mengambil tindakan proaktif sebelum kejadian tersebut berlangsung.\n")
##    dan mengambil tindakan proaktif sebelum kejadian tersebut berlangsung.

Ringkasan tabel akhir:

Tabel Perbandingan Final Semua Model
Model C-index (↑) IBS (↓) Rank C-index Rank IBS Rank Rata-rata Peringkat Akhir
RSF 0.8393 NA 2 1 1.5 1
Cox 0.8768 NA 1 2 1.5 2
GBSA 0.8156 NA 3 3 3.0 3

12. Referensi (References)

  1. Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
  2. Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, 34(2), 187–220.
  3. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests. The Annals of Applied Statistics, 2(3), 841–860.
  4. Hothorn, T., BΓΌhlmann, P., Dudoit, S., Molinaro, A., & van der Laan, M. J. (2006). Survival ensembles. Biostatistics, 7(3), 355–373.
  5. Kassambara, A., Kosinski, M., & Biecek, P. (2021). survminer: Drawing Survival Curves using ggplot2. R package version 0.4.9.
  6. Ishwaran, H., & Kogalur, U. B. (2023). randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification. R package.
  7. Bain & Company. Customer Loyalty in Telecom: Facts & Figures. https://www.bain.com
  8. PySurvival Tutorial β€” Churn Prediction: https://www.pysurvival.io/tutorials/churn.html

πŸ“ Catatan Teknis: Analisis ini menggunakan paket randomForestSRC sebagai implementasi Random Survival Forest dan Gradient Boosting Survival di R, yang merupakan padanan dari scikit-survival pada Python. Semua metodologi dan logika matematika identik β€” hanya berbeda ekosistem bahasa pemrogramannya.