Survival Models
Mid Term - Week 8
Email           : vanessasupit0910@gmail.com      | je070601@gmail.com
RPubs        : https://rpubs.com/vanessasupit/   | https://rpubs.com/invokerarts/
Github       : https://github.com/vanessasupit/   | https://github.com/invokerarts
Majors       : Business Statistics
Address    : ARA Center, Matana University Tower Jl. CBD Barat Kav, RT.1, Curug Sangereng,
            Kelapa Dua, Tangerang, Banten 15810.
library(dplyr)
library(ggplot2)
library(corrplot)
library(ggpubr)
library(tidyverse)
library(lubridate)
library(survival)
library(KMsurv)
library(skimr)
library(knitr)
library(survminer)
library(fastDummies)1 Pendahuluan
Perpindahan pelanggan (Churn) alias persentase pelanggan yang berhenti menggunakan produk atau layanan perusahaan adalah salah satu metrik yang terpenting untuk bisnis karena lebih susah mempertahankan customer lama dibandingkan mencari customer baru. Menurut study Bian & Company, pelanggan yang sudah ada cendrung membeli lebih banyak dari sebuah perusahaan. Customer Churn terjadi ketika pelanggan perusahaan pindah secara bertahap selama periode waktu karena berbagai alasan seperti ketidakpuasan dari produk atau layanan, penawaran kompetitif yang lebih baik, penurunan layanan sementara karena masalah teknis, tidak merasakan nilai layanan terhadap biaya, kematian pelanggan. , relokasi pelanggan, dll. Perpindahan pelanggan juga dikenal sebagai pergantian pelanggan atau pembelotan pelanggan atau kehilangan klien.
Memprediksi churn pelanggan dapat membantu perusahaan untuk meningkatkan retensi pelanggan karena membuat perusahaan proaktif dan perusahaan dapat merumuskan strategi yang lebih efektif untuk mempertahankan pelanggan. Ini dapat membantu perusahaan untuk melakukan rencana khusus yang lebih tepat untuk mempertahankan pelanggan. Dengan menggunakan Survival Analysis, tidak hanya perusahaan yang dapat memprediksi apakah pelanggan akan melakukan churn, tetapi juga kapan peristiwa itu mungkin terjadi.
2 Data
2.1 Sumber Data
Data diambil dari kaggle yang dapat diakses dengan cara klik disini
2.2 Variabel Data
| Variabel | Deskripsi |
|---|---|
| customerID | Customer ID |
| gender | Jenis Kelamin Customer (male or a female) |
| SeniorCitizen | Apakah customer merupakan lanjut usia (1,0) |
| Partner | Apakah customer memiliki pasangan (Yes, No) |
| Dependents | Apakah customer memiliki tanggungan (Yes, No) |
| Tenure | Lama dalam bulan customer tetap menjadi pelanggan di perusahaan |
| PhoneService | Apakah customer memiliki layanan telepon (Yes, No) |
| MultipleLines | Apakah customer memiliki lebih dari 1 layanan ato tidak (Yes, No, No phone service) |
| InternetService | Internet Provider Customer (DSL, Fiber Optic, No) |
| OnlineSecurity | Apakah customer memiliki online security (Yes, No, No phone service) |
| OnlineBackup | Apakah customer berlangganan online backup service lain yang disediakan oleh perusahaan (Yes, No, No phone service) |
| DeviceProtection | Apakah customer berlangganan paket perlindungan yang disediakan oleh perusahaan(Yes, No, No phone service) |
| TechSupport | Apakah customer berlangganan paket technical support yang disediakan oleh perusahaan (Yes, No, No phone service) |
| StreamingTV | Apakah customer menggunakan Internet Service mereka untuk menonton TV dari pihak ke-3 (Yes, No, No phone service) |
| StreamingMovies | Apakah customer menggunakan Internet Service mereka untuk menonton film dari pihak ke-3 (Yes, No, No phone service) |
| Contract | Apa tipe kontrak customer (Month-to-Month, One Year, Two Year) |
| PaperlessBilling | Apakah customer memilih paperless billing (Yes, No) |
| Payment Method | Bagaimana cara customer membayar tagihan (Bank Transfer[Auto], Credit Card[Auto], Electronic Check, Mailed Check) |
| MonthlyCharges | Menandakan biaya bulanan customer untuk semua service yang disediakan oleh perusahaan |
| TotalCharges | Menandakan total biaya customer untuk semua service yang disediakan oleh perusahaan |
| Churn | Menandakan keadaan customer (1,0) |
2.3 Import Data
data <- read.csv("Data.csv")
str(data)## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
data <- data %>% mutate_if(is.character, as.factor)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
str(data)## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
datasummary(data)## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 0:5901 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1:1142 Yes:3402 Yes:2110
## 0004-TLHLJ: 1
## 0011-IGKFF: 1
## 0013-EXCHZ: 1
## 0013-MHZWF: 1
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
3 EDA
3.1 Penanganan Data Duplikat
check.duplicate <- data.frame(
row_of_data = data %>% nrow (),
row_of_unique.data = data %>% distinct() %>% nrow())
check.duplicateJadi tidak ada data duplikat
3.2 Penanganan Data Hilang
Periksa jumlah baris NA jika jumlahnya relatif kecil maka abaikan baris tersebut dari analisis
ColNA <- colnames(data)[ apply(data, 2, anyNA) ] # Melihat Kolom dengan daya yang hilang
ColNA## [1] "TotalCharges"
data <- na.omit(data)
dataJadi data berkurang dari 7043 menjadi 7032 (11 data memiliki variabel yang hilang)
3.3 Penanganan Data Outlier
bp1 <- data %>%
select(MonthlyCharges) %>%
ggplot(aes(MonthlyCharges)) + geom_boxplot()
bp2 <- data %>%
select(TotalCharges) %>%
ggplot(aes(TotalCharges)) + geom_boxplot()
bp3 <- data %>%
select(tenure) %>%
ggplot(aes(tenure)) + geom_boxplot()
subplot4 <- ggarrange(bp1, bp2, bp3,
ncol = 1, nrow = 3,
common.legend = TRUE, legend = "bottom")
subplot43.4 Visualisasi Data Kategorikal
# Membuat ggplot
ggpGender <- ggplot(data, aes(x = gender, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpSenior <- ggplot(data, aes(x = SeniorCitizen, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpPartner <- ggplot(data, aes(x = Partner, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpDependents <- ggplot(data, aes(x = Dependents, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpPhoneService <- ggplot(data, aes(x = PhoneService, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpLines <- ggplot(data, aes(x = MultipleLines, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpInternet <- ggplot(data, aes(x = InternetService, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpSecurity <- ggplot(data, aes(x = OnlineSecurity, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpBackup <- ggplot(data, aes(x = OnlineBackup, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpProtection <- ggplot(data, aes(x = DeviceProtection, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpSupport <- ggplot(data, aes(x = TechSupport, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpTV <- ggplot(data, aes(x = StreamingTV, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpMovies <- ggplot(data, aes(x = StreamingMovies, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpContract <- ggplot(data, aes(x = Contract, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpBilling <- ggplot(data, aes(x = PaperlessBilling, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpMethod <- ggplot(data, aes(x = PaymentMethod, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single")) +
theme(axis.text.x = element_text(angle=-45, vjust=1, hjust=0))
ggpChurn <- ggplot(data, aes(x = Churn, fill = Churn)) + theme_minimal() +
geom_bar(alpha = 0.8, position = position_dodge(preserve = "single"))# Menggabungkan ggplot
figure1 <- ggarrange(ggpGender, ggpSenior, ggpPartner, ggpDependents,
labels = c("Gender", "Senior", "Partner", "Dependents"),
ncol = 2, nrow = 2)
figure2 <- ggarrange(ggpPhoneService, ggpLines, ggpInternet, ggpSecurity,
labels = c("PhoneService", "Lines", "Internet", "Security"),
ncol = 2, nrow = 2)
figure3 <- ggarrange(ggpBackup, ggpProtection, ggpSupport, ggpTV,
labels = c("Backup", "Protection", "Support", "TV"),
ncol = 2, nrow = 2)
figure4 <- ggarrange(ggpMovies, ggpContract, ggpBilling,ggpMethod,
labels = c("Movies", "Contract", "Billing", "Method"),
ncol = 2, nrow = 2)3.5 Visualisasi Data Numerik
ggpTenure <- ggplot(data, aes(x=tenure, fill = Churn, color=Churn)) +
geom_histogram(alpha=0.5)
ggpMonthly <- ggplot(data, aes(x=MonthlyCharges, fill = Churn, color=Churn)) +
geom_histogram(alpha=0.5)
ggpTotal <- ggplot(data, aes(x=TotalCharges, fill = Churn, color=Churn)) +
geom_histogram(alpha=0.5)
figure5 <- ggarrange(ggpTenure, ggpMonthly, ggpTotal,
labels = c("Tenure", "Monthly", "Total"),
ncol = 1, nrow = 3)
figure53.6 Korelasi
DataNum <- dplyr::select_if(data, is.numeric)
Correlation <- cor(DataNum) %>% corrplot(method = "color")4 Survival Analysis
4.1 Kaplan-Meier Survival Function
data$gender_d <- ifelse(data$gender != "Male", 1,0 )
data$Partner_d <- ifelse(data$Partner == "Yes", 1,0 )
data$Dependents_d <- ifelse(data$Dependents == "Yes", 1,0 )
data$PaperlessBilling_d <- ifelse(data$PaperlessBilling == "Yes", 1,0 )
data$PhoneService_d <- ifelse(data$PhoneService == "Yes", 1,0 )
data$MultipleLines_d <- ifelse(data$MultipleLines == "Yes", 1,0 )
data$OnlineSecurity_d <- ifelse(data$OnlineSecurity == "Yes", 1,0 )
data$OnlineBackup_d <- ifelse(data$OnlineBackup == "Yes", 1,0 )
data$DeviceProtection_d <- ifelse(data$DeviceProtection == "Yes", 1,0 )
data$TechSupport_d <- ifelse(data$TechSupport == "Yes", 1,0 )
data$StreamingTV_d <- ifelse(data$StreamingTV == "Yes", 1,0 )
data$StreamingMovies_d <- ifelse(data$StreamingMovies == "Yes", 1,0 )
data$has_InternetService_d <- ifelse(data$InternetService != "No", 1, 0)
data_tres_levels <- data %>%
select(InternetService,Contract,PaymentMethod)
data_tres_levels_d <-dummy_cols(data_tres_levels)
data_tres_levels_d<- data_tres_levels_d %>%
select(-c(InternetService,Contract,PaymentMethod))
attach(data)
data_final <- data.frame(customerID,gender_d,SeniorCitizen,Partner_d,Dependents_d,tenure,
PaperlessBilling_d,PhoneService_d,MultipleLines_d,
OnlineSecurity_d,OnlineBackup_d,DeviceProtection_d,
TechSupport_d,StreamingTV_d,StreamingMovies_d,has_InternetService_d,
data_tres_levels_d,MonthlyCharges,TotalCharges,Churn)data$Churn <- ifelse(data$Churn=='Yes',1,0 )
data_surv <- Surv(data$tenure, data$Churn)
plot(data_surv, xlab="Months", ylab="Survival Function", main="Kaplan-Meier's Survival Function", col=1:3)fit <- survfit(data_surv ~ gender_d, data = data)
ggsurvplot(fit, data = data,
pval = TRUE,
conf.int = TRUE,
)fit <- survfit(data_surv ~ SeniorCitizen, data = data)
ggsurvplot(fit, data = data,
pval = TRUE,
conf.int = TRUE,
)4.2 Kaplan-Meier Life Table
m1 <- survfit(Surv(tenure, Churn) ~ gender, data = data)
sm1 <- surv_summary(m1, data = data)
sm1ggsurvplot(m1, data = data,
pval = TRUE, conf.int = TRUE,
risk.table = "abs_pct",
risk.table.col = "strata",
linetype= "strata",
break.time.by = 12,
surv.median.line = "hv",
ncensor.plot = TRUE,
ggtheme = theme_bw(),
palette = c("#2E9FDF", "#9B59B6"))5 Kesimpulan
Data yang kita gunakan merupakan data customer berlangganan suatu jasa dalam suatu perusahaan dengan waktu dalam bulan (0-72 bulan). Data awal yang kita miliki adalah 7043 dan dilakukan cleaning sehingga data menjadi 7032 (memiliki data hilang). Karena data cukup banyak dan memiliki nilai hilang sedikit maka kita dapat melakukan penghapusan data. Data yang kita gunakan juga tidak memiliki outliers sehingga tidak perlu dilakukan penanganan data outlier. Analysis survival kelompok kami menggunakan Metode Kaplan-Meier Life Tavel yang mana menghasilkan hasil sebagai berikut.
data.frame(subset(sm1, time == 60))Dari grafik Kaplan-Meier diatas, dapat disumpulkan bahwa pada tahun ke 5 (60 bulan), peluang bertahannya customer di perusahaan berdasarkan variabel gender adalah
- 66.03705% untuk gender Perempuan
- 66.85883% untuk gender Laki-laki