Dataset ini bercerita tentang Memprediksi perilaku untuk mempertahankan pelanggan. Dapat dilakukan untuk menganalisis semua data pelanggan yang relevan dan mengembangkan program retensi pelanggan.
Setiap baris mewakili pelanggan, setiap kolom berisi atribut pelanggan yang dijelaskan pada kolom Metadata.
Kumpulan data mencakup informasi tentang:
Pelanggan yang keluar dalam sebulan terakhir – kolomnya disebut dengan Churn. Layanan yang telah didaftarkan oleh setiap pelanggan – telepon, banyak saluran, internet, keamanan online, pencadangan online, perlindungan perangkat, dukungan teknis, serta streaming TV dan film Informasi akun pelanggan – berapa lama mereka menjadi pelanggan, kontrak, metode pembayaran, penagihan tanpa kertas, tagihan bulanan, dan total tagihan Informasi demografis tentang pelanggan – jenis kelamin, rentang usia, dan apakah mereka memiliki mitra dan tanggungan inspirasi untuk menjelajahi model jenis ini dan mempelajari lebih lanjut tentang subjeknya.
churn <- read.csv("data_input/churn.csv", stringsAsFactors = T)
churn_clean <- churn %>%
select(-customerID) %>%
mutate(
SeniorCitizen = as.factor(SeniorCitizen),
MonthlyCharges = as.numeric(MonthlyCharges),
TotalCharges = as.numeric(TotalCharges)
)
glimpse(churn_clean)
## Rows: 7,043
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
Dilakukan data wrangling dengan membuang kolom customerID, mengubah tipe data SeniorCitizen menjadi factor, mengubah tipe data MonthlyCharge dan TotalCharges menjadi numeric.
head(churn_clean)
levels(churn_clean$Churn)
## [1] "No" "Yes"
Memeriksa level pada variabel target Churn dengan jumlah levels 2 yaitu “No” dan “Yes”
Dilakukan model logistic linear model tanpa prediktor
churn_null <- glm(formula = Churn ~1,
data = churn_clean,
family = "binomial")
summary(churn_null)
##
## Call:
## glm(formula = Churn ~ 1, family = "binomial", data = churn_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.01824 0.02699 -37.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8150.1 on 7042 degrees of freedom
## Residual deviance: 8150.1 on 7042 degrees of freedom
## AIC: 8152.1
##
## Number of Fisher Scoring iterations: 4
didapat nilai intercept = -1.018
table(churn_clean$Churn)
##
## No Yes
## 5174 1869
Memeriksa jumlah nilai pada target Churn
Dilakukan logistic liner Model dengan numerical prediktor
churn_num <- glm(formula = Churn ~TotalCharges,
data = churn_clean,
family = "binomial")
summary(churn_num)
##
## Call:
## glm(formula = Churn ~ TotalCharges, family = "binomial", data = churn_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.448e-01 3.720e-02 -14.65 <2e-16 ***
## TotalCharges -2.362e-04 1.457e-05 -16.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 7833.7 on 7030 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 7837.7
##
## Number of Fisher Scoring iterations: 4
didapatkan nilai Intercept = -0.544 dan TotalCharge -0.00023
colSums(is.na(churn_clean))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 11 0
memeriksa kolom yang tidak memiliki nilai.
churn_clean2 <- churn_clean %>%
drop_na()
anyNA(churn_clean2)
## [1] FALSE
colSums(is.na(churn_clean2))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
sudah dilakukan penghapusan nilai yang kosong
melakukan logistic linear Model dengan semua predictor
churn_all <- glm(formula = Churn ~.,
data = churn_clean2,
family = "binomial")
summary(churn_all)
##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = churn_clean2)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.165e+00 8.151e-01 1.430 0.15284
## genderMale -2.183e-02 6.480e-02 -0.337 0.73619
## SeniorCitizen1 2.168e-01 8.453e-02 2.564 0.01033 *
## PartnerYes -3.840e-04 7.783e-02 -0.005 0.99606
## DependentsYes -1.485e-01 8.973e-02 -1.655 0.09796 .
## tenure -6.059e-02 6.236e-03 -9.716 < 2e-16 ***
## PhoneServiceYes 1.715e-01 6.487e-01 0.264 0.79153
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 4.484e-01 1.773e-01 2.530 0.01142 *
## InternetServiceFiber optic 1.747e+00 7.981e-01 2.190 0.02855 *
## InternetServiceNo -1.786e+00 8.073e-01 -2.213 0.02691 *
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -2.054e-01 1.787e-01 -1.150 0.25031
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes 2.604e-02 1.754e-01 0.148 0.88197
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes 1.474e-01 1.764e-01 0.836 0.40339
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -1.805e-01 1.806e-01 -0.999 0.31759
## StreamingTVNo internet service NA NA NA NA
## StreamingTVYes 5.905e-01 3.263e-01 1.810 0.07035 .
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 5.993e-01 3.267e-01 1.834 0.06658 .
## ContractOne year -6.608e-01 1.076e-01 -6.142 8.15e-10 ***
## ContractTwo year -1.357e+00 1.764e-01 -7.691 1.46e-14 ***
## PaperlessBillingYes 3.424e-01 7.450e-02 4.596 4.31e-06 ***
## PaymentMethodCredit card (automatic) -8.779e-02 1.141e-01 -0.770 0.44156
## PaymentMethodElectronic check 3.045e-01 9.450e-02 3.222 0.00127 **
## PaymentMethodMailed check -5.759e-02 1.149e-01 -0.501 0.61627
## MonthlyCharges -4.034e-02 3.176e-02 -1.270 0.20392
## TotalCharges 3.289e-04 7.063e-05 4.657 3.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 5826.3 on 7008 degrees of freedom
## AIC: 5874.3
##
## Number of Fisher Scoring iterations: 6
set.seed(111)
# index sampling
index <- sample(x = nrow(churn_clean2),size = 0.8*nrow(churn_clean2))
# splitting data
churn_train <- churn_clean2[index,] #ambil data 80%
churn_test <- churn_clean2[-index,] #ambil data 20%
melakukan sampling index serta splitting data dengan pembagian data train sebanyak 80% dan data train 20%
table(churn_train$Churn) %>% prop.table()
##
## No Yes
## 0.7333333 0.2666667
memeriksa jumlah ratio pada nilai Churn dan didapat nilai No sebesar 73,42% dan nilai Yes sebesar 26,57%
# prediktor
churn_train_x <- churn_train %>% select_if(is.numeric)
churn_test_x <- churn_test %>% select_if(is.numeric)
# target
churn_train_y <- churn_train[,"Churn"]
churn_test_y <- churn_test[,"Churn"]
# scaling data
# train
churn_train_xs <- scale(x=churn_train_x)
# test
churn_test_xs <- scale(x=churn_test_x,
center = attr(churn_train_xs,"scaled:center") ,
scale= attr(churn_train_xs,"scaled:scale"))
melakukan prediksi dengan menentukan nilai k yang optimum
# find optimum k
sqrt(nrow(churn_train_xs))
## [1] 75
# package untuk fungsi `knn()`
churn_pred <- knn(train = churn_train_xs,
test = churn_test_xs,
cl = churn_train_y,
k = 75)
head(churn_pred)
## [1] No No No No Yes No
## Levels: No Yes
# confusion matrix
confusionMatrix(data = churn_pred,
reference = churn_test_y,
positive = "No")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 936 192
## Yes 102 177
##
## Accuracy : 0.791
## 95% CI : (0.7688, 0.812)
## No Information Rate : 0.7377
## P-Value [Acc > NIR] : 1.918e-06
##
## Kappa : 0.4139
##
## Mcnemar's Test P-Value : 2.096e-07
##
## Sensitivity : 0.9017
## Specificity : 0.4797
## Pos Pred Value : 0.8298
## Neg Pred Value : 0.6344
## Prevalence : 0.7377
## Detection Rate : 0.6652
## Detection Prevalence : 0.8017
## Balanced Accuracy : 0.6907
##
## 'Positive' Class : No
##
Kesimpulan: Diperoleh nilai recall sebesar 91%, artinya KNN dapat memprediksi dengan baik kelas target