Dataset Churn

Dataset ini bercerita tentang Memprediksi perilaku untuk mempertahankan pelanggan. Dapat dilakukan untuk menganalisis semua data pelanggan yang relevan dan mengembangkan program retensi pelanggan.

Setiap baris mewakili pelanggan, setiap kolom berisi atribut pelanggan yang dijelaskan pada kolom Metadata.

Kumpulan data mencakup informasi tentang:

Pelanggan yang keluar dalam sebulan terakhir – kolomnya disebut dengan Churn. Layanan yang telah didaftarkan oleh setiap pelanggan – telepon, banyak saluran, internet, keamanan online, pencadangan online, perlindungan perangkat, dukungan teknis, serta streaming TV dan film Informasi akun pelanggan – berapa lama mereka menjadi pelanggan, kontrak, metode pembayaran, penagihan tanpa kertas, tagihan bulanan, dan total tagihan Informasi demografis tentang pelanggan – jenis kelamin, rentang usia, dan apakah mereka memiliki mitra dan tanggungan inspirasi untuk menjelajahi model jenis ini dan mempelajari lebih lanjut tentang subjeknya.

churn <- read.csv("data_input/churn.csv", stringsAsFactors = T)
churn_clean <- churn %>% 
  select(-customerID) %>% 
  mutate(
    SeniorCitizen = as.factor(SeniorCitizen),
    MonthlyCharges = as.numeric(MonthlyCharges),
    TotalCharges = as.numeric(TotalCharges)
   
  )
glimpse(churn_clean)
## Rows: 7,043
## Columns: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…

Dilakukan data wrangling dengan membuang kolom customerID, mengubah tipe data SeniorCitizen menjadi factor, mengubah tipe data MonthlyCharge dan TotalCharges menjadi numeric.

head(churn_clean)
levels(churn_clean$Churn)
## [1] "No"  "Yes"

Memeriksa level pada variabel target Churn dengan jumlah levels 2 yaitu “No” dan “Yes”

Dilakukan model logistic linear model tanpa prediktor

churn_null <- glm(formula = Churn ~1,
                  data = churn_clean,
                  family = "binomial")

summary(churn_null)
## 
## Call:
## glm(formula = Churn ~ 1, family = "binomial", data = churn_clean)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.01824    0.02699  -37.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8150.1  on 7042  degrees of freedom
## Residual deviance: 8150.1  on 7042  degrees of freedom
## AIC: 8152.1
## 
## Number of Fisher Scoring iterations: 4

didapat nilai intercept = -1.018

table(churn_clean$Churn)
## 
##   No  Yes 
## 5174 1869

Memeriksa jumlah nilai pada target Churn

Dilakukan logistic liner Model dengan numerical prediktor

churn_num <- glm(formula = Churn ~TotalCharges,
                  data = churn_clean,
                  family = "binomial")
summary(churn_num)
## 
## Call:
## glm(formula = Churn ~ TotalCharges, family = "binomial", data = churn_clean)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.448e-01  3.720e-02  -14.65   <2e-16 ***
## TotalCharges -2.362e-04  1.457e-05  -16.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8143.4  on 7031  degrees of freedom
## Residual deviance: 7833.7  on 7030  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 7837.7
## 
## Number of Fisher Scoring iterations: 4

didapatkan nilai Intercept = -0.544 dan TotalCharge -0.00023

colSums(is.na(churn_clean))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0               11                0

memeriksa kolom yang tidak memiliki nilai.

churn_clean2 <- churn_clean %>% 
  drop_na()
anyNA(churn_clean2)
## [1] FALSE
colSums(is.na(churn_clean2))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

sudah dilakukan penghapusan nilai yang kosong

melakukan logistic linear Model dengan semua predictor

churn_all <- glm(formula = Churn ~.,
                  data = churn_clean2,
                  family = "binomial")
summary(churn_all)
## 
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = churn_clean2)
## 
## Coefficients: (7 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           1.165e+00  8.151e-01   1.430  0.15284    
## genderMale                           -2.183e-02  6.480e-02  -0.337  0.73619    
## SeniorCitizen1                        2.168e-01  8.453e-02   2.564  0.01033 *  
## PartnerYes                           -3.840e-04  7.783e-02  -0.005  0.99606    
## DependentsYes                        -1.485e-01  8.973e-02  -1.655  0.09796 .  
## tenure                               -6.059e-02  6.236e-03  -9.716  < 2e-16 ***
## PhoneServiceYes                       1.715e-01  6.487e-01   0.264  0.79153    
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      4.484e-01  1.773e-01   2.530  0.01142 *  
## InternetServiceFiber optic            1.747e+00  7.981e-01   2.190  0.02855 *  
## InternetServiceNo                    -1.786e+00  8.073e-01  -2.213  0.02691 *  
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -2.054e-01  1.787e-01  -1.150  0.25031    
## OnlineBackupNo internet service              NA         NA      NA       NA    
## OnlineBackupYes                       2.604e-02  1.754e-01   0.148  0.88197    
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                   1.474e-01  1.764e-01   0.836  0.40339    
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                       -1.805e-01  1.806e-01  -0.999  0.31759    
## StreamingTVNo internet service               NA         NA      NA       NA    
## StreamingTVYes                        5.905e-01  3.263e-01   1.810  0.07035 .  
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                    5.993e-01  3.267e-01   1.834  0.06658 .  
## ContractOne year                     -6.608e-01  1.076e-01  -6.142 8.15e-10 ***
## ContractTwo year                     -1.357e+00  1.764e-01  -7.691 1.46e-14 ***
## PaperlessBillingYes                   3.424e-01  7.450e-02   4.596 4.31e-06 ***
## PaymentMethodCredit card (automatic) -8.779e-02  1.141e-01  -0.770  0.44156    
## PaymentMethodElectronic check         3.045e-01  9.450e-02   3.222  0.00127 ** 
## PaymentMethodMailed check            -5.759e-02  1.149e-01  -0.501  0.61627    
## MonthlyCharges                       -4.034e-02  3.176e-02  -1.270  0.20392    
## TotalCharges                          3.289e-04  7.063e-05   4.657 3.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8143.4  on 7031  degrees of freedom
## Residual deviance: 5826.3  on 7008  degrees of freedom
## AIC: 5874.3
## 
## Number of Fisher Scoring iterations: 6
set.seed(111)

# index sampling 

index <- sample(x = nrow(churn_clean2),size = 0.8*nrow(churn_clean2)) 


# splitting data 
churn_train <- churn_clean2[index,] #ambil data 80%
churn_test  <- churn_clean2[-index,]  #ambil data 20%

melakukan sampling index serta splitting data dengan pembagian data train sebanyak 80% dan data train 20%

table(churn_train$Churn) %>% prop.table()
## 
##        No       Yes 
## 0.7333333 0.2666667

memeriksa jumlah ratio pada nilai Churn dan didapat nilai No sebesar 73,42% dan nilai Yes sebesar 26,57%

# prediktor
churn_train_x <- churn_train %>% select_if(is.numeric)

churn_test_x <- churn_test %>% select_if(is.numeric)

# target
churn_train_y <- churn_train[,"Churn"]

churn_test_y <- churn_test[,"Churn"]
# scaling data
# train
churn_train_xs <- scale(x=churn_train_x)

# test
churn_test_xs <- scale(x=churn_test_x,
                      center = attr(churn_train_xs,"scaled:center") ,
                      scale= attr(churn_train_xs,"scaled:scale"))

Predict

melakukan prediksi dengan menentukan nilai k yang optimum

# find optimum k
sqrt(nrow(churn_train_xs))
## [1] 75
 # package untuk fungsi `knn()`

churn_pred <- knn(train = churn_train_xs,
                 test = churn_test_xs,
                 cl = churn_train_y,
                 k = 75)

head(churn_pred)
## [1] No  No  No  No  Yes No 
## Levels: No Yes
# confusion matrix

confusionMatrix(data = churn_pred,              
                reference = churn_test_y,
                positive = "No")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  936 192
##        Yes 102 177
##                                          
##                Accuracy : 0.791          
##                  95% CI : (0.7688, 0.812)
##     No Information Rate : 0.7377         
##     P-Value [Acc > NIR] : 1.918e-06      
##                                          
##                   Kappa : 0.4139         
##                                          
##  Mcnemar's Test P-Value : 2.096e-07      
##                                          
##             Sensitivity : 0.9017         
##             Specificity : 0.4797         
##          Pos Pred Value : 0.8298         
##          Neg Pred Value : 0.6344         
##              Prevalence : 0.7377         
##          Detection Rate : 0.6652         
##    Detection Prevalence : 0.8017         
##       Balanced Accuracy : 0.6907         
##                                          
##        'Positive' Class : No             
## 

Kesimpulan: Diperoleh nilai recall sebesar 91%, artinya KNN dapat memprediksi dengan baik kelas target