1 Read and Inspect Data

telco <- read.csv("telcochurn.csv", stringsAsFactors = TRUE)
glimpse(telco)
#> Rows: 7,043
#> Columns: 22
#> $ X                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
#> $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
#> $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
#> $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
#> $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
#> $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
#> $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
#> $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
#> $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
#> $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
#> $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
#> $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
#> $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
#> $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
#> $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
#> $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M~
#> $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
#> $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr~
#> $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
#> $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
#> $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~

Data Explanation:
Overall, the column in dataset is self explanatory. Maybe only a few that needed further explanation.

  • Tenure = Duration of usage
  • Contract = Type of subscription
  • Churn = Whether the customer is stop using phone service or not

Business Question:
We assume ourself to be a data analyst in a Telco company. We were asked to build the best model to predict the customer that intent to stop using our services, so we could ask them for their feedback of our services in order to improve our services.

2 Data Wrangling

  • Select only columns needed, then check missing values.
telco <- telco %>% 
  select(-c(X, customerID))

colSums(is.na(telco))
#>           gender    SeniorCitizen          Partner       Dependents 
#>                0                0                0                0 
#>           tenure     PhoneService    MultipleLines  InternetService 
#>                0                0                0                0 
#>   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
#>                0                0                0                0 
#>      StreamingTV  StreamingMovies         Contract PaperlessBilling 
#>                0                0                0                0 
#>    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
#>                0                0               11                0
  • Remove missing value with complete case, since missing value is < 5% of data.
telco_clean <- telco %>% 
  na.omit()

colSums(is.na(telco_clean))
#>           gender    SeniorCitizen          Partner       Dependents 
#>                0                0                0                0 
#>           tenure     PhoneService    MultipleLines  InternetService 
#>                0                0                0                0 
#>   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
#>                0                0                0                0 
#>      StreamingTV  StreamingMovies         Contract PaperlessBilling 
#>                0                0                0                0 
#>    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
#>                0                0                0                0

3 Cross Validation

3.1 Check proportion of target variable.

prop.table(table(telco_clean$Churn))
#> 
#>       No      Yes 
#> 0.734215 0.265785
table(telco_clean$Churn)
#> 
#>   No  Yes 
#> 5163 1869

Note:
The proportion of target variable is imbalance, but we keep going for now. Will do the upsampling and/or downsampling later.

3.2 Splitting

Split the original data to data train and data test, and then check proportion of target variable in data train.

RNGkind(sample.kind= "Rounding")
set.seed(417)
idx <- sample(nrow(telco_clean), nrow(telco_clean)*0.8)
telco_train <- telco_clean[idx,]
telco_test <- telco_clean[-idx,]
prop.table(table(telco_train$Churn))
#> 
#>        No       Yes 
#> 0.7347556 0.2652444

3.3 Modelling

3.3.1 Naive Bayes

1. Making Naive Bayes model using all predictor variables, without upsampling or downsampling the data train.

  • Making model.
model_nb <- naiveBayes(Churn~., telco_train)
pred <- predict(model_nb, telco_test, type = "class")
  • Model Evaluation
confusionMatrix(pred, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  829 113
#>        Yes 201 264
#>                                           
#>                Accuracy : 0.7768          
#>                  95% CI : (0.7542, 0.7983)
#>     No Information Rate : 0.7321          
#>     P-Value [Acc > NIR] : 0.0000641500    
#>                                           
#>                   Kappa : 0.4703          
#>                                           
#>  Mcnemar's Test P-Value : 0.0000009122    
#>                                           
#>             Sensitivity : 0.7003          
#>             Specificity : 0.8049          
#>          Pos Pred Value : 0.5677          
#>          Neg Pred Value : 0.8800          
#>              Prevalence : 0.2679          
#>          Detection Rate : 0.1876          
#>    Detection Prevalence : 0.3305          
#>       Balanced Accuracy : 0.7526          
#>                                           
#>        'Positive' Class : Yes             
#> 

Note:

  • Model Summary :
    • Accuracy = 0.7768
    • Sensitivity = 0.7003

2. Making Naive Bayes model using all predictor variables, and downsampling the data train.

  • Downsampling data train to have target variable proportion that is balance.
telco_train_down <- downSample(x = telco_train %>% select(-Churn), y = telco_train$Churn, yname = "Churn")
table(telco_train_down$Churn)
#> 
#>   No  Yes 
#> 1492 1492
prop.table(table(telco_train_down$Churn))
#> 
#>  No Yes 
#> 0.5 0.5
  • Making down sample model.
model_nb_down <- naiveBayes(Churn~., telco_train_down)
pred_down <- predict(model_nb_down, telco_test, type = "class")
  • Model Evaluation.
confusionMatrix(pred_down, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  755  82
#>        Yes 275 295
#>                                              
#>                Accuracy : 0.7463             
#>                  95% CI : (0.7227, 0.7688)   
#>     No Information Rate : 0.7321             
#>     P-Value [Acc > NIR] : 0.1199             
#>                                              
#>                   Kappa : 0.4435             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7825             
#>             Specificity : 0.7330             
#>          Pos Pred Value : 0.5175             
#>          Neg Pred Value : 0.9020             
#>              Prevalence : 0.2679             
#>          Detection Rate : 0.2097             
#>    Detection Prevalence : 0.4051             
#>       Balanced Accuracy : 0.7578             
#>                                              
#>        'Positive' Class : Yes                
#> 

Note:

  • Model Summary :
    • Accuracy = 0.7463
    • Sensitivity = 0.7825

3. Making Naive Bayes model using all predictor variables, and upsampling the data train.

  • Upsampling data train to have target variable proportion that is balance.
telco_train_up <- upSample(x = telco_train %>% select(-Churn), y = telco_train$Churn, yname = "Churn")
table(telco_train_up$Churn)
#> 
#>   No  Yes 
#> 4133 4133
prop.table(table(telco_train_up$Churn))
#> 
#>  No Yes 
#> 0.5 0.5
  • Making model.
model_nb_up <- naiveBayes(Churn~., telco_train_up)
pred_up <- predict(model_nb_up, telco_test, type = "class")
  • Model Evaluation.
confusionMatrix(pred_up, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  755  83
#>        Yes 275 294
#>                                              
#>                Accuracy : 0.7456             
#>                  95% CI : (0.7219, 0.7681)   
#>     No Information Rate : 0.7321             
#>     P-Value [Acc > NIR] : 0.1325             
#>                                              
#>                   Kappa : 0.4416             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7798             
#>             Specificity : 0.7330             
#>          Pos Pred Value : 0.5167             
#>          Neg Pred Value : 0.9010             
#>              Prevalence : 0.2679             
#>          Detection Rate : 0.2090             
#>    Detection Prevalence : 0.4044             
#>       Balanced Accuracy : 0.7564             
#>                                              
#>        'Positive' Class : Yes                
#> 

Note:

  • Model Summary : 
    • Accuracy = 0.7456
    • Sensitivity = 0.7798

ROC and AUC

  • Data default before downsample.
pred_prob <- predict(model_nb, telco_test, type = "raw")
head(pred_prob)
#>              No           Yes
#> [1,] 0.02250140 0.97749860197
#> [2,] 0.01464784 0.98535215574
#> [3,] 0.99998922 0.00001078168
#> [4,] 0.99420460 0.00579539609
#> [5,] 0.99942708 0.00057292442
#> [6,] 0.99841864 0.00158136367
pred_rocr <- prediction(predictions = pred_prob[,2], 
                        labels = as.numeric(ifelse(test = telco_test$Churn == "Yes", 1, 0)))

perform <- performance(prediction.obj = pred_rocr, measure = "tpr", x.measure = "fpr")
  • Plotting ROC and find AUC values.
# roc plot
plot(perform)

# auc
auc <- performance(pred_rocr, "auc")

auc@y.values
#> [[1]]
#> [1] 0.8408424
  • Data downsample.
pred_prob_down <- predict(model_nb_down, telco_test, type = "raw")
head(pred_prob_down)
#>               No           Yes
#> [1,] 0.008816022 0.99118397761
#> [2,] 0.005479125 0.99452087518
#> [3,] 0.999969267 0.00003073294
#> [4,] 0.984380651 0.01561934880
#> [5,] 0.998128070 0.00187192969
#> [6,] 0.995777509 0.00422249090
pred_rocr_down <- prediction(predictions = pred_prob_down[,2], 
                        labels = as.numeric(ifelse(test = telco_test$Churn == "Yes", 1, 0)))

perf_down <- performance(prediction.obj = pred_rocr_down, measure = "tpr", x.measure = "fpr")
  • Plotting ROC and find AUC values.
# roc plot
plot(perf_down)

# auc
auc_down <- performance(pred_rocr_down, "auc")

auc_down@y.values
#> [[1]]
#> [1] 0.8408115

Naive Bayes model evaluation:

  • Based on Model Summary, Naive Bayes with data train that has been downsampled or model_nb_down is the better model. The details is below:
    • Model Summary :
      • Accuracy = 0.7463
      • Sensitivity = 0.7825
  • Even though model_nb_down produced less AUC value rather than model_nb, which is 0.8408115 compared to 0.8408424, but it has the better Recall / Sensitivity as a Matrix.
    We use Recall / Sensitivity as an Evaluation Matrix because we want to have a model that is good in predicting class for true positive, as described in our business question.

3.3.2 Decision Tree

1. Making Decision Tree model.

  • Making model.
model_dt <- ctree(Churn ~ ., telco_train_down)
  • Inspect model Decision Tree.
plot(model_dt, type = "simple")

  • Decision Tree Model Evaluation.
class_predict_dt <- predict(model_dt, telco_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  780  86
#>        Yes 250 291
#>                                              
#>                Accuracy : 0.7612             
#>                  95% CI : (0.738, 0.7833)    
#>     No Information Rate : 0.7321             
#>     P-Value [Acc > NIR] : 0.0069             
#>                                              
#>                   Kappa : 0.465              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7719             
#>             Specificity : 0.7573             
#>          Pos Pred Value : 0.5379             
#>          Neg Pred Value : 0.9007             
#>              Prevalence : 0.2679             
#>          Detection Rate : 0.2068             
#>    Detection Prevalence : 0.3845             
#>       Balanced Accuracy : 0.7646             
#>                                              
#>        'Positive' Class : Yes                
#> 

2. Pruning Decision Tree model.

  • Making model.
model_dt_pruning <- ctree(Churn ~ ., telco_train_down,
                          control = ctree_control(minsplit = 100,
                                                  minbucket = 50))
  • Inspect model Decision Tree.
plot(model_dt_pruning, type = "simple")

  • Decision Tree Model Evaluation.
class_predict_dt <- predict(model_dt_pruning, telco_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  775  78
#>        Yes 255 299
#>                                                
#>                Accuracy : 0.7633               
#>                  95% CI : (0.7402, 0.7853)     
#>     No Information Rate : 0.7321               
#>     P-Value [Acc > NIR] : 0.004052             
#>                                                
#>                   Kappa : 0.4749               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7931               
#>             Specificity : 0.7524               
#>          Pos Pred Value : 0.5397               
#>          Neg Pred Value : 0.9086               
#>              Prevalence : 0.2679               
#>          Detection Rate : 0.2125               
#>    Detection Prevalence : 0.3937               
#>       Balanced Accuracy : 0.7728               
#>                                                
#>        'Positive' Class : Yes                  
#> 

Notes:

  • Model Decision Tree / model_dt:
    • Accuracy = 0.7612
    • Sensitivity = 0.7719
  • Model Decision Tree after pruning / model_dt_pruning:
    • Accuracy : 0.7633 
    • Sensitivity : 0.7931
  • Model Decision Tree after pruning or model_dt_pruning is better than models generated by Naive Bayes because it produce better Recall / Sensitivity, which is 0.7931

3.3.3 Random Forest

  • Making Random Forest model.
set.seed(417)

ctrl <- trainControl(method="repeatedcv", number=4, repeats=4)
# model_rf <- train(Churn ~ ., data = telco_train_down, method="rf", trControl = ctrl)

# saveRDS(model_rf, file = "model_rf.rds")
model_rf <- readRDS("model_rf.RDS")
  • Inspect Random Forest model.
# Print Output 
model_rf
#> Random Forest 
#> 
#> 2984 samples
#>   19 predictor
#>    2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 4 times) 
#> Summary of sample sizes: 2238, 2238, 2238, 2238, 2238, 2238, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.7615617  0.5231233
#>   12    0.7554457  0.5108914
#>   23    0.7533512  0.5067024
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
# Model Configuration
model_rf$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 23.59%
#> Confusion matrix:
#>       No  Yes class.error
#> No  1084  408   0.2734584
#> Yes  296 1196   0.1983914
# Model Visualization
plot(model_rf)

# Important predictor variables to predict the target variables
varImp(model_rf)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 23)
#> 
#>                                      Overall
#> tenure                               100.000
#> TotalCharges                          82.679
#> MonthlyCharges                        78.872
#> InternetServiceFiber optic            46.534
#> ContractTwo year                      44.927
#> PaymentMethodElectronic check         29.326
#> ContractOne year                      24.326
#> InternetServiceNo                     20.647
#> TechSupportYes                        17.443
#> OnlineSecurityYes                     16.276
#> PaperlessBillingYes                   13.717
#> OnlineBackupYes                        7.447
#> SeniorCitizen                          7.060
#> DependentsYes                          6.853
#> PartnerYes                             6.587
#> PaymentMethodCredit card (automatic)   4.827
#> StreamingMoviesYes                     4.684
#> DeviceProtectionYes                    4.587
#> MultipleLinesYes                       4.525
#> genderMale                             4.417
  • Random Forest model evaluation.
pred_rf <- predict(model_rf, newdata = telco_test, positive = "Yes")
confusionMatrix(pred_rf, reference = telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  763  65
#>        Yes 267 312
#>                                                
#>                Accuracy : 0.764                
#>                  95% CI : (0.741, 0.786)       
#>     No Information Rate : 0.7321               
#>     P-Value [Acc > NIR] : 0.003369             
#>                                                
#>                   Kappa : 0.4858               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8276               
#>             Specificity : 0.7408               
#>          Pos Pred Value : 0.5389               
#>          Neg Pred Value : 0.9215               
#>              Prevalence : 0.2679               
#>          Detection Rate : 0.2217               
#>    Detection Prevalence : 0.4115               
#>       Balanced Accuracy : 0.7842               
#>                                                
#>        'Positive' Class : Yes                  
#> 

Random Forest Model tuning.

  • Making Random Forest model tuning by setting mtry.
set.seed(417)
grid <- expand.grid(.mtry = c(3,4,5))

# model_rf_tune <- train(Churn ~ ., data = telco_train_down,
#                   method="rf",
#                   trControl = ctrl,
#                   tuneGrid = grid)
# saveRDS(model_rf_tune, file = "model_rf_tune.rds")
model_rf_tune <- readRDS("model_rf_tune.RDS")
  • Inspect tuned Random Forest model.
# Print Output 
model_rf_tune
#> Random Forest 
#> 
#> 2984 samples
#>   19 predictor
#>    2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 4 times) 
#> Summary of sample sizes: 2238, 2238, 2238, 2238, 2238, 2238, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   3     0.7647453  0.5294906
#>   4     0.7620643  0.5241287
#>   5     0.7592158  0.5184316
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 3.
# Model Configuration
model_rf_tune$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 3
#> 
#>         OOB estimate of  error rate: 24.3%
#> Confusion matrix:
#>       No  Yes class.error
#> No  1083  409   0.2741287
#> Yes  316 1176   0.2117962
# Model Visualization
plot(model_rf_tune)

# Important predictor variables to predict the target variables
varImp(model_rf_tune)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 23)
#> 
#>                               Overall
#> tenure                        100.000
#> TotalCharges                   94.292
#> MonthlyCharges                 89.705
#> ContractTwo year               39.427
#> InternetServiceFiber optic     37.174
#> PaymentMethodElectronic check  20.812
#> ContractOne year               18.846
#> InternetServiceNo              14.023
#> TechSupportYes                 12.830
#> OnlineSecurityYes              11.967
#> PaperlessBillingYes            10.857
#> genderMale                      8.004
#> OnlineBackupYes                 6.998
#> PartnerYes                      6.608
#> DependentsYes                   6.561
#> SeniorCitizen                   6.205
#> MultipleLinesYes                5.531
#> StreamingMoviesYes              4.962
#> StreamingTVYes                  4.611
#> DeviceProtectionYes             4.345
  • Tuned Random Forest model evaluation.
pred_rf_tune <- predict(model_rf_tune, newdata = telco_test, positive = "Yes")
confusionMatrix(pred_rf_tune, reference = telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  770  71
#>        Yes 260 306
#>                                                
#>                Accuracy : 0.7647               
#>                  95% CI : (0.7417, 0.7867)     
#>     No Information Rate : 0.7321               
#>     P-Value [Acc > NIR] : 0.002791             
#>                                                
#>                   Kappa : 0.4826               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8117               
#>             Specificity : 0.7476               
#>          Pos Pred Value : 0.5406               
#>          Neg Pred Value : 0.9156               
#>              Prevalence : 0.2679               
#>          Detection Rate : 0.2175               
#>    Detection Prevalence : 0.4023               
#>       Balanced Accuracy : 0.7796               
#>                                                
#>        'Positive' Class : Yes                  
#> 

Note:

  • Model Summary of model_rf:
    • Accuracy = 0.764
    • Sensitivity = 0.8276
  • Model Summary of model_rf_tune
    • Accuracy : 0.7647 
    • Sensitivity : 0.8117 
  • We know that we can tune Random Forest model by changing mtry and trees, but in this case it won’t do anything good to our model. Therefore the best model is still the model that generated by Random Forest. It has better accuracy and Sensitivity.

4 Model Evaluation

Recall / Sensitivity comparison of all model:

5 Conclusions

Here’s some conclusions we can take from this study case:

  • For this dataset, we have to down sampling the data train in order to make a better model.
  • The chosen Matrix Evaluation for this study case is Recall / Sensitivity, because we want to focus on class positive. In this case, as we want to improve our services regarding to customer’s feedback, firstly we have to know which customer is going to stop using our services.
  • Model Random Forest or model_rf is the best model for this study case. It generate sensitivity of 0.8276.
  • The most important variable in predicting whether the customer is going to stop using our services or not is the tenure or their duration of using our services.