Intro

Customer churn, also known as customer attrition, occurs when customers stop doing business with a company or stop using a company’s services. By being aware of and monitoring churn rate, companies are equipped to determine their customer retention success rates and identify strategies for improvement. We will use a machine learning model to understand the precise customer behaviors and attributes which signal the risk and timing of customer churn. In this project, I want to predict Telco customer churn based on their behavior. This dataset taken from Kaggle dataset.

Import Dataset

Exploratory Data Analysis

A. Visualization

Customer Behavior Based on Churn or Not Churn

2. Churn based on Customer Characteristics

From the customer characteristics, I can get several information :

  • From the gender, male and female seems to have same proportion to churn.

  • Senior citizens are more likely to churn

  • Customer without partner are more likely to churn compared to customer who have partner

  • Customer without dependents are more likely to churn compared to customer with dependents

B. Checking Correlation

Inspect the correlation between numerical column from the dataset

Because TotalCharges has strong correlation with tenure and MonthlyCharges, I can take out TotalCharges column

Modelling

In this project, I am using 3 machine learning models (Naive Bayes, Decision Tree, and Random Forest)

A. Cross Validation

To make the model, I have to split the dataset into data train and data test. Data train for making the model, and data test as the unseen data to make the prediction

B. Machine Learning Models

1. Naive Bayes Model

Making confusion matrix for data train and data test. This is to see if the model overfit or not.

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  2844  284
       Yes 1296 1212
                                          
               Accuracy : 0.7197          
                 95% CI : (0.7077, 0.7314)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : 0.9944          
                                          
                  Kappa : 0.4088          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8102          
            Specificity : 0.6870          
         Pos Pred Value : 0.4833          
         Neg Pred Value : 0.9092          
             Prevalence : 0.2654          
         Detection Rate : 0.2150          
   Detection Prevalence : 0.4450          
      Balanced Accuracy : 0.7486          
                                          
       'Positive' Class : Yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  712  58
       Yes 322 315
                                         
               Accuracy : 0.7299         
                 95% CI : (0.7059, 0.753)
    No Information Rate : 0.7349         
    P-Value [Acc > NIR] : 0.6761         
                                         
                  Kappa : 0.4347         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8445         
            Specificity : 0.6886         
         Pos Pred Value : 0.4945         
         Neg Pred Value : 0.9247         
             Prevalence : 0.2651         
         Detection Rate : 0.2239         
   Detection Prevalence : 0.4527         
      Balanced Accuracy : 0.7665         
                                         
       'Positive' Class : Yes            
                                         

Visualize ROC curve for Naive-Bayes Model

ROC is a curve that illustrates how capable the model is made to define between positive and negative classes. This plot has 2 axes namely:

  • x: False Positive Rate (recall)

  • y: True Positive Rate (1 - specificitty)

Checking AUC (Area under the curve) value

[1] 0.8412656

2. Decision Tree Model

Prediction with decision tree model

Creating confusion matrix for data train and data test.

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  3818  816
       Yes  322  680
                                          
               Accuracy : 0.7981          
                 95% CI : (0.7874, 0.8085)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4212          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4545          
            Specificity : 0.9222          
         Pos Pred Value : 0.6786          
         Neg Pred Value : 0.8239          
             Prevalence : 0.2654          
         Detection Rate : 0.1207          
   Detection Prevalence : 0.1778          
      Balanced Accuracy : 0.6884          
                                          
       'Positive' Class : Yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  937 196
       Yes  97 177
                                          
               Accuracy : 0.7918          
                 95% CI : (0.7696, 0.8127)
    No Information Rate : 0.7349          
    P-Value [Acc > NIR] : 4.327e-07       
                                          
                  Kappa : 0.416           
                                          
 Mcnemar's Test P-Value : 1.033e-08       
                                          
            Sensitivity : 0.4745          
            Specificity : 0.9062          
         Pos Pred Value : 0.6460          
         Neg Pred Value : 0.8270          
             Prevalence : 0.2651          
         Detection Rate : 0.1258          
   Detection Prevalence : 0.1947          
      Balanced Accuracy : 0.6904          
                                          
       'Positive' Class : Yes             
                                          

From two confusion matrix using decision tree for data train and data test, it shows that the model has bigger accuracy when predicting the data test. I think it’s logical because I use the data train in making the model. But because the difference seem not too significant, I think the model still can not be concluded to be overfit. The confusion matrix show bad score in Sensitivity but good score in Specificity. That mean that the decision tree model good at predicting the customer who doesn’t churn.

Visualize ROC curve for Decision Tree Model

Checking AUC (Area under the curve) value

[1] 0.8365921

3. Random Forest Model

I want to create random forest model using data train with k-fold cross validation. In this model I use 5 fold cross validation and 3 repetition.

I will show the demonstration of k fold cross validation in random forest model

Random Forest 

5636 samples
  18 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 4509, 4508, 4509, 4509, 4509, 4508, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.7884444  0.3429087
  15    0.7947143  0.4342739
  29    0.7921121  0.4299515

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 15

        OOB estimate of  error rate: 20.53%
Confusion matrix:
      No Yes class.error
No  3716 424   0.1024155
Yes  733 763   0.4899733

The model show the best accuracy at mtry = 15

Evaluate random forest model for data train and data test

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  4051  183
       Yes   89 1313
                                          
               Accuracy : 0.9517          
                 95% CI : (0.9458, 0.9572)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8737          
                                          
 Mcnemar's Test P-Value : 1.711e-08       
                                          
            Sensitivity : 0.8777          
            Specificity : 0.9785          
         Pos Pred Value : 0.9365          
         Neg Pred Value : 0.9568          
             Prevalence : 0.2654          
         Detection Rate : 0.2330          
   Detection Prevalence : 0.2488          
      Balanced Accuracy : 0.9281          
                                          
       'Positive' Class : Yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1004   37
       Yes   30  336
                                          
               Accuracy : 0.9524          
                 95% CI : (0.9399, 0.9629)
    No Information Rate : 0.7349          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8771          
                                          
 Mcnemar's Test P-Value : 0.4635          
                                          
            Sensitivity : 0.9008          
            Specificity : 0.9710          
         Pos Pred Value : 0.9180          
         Neg Pred Value : 0.9645          
             Prevalence : 0.2651          
         Detection Rate : 0.2388          
   Detection Prevalence : 0.2601          
      Balanced Accuracy : 0.9359          
                                          
       'Positive' Class : Yes             
                                          

From random forest model, we can see most important variables that use in each fold and repetition

Visualize ROC Curve for Random Forest Model

Checking AUC (Area under the curve) value

[1] 0.9863787

Conclusion

From the three models that already created, I got the accuracy and recall from them :

  • Naive-Bayes model : Accuracy = 72.9% ,Recall = 84.4%, and AUC = 84.12%

  • Decision Tree model : Accuracy = 79.18%, Recall = 47.45%, and AUC = 83.65%

  • Random Forest model : Accuracy = 95.24%, Recall = 90.08%, and AUC = 98.63%

I can conclude that Random forest model give the best prediction among all three models. This is because random forest model has the highest performance compared to naive bayes model and decision tree model.