Telecommunication Churn Prediction

Introduction

Customer churn is the loss of clients or customers. In order to avoid losing customers, a company needs to examine why its customers have left in the past and which features are more important to determine who will churn in the future. Our task is therefore to predict whether customers are about to churn and which are the most important features to get that prediction right. As in most prediction problems, we will use machine learning.

Companies usually define 2 types of customer churn, namely voluntary and involuntary. Voluntary churn is a customer who intentionally quits and switches to another company, while involuntary churn is a customer who quits due to external reasons such as changing locations, death, or other reasons.

Between the two types, voluntary churn is not difficult to do because we can study customer characteristics that can be seen from customer profiles. This problem can be answered by creating a Machine Learning model that can predict whether a customer will churn or not. It is hoped that with this model, telecommunication companies can take preventive actions for customers who have a high chance of churn.

Import Data

library(tidyverse)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)
library(inspectdf)

The data used is customer profile data from a telecommunications company obtained from Kaggle. The dataset contains data for 7043 customers which includes customer demographics, account payment information, and service products registered by each customer. From this information, we want to predict whether a customer will Churn or not.

# read data
churn <- read.csv("data/telcochurn.csv")
head(churn)
  X customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 1 7590-VHVEG Female             0     Yes         No      1           No
2 2 5575-GNVDE   Male             0      No         No     34          Yes
3 3 3668-QPYBK   Male             0      No         No      2          Yes
4 4 7795-CFOCW   Male             0      No         No     45           No
5 5 9237-HQITU Female             0      No         No      2          Yes
6 6 9305-CDSKC Female             0      No         No      8          Yes
  MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1            No             DSL             No          Yes               No
2            No             DSL            Yes           No              Yes
3            No             DSL            Yes          Yes               No
4            No             DSL            Yes           No              Yes
5            No     Fiber optic             No           No               No
6           Yes     Fiber optic             No           No              Yes
  TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
1          No          No              No Month-to-month              Yes
2          No          No              No       One year               No
3          No          No              No Month-to-month              Yes
4         Yes          No              No       One year               No
5          No          No              No Month-to-month              Yes
6          No         Yes             Yes Month-to-month              Yes
              PaymentMethod MonthlyCharges TotalCharges Churn
1          Electronic check          29.85        29.85    No
2              Mailed check          56.95      1889.50    No
3              Mailed check          53.85       108.15   Yes
4 Bank transfer (automatic)          42.30      1840.75    No
5          Electronic check          70.70       151.65   Yes
6          Electronic check          99.65       820.50   Yes

The following is a description for each variable:

  • customerID: Customer ID
  • genderCustomer: gender (female, male)
  • SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
  • PartnerWhether: the customer has a partner or not (Yes, No)
  • Dependents: Whether the customer has dependents or not (Yes, No)
  • tenure: Number of months the customer has stayed with the company
  • PhoneService: Whether the customer has a phone service or not (Yes, No)
  • MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
  • InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
  • OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
  • StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract: The contract term of the customer (Month-to-month, One year, Two year)
  • PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
  • PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card ( automatic))
  • MonthlyCharges: The amount of charged to the customer monthly
  • TotalCharges: The total amount of charged to the customer
  • Churn: Whether the customer churned or not (Yes or No)

Cleansing Data

From all the character data above, we will change it to the factor data type. We can use the stringsAsFactors = TRUE parameter of read.csv() so that all character fields will be automatically saved as factors.

# change data type into factor
churn <- read.csv("data/telcochurn.csv", stringsAsFactors = TRUE)
glimpse(churn)
Rows: 7,043
Columns: 22
$ X                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
$ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
$ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
$ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
$ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
$ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
$ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
$ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
$ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
$ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
$ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
$ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
$ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
$ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M~
$ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
$ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr~
$ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
$ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
$ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~

The data has 7043 observations and 22 columns, check whether any missing data in telcochurn dataset

colSums(is.na(churn))
               X       customerID           gender    SeniorCitizen 
               0                0                0                0 
         Partner       Dependents           tenure     PhoneService 
               0                0                0                0 
   MultipleLines  InternetService   OnlineSecurity     OnlineBackup 
               0                0                0                0 
DeviceProtection      TechSupport      StreamingTV  StreamingMovies 
               0                0                0                0 
        Contract PaperlessBilling    PaymentMethod   MonthlyCharges 
               0                0                0                0 
    TotalCharges            Churn 
              11                0 

From 7043 observations, there is 11 missing value data in TotalCharges column, due to missing value in TotalCharges column is quite a bit we can throw the missing value data. There is some column also which not relevan with the modeling, the column is X and customerID, we can remove this two column from dataset. We adjust the data type of the SeniorCitizen column which was previously numeric to categorical.

# remove column X & customerID, and missing value
churn <- churn %>% 
  select(-c(X,customerID)) %>% 
  na.omit()

# assigning label for seniorCitizen column
churn$SeniorCitizen <- factor(churn$SeniorCitizen, levels = c("0", "1"), labels = c("No", "Yes"))

We can check again, whether data is already clean or not.

glimpse(churn)
Rows: 7,032
Columns: 20
$ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
$ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
$ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
$ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
$ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
$ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
$ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
$ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
$ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
$ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
$ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
$ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
$ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M~
$ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
$ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr~
$ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
$ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
$ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
anyNA(churn)
[1] FALSE

Ok data is already clean now, we can continue to the next steps.

Exploratory Data Analysis

Next, let’s explore the data for both categorical and numeric columns. To find out the class proportion for each categorical variable, we can use the inspect_cat function from the inspectdf package as below:

churn %>% 
  inspect_cat() %>% 
  show_plot()

From the visualization above, it can be seen that the proportion of class for the Churn target variable is more in the No category than Yes. While, for the proportion of other variables the majority is balanced.

We can inspect numeric data variable also using inspect_num function from the inspectdf package, as below :

churn %>% 
  inspect_num() %>% 
  show_plot()

From the visualization above, it can be concluded that the distribution of numerical data is quite diverse for each variable.

Before we create a model, we must divide the dataset into train and test data. Divide the data by the proportion of 80% train and 20% test. Make sure to use the RNGkind() and set.seed() functions before splitting the data and running it together with the sample() code. set.seed ensures that you get the same result if you start with that same seed each time you run the same process.

RNGkind(sample.kind = "Rounding")
set.seed(100)
# splitting data
idx <- initial_split(data = churn,
                     prop = 0.8,
                     strata = "Churn")
data_train <- training(idx)
data_test <- testing(idx)

Let’s check class proportion of the target variable :

prop.table(table(data_train$Churn))

       No       Yes 
0.7342222 0.2657778 

Based on the above result , it can be concluded that the proportion of the target class is unbalanced, we must balance it before using it in our model. Sub-sampling method applies only to the data train. There are several methods for resampling data, including downsampling, upsampling, adding data from available data sources. Here I will perform an upsampling method to make the proportion of the target class balanced.

RNGkind(sample.kind = "Rounding")
set.seed(100)
# upsampling data train
data_train_up <- upSample(x = data_train %>% select(-Churn),
                              y = data_train$Churn, 
                              yname = "Churn")

table(data_train_up$Churn)

  No  Yes 
4130 4130 
prop.table(table(data_train_up$Churn))

 No Yes 
0.5 0.5 

The proportion of the class variable target is already balanced now, then we continue to make a model using data_train_up.

Modeling

Naive Bayes

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

After dividing the data into train and test data and upsampling the train data, let’s create the first model using the Naive Bayes algorithm. To avoid skewness due scarcity data we use laplace function.

# model naive bayes
model_nb <- naiveBayes(Churn ~., data = data_train_up, laplace = 1)

Decision Tree

Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables.

Let’s identify important terminologies on Decision Tree, looking at the image above:

  • Root Node represents the entire population or sample. It further gets divided into two or more homogeneous sets.
  • Splitting is a process of dividing a node into two or more sub-nodes.
  • When a sub-node splits into further sub-nodes, it is called a Decision Node.
  • Nodes that do not split is called a Terminal Node or a Leaf.
  • When you remove sub-nodes of a decision node, this process is called Pruning. The opposite of pruning is Splitting.
  • A sub-section of an entire tree is called Branch.
  • A node, which is divided into sub-nodes is called a parent node of the sub-nodes; whereas the sub-nodes are called the child of the parent node.

Let’s create decision tree model using ctree() function .

# model dt
RNGkind(sample.kind = "Rounding")
set.seed(100)
model_dt <- ctree(Churn ~ ., data = data_train_up)

Visualization of the model decision tree using plot function and type simple.

plot(model_dt, type = "simple")

Random Forest

Random forest algorithm is a supervised classification and regression algorithm. As the name suggests, this algorithm randomly creates a forest with several trees. Generally, the more trees in the forest the more robust the forest looks like. Similarly, in the random forest classifier, the higher the number of trees in the forest, greater is the accuracy of the results.

In simple words, Random forest builds multiple decision trees (called the forest) and glues them together to get a more accurate and stable prediction. The forest it builds is a collection of Decision Trees, trained with the bagging method. What is the difference between random forest and decision trees that we used in modeling before ?. Here the explanation with the sample case :

Let’s say that you’re looking to buy a house, but you’re unable to decide which one to buy. So, you consult a few agents and they give you a list of parameters that you should consider before buying a house. The list includes:

  • Price of the house
  • Locality
  • Number of bedrooms
  • Parking space
  • Available facilities

These parameters are known as predictor variables, which are used to find the response variable. Here’s a diagrammatic illustration of how you can represent the above problem statement using a decision tree.

An important point to note here is that Decision trees are built on the entire data set, by making use of all the predictor variables. Now let’s see how Random Forest chart is. Like mentioned earlier Random forest is an ensemble of decision trees, it randomly selects a set of parameters and creates a decision tree for each set of chosen parameters.

Take a look at the below figure.

Here, I’ve created 3 Decision Trees and each Decision Tree is taking only 3 parameters from the entire data set. Each decision tree predicts the outcome based on the respective predictor variables used in that tree and finally takes the average of the results from all the decision trees in the random forest.

In simple words, after creating multiple Decision trees using this method, each tree selects or votes the class (in this case the decision trees will choose whether or not a house is bought), and the class receiving the most votes by a simple majority is termed as the predicted class.

To conclude, Decision trees are built on the entire data set using all the predictor variables, whereas Random Forests are used to create multiple decision trees, such that each decision tree is built only on a part of the data set.

Next we will do the modeling using the Random Forest algorithm (package caret) by determining the number of cross validation, repetitions, and including the name of the target variable and also the predictor used from the data train up. Here I want to create random forest model with 10 Fold Cross Validation and 10 repeats, as below:

# set.seed(100)
# ctrl <- trainControl(method = "repeatedcv",
#                      number = 5,
#                      repeats = 3)
# model_forest <- train(Churn ~ .,
#                 data = data_train_up,
#                 method = "rf",
#                 trControl = ctrl)
# 
# saveRDS(model_forest, "assets/model_forest.rds")

Chunk above takes time to process, let’s check the model which we already save it, after we create the model, and save it, we can read the model using readRDS() function, as below :

model_forest <- readRDS("assets/model_forest.rds")
model_forest
Random Forest 

8260 samples
  19 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 6608, 6608, 6608, 6608, 6608, 6608, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8248991  0.6497982
  12    0.8962066  0.7924132
  23    0.8902744  0.7805488

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 12.

From the above information the accuracy of the data train up higher 89,62% in optimum mtry 12. Next step let’s evaluated all model we’ve created before .

Model Evaluation

Naive Bayes

I will do data test predict using naive model with the type of parameter class, to predict target of class

pred_naive <- predict(model_nb, data_test, type = "class")
# evaluasi model nb
conf_nb <- confusionMatrix(pred_naive, reference = data_test$Churn, positive = "Yes")
conf_nb
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  740  93
       Yes 293 281
                                             
               Accuracy : 0.7257             
                 95% CI : (0.7015, 0.7488)   
    No Information Rate : 0.7342             
    P-Value [Acc > NIR] : 0.7753             
                                             
                  Kappa : 0.3995             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.7513             
            Specificity : 0.7164             
         Pos Pred Value : 0.4895             
         Neg Pred Value : 0.8884             
             Prevalence : 0.2658             
         Detection Rate : 0.1997             
   Detection Prevalence : 0.4080             
      Balanced Accuracy : 0.7338             
                                             
       'Positive' Class : Yes                
                                             

In this case, we want to get the maximum recall or Sensitivity metric value so that our model can detect as many customers who actually churn as possible. From the above result of confusionmatrix we get value of Sensitivity 75,13% with the Accuracy 72,57%.

Another way to evaluate model performance by using ROC curve and AUC value using ROCR package, as below :

# create probability test(using data test)
prob_test <- predict(model_nb, data_test, type = "raw")
head(prob_test)
             No           Yes
[1,] 0.61879545 0.38120454513
[2,] 0.99998711 0.00001288535
[3,] 0.01690873 0.98309126962
[4,] 0.66678895 0.33321105217
[5,] 0.28037194 0.71962805639
[6,] 0.54824980 0.45175020033

As the requirement of ROC curve, we provide our model prediction and the actual class.

pred_rocr <- prediction(predictions = prob_test[, 2],
                       labels = as.numeric(data_test$Churn))
# to define measurement of x&y axis
perf <- performance(prediction.obj = pred_rocr, measure = "tpr", x.measure = "fpr")
  • tpr (true positive rate) = recall/sensitivity
  • fpr (false positive rate) = 1- specificity

ROC curve is a plot of Sensitivity respect to Loss of Specificity given all possible threshold. Then we can visualize the ROC curve.

plot(perf)

check the AUC value , as below:

# check AUC value 
auc <- performance(pred_rocr, "auc")

auc@y.values
[[1]]
[1] 0.8059038

The AUC value above indicates that our model’s performance is 80.6% in separating the positive Churn class distribution from the negative on the test data. I think it’s quite good for the model’s performance. Next, let’s evaluate for the Decision tree model.

Decision Tree

Model Decision Tree

I will predict decision tree model using type parameter response to get the result of the categorical prediction.

# prediction dt model
pred_dt <- predict(model_dt, data_test, type = "response")
# evaluate model dt
conf_dt <- confusionMatrix(as.factor(pred_dt), data_test$Churn, positive = "Yes")
conf_dt
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  765  96
       Yes 268 278
                                             
               Accuracy : 0.7413             
                 95% CI : (0.7176, 0.764)    
    No Information Rate : 0.7342             
    P-Value [Acc > NIR] : 0.2843             
                                             
                  Kappa : 0.422              
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.7433             
            Specificity : 0.7406             
         Pos Pred Value : 0.5092             
         Neg Pred Value : 0.8885             
             Prevalence : 0.2658             
         Detection Rate : 0.1976             
   Detection Prevalence : 0.3881             
      Balanced Accuracy : 0.7419             
                                             
       'Positive' Class : Yes                
                                             

From the above result of confusionmatrix we get value of Sensitivity 74,33% with the Accuracy 74,13%. this value of accuracy is a bit high compared to Naive Bayes model, but a bit low in sensitivity from the Naive Bayes model.

Then check ROC curve and AUC value of decision tree model.

prob_test_dt <- predict(model_dt, data_test, type = "prob")
head(prob_test_dt)
          No       Yes
11 0.5765550 0.4234450
16 0.7580645 0.2419355
20 0.2036728 0.7963272
23 0.3671498 0.6328502
27 0.2036728 0.7963272
28 0.2128906 0.7871094
# get the positive class and predict
pred_rocr_dt <- prediction(predictions = prob_test_dt[, 2],
                       labels = as.numeric(data_test$Churn))
# performance
perf_dt <- performance(prediction.obj = pred_rocr_dt, measure = "tpr", x.measure = "fpr")

plot(perf_dt)

# Check AUC value
auc_dt <- performance(pred_rocr_dt, "auc")

auc_dt@y.values[[1]]
[1] 0.8177547

The AUC value above indicates that our model’s performance is 81.77% in separating the positive Churn class distribution from the negative on the test data. It’s just a bit high from naive bayes auc value. Next, let’s evaluate for random forest model.

Model Decision Trees & Pruning

From the above result of confusionmatrix we get value of Sensitivity 74,33% with the Accuracy 74,13%. this value of accuracy is a bit high compared to Naive Bayes model, but a bit low in sensitivity from the Naive Bayes model. Hence we can do the pruning method on the model we created.

Pruning is part of the decision tree formation process. When forming a decision tree, some nodes are outliers or the result of data noise. The application of pruning to the decision tree can reduce outliers and data noise in the initial decision tree so that it can increase the accuracy of data classification. The Pruning algorithm itself has value criteria, namely:

  • mincriterion: The value is 1 - Alpha. Works as a “regulator” for tree depth. The smaller the value, the more complex the resulting tree. For example mincriterion = 0.8, then p-value < 0.2 which is used to split the node.
  • minsplit: Minimum number of observations on the node before splitting. For example minsplit = 50, then the node will not be split if the observations contained in the node are < 50.
  • minbucket: the minimum number of observations on the terminal node. For example minbucket = 3, then each terminal node that is formed must have at least 3 observations.

Let’s create decision tree model using pruning method, as below :

RNGkind(sample.kind = "Rounding")
set.seed(100)
model_dt_prun <- ctree(Churn ~ ., data = data_train_up,
                       control = ctree_control(mincriterion = 0.99,
                                               minsplit = 16,
                                               minbucket = 3))

plot(model_dt_prun, type = "simple")

predict in both data train and data test to make sure whether our model overfitting or not.

pred_dt_prun_train <- predict(model_dt_prun, data_train_up, type = "response")

pred_dt_prun_test <- predict(model_dt_prun, data_test, type = "response")

Let’s evaluate the model.

conf_dt_prun_train <- confusionMatrix(as.factor(pred_dt_prun_train), data_train_up$Churn, positive = "Yes")
conf_dt_prun_train
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  3089  790
       Yes 1041 3340
                                               
               Accuracy : 0.7783               
                 95% CI : (0.7692, 0.7872)     
    No Information Rate : 0.5                  
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.5567               
                                               
 Mcnemar's Test P-Value : 0.000000005144       
                                               
            Sensitivity : 0.8087               
            Specificity : 0.7479               
         Pos Pred Value : 0.7624               
         Neg Pred Value : 0.7963               
             Prevalence : 0.5000               
         Detection Rate : 0.4044               
   Detection Prevalence : 0.5304               
      Balanced Accuracy : 0.7783               
                                               
       'Positive' Class : Yes                  
                                               
conf_dt_prun <- confusionMatrix(as.factor(pred_dt_prun_test), data_test$Churn, positive = "Yes")
conf_dt_prun
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  755  87
       Yes 278 287
                                             
               Accuracy : 0.7406             
                 95% CI : (0.7168, 0.7633)   
    No Information Rate : 0.7342             
    P-Value [Acc > NIR] : 0.3052             
                                             
                  Kappa : 0.4285             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.7674             
            Specificity : 0.7309             
         Pos Pred Value : 0.5080             
         Neg Pred Value : 0.8967             
             Prevalence : 0.2658             
         Detection Rate : 0.2040             
   Detection Prevalence : 0.4016             
      Balanced Accuracy : 0.7491             
                                             
       'Positive' Class : Yes                
                                             

From the results which I try several value of mincriterion, minsplit , and minbucket, it was found that pruning produced slightly better results with mincriterion = 0.99, minsplit = 16, and minbucket = 3. But the performance of the data test is a little bit low than in data train. In data train we get the Sensitivity 80,87% with the Accuracy 77,83% in data test we get the Sensitivity 76,74% with the Accuracy 74,06% , but this result in data test quite increase compare to the first model without pruning. And I can conclude that there is no overfitting also between data train and data test due to the result range is quite balance.

Then check ROC curve and AUC value of decision tree pruning model.

prob_test_dt_prun <- predict(model_dt_prun, data_test, type = "prob")
head(prob_test_dt_prun)
          No       Yes
11 0.5765550 0.4234450
16 0.7580645 0.2419355
20 0.2437870 0.7562130
23 0.3671498 0.6328502
27 0.2437870 0.7562130
28 0.2128906 0.7871094
# get the positive class and predict
pred_rocr_dt_prun <- prediction(predictions = prob_test_dt_prun[, 2],
                       labels = as.numeric(data_test$Churn))
# performance
perf_dt_prun <- performance(prediction.obj = pred_rocr_dt_prun, measure = "tpr", x.measure = "fpr")

plot(perf_dt_prun)

# Check AUC value
auc_dt_prun <- performance(pred_rocr_dt_prun, "auc")

auc_dt_prun@y.values[[1]]
[1] 0.8224565

The AUC value above indicates that our model’s performance is 82.24% in separating the positive Churn class distribution from the negative on the test data. It’s seems quite high so far from naive bayes and decision trees without pruning. Ok, let’s continue to check the evaluation of random forest model.

Random Forest

Check our random forest model which we already save in saveRDS function.

model_forest
Random Forest 

8260 samples
  19 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 6608, 6608, 6608, 6608, 6608, 6608, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8248991  0.6497982
  12    0.8962066  0.7924132
  23    0.8902744  0.7805488

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 12.

From the above information the accuracy of the data train up higher 89,62% in optimum mtry 12. Then, take a look at the final model summary of the Random Forest model using model_forest$finalModel

model_forest$finalModel

Call:
 randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 12

        OOB estimate of  error rate: 8.62%
Confusion matrix:
      No  Yes class.error
No  3512  618  0.14963680
Yes   94 4036  0.02276029

The OOB value shows the error value in the data that the model has not seen, and from the summary of the model we have 8,62% error of our unseen data.

We can also use the Variable Importance information, to get a list of important variables used in our Random Forest model. Using the varImp() function, then put it in the plot() function to get the visualization.

plot(varImp(model_forest))

From the visualization top three of important variable in our random forest model is TotalCharges, MonthlyCharges, and tenure.

pred <- predict(model_forest, newdata = data_test, type = "prob")
pred$result <- as.factor(ifelse(pred$Yes > 0.45, "Yes", "No"))
conf_rf <- confusionMatrix(as.factor(pred$result), as.factor(data_test$Churn), positive = "Yes")
conf_rf
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  831 141
       Yes 202 233
                                          
               Accuracy : 0.7562          
                 95% CI : (0.7329, 0.7785)
    No Information Rate : 0.7342          
    P-Value [Acc > NIR] : 0.031981        
                                          
                  Kappa : 0.4063          
                                          
 Mcnemar's Test P-Value : 0.001197        
                                          
            Sensitivity : 0.6230          
            Specificity : 0.8045          
         Pos Pred Value : 0.5356          
         Neg Pred Value : 0.8549          
             Prevalence : 0.2658          
         Detection Rate : 0.1656          
   Detection Prevalence : 0.3092          
      Balanced Accuracy : 0.7137          
                                          
       'Positive' Class : Yes             
                                          

By using a threshold of 0.45, a sensitivity of 62.30% was obtained with an accuracy of 75.62%.

In addition to using the confusion matrix, we can form the ROC curve along with the AUC value by using the ROCR package as follows:

prob_test_rf <- predict(model_forest, data_test, type = "prob")
pred_rocr_rf <- prediction(prob_test_rf[,2], labels = data_test$Churn)
perf_rf <- performance(prediction.obj = pred_rocr_rf, measure = "tpr", x.measure = "fpr")

plot(perf_rf)

auc_rf <- performance(pred_rocr_rf, measure = "auc")
auc_rf@y.values[[1]]
[1] 0.8009639

The AUC value above indicates that our model’s performance is 80.09% in separating the positive Churn class distribution from the negative on the test data.

Conclusion

Herewith the comparison between our model created Naive Bayes, Decision Trees, Decision Trees with pruning, and Random Forest.

Model Accuracy Sensitivity
Naive Bayes 0.7257 0.7513
Decision Trees 0.7413 0.7433
Decision Trees Pruning 0.7406 0.7674
Random Forest 0.7562 0.6230

As I mention before, in this case, we want to get the maximum recall or Sensitivity metric value so that our model can detect as many customers who actually churn as possible. From the above comparison result, actually the value is quite the same, but the highest Sensitivity value is 76,74% with the Accuracy 74,06% which is from Decision Trees with pruning model. The decision trees model that uses the pruning method also produces the highest AUC value among the others at 82.24%.

The following visualization shows the predicted results for two customers. The two customers have a large enough opportunity to churn and we can also find out which variables support and contradict the prediction results of the model.

library(lime)
test_x <- data_test %>% 
  dplyr::select(-Churn)

explainer <- lime(test_x, model_forest)
explanation <- lime::explain(test_x[1:2,],
                             explainer, 
                             labels = c("Yes"),
                             n_features = 10)
plot_features(explanation)

We can conclude from the visualization that the strongest reason for the two customers has a high chance of churn because they have a monthly contract, and Customers with internet service, especially in fiber optic service, are more likely to churn.

This model is not quite an optimal model created, we can try in obtaining a better model performance with the highest metric of Sensitivity or Accuracy by recreating and find the best parameter of k-fold value specially for random forest model. And also we can do a deep pruning for decision trees model to get the optimal metric.

Resources