Telecommunication Churn Prediction
Introduction
Customer churn is the loss of clients or customers. In order to avoid losing customers, a company needs to examine why its customers have left in the past and which features are more important to determine who will churn in the future. Our task is therefore to predict whether customers are about to churn and which are the most important features to get that prediction right. As in most prediction problems, we will use machine learning.
Companies usually define 2 types of customer churn, namely voluntary and involuntary. Voluntary churn is a customer who intentionally quits and switches to another company, while involuntary churn is a customer who quits due to external reasons such as changing locations, death, or other reasons.
Between the two types, voluntary churn is not difficult to do because we can study customer characteristics that can be seen from customer profiles. This problem can be answered by creating a Machine Learning model that can predict whether a customer will churn or not. It is hoped that with this model, telecommunication companies can take preventive actions for customers who have a high chance of churn.
Import Data
library(tidyverse)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)
library(inspectdf)The data used is customer profile data from a telecommunications company obtained from Kaggle. The dataset contains data for 7043 customers which includes customer demographics, account payment information, and service products registered by each customer. From this information, we want to predict whether a customer will Churn or not.
# read data
churn <- read.csv("data/telcochurn.csv")
head(churn) X customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 1 7590-VHVEG Female 0 Yes No 1 No
2 2 5575-GNVDE Male 0 No No 34 Yes
3 3 3668-QPYBK Male 0 No No 2 Yes
4 4 7795-CFOCW Male 0 No No 45 No
5 5 9237-HQITU Female 0 No No 2 Yes
6 6 9305-CDSKC Female 0 No No 8 Yes
MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1 No DSL No Yes No
2 No DSL Yes No Yes
3 No DSL Yes Yes No
4 No DSL Yes No Yes
5 No Fiber optic No No No
6 Yes Fiber optic No No Yes
TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
1 No No No Month-to-month Yes
2 No No No One year No
3 No No No Month-to-month Yes
4 Yes No No One year No
5 No No No Month-to-month Yes
6 No Yes Yes Month-to-month Yes
PaymentMethod MonthlyCharges TotalCharges Churn
1 Electronic check 29.85 29.85 No
2 Mailed check 56.95 1889.50 No
3 Mailed check 53.85 108.15 Yes
4 Bank transfer (automatic) 42.30 1840.75 No
5 Electronic check 70.70 151.65 Yes
6 Electronic check 99.65 820.50 Yes
The following is a description for each variable:
- customerID: Customer ID
- genderCustomer: gender (female, male)
- SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
- PartnerWhether: the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has a phone service or not (Yes, No)
- MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two year)
- PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card ( automatic))
- MonthlyCharges: The amount of charged to the customer monthly
- TotalCharges: The total amount of charged to the customer
- Churn: Whether the customer churned or not (Yes or No)
Cleansing Data
From all the character data above, we will change it to the factor data type. We can use the stringsAsFactors = TRUE parameter of read.csv() so that all character fields will be automatically saved as factors.
# change data type into factor
churn <- read.csv("data/telcochurn.csv", stringsAsFactors = TRUE)
glimpse(churn)Rows: 7,043
Columns: 22
$ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
$ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
$ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
$ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
$ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
$ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
$ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
$ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
$ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
$ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
$ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
$ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
$ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
$ Contract <fct> Month-to-month, One year, Month-to-month, One year, M~
$ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
$ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr~
$ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
$ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
$ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
The data has 7043 observations and 22 columns, check whether any missing data in telcochurn dataset
colSums(is.na(churn)) X customerID gender SeniorCitizen
0 0 0 0
Partner Dependents tenure PhoneService
0 0 0 0
MultipleLines InternetService OnlineSecurity OnlineBackup
0 0 0 0
DeviceProtection TechSupport StreamingTV StreamingMovies
0 0 0 0
Contract PaperlessBilling PaymentMethod MonthlyCharges
0 0 0 0
TotalCharges Churn
11 0
From 7043 observations, there is 11 missing value data in TotalCharges column, due to missing value in TotalCharges column is quite a bit we can throw the missing value data. There is some column also which not relevan with the modeling, the column is X and customerID, we can remove this two column from dataset. We adjust the data type of the SeniorCitizen column which was previously numeric to categorical.
# remove column X & customerID, and missing value
churn <- churn %>%
select(-c(X,customerID)) %>%
na.omit()
# assigning label for seniorCitizen column
churn$SeniorCitizen <- factor(churn$SeniorCitizen, levels = c("0", "1"), labels = c("No", "Yes"))We can check again, whether data is already clean or not.
glimpse(churn)Rows: 7,032
Columns: 20
$ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
$ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
$ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
$ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
$ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
$ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
$ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
$ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
$ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
$ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
$ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
$ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
$ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
$ Contract <fct> Month-to-month, One year, Month-to-month, One year, M~
$ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
$ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr~
$ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
$ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
$ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
anyNA(churn)[1] FALSE
Ok data is already clean now, we can continue to the next steps.
Exploratory Data Analysis
Next, let’s explore the data for both categorical and numeric columns. To find out the class proportion for each categorical variable, we can use the inspect_cat function from the inspectdf package as below:
churn %>%
inspect_cat() %>%
show_plot()From the visualization above, it can be seen that the proportion of class for the Churn target variable is more in the No category than Yes. While, for the proportion of other variables the majority is balanced.
We can inspect numeric data variable also using inspect_num function from the inspectdf package, as below :
churn %>%
inspect_num() %>%
show_plot()From the visualization above, it can be concluded that the distribution of numerical data is quite diverse for each variable.
Before we create a model, we must divide the dataset into train and test data. Divide the data by the proportion of 80% train and 20% test. Make sure to use the RNGkind() and set.seed() functions before splitting the data and running it together with the sample() code. set.seed ensures that you get the same result if you start with that same seed each time you run the same process.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# splitting data
idx <- initial_split(data = churn,
prop = 0.8,
strata = "Churn")
data_train <- training(idx)
data_test <- testing(idx)Let’s check class proportion of the target variable :
prop.table(table(data_train$Churn))
No Yes
0.7342222 0.2657778
Based on the above result , it can be concluded that the proportion of the target class is unbalanced, we must balance it before using it in our model. Sub-sampling method applies only to the data train. There are several methods for resampling data, including downsampling, upsampling, adding data from available data sources. Here I will perform an upsampling method to make the proportion of the target class balanced.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# upsampling data train
data_train_up <- upSample(x = data_train %>% select(-Churn),
y = data_train$Churn,
yname = "Churn")
table(data_train_up$Churn)
No Yes
4130 4130
prop.table(table(data_train_up$Churn))
No Yes
0.5 0.5
The proportion of the class variable target is already balanced now, then we continue to make a model using data_train_up.
Modeling
Naive Bayes
Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
After dividing the data into train and test data and upsampling the train data, let’s create the first model using the Naive Bayes algorithm. To avoid skewness due scarcity data we use laplace function.
# model naive bayes
model_nb <- naiveBayes(Churn ~., data = data_train_up, laplace = 1)Decision Tree
Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables.
Let’s identify important terminologies on Decision Tree, looking at the image above:
Root Noderepresents the entire population or sample. It further gets divided into two or more homogeneous sets.Splittingis a process of dividing a node into two or more sub-nodes.- When a sub-node splits into further sub-nodes, it is called a
Decision Node. - Nodes that do not split is called a
Terminal Nodeor aLeaf. - When you remove sub-nodes of a decision node, this process is called
Pruning. The opposite of pruning isSplitting. - A sub-section of an entire tree is called
Branch. - A node, which is divided into sub-nodes is called a
parent nodeof the sub-nodes; whereas the sub-nodes are called thechildof the parent node.
Let’s create decision tree model using ctree() function .
# model dt
RNGkind(sample.kind = "Rounding")
set.seed(100)
model_dt <- ctree(Churn ~ ., data = data_train_up)Visualization of the model decision tree using plot function and type simple.
plot(model_dt, type = "simple")Random Forest
Random forest algorithm is a supervised classification and regression algorithm. As the name suggests, this algorithm randomly creates a forest with several trees. Generally, the more trees in the forest the more robust the forest looks like. Similarly, in the random forest classifier, the higher the number of trees in the forest, greater is the accuracy of the results.
In simple words, Random forest builds multiple decision trees (called the forest) and glues them together to get a more accurate and stable prediction. The forest it builds is a collection of Decision Trees, trained with the bagging method. What is the difference between random forest and decision trees that we used in modeling before ?. Here the explanation with the sample case :
Let’s say that you’re looking to buy a house, but you’re unable to decide which one to buy. So, you consult a few agents and they give you a list of parameters that you should consider before buying a house. The list includes:
- Price of the house
- Locality
- Number of bedrooms
- Parking space
- Available facilities
These parameters are known as predictor variables, which are used to find the response variable. Here’s a diagrammatic illustration of how you can represent the above problem statement using a decision tree.
An important point to note here is that Decision trees are built on the entire data set, by making use of all the predictor variables. Now let’s see how Random Forest chart is. Like mentioned earlier Random forest is an ensemble of decision trees, it randomly selects a set of parameters and creates a decision tree for each set of chosen parameters.
Take a look at the below figure.
Here, I’ve created 3 Decision Trees and each Decision Tree is taking only 3 parameters from the entire data set. Each decision tree predicts the outcome based on the respective predictor variables used in that tree and finally takes the average of the results from all the decision trees in the random forest.
In simple words, after creating multiple Decision trees using this method, each tree selects or votes the class (in this case the decision trees will choose whether or not a house is bought), and the class receiving the most votes by a simple majority is termed as the predicted class.
To conclude, Decision trees are built on the entire data set using all the predictor variables, whereas Random Forests are used to create multiple decision trees, such that each decision tree is built only on a part of the data set.
Next we will do the modeling using the Random Forest algorithm (package caret) by determining the number of cross validation, repetitions, and including the name of the target variable and also the predictor used from the data train up. Here I want to create random forest model with 10 Fold Cross Validation and 10 repeats, as below:
# set.seed(100)
# ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
# model_forest <- train(Churn ~ .,
# data = data_train_up,
# method = "rf",
# trControl = ctrl)
#
# saveRDS(model_forest, "assets/model_forest.rds")Chunk above takes time to process, let’s check the model which we already save it, after we create the model, and save it, we can read the model using readRDS() function, as below :
model_forest <- readRDS("assets/model_forest.rds")
model_forestRandom Forest
8260 samples
19 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 6608, 6608, 6608, 6608, 6608, 6608, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8248991 0.6497982
12 0.8962066 0.7924132
23 0.8902744 0.7805488
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 12.
From the above information the accuracy of the data train up higher 89,62% in optimum mtry 12. Next step let’s evaluated all model we’ve created before .
Model Evaluation
Naive Bayes
I will do data test predict using naive model with the type of parameter class, to predict target of class
pred_naive <- predict(model_nb, data_test, type = "class")# evaluasi model nb
conf_nb <- confusionMatrix(pred_naive, reference = data_test$Churn, positive = "Yes")
conf_nbConfusion Matrix and Statistics
Reference
Prediction No Yes
No 740 93
Yes 293 281
Accuracy : 0.7257
95% CI : (0.7015, 0.7488)
No Information Rate : 0.7342
P-Value [Acc > NIR] : 0.7753
Kappa : 0.3995
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.7513
Specificity : 0.7164
Pos Pred Value : 0.4895
Neg Pred Value : 0.8884
Prevalence : 0.2658
Detection Rate : 0.1997
Detection Prevalence : 0.4080
Balanced Accuracy : 0.7338
'Positive' Class : Yes
In this case, we want to get the maximum recall or Sensitivity metric value so that our model can detect as many customers who actually churn as possible. From the above result of confusionmatrix we get value of Sensitivity 75,13% with the Accuracy 72,57%.
Another way to evaluate model performance by using ROC curve and AUC value using ROCR package, as below :
# create probability test(using data test)
prob_test <- predict(model_nb, data_test, type = "raw")
head(prob_test) No Yes
[1,] 0.61879545 0.38120454513
[2,] 0.99998711 0.00001288535
[3,] 0.01690873 0.98309126962
[4,] 0.66678895 0.33321105217
[5,] 0.28037194 0.71962805639
[6,] 0.54824980 0.45175020033
As the requirement of ROC curve, we provide our model prediction and the actual class.
pred_rocr <- prediction(predictions = prob_test[, 2],
labels = as.numeric(data_test$Churn))# to define measurement of x&y axis
perf <- performance(prediction.obj = pred_rocr, measure = "tpr", x.measure = "fpr")- tpr (true positive rate) = recall/sensitivity
- fpr (false positive rate) = 1- specificity
ROC curve is a plot of Sensitivity respect to Loss of Specificity given all possible threshold. Then we can visualize the ROC curve.
plot(perf)check the AUC value , as below:
# check AUC value
auc <- performance(pred_rocr, "auc")
auc@y.values[[1]]
[1] 0.8059038
The AUC value above indicates that our model’s performance is 80.6% in separating the positive Churn class distribution from the negative on the test data. I think it’s quite good for the model’s performance. Next, let’s evaluate for the Decision tree model.
Decision Tree
Model Decision Tree
I will predict decision tree model using type parameter response to get the result of the categorical prediction.
# prediction dt model
pred_dt <- predict(model_dt, data_test, type = "response")# evaluate model dt
conf_dt <- confusionMatrix(as.factor(pred_dt), data_test$Churn, positive = "Yes")
conf_dtConfusion Matrix and Statistics
Reference
Prediction No Yes
No 765 96
Yes 268 278
Accuracy : 0.7413
95% CI : (0.7176, 0.764)
No Information Rate : 0.7342
P-Value [Acc > NIR] : 0.2843
Kappa : 0.422
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.7433
Specificity : 0.7406
Pos Pred Value : 0.5092
Neg Pred Value : 0.8885
Prevalence : 0.2658
Detection Rate : 0.1976
Detection Prevalence : 0.3881
Balanced Accuracy : 0.7419
'Positive' Class : Yes
From the above result of confusionmatrix we get value of Sensitivity 74,33% with the Accuracy 74,13%. this value of accuracy is a bit high compared to Naive Bayes model, but a bit low in sensitivity from the Naive Bayes model.
Then check ROC curve and AUC value of decision tree model.
prob_test_dt <- predict(model_dt, data_test, type = "prob")
head(prob_test_dt) No Yes
11 0.5765550 0.4234450
16 0.7580645 0.2419355
20 0.2036728 0.7963272
23 0.3671498 0.6328502
27 0.2036728 0.7963272
28 0.2128906 0.7871094
# get the positive class and predict
pred_rocr_dt <- prediction(predictions = prob_test_dt[, 2],
labels = as.numeric(data_test$Churn))# performance
perf_dt <- performance(prediction.obj = pred_rocr_dt, measure = "tpr", x.measure = "fpr")
plot(perf_dt)# Check AUC value
auc_dt <- performance(pred_rocr_dt, "auc")
auc_dt@y.values[[1]][1] 0.8177547
The AUC value above indicates that our model’s performance is 81.77% in separating the positive Churn class distribution from the negative on the test data. It’s just a bit high from naive bayes auc value. Next, let’s evaluate for random forest model.
Model Decision Trees & Pruning
From the above result of confusionmatrix we get value of Sensitivity 74,33% with the Accuracy 74,13%. this value of accuracy is a bit high compared to Naive Bayes model, but a bit low in sensitivity from the Naive Bayes model. Hence we can do the pruning method on the model we created.
Pruning is part of the decision tree formation process. When forming a decision tree, some nodes are outliers or the result of data noise. The application of pruning to the decision tree can reduce outliers and data noise in the initial decision tree so that it can increase the accuracy of data classification. The Pruning algorithm itself has value criteria, namely:
mincriterion: The value is 1 - Alpha. Works as a “regulator” for tree depth. The smaller the value, the more complex the resulting tree. For example mincriterion = 0.8, then p-value < 0.2 which is used to split the node.minsplit: Minimum number of observations on the node before splitting. For example minsplit = 50, then the node will not be split if the observations contained in the node are < 50.minbucket: the minimum number of observations on the terminal node. For example minbucket = 3, then each terminal node that is formed must have at least 3 observations.
Let’s create decision tree model using pruning method, as below :
RNGkind(sample.kind = "Rounding")
set.seed(100)
model_dt_prun <- ctree(Churn ~ ., data = data_train_up,
control = ctree_control(mincriterion = 0.99,
minsplit = 16,
minbucket = 3))
plot(model_dt_prun, type = "simple")predict in both data train and data test to make sure whether our model overfitting or not.
pred_dt_prun_train <- predict(model_dt_prun, data_train_up, type = "response")
pred_dt_prun_test <- predict(model_dt_prun, data_test, type = "response")Let’s evaluate the model.
conf_dt_prun_train <- confusionMatrix(as.factor(pred_dt_prun_train), data_train_up$Churn, positive = "Yes")
conf_dt_prun_trainConfusion Matrix and Statistics
Reference
Prediction No Yes
No 3089 790
Yes 1041 3340
Accuracy : 0.7783
95% CI : (0.7692, 0.7872)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5567
Mcnemar's Test P-Value : 0.000000005144
Sensitivity : 0.8087
Specificity : 0.7479
Pos Pred Value : 0.7624
Neg Pred Value : 0.7963
Prevalence : 0.5000
Detection Rate : 0.4044
Detection Prevalence : 0.5304
Balanced Accuracy : 0.7783
'Positive' Class : Yes
conf_dt_prun <- confusionMatrix(as.factor(pred_dt_prun_test), data_test$Churn, positive = "Yes")
conf_dt_prunConfusion Matrix and Statistics
Reference
Prediction No Yes
No 755 87
Yes 278 287
Accuracy : 0.7406
95% CI : (0.7168, 0.7633)
No Information Rate : 0.7342
P-Value [Acc > NIR] : 0.3052
Kappa : 0.4285
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.7674
Specificity : 0.7309
Pos Pred Value : 0.5080
Neg Pred Value : 0.8967
Prevalence : 0.2658
Detection Rate : 0.2040
Detection Prevalence : 0.4016
Balanced Accuracy : 0.7491
'Positive' Class : Yes
From the results which I try several value of mincriterion, minsplit , and minbucket, it was found that pruning produced slightly better results with mincriterion = 0.99, minsplit = 16, and minbucket = 3. But the performance of the data test is a little bit low than in data train. In data train we get the Sensitivity 80,87% with the Accuracy 77,83% in data test we get the Sensitivity 76,74% with the Accuracy 74,06% , but this result in data test quite increase compare to the first model without pruning. And I can conclude that there is no overfitting also between data train and data test due to the result range is quite balance.
Then check ROC curve and AUC value of decision tree pruning model.
prob_test_dt_prun <- predict(model_dt_prun, data_test, type = "prob")
head(prob_test_dt_prun) No Yes
11 0.5765550 0.4234450
16 0.7580645 0.2419355
20 0.2437870 0.7562130
23 0.3671498 0.6328502
27 0.2437870 0.7562130
28 0.2128906 0.7871094
# get the positive class and predict
pred_rocr_dt_prun <- prediction(predictions = prob_test_dt_prun[, 2],
labels = as.numeric(data_test$Churn))# performance
perf_dt_prun <- performance(prediction.obj = pred_rocr_dt_prun, measure = "tpr", x.measure = "fpr")
plot(perf_dt_prun)# Check AUC value
auc_dt_prun <- performance(pred_rocr_dt_prun, "auc")
auc_dt_prun@y.values[[1]][1] 0.8224565
The AUC value above indicates that our model’s performance is 82.24% in separating the positive Churn class distribution from the negative on the test data. It’s seems quite high so far from naive bayes and decision trees without pruning. Ok, let’s continue to check the evaluation of random forest model.
Random Forest
Check our random forest model which we already save in saveRDS function.
model_forestRandom Forest
8260 samples
19 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 6608, 6608, 6608, 6608, 6608, 6608, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8248991 0.6497982
12 0.8962066 0.7924132
23 0.8902744 0.7805488
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 12.
From the above information the accuracy of the data train up higher 89,62% in optimum mtry 12. Then, take a look at the final model summary of the Random Forest model using model_forest$finalModel
model_forest$finalModel
Call:
randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 12
OOB estimate of error rate: 8.62%
Confusion matrix:
No Yes class.error
No 3512 618 0.14963680
Yes 94 4036 0.02276029
The OOB value shows the error value in the data that the model has not seen, and from the summary of the model we have 8,62% error of our unseen data.
We can also use the Variable Importance information, to get a list of important variables used in our Random Forest model. Using the varImp() function, then put it in the plot() function to get the visualization.
plot(varImp(model_forest))From the visualization top three of important variable in our random forest model is TotalCharges, MonthlyCharges, and tenure.
pred <- predict(model_forest, newdata = data_test, type = "prob")
pred$result <- as.factor(ifelse(pred$Yes > 0.45, "Yes", "No"))
conf_rf <- confusionMatrix(as.factor(pred$result), as.factor(data_test$Churn), positive = "Yes")
conf_rfConfusion Matrix and Statistics
Reference
Prediction No Yes
No 831 141
Yes 202 233
Accuracy : 0.7562
95% CI : (0.7329, 0.7785)
No Information Rate : 0.7342
P-Value [Acc > NIR] : 0.031981
Kappa : 0.4063
Mcnemar's Test P-Value : 0.001197
Sensitivity : 0.6230
Specificity : 0.8045
Pos Pred Value : 0.5356
Neg Pred Value : 0.8549
Prevalence : 0.2658
Detection Rate : 0.1656
Detection Prevalence : 0.3092
Balanced Accuracy : 0.7137
'Positive' Class : Yes
By using a threshold of 0.45, a sensitivity of 62.30% was obtained with an accuracy of 75.62%.
In addition to using the confusion matrix, we can form the ROC curve along with the AUC value by using the ROCR package as follows:
prob_test_rf <- predict(model_forest, data_test, type = "prob")
pred_rocr_rf <- prediction(prob_test_rf[,2], labels = data_test$Churn)
perf_rf <- performance(prediction.obj = pred_rocr_rf, measure = "tpr", x.measure = "fpr")
plot(perf_rf)auc_rf <- performance(pred_rocr_rf, measure = "auc")
auc_rf@y.values[[1]][1] 0.8009639
The AUC value above indicates that our model’s performance is 80.09% in separating the positive Churn class distribution from the negative on the test data.
Conclusion
Herewith the comparison between our model created Naive Bayes, Decision Trees, Decision Trees with pruning, and Random Forest.
| Model | Accuracy | Sensitivity |
|---|---|---|
| Naive Bayes | 0.7257 | 0.7513 |
| Decision Trees | 0.7413 | 0.7433 |
| Decision Trees Pruning | 0.7406 | 0.7674 |
| Random Forest | 0.7562 | 0.6230 |
As I mention before, in this case, we want to get the maximum recall or Sensitivity metric value so that our model can detect as many customers who actually churn as possible. From the above comparison result, actually the value is quite the same, but the highest Sensitivity value is 76,74% with the Accuracy 74,06% which is from Decision Trees with pruning model. The decision trees model that uses the pruning method also produces the highest AUC value among the others at 82.24%.
The following visualization shows the predicted results for two customers. The two customers have a large enough opportunity to churn and we can also find out which variables support and contradict the prediction results of the model.
library(lime)
test_x <- data_test %>%
dplyr::select(-Churn)
explainer <- lime(test_x, model_forest)
explanation <- lime::explain(test_x[1:2,],
explainer,
labels = c("Yes"),
n_features = 10)
plot_features(explanation)We can conclude from the visualization that the strongest reason for the two customers has a high chance of churn because they have a monthly contract, and Customers with internet service, especially in fiber optic service, are more likely to churn.
This model is not quite an optimal model created, we can try in obtaining a better model performance with the highest metric of Sensitivity or Accuracy by recreating and find the best parameter of k-fold value specially for random forest model. And also we can do a deep pruning for decision trees model to get the optimal metric.
Resources
- Reference : edureka
- Dataset: Kaggle: Telco Customer Churn