Customer churn, also known as customer attrition, occurs when customers stop doing business with a company or stop using a company’s services. By being aware of and monitoring churn rate, companies are equipped to determine their customer retention success rates and identify strategies for improvement. We will use a machine learning model to understand the precise customer behaviors and attributes which signal the risk and timing of customer churn. In this project, I want to predict Telco customer churn based on their behavior. This dataset taken from Kaggle
dataset.
churn <- churn %>%
select(- customerID) %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen))
datatable(churn, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))
churn %>%
group_by(Churn) %>%
summarise(Number = n()) %>%
mutate(Percent = prop.table(Number)*100) %>%
ggplot(aes(Churn, Percent)) +
geom_col(aes(fill = Churn)) +
labs(title = "Churn Percentage") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = sprintf("%.2f%%", Percent)), hjust = 0.01,vjust = -0.5, size = 4) +
theme_minimal()
From the plot, we can see that the percentage of customer that churn is 26.54%
Churn
or Not Churn
churn %>%
group_by(tenure, Churn) %>%
summarise(Number = n()) %>%
ggplot(aes(tenure, Number)) +
geom_line(aes(col = Churn)) +
labs(x = "Tenure (month)",
y = "Number of Customer",
title = "Churn Based on Tenure") +
scale_x_continuous(breaks = seq(0, 70, 10)) +
theme_minimal()
From the graph we can see that the highest churn happen mostly after 1 month usage.
plot_grid(ggplot(churn, aes(x=gender,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x=SeniorCitizen,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x=Partner,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x=Dependents,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage") +
theme_minimal() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)),
align = "h")
From the customer characteristics, I can get several information :
From the gender, male and female seems to have same proportion to churn.
Senior citizens are more likely to churn
Customer without partner are more likely to churn compared to customer who have partner
Customer without dependents are more likely to churn compared to customer with dependents
plot_grid(ggplot(churn, aes(x= InternetService ,fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x= OnlineSecurity, fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x= OnlineBackup, fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x= DeviceProtection, fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x=PhoneService,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x=MultipleLines,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x= TechSupport, fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
ggplot(churn, aes(x= StreamingTV ,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage") +
theme_minimal() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)),
align = "h")
From the service that used by the customer, I can get several information :
Customer who use fibser optic in internet service are more likely to churn compared to others
Customer who don’t use online security, online backup, device protection, streaming TV, and tech support are more likely to churn compared to others
Inspect the correlation between numerical column from the dataset
Because TotalCharges
has strong correlation with tenure
and MonthlyCharges
, I can take out TotalCharges
column
In this project, I am using 3 machine learning models (Naive Bayes, Decision Tree, and Random Forest)
To make the model, I have to split the dataset into data train and data test. Data train for making the model, and data test as the unseen data to make the prediction
model_naive <- naiveBayes(formula = Churn ~ ., data = churn_train, type = "class")
predict1 <- predict(object = model_naive, newdata = churn_train)
predict2 <- predict(object = model_naive, newdata = churn_test)
Making confusion matrix for data train and data test. This is to see if the model overfit or not.
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2844 284
Yes 1296 1212
Accuracy : 0.7197
95% CI : (0.7077, 0.7314)
No Information Rate : 0.7346
P-Value [Acc > NIR] : 0.9944
Kappa : 0.4088
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8102
Specificity : 0.6870
Pos Pred Value : 0.4833
Neg Pred Value : 0.9092
Prevalence : 0.2654
Detection Rate : 0.2150
Detection Prevalence : 0.4450
Balanced Accuracy : 0.7486
'Positive' Class : Yes
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 712 58
Yes 322 315
Accuracy : 0.7299
95% CI : (0.7059, 0.753)
No Information Rate : 0.7349
P-Value [Acc > NIR] : 0.6761
Kappa : 0.4347
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8445
Specificity : 0.6886
Pos Pred Value : 0.4945
Neg Pred Value : 0.9247
Prevalence : 0.2651
Detection Rate : 0.2239
Detection Prevalence : 0.4527
Balanced Accuracy : 0.7665
'Positive' Class : Yes
Visualize ROC curve for Naive-Bayes Model
ROC is a curve that illustrates how capable the model is made to define between positive and negative classes. This plot has 2 axes namely:
x: False Positive Rate (recall)
y: True Positive Rate (1 - specificitty)
pred_prob_naive <- predict(object = model_naive, newdata = churn_test, type = "raw")
pred.naive <- prediction(pred_prob_naive[,2], labels = test_label)
perf.naive <- performance(prediction.obj = pred.naive, measure = "tpr", "fpr")
plot(perf.naive, colorize = T)
Checking AUC (Area under the curve) value
[1] 0.8412656
model_dt <- ctree(formula = Churn ~., data = churn_train,
control = ctree_control(mincriterion = 0.95))
plot(model_dt, type = "simple")
Prediction with decision tree model
pred_train_dt <- predict(object = model_dt, newdata = churn_train)
pred_test_dt <- predict(object = model_dt, newdata = churn_test)
Creating confusion matrix for data train and data test.
# for data train
confusionMatrix(data = pred_train_dt, reference = churn_train$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 3818 816
Yes 322 680
Accuracy : 0.7981
95% CI : (0.7874, 0.8085)
No Information Rate : 0.7346
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4212
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4545
Specificity : 0.9222
Pos Pred Value : 0.6786
Neg Pred Value : 0.8239
Prevalence : 0.2654
Detection Rate : 0.1207
Detection Prevalence : 0.1778
Balanced Accuracy : 0.6884
'Positive' Class : Yes
# for data test
confusionMatrix(data = pred_test_dt, reference = churn_test$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 937 196
Yes 97 177
Accuracy : 0.7918
95% CI : (0.7696, 0.8127)
No Information Rate : 0.7349
P-Value [Acc > NIR] : 4.327e-07
Kappa : 0.416
Mcnemar's Test P-Value : 1.033e-08
Sensitivity : 0.4745
Specificity : 0.9062
Pos Pred Value : 0.6460
Neg Pred Value : 0.8270
Prevalence : 0.2651
Detection Rate : 0.1258
Detection Prevalence : 0.1947
Balanced Accuracy : 0.6904
'Positive' Class : Yes
From two confusion matrix using decision tree for data train and data test, it shows that the model has bigger accuracy when predicting the data test. I think it’s logical because I use the data train in making the model. But because the difference seem not too significant, I think the model still can not be concluded to be overfit. The confusion matrix show bad score in Sensitivity but good score in Specificity. That mean that the decision tree model good at predicting the customer who doesn’t churn.
Visualize ROC curve for Decision Tree Model
pred_prob_dt <- predict(object = model_dt, newdata = churn_test, type = "prob")
pred.dt <- prediction(pred_prob_dt[,2], labels = test_label)
perf.dt <- performance(prediction.obj = pred.dt, measure = "tpr", "fpr")
plot(perf.dt, colorize = T)
Checking AUC (Area under the curve) value
[1] 0.8365921
I want to create random forest model using data train with k-fold cross validation. In this model I use 5 fold cross validation and 3 repetition.
# trl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# churn_forest <- train(Churn ~., data = churn_train, method = "rf", trControl = ctrl)
I will show the demonstration of k fold cross validation in random forest model
ani.options(interval = 1, nmax = 25)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l", )
Random Forest
5636 samples
18 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 4509, 4508, 4509, 4509, 4509, 4508, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7884444 0.3429087
15 0.7947143 0.4342739
29 0.7921121 0.4299515
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 15
OOB estimate of error rate: 20.53%
Confusion matrix:
No Yes class.error
No 3716 424 0.1024155
Yes 733 763 0.4899733
The model show the best accuracy at mtry = 15
Evaluate random forest model for data train and data test
# for data train
a <- predict(model_forest, churn_train)
confusionMatrix(data = a, reference = churn_train$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 4051 183
Yes 89 1313
Accuracy : 0.9517
95% CI : (0.9458, 0.9572)
No Information Rate : 0.7346
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8737
Mcnemar's Test P-Value : 1.711e-08
Sensitivity : 0.8777
Specificity : 0.9785
Pos Pred Value : 0.9365
Neg Pred Value : 0.9568
Prevalence : 0.2654
Detection Rate : 0.2330
Detection Prevalence : 0.2488
Balanced Accuracy : 0.9281
'Positive' Class : Yes
# for data test
churn_pred <- predict(model_forest, churn_test)
confusionMatrix(data = churn_pred, reference = churn_test$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1004 37
Yes 30 336
Accuracy : 0.9524
95% CI : (0.9399, 0.9629)
No Information Rate : 0.7349
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8771
Mcnemar's Test P-Value : 0.4635
Sensitivity : 0.9008
Specificity : 0.9710
Pos Pred Value : 0.9180
Neg Pred Value : 0.9645
Prevalence : 0.2651
Detection Rate : 0.2388
Detection Prevalence : 0.2601
Balanced Accuracy : 0.9359
'Positive' Class : Yes
From random forest model, we can see most important variables that use in each fold and repetition
Visualize ROC Curve for Random Forest Model
pred_prob_rf <- predict(object = model_forest, newdata = churn_test, type = "prob")
pred.forest <- prediction(pred_prob_rf[,2], labels = test_label)
perf.rf <- performance(prediction.obj = pred.forest, measure = "tpr", "fpr")
plot(perf.rf, colorize = T)
Checking AUC (Area under the curve) value
[1] 0.9863787
From the three models that already created, I got the accuracy and recall from them :
Naive-Bayes model : Accuracy = 72.9% ,Recall = 84.4%, and AUC = 84.12%
Decision Tree model : Accuracy = 79.18%, Recall = 47.45%, and AUC = 83.65%
Random Forest model : Accuracy = 95.24%, Recall = 90.08%, and AUC = 98.63%
I can conclude that Random forest model give the best prediction among all three models. This is because random forest model has the highest performance compared to naive bayes model and decision tree model.