Customer churn, also known as customer attrition, is the loss of clients or customers. Churn is an important business metric for subscription-based services such as telecommunications companies. This project demonstrates a churn analysis using data downloaded from IBM sample data sets. We will use the R statistical programming languange in order to identify variables associated with customer churn.
In this project, we will carry out the following tasks:
Load the data and the relevant R libraries. Preprocess the data with various cleaning and recoding techniques. Provide data visualizations of descriptive statistics of the data Fit models using statistical classification methods commonly used in churn analysis. Decision tree analysis Random forest analysis Logistic regression Examine additional data visualization of selected variables based on our modeling techniques.
library(plyr)
library(rpart.plot)
library(caret)
library(gridExtra)
library(tidyverse)
library(rsample)
library(e1071)
library(GGally)
library(data.table)
library(DT)
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(rms)
library(MASS)
library(e1071)
library(ROCR)
library(gplots)
library(pROC)
library(rpart)
library(randomForest)
library(ggpubr)## [1] "C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Telco Churn"
setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Telco Churn")
churn <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
glimpse(churn)## Observations: 7,043
## Variables: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 92...
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Fe...
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No,...
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No,...
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes...
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Ye...
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fibe...
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, ...
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No...
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No i...
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No ...
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No i...
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes,...
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank...
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No...
The IBM sample data set website gives the following data dicitionary, or description of the variables:
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 489 4472-LVYGI Female 0 Yes Yes 0 No
## 754 3115-CZMZD Male 0 No Yes 0 Yes
## 937 5709-LVOEQ Female 0 Yes Yes 0 Yes
## 1083 4367-NUYAO Male 0 Yes Yes 0 Yes
## 1341 1371-DWPAZ Female 0 Yes Yes 0 No
## 3332 7644-OMVMY Male 0 Yes Yes 0 Yes
## 3827 3213-VVOLG Male 0 Yes Yes 0 Yes
## 4381 2520-SGTTA Female 0 Yes Yes 0 Yes
## 5219 2923-ARZLG Male 0 Yes Yes 0 Yes
## 6671 4075-WKNIU Female 0 Yes Yes 0 Yes
## 6755 2775-SEFEE Male 0 No Yes 0 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 489 No phone service DSL Yes No
## 754 No No No internet service No internet service
## 937 No DSL Yes Yes
## 1083 Yes No No internet service No internet service
## 1341 No phone service DSL Yes Yes
## 3332 No No No internet service No internet service
## 3827 Yes No No internet service No internet service
## 4381 No No No internet service No internet service
## 5219 No No No internet service No internet service
## 6671 Yes DSL No Yes
## 6755 Yes DSL Yes Yes
## DeviceProtection TechSupport StreamingTV
## 489 Yes Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes No Yes
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 No Yes No
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 489 No Two year Yes Bank transfer (automatic)
## 754 No internet service Two year No Mailed check
## 937 Yes Two year No Mailed check
## 1083 No internet service Two year No Mailed check
## 1341 No Two year No Credit card (automatic)
## 3332 No internet service Two year No Mailed check
## 3827 No internet service Two year No Mailed check
## 4381 No internet service Two year No Mailed check
## 5219 No internet service One year Yes Mailed check
## 6671 No Two year No Mailed check
## 6755 No Two year Yes Bank transfer (automatic)
## MonthlyCharges TotalCharges Churn
## 489 52.55 NA No
## 754 20.25 NA No
## 937 80.85 NA No
## 1083 25.75 NA No
## 1341 56.05 NA No
## 3332 19.85 NA No
## 3827 25.35 NA No
## 4381 20.00 NA No
## 5219 19.70 NA No
## 6671 73.35 NA No
## 6755 61.90 NA No
Checking proportion
## [1] 0.001561834
This subset is 0.16% of our data and is quite small. We will remove these cases in order to accomodate our further analyses. Let’s call this cleaned data churn_clean.
The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. We can recode this to ease our interpretation of later graphs and models.
churn_clean$SeniorCitizen <- as.factor(mapvalues(churn_clean$SeniorCitizen,
from=c("0","1"),
to=c("No", "Yes")))The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter variable automatically means a ‘no’ for the former variable. We can again further ease our graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for the MultipleLines variable.
churn_clean$MultipleLines <- as.factor(mapvalues(churn_clean$MultipleLines,
from=c("No phone service"),
to=c("No")))Similiarly, the OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies variables are all dependent on the OnlineService variable. We will recode the responses from ‘No internet service’ to ‘No’ for these variables.
for(i in 10:15){
churn_clean[,i] <- as.factor(mapvalues(churn_clean[,i],
from= c("No internet service"), to= c("No")))
}We will not need the customerID variable for graphs or modeling, so it can be removed.
Before we begin modeling our data, let’s examine descriptive statistics of our data within some simple plots.
Here are barplots of demographic data of our sample.
#Gender plot
p1 <- ggplot(churn_clean, aes(x = gender)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Senior citizen plot
p2 <- ggplot(churn_clean, aes(x = SeniorCitizen)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Partner plot
p3 <- ggplot(churn_clean, aes(x = Partner)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Dependents plot
p4 <- ggplot(churn_clean, aes(x = Dependents)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot demographic data within a grid
grid.arrange(p1, p2, p3, p4, ncol=2)From these demographic plots, we notice that the sample is evenly split across gender and partner status. A minority of the sample are senior citizens, and a minority have dependents.
The various offered services are plotted below.
#Phone service plot
p5 <- ggplot(churn_clean, aes(x = PhoneService)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Multiple phone lines plot
p6 <- ggplot(churn_clean, aes(x = MultipleLines)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Internet service plot
p7 <- ggplot(churn_clean, aes(x = InternetService)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Online security service plot
p8 <- ggplot(churn_clean, aes(x = OnlineSecurity)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Online backup service plot
p9 <- ggplot(churn_clean, aes(x = OnlineBackup)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Device Protection service plot
p10 <- ggplot(churn_clean, aes(x = DeviceProtection)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Tech Support service plot
p11 <- ggplot(churn_clean, aes(x = TechSupport)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Streaming TV service plot
p12 <- ggplot(churn_clean, aes(x = StreamingTV)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Streaming Movies service plot
p13 <- ggplot(churn_clean, aes(x = StreamingMovies)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot service data within a grid
grid.arrange(p5, p6, p7,
p8, p9, p10,
p11, p12, p13,
ncol=3)Most of the sample have phone service with a single phone line. Fiber optic internet connection is more popular than DSL internet service, and each online service has a minority of users.
The remaining categorical variables are related to contract and payment status
#Contract status plot
p14 <- ggplot(churn_clean, aes(x = Contract)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Paperless billing plot
p15 <- ggplot(churn_clean, aes(x = PaperlessBilling)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Payment method plot
p16 <- ggplot(churn_clean, aes(x = PaymentMethod)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot contract data within a grid
grid.arrange(p14, p15, p16, ncol=1)#Tenure histogram
p17 <- ggplot(data = churn_clean, aes(tenure, color = Churn))+
geom_freqpoly(binwidth = 5, size = 1)
#Monthly charges histogram
p18 <- ggplot(data = churn_clean, aes(MonthlyCharges, color = Churn))+
geom_freqpoly(binwidth = 5, size = 1)
#Total charges histogram
p19 <- ggplot(data = churn_clean, aes(TotalCharges, color = Churn))+
geom_freqpoly(binwidth = 200, size = 1)
#Plot quantitative data within a grid
grid.arrange(p17, p18, p19, ncol=1)The tenure variable is stacked at the tails, so a large proportin of customers have either been had the shortest (1 month) or longest (72 month) tenure. It appears as if the MonthlyCharges variable is roughly normaly distribued around $80 per month with a large stack near the lowest rates. The TotalCharges variable is positively skewed with a large stack near the lower amounts.
Lastly, let’s examine our main outcome variable of interest, churn.
p20 <- ggplot(churn_clean, aes(x = Churn)) +
geom_bar(aes(fill = Churn)) +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
p20 Roughly a quarter of our sample are no longer customers. Let’s try to predict those that churn with some classification modeling techniques.
churn_clean %>%
dplyr::select (TotalCharges, MonthlyCharges, tenure) %>%
cor() %>%
corrplot.mixed(upper = "circle", tl.col = "black", number.cex = 0.7)The plot shows high correlations between Totalcharges & tenure and between TotalCharges & MonthlyCharges. Pay attention to these variables while training models later. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set. But it affects calculations regarding individual predictors.
In order to assess the performance of our various modeling techniques, we can split the data into training and test subsets. We will model the training data and use these model parameters to make predictions with the test data. Let’s call these data subsets dtrain and dtest.
We will randomly sample from the entire sample to create these subsets. The ‘set.seed()’ function argument can be changed in order to reset the random number generator used for sampling. The training subset will be roughly 70% of the original sample, with the remaining being the test subset.
set.seed(56)
split_train_test <- createDataPartition(churn_clean$Churn,p=0.7,list=FALSE)
dtrain<- churn_clean[split_train_test,]
dtest<- churn_clean[-split_train_test,]
# Remove Total Charges from the training dataset
dtrain <- dtrain[,-19]
dtest <- dtest[,-19]The tenure represents time period in months. To better find patterns with time, I change it to a factor with 5 levels, with each level represents a bin of tenure in years.
In this project I will use 3 machine learning models (Naive Bayes, Decision Tree, and Random Forest)
Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.
From this decision tree, we can interpret the following:
The contract variable is the most important. Customers with month-to-month contracts are more likely to churn. Customers with DSL internet service are less likely to churn. Customers who have stayed longer than 15 months are less likely to churn. Now let’s assess the prediction accuracy of the decision tree model by investigating how well it predicts churn in the test subset. We will begin with the confustion matrix, which is a useful display of classification accuracy. It displays the following information:
true positives (TP): These are cases in which we predicted yes (they churned), and they did churn. true negatives (TN): We predicted no, and they didn’t churn. false positives (FP): We predicted yes, but they didn’t actually churn. (Also known as a “Type I error.”) false negatives (FN): We predicted no, but they actually churned. (Also known as a “Type II error.”) Let’s examine the confusion matrix for our decision tree model.
tr_prob1 <- predict(tr_fit, dtest)
tr_pred1 <- ifelse(tr_prob1[,2] > 0.5,"Yes","No")
table(Predicted = tr_pred1, Actual = dtest$Churn)## Actual
## Predicted No Yes
## No 1406 284
## Yes 142 276
The diagonal entries give our correct predictions, with the upper left being TN and the lower right being TP. The upper right gives the FN while the lower left gives the FP. From this confusion matrix, we can see that the model performs well at predicting non-churning customers (1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers (232 correct vs. 328 incorrect).
How about the overall accuracy of the decision tree model?
tr_prob2 <- predict(tr_fit, dtrain)
tr_pred2 <- ifelse(tr_prob2[,2] > 0.5,"Yes","No")
tr_tab1 <- table(Predicted = tr_pred2, Actual = dtrain$Churn)
tr_tab2 <- table(Predicted = tr_pred1, Actual = dtest$Churn)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3272 674
## Yes 343 635
##
## Accuracy : 0.7935
## 95% CI : (0.7819, 0.8047)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4245
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.4851
## Specificity : 0.9051
## Pos Pred Value : 0.6493
## Neg Pred Value : 0.8292
## Prevalence : 0.2658
## Detection Rate : 0.1290
## Detection Prevalence : 0.1986
## Balanced Accuracy : 0.6951
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1406 284
## Yes 142 276
##
## Accuracy : 0.7979
## 95% CI : (0.7801, 0.8149)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.000000000006398
##
## Kappa : 0.4364
##
## Mcnemar's Test P-Value : 0.000000000008405
##
## Sensitivity : 0.4929
## Specificity : 0.9083
## Pos Pred Value : 0.6603
## Neg Pred Value : 0.8320
## Prevalence : 0.2657
## Detection Rate : 0.1309
## Detection Prevalence : 0.1983
## Balanced Accuracy : 0.7006
##
## 'Positive' Class : Yes
##
## [1] 0.7979127
The decision tree model is fairly accurate, correctly predicting the churn status of customers in the test subset 79% of the time.
Random forest analysis is another machine learning classification method that is often used in customer churn analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.
We will begin by identifying the number of variables randomly sampled as candidates at each split of the algorithm. In the randomForest package, this is referred to as the ‘mtry’ parameter or argument.
#Set control parameters for random forest model selection
ctrl <- trainControl(method = "cv", number=5,
classProbs = TRUE, summaryFunction = twoClassSummary)
#Exploratory random forest model selection
# rf_fit1 <- train(Churn ~., data = dtrain,
# method = "rf",
# ntree = 75,
# tuneLength = 5,
# metric = "ROC",
# trControl = ctrl)
# saveRDS(rf_fit1, "Churn.RDS")
rf_fit1 <- readRDS("Churn.RDS")The model found that the optimal value for ‘mtry’ is 2. From this model we can investigate the relative importance of the churn predictor variables.
#Run optimal model
rf_fit2 <- randomForest(Churn ~., data = dtrain,
ntree = 75, mtry = 2,
importance = TRUE, proximity = TRUE)
#Display variable importance from random tree
varImpPlot(rf_fit2, sort=T, n.var = 10,
main = 'Top 10 important variables')Similar to the decision tree, this random forest model has identified contract status and tenure length as important predictors for churn. Internet service status does not appear as important in this model, and the total charges variable is now highly emphasized.
Let’s examine the performance of this random forest model. We’ll begin with the confusion matrix.
## Actual
## Predicted No Yes
## No 1429 283
## Yes 119 277
The performance is somewhat similar to the decision tree model. The false negative rate is low (1445 correct vs. 103 incorrect) but the false positive rate is rather high (272 correct vs. 288 incorrect). What about the overall accuracy?
rf_pred2 <- predict(rf_fit2, dtrain)
rf_tab1 <- table(Predicted = rf_pred2, Actual = dtrain$Churn)
rf_tab2 <- table(Predicted = rf_pred1, Actual = dtest$Churn)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3459 439
## Yes 156 870
##
## Accuracy : 0.8792
## 95% CI : (0.8697, 0.8881)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6675
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.6646
## Specificity : 0.9568
## Pos Pred Value : 0.8480
## Neg Pred Value : 0.8874
## Prevalence : 0.2658
## Detection Rate : 0.1767
## Detection Prevalence : 0.2084
## Balanced Accuracy : 0.8107
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1429 283
## Yes 119 277
##
## Accuracy : 0.8093
## 95% CI : (0.7919, 0.8259)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.0000000000000004604
##
## Kappa : 0.4608
##
## Mcnemar's Test P-Value : 0.0000000000000004304
##
## Sensitivity : 0.4946
## Specificity : 0.9231
## Pos Pred Value : 0.6995
## Neg Pred Value : 0.8347
## Prevalence : 0.2657
## Detection Rate : 0.1314
## Detection Prevalence : 0.1879
## Balanced Accuracy : 0.7089
##
## 'Positive' Class : Yes
##
## [1] 0.8092979
As we can see from the confusion matrix analysis. Random forest model predict very good for the train data, but not very much on the test data. This indicating an onverfit on the model.
The random forest model is slightly more accurate than the decision tree model, being able to correctly predict the churn status of a customer in the test subset with 80% accuracy.
Our final statistical method will be logistic regression, a more classic method compared to the two above machine learning based methods. Logistic regression involves regressing predictor variables on a binary outcome using a binomial link function. Let’s fit the model using the base general linear modeling function in R, glm.
##
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = dtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8966 -0.6818 -0.2767 0.7457 3.2597
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.23152883 0.96365570 1.278
## genderMale -0.01242416 0.07748499 -0.160
## SeniorCitizenYes 0.23575590 0.10011663 2.355
## PartnerYes -0.07698419 0.09254966 -0.832
## DependentsYes -0.18956650 0.10708647 -1.770
## tenure -0.06959758 0.00772755 -9.006
## PhoneServiceYes -0.13680249 0.76776087 -0.178
## MultipleLinesYes 0.38236850 0.20877114 1.832
## InternetServiceFiber optic 1.31558759 0.94221565 1.396
## InternetServiceNo -1.70028438 0.95694358 -1.777
## OnlineSecurityYes -0.22175750 0.21021421 -1.055
## OnlineBackupYes -0.11890320 0.20742964 -0.573
## DeviceProtectionYes 0.03831068 0.20800957 0.184
## TechSupportYes -0.30327692 0.21364870 -1.420
## StreamingTVYes 0.42712767 0.38537693 1.108
## StreamingMoviesYes 0.43230845 0.38628263 1.119
## ContractOne year -0.62277873 0.12931299 -4.816
## ContractTwo year -1.38858477 0.21452923 -6.473
## PaperlessBillingYes 0.32051922 0.08915123 3.595
## PaymentMethodCredit card (automatic) -0.21018605 0.13551365 -1.551
## PaymentMethodElectronic check 0.23654707 0.11248087 2.103
## PaymentMethodMailed check -0.23447565 0.13843393 -1.694
## MonthlyCharges -0.02911674 0.03751901 -0.776
## TotalCharges 0.00046359 0.00008674 5.344
## Pr(>|z|)
## (Intercept) 0.201258
## genderMale 0.872611
## SeniorCitizenYes 0.018532 *
## PartnerYes 0.405513
## DependentsYes 0.076691 .
## tenure < 0.0000000000000002 ***
## PhoneServiceYes 0.858579
## MultipleLinesYes 0.067023 .
## InternetServiceFiber optic 0.162633
## InternetServiceNo 0.075603 .
## OnlineSecurityYes 0.291466
## OnlineBackupYes 0.566495
## DeviceProtectionYes 0.853874
## TechSupportYes 0.155750
## StreamingTVYes 0.267716
## StreamingMoviesYes 0.263076
## ContractOne year 0.0000014642283 ***
## ContractTwo year 0.0000000000963 ***
## PaperlessBillingYes 0.000324 ***
## PaymentMethodCredit card (automatic) 0.120894
## PaymentMethodElectronic check 0.035466 *
## PaymentMethodMailed check 0.090308 .
## MonthlyCharges 0.437718
## TotalCharges 0.0000000907512 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5702.8 on 4923 degrees of freedom
## Residual deviance: 4074.2 on 4900 degrees of freedom
## AIC: 4122.2
##
## Number of Fisher Scoring iterations: 6
By examining the significance values, we see similar predictor variables of importance. Tenure length, contract status, and total charges have the lowest p-values and can be identified as the best predictors of customer churn.
Let’s examine the confusion matrix based on our logistic regression model.
lr_prob1 <- predict(lr_fit, dtest, type="response")
lr_pred1 <- ifelse(lr_prob1 > 0.5,"Yes","No")
table(Predicted = lr_pred1, Actual = dtest$Churn)## Actual
## Predicted No Yes
## No 1403 247
## Yes 145 313
Similar to the machine learning algorithms, the false negative rate is low (1395 correct vs. 153 incorrect) while not as quite as low. In contrast, the false positive rate (332 correct vs. 228 incorrect) is actually above 50%, therefore performing better than the machine learning algorithms.
The overal prediction accuracy can be obtained similar to the previous models.
lr_prob2 <- predict(lr_fit, dtrain, type="response")
lr_pred2 <- ifelse(lr_prob2 > 0.5,"Yes","No")
lr_tab1 <- table(Predicted = lr_pred2, Actual = dtrain$Churn)
lr_tab2 <- table(Predicted = lr_pred1, Actual = dtest$Churn)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3227 595
## Yes 388 714
##
## Accuracy : 0.8004
## 95% CI : (0.7889, 0.8115)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4614
##
## Mcnemar's Test P-Value : 0.00000000005019
##
## Sensitivity : 0.5455
## Specificity : 0.8927
## Pos Pred Value : 0.6479
## Neg Pred Value : 0.8443
## Prevalence : 0.2658
## Detection Rate : 0.1450
## Detection Prevalence : 0.2238
## Balanced Accuracy : 0.7191
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1403 247
## Yes 145 313
##
## Accuracy : 0.814
## 95% CI : (0.7968, 0.8304)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.494
##
## Mcnemar's Test P-Value : 0.0000003374
##
## Sensitivity : 0.5589
## Specificity : 0.9063
## Pos Pred Value : 0.6834
## Neg Pred Value : 0.8503
## Prevalence : 0.2657
## Detection Rate : 0.1485
## Detection Prevalence : 0.2173
## Balanced Accuracy : 0.7326
##
## 'Positive' Class : Yes
##
## [1] 0.8140417
The 81.4% accuracy rate of the logistic regression model slightly outperforms the decision tree and random forest models.
Now that we have fit several models and identified some important predictor variables for customer churn, let’s explore some further graphs based on our findings.
Our modeling efforts pointed to several important churn predictors: contract status, internet status, tenure length, and total charges. Let’s examine how these variables split by churn status.
We will begin with the contract status variable.
p21 <- ggplot(churn_clean, aes(x = Contract, fill = Churn)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by contract status")
p21 As would be expected, the churn rate of month-to-month contract customers is much higher than the longer contract customers. Customers who are more willing to commit to longer contracts are less likely to leave.
What does this look like for the internet service status of the customer?
p22 <- ggplot(churn_clean, aes(x = InternetService, fill = Churn)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by internet service status")
p22It appears as if customers with internet service are more likely to churn than those that don’t. This is more pronounced for customers with fiber optic internet service, who are the most likely to churn.
Let’s examine the churn split for the tenure length distribution.
p23 <- ggplot(churn_clean, aes(x = tenure, fill = Churn)) +
geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Churn rate by tenure")
p23As expected, the length of time as a customer decreases the likelihood of churn. There is a large spike at 1 month, indicating that there are a large portion of customers that will leave the after just one month of service.
How about the distribution for total charges split by churn?
p24 <- ggplot(churn_clean, aes(x = TotalCharges, fill = Churn)) +
geom_histogram(binwidth = 100) +
labs(x = "Dollars (binwidth=100)",
title = "Churn rate by tenure")
p24Similar to the tenure trend, customers who have spent more with the company tend not to leave. This could just be a reflection of the tenure effect, or it could be due to financial characteristics of the customer: customers who are more financially well off are less likely to leave.
After going through various preparatory steps including data/library loading and preprocessing, we carried out three statistical classification methods common in churn analysis. We identified several important churn predictor variables from these models and compared these models on accuracy measures.
Here is a summary of our findings:
Logistic regression, although a less complicated method, outperformed machine-learning based methods of decion tree and random forest analysis. Although logistic regression performed slightly less in terms of false negatives, it had a better false positive rate and was more accurate overall.