By: Lawrence Lo
Email: ollecnerwal180@gmail.com
Customer churn, also known as customer attrition, is the loss of clients or customers. Churn is an important business metric for subscription-based services such as telecommunications companies. This project demonstrates a churn analysis using data downloaded from IBM sample data sets. We will use the R statistical programming languange in order to identify variables associated with customer churn.
In this project, we will carry out the following tasks:
We will begin by loading the R libraries we will need for the project.
library(plyr)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
Now let’s read in the data. You will need to modify the file name within the quotations based on the file location of the data within your computer. The first few rows are depicted below.
#Read data file
dat <- read.csv("C:\\Users\\Lawrence\\Documents\\Portfolio1\\TelcoChurn\\WA_Fn-UseC_-Telco-Customer-Churn.csv")
#Examine data
head(dat)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
The IBM sample data set website gives the following data dicitionary, or description of the variables:
Let’s begin by seeing if there is any missing data.
sapply(dat, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
There are 11 cases with missing values in the TotalCharges variable. Let’s see these particular cases.
dat[is.na(dat$TotalCharges),]
## customerID gender SeniorCitizen Partner Dependents tenure
## 489 4472-LVYGI Female 0 Yes Yes 0
## 754 3115-CZMZD Male 0 No Yes 0
## 937 5709-LVOEQ Female 0 Yes Yes 0
## 1083 4367-NUYAO Male 0 Yes Yes 0
## 1341 1371-DWPAZ Female 0 Yes Yes 0
## 3332 7644-OMVMY Male 0 Yes Yes 0
## 3827 3213-VVOLG Male 0 Yes Yes 0
## 4381 2520-SGTTA Female 0 Yes Yes 0
## 5219 2923-ARZLG Male 0 Yes Yes 0
## 6671 4075-WKNIU Female 0 Yes Yes 0
## 6755 2775-SEFEE Male 0 No Yes 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 489 No No phone service DSL Yes
## 754 Yes No No No internet service
## 937 Yes No DSL Yes
## 1083 Yes Yes No No internet service
## 1341 No No phone service DSL Yes
## 3332 Yes No No No internet service
## 3827 Yes Yes No No internet service
## 4381 Yes No No No internet service
## 5219 Yes No No No internet service
## 6671 Yes Yes DSL No
## 6755 Yes Yes DSL Yes
## OnlineBackup DeviceProtection TechSupport
## 489 No Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes Yes No
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 Yes No Yes
## StreamingTV StreamingMovies Contract PaperlessBilling
## 489 Yes No Two year Yes
## 754 No internet service No internet service Two year No
## 937 Yes Yes Two year No
## 1083 No internet service No internet service Two year No
## 1341 Yes No Two year No
## 3332 No internet service No internet service Two year No
## 3827 No internet service No internet service Two year No
## 4381 No internet service No internet service Two year No
## 5219 No internet service No internet service One year Yes
## 6671 Yes No Two year No
## 6755 No No Two year Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 489 Bank transfer (automatic) 52.55 NA No
## 754 Mailed check 20.25 NA No
## 937 Mailed check 80.85 NA No
## 1083 Mailed check 25.75 NA No
## 1341 Credit card (automatic) 56.05 NA No
## 3332 Mailed check 19.85 NA No
## 3827 Mailed check 25.35 NA No
## 4381 Mailed check 20.00 NA No
## 5219 Mailed check 19.70 NA No
## 6671 Mailed check 73.35 NA No
## 6755 Bank transfer (automatic) 61.90 NA No
Inspection of the Churn variable shows that these are all still subscribing customers. What proportion of our sample is this subset with missing values?
sum(is.na(dat$TotalCharges))/nrow(dat)
## [1] 0.001561834
This subset is 0.16% of our data and is quite small. We will remove these cases in order to accomodate our further analyses. Let’s call this cleaned data datc.
datc <- dat[complete.cases(dat), ]
The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. We can recode this to ease our interpretation of later graphs and models.
datc$SeniorCitizen <- as.factor(mapvalues(datc$SeniorCitizen,
from=c("0","1"),
to=c("No", "Yes")))
The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter variable automatically means a ‘no’ for the former variable. We can again further ease our graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for the MultipleLines variable.
datc$MultipleLines <- as.factor(mapvalues(datc$MultipleLines,
from=c("No phone service"),
to=c("No")))
Similiarly, the OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies variables are all dependent on the OnlineService variable. We will recode the responses from ‘No internet service’ to ‘No’ for these variables.
for(i in 10:15){
datc[,i] <- as.factor(mapvalues(datc[,i],
from= c("No internet service"), to= c("No")))
}
We will not need the customerID variable for graphs or modeling, so it can be removed.
datc$customerID <- NULL
In order to assess the performance of our various modeling techniques, we can split the data into training and test subsets. We will model the training data and use these model parameters to make predictions with the test data. Let’s call these data subsets dtrain and dtest.
We will randomly sample from the entire sample to create these subsets. The ‘set.seed()’ function argument can be changed in order to reset the random number generator used for sampling. The training subset will be roughly 70% of the original sample, with the remaining being the test subset.
set.seed(56)
split_train_test <- createDataPartition(datc$Churn,p=0.7,list=FALSE)
dtrain<- datc[split_train_test,]
dtest<- datc[-split_train_test,]
Before we begin modeling our data, let’s examine descriptive statistics of our data within some simple plots.
Here are barplots of demographic data of our sample.
#Gender plot
p1 <- ggplot(datc, aes(x = gender)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Senior citizen plot
p2 <- ggplot(datc, aes(x = SeniorCitizen)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Partner plot
p3 <- ggplot(datc, aes(x = Partner)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Dependents plot
p4 <- ggplot(datc, aes(x = Dependents)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot demographic data within a grid
grid.arrange(p1, p2, p3, p4, ncol=2)
From these demographic plots, we notice that the sample is evenly split across gender and partner status. A minority of the sample are senior citizens, and a minority have dependents.
The various offered services are plotted below.
#Phone service plot
p5 <- ggplot(datc, aes(x = PhoneService)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Multiple phone lines plot
p6 <- ggplot(datc, aes(x = MultipleLines)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Internet service plot
p7 <- ggplot(datc, aes(x = InternetService)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Online security service plot
p8 <- ggplot(datc, aes(x = OnlineSecurity)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Online backup service plot
p9 <- ggplot(datc, aes(x = OnlineBackup)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Device Protection service plot
p10 <- ggplot(datc, aes(x = DeviceProtection)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Tech Support service plot
p11 <- ggplot(datc, aes(x = TechSupport)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Streaming TV service plot
p12 <- ggplot(datc, aes(x = StreamingTV)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Streaming Movies service plot
p13 <- ggplot(datc, aes(x = StreamingMovies)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot service data within a grid
grid.arrange(p5, p6, p7,
p8, p9, p10,
p11, p12, p13,
ncol=3)
Most of the sample have phone service with a single phone line. Fiber optic internet connection is more popular than DSL internet service, and each online service has a minority of users.
The remaining categorical variables are related to contract and payment status.
#Contract status plot
p14 <- ggplot(datc, aes(x = Contract)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Paperless billing plot
p15 <- ggplot(datc, aes(x = PaperlessBilling)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Payment method plot
p16 <- ggplot(datc, aes(x = PaymentMethod)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Plot contract data within a grid
grid.arrange(p14, p15, p16, ncol=1)
Roughly half of the sample are on month-to-month contracts with the remaining split between one and two year contracts. Most of the sample are on paperless billing, and pay by electronic check.
Let’s look at distributions of the quantitative variables.
#Tenure histogram
p17 <- ggplot(datc, aes(x = tenure)) +
geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Tenure Distribtion")
#Monthly charges histogram
p18 <- ggplot(datc, aes(x = MonthlyCharges)) +
geom_histogram(binwidth = 5) +
labs(x = "Dollars (binwidth = 5)",
title = "Monthly charges Distribtion")
#Total charges histogram
p19 <- ggplot(datc, aes(x = TotalCharges)) +
geom_histogram(binwidth = 100) +
labs(x = "Dollars (binwidth = 100)",
title = "Total charges Distribtion")
#Plot quantitative data within a grid
grid.arrange(p17, p18, p19, ncol=1)
The tenure variable is stacked at the tails, so a large proportin of customers have either been had the shortest (1 month) or longest (72 month) tenure. It appears as if the MonthlyCharges variable is roughly normaly distribued around $80 per month with a large stack near the lowest rates. The TotalCharges variable is positively skewed with a large stack near the lower amounts.
Lastly, let’s examine our main outcome variable of interest, churn.
p20 <- ggplot(datc, aes(x = Churn)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
p20
Roughly a quarter of our sample are no longer customers. Let’s try to predict those that churn with some classification modeling techniques.
Three modeling techniques will be implemented on the training data subset. Model parameters from these techniques will make predictions on the test subset. We will examine accuracy in terms of both the percentage of correct predictions as well as confustion matrices.
Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.
tr_fit <- rpart(Churn ~., data = dtrain, method="class")
rpart.plot(tr_fit)
From this decision tree, we can interpret the following:
Now let’s assess the prediction accuracy of the decision tree model by investigating how well it predicts churn in the test subset. We will begin with the confustion matrix, which is a useful display of classification accuracy. It displays the following information:
Let’s examine the confusion matrix for our decision tree model.
tr_prob1 <- predict(tr_fit, dtest)
tr_pred1 <- ifelse(tr_prob1[,2] > 0.5,"Yes","No")
table(Predicted = tr_pred1, Actual = dtest$Churn)
## Actual
## Predicted No Yes
## No 1466 328
## Yes 82 232
The diagonal entries give our correct predictions, with the upper left being TN and the lower right being TP. The upper right gives the FN while the lower left gives the FP. From this confusion matrix, we can see that the model performs well at predicting non-churning customers (1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers (232 correct vs. 328 incorrect).
How about the overall accuracy of the decision tree model?
tr_prob2 <- predict(tr_fit, dtrain)
tr_pred2 <- ifelse(tr_prob2[,2] > 0.5,"Yes","No")
tr_tab1 <- table(Predicted = tr_pred2, Actual = dtrain$Churn)
tr_tab2 <- table(Predicted = tr_pred1, Actual = dtest$Churn)
tr_acc <- sum(diag(tr_tab2))/sum(tr_tab2)
tr_acc
## [1] 0.8055028
The decision tree model is fairly accurate, correctly predicting the churn status of customers in the test subset 80.55% of the time.
Random forest analysis is another machine learning classification method that is often used in customer churn analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.
We will begin by identifying the number of variables randomly sampled as candidates at each split of the algorithm. In the randomForest package, this is referred to as the ‘mtry’ parameter or argument.
#Set control parameters for random forest model selection
ctrl <- trainControl(method = "cv", number=5,
classProbs = TRUE, summaryFunction = twoClassSummary)
#Exploratory random forest model selection
rf_fit1 <- train(Churn ~., data = dtrain,
method = "rf",
ntree = 75,
tuneLength = 5,
metric = "ROC",
trControl = ctrl)
rf_fit1
## Random Forest
##
## 4924 samples
## 19 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3939, 3939, 3939, 3939, 3940
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.8268924 0.9186722 0.4453921
## 7 0.8136580 0.8946058 0.4805534
## 12 0.8085244 0.8879668 0.4767132
## 17 0.8087431 0.8865837 0.4767336
## 23 0.8067040 0.8915629 0.4790149
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The model found that the optimal value for ‘mtry’ is 2. From this model we can investigate the relative importance of the churn predictor variables.
#Run optimal model
rf_fit2 <- randomForest(Churn ~., data = dtrain,
ntree = 75, mtry = 2,
importance = TRUE, proximity = TRUE)
#Display variable importance from random tree
varImpPlot(rf_fit2, sort=T, n.var = 10,
main = 'Top 10 important variables')
Similar to the decision tree, this random forest model has identified contract status and tenure length as important predictors for churn. Internet service status does not appear as important in this model, and the total charges variable is now highly emphasized.
Let’s examine the performance of this random forest model. We’ll begin with the confusion matrix.
rf_pred1 <- predict(rf_fit2, dtest)
table(Predicted = rf_pred1, Actual = dtest$Churn)
## Actual
## Predicted No Yes
## No 1445 288
## Yes 103 272
The performance is somewhat similar to the decision tree model. The false negative rate is low (1445 correct vs. 103 incorrect) but the false positive rate is rather high (272 correct vs. 288 incorrect). What about the overall accuracy?
rf_pred2 <- predict(rf_fit2, dtrain)
rf_tab1 <- table(Predicted = rf_pred2, Actual = dtrain$Churn)
rf_tab2 <- table(Predicted = rf_pred1, Actual = dtest$Churn)
rf_acc <- sum(diag(rf_tab2))/sum(rf_tab2)
rf_acc
## [1] 0.8145161
The random forest model is slightly more accurate than the decision tree model, being able to correctly predict the churn status of a customer in the test subset with 81.45% accuracy.
Our final statistical method will be logistic regression, a more classic method compared to the two above machine learning based methods. Logistic regression involves regressing predictor variables on a binary outcome using a binomial link function. Let’s fit the model using the base general linear modeling function in R, glm.
lr_fit <- glm(Churn ~., data = dtrain,
family=binomial(link='logit'))
summary(lr_fit)
##
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = dtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9220 -0.6926 -0.2899 0.7488 3.4425
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 2.056e+00 9.710e-01 2.118
## genderMale -3.932e-02 7.715e-02 -0.510
## SeniorCitizenYes 3.158e-01 1.004e-01 3.146
## PartnerYes 4.608e-03 9.331e-02 0.049
## DependentsYes -1.024e-01 1.087e-01 -0.942
## tenure -6.429e-02 7.447e-03 -8.633
## PhoneServiceYes 9.624e-01 7.721e-01 1.247
## MultipleLinesYes 6.519e-01 2.125e-01 3.068
## InternetServiceFiber optic 2.482e+00 9.474e-01 2.620
## InternetServiceNo -2.738e+00 9.606e-01 -2.851
## OnlineSecurityYes -2.438e-02 2.128e-01 -0.115
## OnlineBackupYes 2.669e-01 2.095e-01 1.274
## DeviceProtectionYes 3.036e-01 2.102e-01 1.444
## TechSupportYes -1.026e-01 2.141e-01 -0.479
## StreamingTVYes 9.326e-01 3.880e-01 2.404
## StreamingMoviesYes 9.986e-01 3.906e-01 2.557
## ContractOne year -7.273e-01 1.288e-01 -5.645
## ContractTwo year -1.341e+00 2.084e-01 -6.434
## PaperlessBillingYes 3.138e-01 8.775e-02 3.576
## PaymentMethodCredit card (automatic) -1.043e-01 1.360e-01 -0.767
## PaymentMethodElectronic check 3.037e-01 1.119e-01 2.713
## PaymentMethodMailed check 1.935e-02 1.360e-01 0.142
## MonthlyCharges -7.633e-02 3.776e-02 -2.021
## TotalCharges 3.890e-04 8.404e-05 4.628
## Pr(>|z|)
## (Intercept) 0.034213 *
## genderMale 0.610245
## SeniorCitizenYes 0.001654 **
## PartnerYes 0.960608
## DependentsYes 0.345944
## tenure < 2e-16 ***
## PhoneServiceYes 0.212551
## MultipleLinesYes 0.002156 **
## InternetServiceFiber optic 0.008802 **
## InternetServiceNo 0.004364 **
## OnlineSecurityYes 0.908790
## OnlineBackupYes 0.202642
## DeviceProtectionYes 0.148627
## TechSupportYes 0.631601
## StreamingTVYes 0.016227 *
## StreamingMoviesYes 0.010562 *
## ContractOne year 1.65e-08 ***
## ContractTwo year 1.25e-10 ***
## PaperlessBillingYes 0.000349 ***
## PaymentMethodCredit card (automatic) 0.442884
## PaymentMethodElectronic check 0.006670 **
## PaymentMethodMailed check 0.886877
## MonthlyCharges 0.043250 *
## TotalCharges 3.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5702.8 on 4923 degrees of freedom
## Residual deviance: 4119.0 on 4900 degrees of freedom
## AIC: 4167
##
## Number of Fisher Scoring iterations: 6
By examining the significance values, we see similar predictor variables of importance. Tenure length, contract status, and total charges have the lowest p-values and can be identified as the best predictors of customer churn.
Let’s examine the confusion matrix based on our logistic regression model.
lr_prob1 <- predict(lr_fit, dtest, type="response")
lr_pred1 <- ifelse(lr_prob1 > 0.5,"Yes","No")
table(Predicted = lr_pred1, Actual = dtest$Churn)
## Actual
## Predicted No Yes
## No 1395 228
## Yes 153 332
Similar to the machine learning algorithms, the false negative rate is low (1395 correct vs. 153 incorrect) while not as quite as low. In contrast, the false positive rate (332 correct vs. 228 incorrect) is actually above 50%, therefore performing better than the machine learning algorithms.
The overal prediction accuracy can be obtained similar to the previous models.
lr_prob2 <- predict(lr_fit, dtrain, type="response")
lr_pred2 <- ifelse(lr_prob2 > 0.5,"Yes","No")
lr_tab1 <- table(Predicted = lr_pred2, Actual = dtrain$Churn)
lr_tab2 <- table(Predicted = lr_pred1, Actual = dtest$Churn)
lr_acc <- sum(diag(lr_tab2))/sum(lr_tab2)
lr_acc
## [1] 0.81926
The 81.93% accuracy rate of the logistic regression model slightly outperforms the decision tree and random forest models.
Now that we have fit several models and identified some important predictor variables for customer churn, let’s explore some further graphs based on our findings.
Our modeling efforts pointed to several important churn predictors: contract status, internet status, tenure length, and total charges. Let’s examine how these variables split by churn status.
We will begin with the contract status variable.
p21 <- ggplot(datc, aes(x = Contract, fill = Churn)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by contract status")
p21
As would be expected, the churn rate of month-to-month contract customers is much higher than the longer contract customers. Customers who are more willing to commit to longer contracts are less likely to leave.
What does this look like for the internet service status of the customer?
p22 <- ggplot(datc, aes(x = InternetService, fill = Churn)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by internet service status")
p22
It appears as if customers with internet service are more likely to churn than those that don’t. This is more pronounced for customers with fiber optic internet service, who are the most likely to churn.
Let’s examine the churn split for the tenure length distribution.
p23 <- ggplot(datc, aes(x = tenure, fill = Churn)) +
geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Churn rate by tenure")
p23
As expected, the length of time as a customer decreases the likelihood of churn. There is a large spike at 1 month, indicating that there are a large portion of customers that will leave the after just one month of service.
How about the distribution for total charges split by churn?
p24 <- ggplot(datc, aes(x = TotalCharges, fill = Churn)) +
geom_histogram(binwidth = 100) +
labs(x = "Dollars (binwidth=100)",
title = "Churn rate by tenure")
p24
Similar to the tenure trend, customers who have spent more with the company tend not to leave. This could just be a reflection of the tenure effect, or it could be due to financial characteristics of the customer: customers who are more financially well off are less likely to leave.
After going through various preparatory steps including data/library loading and preprocessing, we carried out three statistical classification methods common in churn analysis. We identified several important churn predictor variables from these models and compared these models on accuracy measures.
Here is a summary of our findings: