1 Introduction

Customer churn, also known as customer attrition, is the loss of clients or customers. Churn is an important business metric for subscription-based services such as telecommunications companies. This project demonstrates a churn analysis using data downloaded from IBM sample data sets. We will use the R statistical programming languange in order to identify variables associated with customer churn.

In this project, we will carry out the following tasks:

Load the data and the relevant R libraries. Preprocess the data with various cleaning and recoding techniques. Provide data visualizations of descriptive statistics of the data Fit models using statistical classification methods commonly used in churn analysis. Decision tree analysis Random forest analysis Logistic regression Examine additional data visualization of selected variables based on our modeling techniques.

2 Read Library

library(plyr)  
library(rpart.plot) 
library(caret)
library(gridExtra) 
library(tidyverse) 
library(rsample)
library(e1071) 
library(GGally)
library(data.table)
library(DT)
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(rms)
library(MASS)
library(e1071)
library(ROCR)
library(gplots)
library(pROC)
library(rpart)
library(randomForest)
library(ggpubr)

3 Read Data

getwd()

## [1] "C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Telco Churn"

setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Telco Churn")
churn <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

glimpse(churn)

## Observations: 7,043
## Variables: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 92...
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Fe...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No,...
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No,...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes...
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Ye...
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fibe...
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, ...
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No...
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No i...
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No ...
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No i...
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes,...
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No...

The IBM sample data set website gives the following data dicitionary, or description of the variables:

customerID: Customer ID
genderCustomer: gender (female, male)
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
PartnerWhether: the customer has a partner or not (Yes, No)
Dependents: Whether the customer has dependents or not (Yes, No)
tenure: Number of months the customer has stayed with the company
PhoneService: Whether the customer has a phone service or not (Yes, No)
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card ( automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Churn: Whether the customer churned or not (Yes or No)

4 Data Preprocessing

sapply(churn, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

churn[is.na(churn$TotalCharges),]

##      customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 489  4472-LVYGI Female             0     Yes        Yes      0           No
## 754  3115-CZMZD   Male             0      No        Yes      0          Yes
## 937  5709-LVOEQ Female             0     Yes        Yes      0          Yes
## 1083 4367-NUYAO   Male             0     Yes        Yes      0          Yes
## 1341 1371-DWPAZ Female             0     Yes        Yes      0           No
## 3332 7644-OMVMY   Male             0     Yes        Yes      0          Yes
## 3827 3213-VVOLG   Male             0     Yes        Yes      0          Yes
## 4381 2520-SGTTA Female             0     Yes        Yes      0          Yes
## 5219 2923-ARZLG   Male             0     Yes        Yes      0          Yes
## 6671 4075-WKNIU Female             0     Yes        Yes      0          Yes
## 6755 2775-SEFEE   Male             0      No        Yes      0          Yes
##         MultipleLines InternetService      OnlineSecurity        OnlineBackup
## 489  No phone service             DSL                 Yes                  No
## 754                No              No No internet service No internet service
## 937                No             DSL                 Yes                 Yes
## 1083              Yes              No No internet service No internet service
## 1341 No phone service             DSL                 Yes                 Yes
## 3332               No              No No internet service No internet service
## 3827              Yes              No No internet service No internet service
## 4381               No              No No internet service No internet service
## 5219               No              No No internet service No internet service
## 6671              Yes             DSL                  No                 Yes
## 6755              Yes             DSL                 Yes                 Yes
##         DeviceProtection         TechSupport         StreamingTV
## 489                  Yes                 Yes                 Yes
## 754  No internet service No internet service No internet service
## 937                  Yes                  No                 Yes
## 1083 No internet service No internet service No internet service
## 1341                 Yes                 Yes                 Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671                 Yes                 Yes                 Yes
## 6755                  No                 Yes                  No
##          StreamingMovies Contract PaperlessBilling             PaymentMethod
## 489                   No Two year              Yes Bank transfer (automatic)
## 754  No internet service Two year               No              Mailed check
## 937                  Yes Two year               No              Mailed check
## 1083 No internet service Two year               No              Mailed check
## 1341                  No Two year               No   Credit card (automatic)
## 3332 No internet service Two year               No              Mailed check
## 3827 No internet service Two year               No              Mailed check
## 4381 No internet service Two year               No              Mailed check
## 5219 No internet service One year              Yes              Mailed check
## 6671                  No Two year               No              Mailed check
## 6755                  No Two year              Yes Bank transfer (automatic)
##      MonthlyCharges TotalCharges Churn
## 489           52.55           NA    No
## 754           20.25           NA    No
## 937           80.85           NA    No
## 1083          25.75           NA    No
## 1341          56.05           NA    No
## 3332          19.85           NA    No
## 3827          25.35           NA    No
## 4381          20.00           NA    No
## 5219          19.70           NA    No
## 6671          73.35           NA    No
## 6755          61.90           NA    No

Checking proportion

sum(is.na(churn$TotalCharges))/nrow(churn)

## [1] 0.001561834

This subset is 0.16% of our data and is quite small. We will remove these cases in order to accomodate our further analyses. Let’s call this cleaned data churn_clean.

churn_clean <- churn[complete.cases(churn), ]

The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. We can recode this to ease our interpretation of later graphs and models.

churn_clean$SeniorCitizen <- as.factor(mapvalues(churn_clean$SeniorCitizen,
                                          from=c("0","1"),
                                          to=c("No", "Yes")))

The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter variable automatically means a ‘no’ for the former variable. We can again further ease our graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for the MultipleLines variable.

churn_clean$MultipleLines <- as.factor(mapvalues(churn_clean$MultipleLines, 
                                           from=c("No phone service"),
                                           to=c("No")))

Similiarly, the OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies variables are all dependent on the OnlineService variable. We will recode the responses from ‘No internet service’ to ‘No’ for these variables.

for(i in 10:15){
  churn_clean[,i] <- as.factor(mapvalues(churn_clean[,i],
                                  from= c("No internet service"), to= c("No")))
}

We will not need the customerID variable for graphs or modeling, so it can be removed.

churn_clean$customerID <- NULL

5 2. Data Visualization for Descriptive Statistics

Before we begin modeling our data, let’s examine descriptive statistics of our data within some simple plots.

5.1 Demographic Data

Here are barplots of demographic data of our sample.

#Gender plot
p1 <- ggplot(churn_clean, aes(x = gender)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Senior citizen plot
p2 <- ggplot(churn_clean, aes(x = SeniorCitizen)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Partner plot
p3 <- ggplot(churn_clean, aes(x = Partner)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Dependents plot
p4 <- ggplot(churn_clean, aes(x = Dependents)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot demographic data within a grid
grid.arrange(p1, p2, p3, p4, ncol=2)

From these demographic plots, we notice that the sample is evenly split across gender and partner status. A minority of the sample are senior citizens, and a minority have dependents.

5.2 Offered Services

The various offered services are plotted below.

#Phone service plot
p5 <- ggplot(churn_clean, aes(x = PhoneService)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Multiple phone lines plot
p6 <- ggplot(churn_clean, aes(x = MultipleLines)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Internet service plot
p7 <- ggplot(churn_clean, aes(x = InternetService)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Online security service plot
p8 <- ggplot(churn_clean, aes(x = OnlineSecurity)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Online backup service plot
p9 <- ggplot(churn_clean, aes(x = OnlineBackup)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Device Protection service plot
p10 <- ggplot(churn_clean, aes(x = DeviceProtection)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Tech Support service plot
p11 <- ggplot(churn_clean, aes(x = TechSupport)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Streaming TV service plot
p12 <- ggplot(churn_clean, aes(x = StreamingTV)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Streaming Movies service plot
p13 <- ggplot(churn_clean, aes(x = StreamingMovies)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot service data within a grid
grid.arrange(p5, p6, p7,
             p8, p9, p10,
             p11, p12, p13,
             ncol=3)

Most of the sample have phone service with a single phone line. Fiber optic internet connection is more popular than DSL internet service, and each online service has a minority of users.

5.3 Other Categorical Variables

The remaining categorical variables are related to contract and payment status

#Contract status plot
p14 <- ggplot(churn_clean, aes(x = Contract)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Paperless billing plot
p15 <- ggplot(churn_clean, aes(x = PaperlessBilling)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Payment method plot
p16 <- ggplot(churn_clean, aes(x = PaymentMethod)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot contract data within a grid
grid.arrange(p14, p15, p16, ncol=1)

5.4 Quantitative Variables

#Tenure histogram
p17 <- ggplot(data = churn_clean, aes(tenure, color = Churn))+
  geom_freqpoly(binwidth = 5, size = 1)

#Monthly charges histogram
p18 <- ggplot(data = churn_clean, aes(MonthlyCharges, color = Churn))+
  geom_freqpoly(binwidth = 5, size = 1)

#Total charges histogram
p19 <- ggplot(data = churn_clean, aes(TotalCharges, color = Churn))+
  geom_freqpoly(binwidth = 200, size = 1)

#Plot quantitative data within a grid
grid.arrange(p17, p18, p19, ncol=1)

The tenure variable is stacked at the tails, so a large proportin of customers have either been had the shortest (1 month) or longest (72 month) tenure. It appears as if the MonthlyCharges variable is roughly normaly distribued around $80 per month with a large stack near the lowest rates. The TotalCharges variable is positively skewed with a large stack near the lower amounts.

5.5 Churn : Variable of interest

Lastly, let’s examine our main outcome variable of interest, churn.

p20 <- ggplot(churn_clean, aes(x = Churn)) +
  geom_bar(aes(fill = Churn)) +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)
p20

Roughly a quarter of our sample are no longer customers. Let’s try to predict those that churn with some classification modeling techniques.

6 Checking Correlation

churn_clean %>%
  dplyr::select (TotalCharges, MonthlyCharges, tenure) %>%
  cor() %>%
  corrplot.mixed(upper = "circle", tl.col = "black", number.cex = 0.7)

The plot shows high correlations between Totalcharges & tenure and between TotalCharges & MonthlyCharges. Pay attention to these variables while training models later. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set. But it affects calculations regarding individual predictors.

In order to assess the performance of our various modeling techniques, we can split the data into training and test subsets. We will model the training data and use these model parameters to make predictions with the test data. Let’s call these data subsets dtrain and dtest.

We will randomly sample from the entire sample to create these subsets. The ‘set.seed()’ function argument can be changed in order to reset the random number generator used for sampling. The training subset will be roughly 70% of the original sample, with the remaining being the test subset.

set.seed(56)
split_train_test <- createDataPartition(churn_clean$Churn,p=0.7,list=FALSE)
dtrain<- churn_clean[split_train_test,]
dtest<-  churn_clean[-split_train_test,]

# Remove Total Charges from the training dataset

dtrain <- dtrain[,-19]
dtest <- dtest[,-19]

The tenure represents time period in months. To better find patterns with time, I change it to a factor with 5 levels, with each level represents a bin of tenure in years.

7 Modelling

In this project I will use 3 machine learning models (Naive Bayes, Decision Tree, and Random Forest)

7.1 Decision tree analysis

Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.

tr_fit <- rpart(Churn ~., data = dtrain, method="class")
rpart.plot(tr_fit)

From this decision tree, we can interpret the following:

The contract variable is the most important. Customers with month-to-month contracts are more likely to churn. Customers with DSL internet service are less likely to churn. Customers who have stayed longer than 15 months are less likely to churn. Now let’s assess the prediction accuracy of the decision tree model by investigating how well it predicts churn in the test subset. We will begin with the confustion matrix, which is a useful display of classification accuracy. It displays the following information:

true positives (TP): These are cases in which we predicted yes (they churned), and they did churn. true negatives (TN): We predicted no, and they didn’t churn. false positives (FP): We predicted yes, but they didn’t actually churn. (Also known as a “Type I error.”) false negatives (FN): We predicted no, but they actually churned. (Also known as a “Type II error.”) Let’s examine the confusion matrix for our decision tree model.

tr_prob1 <- predict(tr_fit, dtest)
tr_pred1 <- ifelse(tr_prob1[,2] > 0.5,"Yes","No")
table(Predicted = tr_pred1, Actual = dtest$Churn)

##          Actual
## Predicted   No  Yes
##       No  1406  284
##       Yes  142  276

The diagonal entries give our correct predictions, with the upper left being TN and the lower right being TP. The upper right gives the FN while the lower left gives the FP. From this confusion matrix, we can see that the model performs well at predicting non-churning customers (1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers (232 correct vs. 328 incorrect).

How about the overall accuracy of the decision tree model?

tr_prob2 <- predict(tr_fit, dtrain)
tr_pred2 <- ifelse(tr_prob2[,2] > 0.5,"Yes","No")
tr_tab1 <- table(Predicted = tr_pred2, Actual = dtrain$Churn)
tr_tab2 <- table(Predicted = tr_pred1, Actual = dtest$Churn)

# Train
confusionMatrix(
  as.factor(tr_pred2),
  as.factor(dtrain$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3272  674
##        Yes  343  635
##                                                
##                Accuracy : 0.7935               
##                  95% CI : (0.7819, 0.8047)     
##     No Information Rate : 0.7342               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4245               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.4851               
##             Specificity : 0.9051               
##          Pos Pred Value : 0.6493               
##          Neg Pred Value : 0.8292               
##              Prevalence : 0.2658               
##          Detection Rate : 0.1290               
##    Detection Prevalence : 0.1986               
##       Balanced Accuracy : 0.6951               
##                                                
##        'Positive' Class : Yes                  
##

# Test
confusionMatrix(
as.factor(tr_pred1),
  as.factor(dtest$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1406  284
##        Yes  142  276
##                                            
##                Accuracy : 0.7979           
##                  95% CI : (0.7801, 0.8149) 
##     No Information Rate : 0.7343           
##     P-Value [Acc > NIR] : 0.000000000006398
##                                            
##                   Kappa : 0.4364           
##                                            
##  Mcnemar's Test P-Value : 0.000000000008405
##                                            
##             Sensitivity : 0.4929           
##             Specificity : 0.9083           
##          Pos Pred Value : 0.6603           
##          Neg Pred Value : 0.8320           
##              Prevalence : 0.2657           
##          Detection Rate : 0.1309           
##    Detection Prevalence : 0.1983           
##       Balanced Accuracy : 0.7006           
##                                            
##        'Positive' Class : Yes              
##

tr_acc <- sum(diag(tr_tab2))/sum(tr_tab2)
tr_acc

## [1] 0.7979127

The decision tree model is fairly accurate, correctly predicting the churn status of customers in the test subset 79% of the time.

7.2 Random forest analysis

Random forest analysis is another machine learning classification method that is often used in customer churn analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.

We will begin by identifying the number of variables randomly sampled as candidates at each split of the algorithm. In the randomForest package, this is referred to as the ‘mtry’ parameter or argument.

#Set control parameters for random forest model selection
ctrl <- trainControl(method = "cv", number=5, 
                     classProbs = TRUE, summaryFunction = twoClassSummary)

#Exploratory random forest model selection
# rf_fit1 <- train(Churn ~., data = dtrain,
#                  method = "rf",
#                  ntree = 75,
#                  tuneLength = 5,
#                  metric = "ROC",
#                  trControl = ctrl)

# saveRDS(rf_fit1, "Churn.RDS")

rf_fit1 <- readRDS("Churn.RDS")

The model found that the optimal value for ‘mtry’ is 2. From this model we can investigate the relative importance of the churn predictor variables.

#Run optimal model
rf_fit2 <- randomForest(Churn ~., data = dtrain, 
                        ntree = 75, mtry = 2, 
                        importance = TRUE, proximity = TRUE)

#Display variable importance from random tree
varImpPlot(rf_fit2, sort=T, n.var = 10, 
           main = 'Top 10 important variables')

Similar to the decision tree, this random forest model has identified contract status and tenure length as important predictors for churn. Internet service status does not appear as important in this model, and the total charges variable is now highly emphasized.

Let’s examine the performance of this random forest model. We’ll begin with the confusion matrix.

rf_pred1 <- predict(rf_fit2, dtest)
table(Predicted = rf_pred1, Actual = dtest$Churn)

##          Actual
## Predicted   No  Yes
##       No  1429  283
##       Yes  119  277

plot(rf_fit2)

The performance is somewhat similar to the decision tree model. The false negative rate is low (1445 correct vs. 103 incorrect) but the false positive rate is rather high (272 correct vs. 288 incorrect). What about the overall accuracy?

rf_pred2 <- predict(rf_fit2, dtrain)
rf_tab1 <- table(Predicted = rf_pred2, Actual = dtrain$Churn)
rf_tab2 <- table(Predicted = rf_pred1, Actual = dtest$Churn)

# Train
confusionMatrix(
  as.factor(rf_pred2),
  as.factor(dtrain$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3459  439
##        Yes  156  870
##                                                
##                Accuracy : 0.8792               
##                  95% CI : (0.8697, 0.8881)     
##     No Information Rate : 0.7342               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6675               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.6646               
##             Specificity : 0.9568               
##          Pos Pred Value : 0.8480               
##          Neg Pred Value : 0.8874               
##              Prevalence : 0.2658               
##          Detection Rate : 0.1767               
##    Detection Prevalence : 0.2084               
##       Balanced Accuracy : 0.8107               
##                                                
##        'Positive' Class : Yes                  
##

# Test
confusionMatrix(
as.factor(rf_pred1),
  as.factor(dtest$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1429  283
##        Yes  119  277
##                                                
##                Accuracy : 0.8093               
##                  95% CI : (0.7919, 0.8259)     
##     No Information Rate : 0.7343               
##     P-Value [Acc > NIR] : 0.0000000000000004604
##                                                
##                   Kappa : 0.4608               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000000004304
##                                                
##             Sensitivity : 0.4946               
##             Specificity : 0.9231               
##          Pos Pred Value : 0.6995               
##          Neg Pred Value : 0.8347               
##              Prevalence : 0.2657               
##          Detection Rate : 0.1314               
##    Detection Prevalence : 0.1879               
##       Balanced Accuracy : 0.7089               
##                                                
##        'Positive' Class : Yes                  
##

rf_acc <- sum(diag(rf_tab2))/sum(rf_tab2)
rf_acc

## [1] 0.8092979

As we can see from the confusion matrix analysis. Random forest model predict very good for the train data, but not very much on the test data. This indicating an onverfit on the model.

The random forest model is slightly more accurate than the decision tree model, being able to correctly predict the churn status of a customer in the test subset with 80% accuracy.

7.3 Logistic Regression Analysis

Our final statistical method will be logistic regression, a more classic method compared to the two above machine learning based methods. Logistic regression involves regressing predictor variables on a binary outcome using a binomial link function. Let’s fit the model using the base general linear modeling function in R, glm.

lr_fit <- glm(Churn ~., data = dtrain,
          family=binomial(link='logit'))
summary(lr_fit)

## 
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = dtrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8966  -0.6818  -0.2767   0.7457   3.2597  
## 
## Coefficients:
##                                         Estimate  Std. Error z value
## (Intercept)                           1.23152883  0.96365570   1.278
## genderMale                           -0.01242416  0.07748499  -0.160
## SeniorCitizenYes                      0.23575590  0.10011663   2.355
## PartnerYes                           -0.07698419  0.09254966  -0.832
## DependentsYes                        -0.18956650  0.10708647  -1.770
## tenure                               -0.06959758  0.00772755  -9.006
## PhoneServiceYes                      -0.13680249  0.76776087  -0.178
## MultipleLinesYes                      0.38236850  0.20877114   1.832
## InternetServiceFiber optic            1.31558759  0.94221565   1.396
## InternetServiceNo                    -1.70028438  0.95694358  -1.777
## OnlineSecurityYes                    -0.22175750  0.21021421  -1.055
## OnlineBackupYes                      -0.11890320  0.20742964  -0.573
## DeviceProtectionYes                   0.03831068  0.20800957   0.184
## TechSupportYes                       -0.30327692  0.21364870  -1.420
## StreamingTVYes                        0.42712767  0.38537693   1.108
## StreamingMoviesYes                    0.43230845  0.38628263   1.119
## ContractOne year                     -0.62277873  0.12931299  -4.816
## ContractTwo year                     -1.38858477  0.21452923  -6.473
## PaperlessBillingYes                   0.32051922  0.08915123   3.595
## PaymentMethodCredit card (automatic) -0.21018605  0.13551365  -1.551
## PaymentMethodElectronic check         0.23654707  0.11248087   2.103
## PaymentMethodMailed check            -0.23447565  0.13843393  -1.694
## MonthlyCharges                       -0.02911674  0.03751901  -0.776
## TotalCharges                          0.00046359  0.00008674   5.344
##                                                  Pr(>|z|)    
## (Intercept)                                      0.201258    
## genderMale                                       0.872611    
## SeniorCitizenYes                                 0.018532 *  
## PartnerYes                                       0.405513    
## DependentsYes                                    0.076691 .  
## tenure                               < 0.0000000000000002 ***
## PhoneServiceYes                                  0.858579    
## MultipleLinesYes                                 0.067023 .  
## InternetServiceFiber optic                       0.162633    
## InternetServiceNo                                0.075603 .  
## OnlineSecurityYes                                0.291466    
## OnlineBackupYes                                  0.566495    
## DeviceProtectionYes                              0.853874    
## TechSupportYes                                   0.155750    
## StreamingTVYes                                   0.267716    
## StreamingMoviesYes                               0.263076    
## ContractOne year                          0.0000014642283 ***
## ContractTwo year                          0.0000000000963 ***
## PaperlessBillingYes                              0.000324 ***
## PaymentMethodCredit card (automatic)             0.120894    
## PaymentMethodElectronic check                    0.035466 *  
## PaymentMethodMailed check                        0.090308 .  
## MonthlyCharges                                   0.437718    
## TotalCharges                              0.0000000907512 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5702.8  on 4923  degrees of freedom
## Residual deviance: 4074.2  on 4900  degrees of freedom
## AIC: 4122.2
## 
## Number of Fisher Scoring iterations: 6

By examining the significance values, we see similar predictor variables of importance. Tenure length, contract status, and total charges have the lowest p-values and can be identified as the best predictors of customer churn.

Let’s examine the confusion matrix based on our logistic regression model.

lr_prob1 <- predict(lr_fit, dtest, type="response")
lr_pred1 <- ifelse(lr_prob1 > 0.5,"Yes","No")
table(Predicted = lr_pred1, Actual = dtest$Churn)

##          Actual
## Predicted   No  Yes
##       No  1403  247
##       Yes  145  313

Similar to the machine learning algorithms, the false negative rate is low (1395 correct vs. 153 incorrect) while not as quite as low. In contrast, the false positive rate (332 correct vs. 228 incorrect) is actually above 50%, therefore performing better than the machine learning algorithms.

The overal prediction accuracy can be obtained similar to the previous models.

lr_prob2 <- predict(lr_fit, dtrain, type="response")
lr_pred2 <- ifelse(lr_prob2 > 0.5,"Yes","No")
lr_tab1 <- table(Predicted = lr_pred2, Actual = dtrain$Churn)
lr_tab2 <- table(Predicted = lr_pred1, Actual = dtest$Churn)

# Train
confusionMatrix(
  as.factor(lr_pred2),
  as.factor(dtrain$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3227  595
##        Yes  388  714
##                                                
##                Accuracy : 0.8004               
##                  95% CI : (0.7889, 0.8115)     
##     No Information Rate : 0.7342               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4614               
##                                                
##  Mcnemar's Test P-Value : 0.00000000005019     
##                                                
##             Sensitivity : 0.5455               
##             Specificity : 0.8927               
##          Pos Pred Value : 0.6479               
##          Neg Pred Value : 0.8443               
##              Prevalence : 0.2658               
##          Detection Rate : 0.1450               
##    Detection Prevalence : 0.2238               
##       Balanced Accuracy : 0.7191               
##                                                
##        'Positive' Class : Yes                  
##

# Test
confusionMatrix(
as.factor(lr_pred1),
  as.factor(dtest$Churn),
  positive = "Yes" 
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1403  247
##        Yes  145  313
##                                                
##                Accuracy : 0.814                
##                  95% CI : (0.7968, 0.8304)     
##     No Information Rate : 0.7343               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.494                
##                                                
##  Mcnemar's Test P-Value : 0.0000003374         
##                                                
##             Sensitivity : 0.5589               
##             Specificity : 0.9063               
##          Pos Pred Value : 0.6834               
##          Neg Pred Value : 0.8503               
##              Prevalence : 0.2657               
##          Detection Rate : 0.1485               
##    Detection Prevalence : 0.2173               
##       Balanced Accuracy : 0.7326               
##                                                
##        'Positive' Class : Yes                  
##

lr_acc <- sum(diag(lr_tab2))/sum(lr_tab2)
lr_acc

## [1] 0.8140417

The 81.4% accuracy rate of the logistic regression model slightly outperforms the decision tree and random forest models.

Now that we have fit several models and identified some important predictor variables for customer churn, let’s explore some further graphs based on our findings.

8 Data Visualization based on Models

Our modeling efforts pointed to several important churn predictors: contract status, internet status, tenure length, and total charges. Let’s examine how these variables split by churn status.

We will begin with the contract status variable.

p21 <- ggplot(churn_clean, aes(x = Contract, fill = Churn)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3) +
  labs(title="Churn rate by contract status")

p21

As would be expected, the churn rate of month-to-month contract customers is much higher than the longer contract customers. Customers who are more willing to commit to longer contracts are less likely to leave.

What does this look like for the internet service status of the customer?

p22 <- ggplot(churn_clean, aes(x = InternetService, fill = Churn)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3) +
  labs(title="Churn rate by internet service status")

p22

It appears as if customers with internet service are more likely to churn than those that don’t. This is more pronounced for customers with fiber optic internet service, who are the most likely to churn.

Let’s examine the churn split for the tenure length distribution.

p23 <- ggplot(churn_clean, aes(x = tenure, fill = Churn)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Months",
       title = "Churn rate by tenure")
p23

As expected, the length of time as a customer decreases the likelihood of churn. There is a large spike at 1 month, indicating that there are a large portion of customers that will leave the after just one month of service.

How about the distribution for total charges split by churn?

p24 <- ggplot(churn_clean, aes(x = TotalCharges, fill = Churn)) +
  geom_histogram(binwidth = 100) +
  labs(x = "Dollars (binwidth=100)",
       title = "Churn rate by tenure")
p24

Similar to the tenure trend, customers who have spent more with the company tend not to leave. This could just be a reflection of the tenure effect, or it could be due to financial characteristics of the customer: customers who are more financially well off are less likely to leave.

9 Conclusion

After going through various preparatory steps including data/library loading and preprocessing, we carried out three statistical classification methods common in churn analysis. We identified several important churn predictor variables from these models and compared these models on accuracy measures.

Here is a summary of our findings:

Customers with month-to-month contracts are less likely to churn.
Customers with internet service, in particular fiber optic service, are more likely to churn.
Customers who have been with the company longer or have paid more in total are less likely to churn.

Logistic regression, although a less complicated method, outperformed machine-learning based methods of decion tree and random forest analysis. Although logistic regression performed slightly less in terms of false negatives, it had a better false positive rate and was more accurate overall.

Tellecomunication Churn Prediction

Ezra Soterion Nugroho

4/7/2020