By: Lawrence Lo

Introduction

Customer churn, also known as customer attrition, is the loss of clients or customers. Churn is an important business metric for subscription-based services such as telecommunications companies. This project demonstrates a churn analysis using data downloaded from IBM sample data sets. We will use the R statistical programming languange in order to identify variables associated with customer churn.

In this project, we will carry out the following tasks:

  1. Load the data and the relevant R libraries.
  2. Preprocess the data with various cleaning and recoding techniques.
  3. Provide data visualizations of descriptive statistics of the data
  4. Fit models using statistical classification methods commonly used in churn analysis.
    • Decision tree analysis
    • Random forest analysis
    • Logistic regression
  5. Examine additional data visualization of selected variables based on our modeling techniques.

1. Loading in data and R libraries

We will begin by loading the R libraries we will need for the project.

library(plyr)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine

Now let’s read in the data. You will need to modify the file name within the quotations based on the file location of the data within your computer. The first few rows are depicted below.

#Read data file
dat <- read.csv("C:\\Users\\Lawrence\\Documents\\Portfolio1\\TelcoChurn\\WA_Fn-UseC_-Telco-Customer-Churn.csv")

#Examine data
head(dat)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service             DSL             No          Yes
## 2               No             DSL            Yes           No
## 3               No             DSL            Yes          Yes
## 4 No phone service             DSL            Yes           No
## 5               No     Fiber optic             No           No
## 6              Yes     Fiber optic             No           No
##   DeviceProtection TechSupport StreamingTV StreamingMovies       Contract
## 1               No          No          No              No Month-to-month
## 2              Yes          No          No              No       One year
## 3               No          No          No              No Month-to-month
## 4              Yes         Yes          No              No       One year
## 5               No          No          No              No Month-to-month
## 6              Yes          No         Yes             Yes Month-to-month
##   PaperlessBilling             PaymentMethod MonthlyCharges TotalCharges
## 1              Yes          Electronic check          29.85        29.85
## 2               No              Mailed check          56.95      1889.50
## 3              Yes              Mailed check          53.85       108.15
## 4               No Bank transfer (automatic)          42.30      1840.75
## 5              Yes          Electronic check          70.70       151.65
## 6              Yes          Electronic check          99.65       820.50
##   Churn
## 1    No
## 2    No
## 3   Yes
## 4    No
## 5   Yes
## 6   Yes

The IBM sample data set website gives the following data dicitionary, or description of the variables:

2. Data preprocessing

Let’s begin by seeing if there is any missing data.

sapply(dat, function(x) sum(is.na(x)))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

There are 11 cases with missing values in the TotalCharges variable. Let’s see these particular cases.

dat[is.na(dat$TotalCharges),]
##      customerID gender SeniorCitizen Partner Dependents tenure
## 489  4472-LVYGI Female             0     Yes        Yes      0
## 754  3115-CZMZD   Male             0      No        Yes      0
## 937  5709-LVOEQ Female             0     Yes        Yes      0
## 1083 4367-NUYAO   Male             0     Yes        Yes      0
## 1341 1371-DWPAZ Female             0     Yes        Yes      0
## 3332 7644-OMVMY   Male             0     Yes        Yes      0
## 3827 3213-VVOLG   Male             0     Yes        Yes      0
## 4381 2520-SGTTA Female             0     Yes        Yes      0
## 5219 2923-ARZLG   Male             0     Yes        Yes      0
## 6671 4075-WKNIU Female             0     Yes        Yes      0
## 6755 2775-SEFEE   Male             0      No        Yes      0
##      PhoneService    MultipleLines InternetService      OnlineSecurity
## 489            No No phone service             DSL                 Yes
## 754           Yes               No              No No internet service
## 937           Yes               No             DSL                 Yes
## 1083          Yes              Yes              No No internet service
## 1341           No No phone service             DSL                 Yes
## 3332          Yes               No              No No internet service
## 3827          Yes              Yes              No No internet service
## 4381          Yes               No              No No internet service
## 5219          Yes               No              No No internet service
## 6671          Yes              Yes             DSL                  No
## 6755          Yes              Yes             DSL                 Yes
##             OnlineBackup    DeviceProtection         TechSupport
## 489                   No                 Yes                 Yes
## 754  No internet service No internet service No internet service
## 937                  Yes                 Yes                  No
## 1083 No internet service No internet service No internet service
## 1341                 Yes                 Yes                 Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671                 Yes                 Yes                 Yes
## 6755                 Yes                  No                 Yes
##              StreamingTV     StreamingMovies Contract PaperlessBilling
## 489                  Yes                  No Two year              Yes
## 754  No internet service No internet service Two year               No
## 937                  Yes                 Yes Two year               No
## 1083 No internet service No internet service Two year               No
## 1341                 Yes                  No Two year               No
## 3332 No internet service No internet service Two year               No
## 3827 No internet service No internet service Two year               No
## 4381 No internet service No internet service Two year               No
## 5219 No internet service No internet service One year              Yes
## 6671                 Yes                  No Two year               No
## 6755                  No                  No Two year              Yes
##                  PaymentMethod MonthlyCharges TotalCharges Churn
## 489  Bank transfer (automatic)          52.55           NA    No
## 754               Mailed check          20.25           NA    No
## 937               Mailed check          80.85           NA    No
## 1083              Mailed check          25.75           NA    No
## 1341   Credit card (automatic)          56.05           NA    No
## 3332              Mailed check          19.85           NA    No
## 3827              Mailed check          25.35           NA    No
## 4381              Mailed check          20.00           NA    No
## 5219              Mailed check          19.70           NA    No
## 6671              Mailed check          73.35           NA    No
## 6755 Bank transfer (automatic)          61.90           NA    No

Inspection of the Churn variable shows that these are all still subscribing customers. What proportion of our sample is this subset with missing values?

sum(is.na(dat$TotalCharges))/nrow(dat)
## [1] 0.001561834

This subset is 0.16% of our data and is quite small. We will remove these cases in order to accomodate our further analyses. Let’s call this cleaned data datc.

datc <- dat[complete.cases(dat), ]

The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. We can recode this to ease our interpretation of later graphs and models.

datc$SeniorCitizen <- as.factor(mapvalues(datc$SeniorCitizen,
                                          from=c("0","1"),
                                          to=c("No", "Yes")))

The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter variable automatically means a ‘no’ for the former variable. We can again further ease our graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for the MultipleLines variable.

datc$MultipleLines <- as.factor(mapvalues(datc$MultipleLines, 
                                           from=c("No phone service"),
                                           to=c("No")))

Similiarly, the OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies variables are all dependent on the OnlineService variable. We will recode the responses from ‘No internet service’ to ‘No’ for these variables.

for(i in 10:15){
  datc[,i] <- as.factor(mapvalues(datc[,i],
                                  from= c("No internet service"), to= c("No")))
}

We will not need the customerID variable for graphs or modeling, so it can be removed.

datc$customerID <- NULL

In order to assess the performance of our various modeling techniques, we can split the data into training and test subsets. We will model the training data and use these model parameters to make predictions with the test data. Let’s call these data subsets dtrain and dtest.

We will randomly sample from the entire sample to create these subsets. The ‘set.seed()’ function argument can be changed in order to reset the random number generator used for sampling. The training subset will be roughly 70% of the original sample, with the remaining being the test subset.

set.seed(56)
split_train_test <- createDataPartition(datc$Churn,p=0.7,list=FALSE)
dtrain<- datc[split_train_test,]
dtest<- datc[-split_train_test,]

3. Data visualization of descriptive statistics

Before we begin modeling our data, let’s examine descriptive statistics of our data within some simple plots.

Here are barplots of demographic data of our sample.

#Gender plot
p1 <- ggplot(datc, aes(x = gender)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Senior citizen plot
p2 <- ggplot(datc, aes(x = SeniorCitizen)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Partner plot
p3 <- ggplot(datc, aes(x = Partner)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Dependents plot
p4 <- ggplot(datc, aes(x = Dependents)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot demographic data within a grid
grid.arrange(p1, p2, p3, p4, ncol=2)

From these demographic plots, we notice that the sample is evenly split across gender and partner status. A minority of the sample are senior citizens, and a minority have dependents.

The various offered services are plotted below.

#Phone service plot
p5 <- ggplot(datc, aes(x = PhoneService)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Multiple phone lines plot
p6 <- ggplot(datc, aes(x = MultipleLines)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Internet service plot
p7 <- ggplot(datc, aes(x = InternetService)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Online security service plot
p8 <- ggplot(datc, aes(x = OnlineSecurity)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Online backup service plot
p9 <- ggplot(datc, aes(x = OnlineBackup)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Device Protection service plot
p10 <- ggplot(datc, aes(x = DeviceProtection)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Tech Support service plot
p11 <- ggplot(datc, aes(x = TechSupport)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Streaming TV service plot
p12 <- ggplot(datc, aes(x = StreamingTV)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Streaming Movies service plot
p13 <- ggplot(datc, aes(x = StreamingMovies)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot service data within a grid
grid.arrange(p5, p6, p7,
             p8, p9, p10,
             p11, p12, p13,
             ncol=3)

Most of the sample have phone service with a single phone line. Fiber optic internet connection is more popular than DSL internet service, and each online service has a minority of users.

The remaining categorical variables are related to contract and payment status.

#Contract status plot
p14 <- ggplot(datc, aes(x = Contract)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Paperless billing plot
p15 <- ggplot(datc, aes(x = PaperlessBilling)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Payment method plot
p16 <- ggplot(datc, aes(x = PaymentMethod)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)

#Plot contract data within a grid
grid.arrange(p14, p15, p16, ncol=1)

Roughly half of the sample are on month-to-month contracts with the remaining split between one and two year contracts. Most of the sample are on paperless billing, and pay by electronic check.

Let’s look at distributions of the quantitative variables.

#Tenure histogram
p17 <- ggplot(datc, aes(x = tenure)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Months",
       title = "Tenure Distribtion")

#Monthly charges histogram
p18 <- ggplot(datc, aes(x = MonthlyCharges)) +
  geom_histogram(binwidth = 5) +
  labs(x = "Dollars (binwidth = 5)",
       title = "Monthly charges Distribtion")

#Total charges histogram
p19 <- ggplot(datc, aes(x = TotalCharges)) +
  geom_histogram(binwidth = 100) +
  labs(x = "Dollars (binwidth = 100)",
       title = "Total charges Distribtion")

#Plot quantitative data within a grid
grid.arrange(p17, p18, p19, ncol=1)

The tenure variable is stacked at the tails, so a large proportin of customers have either been had the shortest (1 month) or longest (72 month) tenure. It appears as if the MonthlyCharges variable is roughly normaly distribued around $80 per month with a large stack near the lowest rates. The TotalCharges variable is positively skewed with a large stack near the lower amounts.

Lastly, let’s examine our main outcome variable of interest, churn.

p20 <- ggplot(datc, aes(x = Churn)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3)
p20

Roughly a quarter of our sample are no longer customers. Let’s try to predict those that churn with some classification modeling techniques.

4. Statistical modeling

Three modeling techniques will be implemented on the training data subset. Model parameters from these techniques will make predictions on the test subset. We will examine accuracy in terms of both the percentage of correct predictions as well as confustion matrices.

Decision tree analysis

Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.

tr_fit <- rpart(Churn ~., data = dtrain, method="class")
rpart.plot(tr_fit)

From this decision tree, we can interpret the following:

  • The contract variable is the most important. Customers with month-to-month contracts are more likely to churn.
  • Customers with DSL internet service are less likely to churn.
  • Customers who have stayed longer than 15 months are less likely to churn.

Now let’s assess the prediction accuracy of the decision tree model by investigating how well it predicts churn in the test subset. We will begin with the confustion matrix, which is a useful display of classification accuracy. It displays the following information:

  • true positives (TP): These are cases in which we predicted yes (they churned), and they did churn.
  • true negatives (TN): We predicted no, and they didn’t churn.
  • false positives (FP): We predicted yes, but they didn’t actually churn. (Also known as a “Type I error.”)
  • false negatives (FN): We predicted no, but they actually churned. (Also known as a “Type II error.”)

Let’s examine the confusion matrix for our decision tree model.

tr_prob1 <- predict(tr_fit, dtest)
tr_pred1 <- ifelse(tr_prob1[,2] > 0.5,"Yes","No")
table(Predicted = tr_pred1, Actual = dtest$Churn)
##          Actual
## Predicted   No  Yes
##       No  1466  328
##       Yes   82  232

The diagonal entries give our correct predictions, with the upper left being TN and the lower right being TP. The upper right gives the FN while the lower left gives the FP. From this confusion matrix, we can see that the model performs well at predicting non-churning customers (1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers (232 correct vs. 328 incorrect).

How about the overall accuracy of the decision tree model?

tr_prob2 <- predict(tr_fit, dtrain)
tr_pred2 <- ifelse(tr_prob2[,2] > 0.5,"Yes","No")
tr_tab1 <- table(Predicted = tr_pred2, Actual = dtrain$Churn)
tr_tab2 <- table(Predicted = tr_pred1, Actual = dtest$Churn)
tr_acc <- sum(diag(tr_tab2))/sum(tr_tab2)
tr_acc
## [1] 0.8055028

The decision tree model is fairly accurate, correctly predicting the churn status of customers in the test subset 80.55% of the time.

Random forest analysis

Random forest analysis is another machine learning classification method that is often used in customer churn analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.

We will begin by identifying the number of variables randomly sampled as candidates at each split of the algorithm. In the randomForest package, this is referred to as the ‘mtry’ parameter or argument.

#Set control parameters for random forest model selection
ctrl <- trainControl(method = "cv", number=5, 
                     classProbs = TRUE, summaryFunction = twoClassSummary)

#Exploratory random forest model selection
rf_fit1 <- train(Churn ~., data = dtrain,
                 method = "rf",
                 ntree = 75,
                 tuneLength = 5,
                 metric = "ROC",
                 trControl = ctrl)
rf_fit1
## Random Forest 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3939, 3939, 3939, 3939, 3940 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.8268924  0.9186722  0.4453921
##    7    0.8136580  0.8946058  0.4805534
##   12    0.8085244  0.8879668  0.4767132
##   17    0.8087431  0.8865837  0.4767336
##   23    0.8067040  0.8915629  0.4790149
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The model found that the optimal value for ‘mtry’ is 2. From this model we can investigate the relative importance of the churn predictor variables.

#Run optimal model
rf_fit2 <- randomForest(Churn ~., data = dtrain, 
                        ntree = 75, mtry = 2, 
                        importance = TRUE, proximity = TRUE)

#Display variable importance from random tree
varImpPlot(rf_fit2, sort=T, n.var = 10, 
           main = 'Top 10 important variables')

Similar to the decision tree, this random forest model has identified contract status and tenure length as important predictors for churn. Internet service status does not appear as important in this model, and the total charges variable is now highly emphasized.

Let’s examine the performance of this random forest model. We’ll begin with the confusion matrix.

rf_pred1 <- predict(rf_fit2, dtest)
table(Predicted = rf_pred1, Actual = dtest$Churn)
##          Actual
## Predicted   No  Yes
##       No  1445  288
##       Yes  103  272

The performance is somewhat similar to the decision tree model. The false negative rate is low (1445 correct vs. 103 incorrect) but the false positive rate is rather high (272 correct vs. 288 incorrect). What about the overall accuracy?

rf_pred2 <- predict(rf_fit2, dtrain)
rf_tab1 <- table(Predicted = rf_pred2, Actual = dtrain$Churn)
rf_tab2 <- table(Predicted = rf_pred1, Actual = dtest$Churn)
rf_acc <- sum(diag(rf_tab2))/sum(rf_tab2)
rf_acc
## [1] 0.8145161

The random forest model is slightly more accurate than the decision tree model, being able to correctly predict the churn status of a customer in the test subset with 81.45% accuracy.

Logistic regression analysis

Our final statistical method will be logistic regression, a more classic method compared to the two above machine learning based methods. Logistic regression involves regressing predictor variables on a binary outcome using a binomial link function. Let’s fit the model using the base general linear modeling function in R, glm.

lr_fit <- glm(Churn ~., data = dtrain,
          family=binomial(link='logit'))
summary(lr_fit)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = dtrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9220  -0.6926  -0.2899   0.7488   3.4425  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                           2.056e+00  9.710e-01   2.118
## genderMale                           -3.932e-02  7.715e-02  -0.510
## SeniorCitizenYes                      3.158e-01  1.004e-01   3.146
## PartnerYes                            4.608e-03  9.331e-02   0.049
## DependentsYes                        -1.024e-01  1.087e-01  -0.942
## tenure                               -6.429e-02  7.447e-03  -8.633
## PhoneServiceYes                       9.624e-01  7.721e-01   1.247
## MultipleLinesYes                      6.519e-01  2.125e-01   3.068
## InternetServiceFiber optic            2.482e+00  9.474e-01   2.620
## InternetServiceNo                    -2.738e+00  9.606e-01  -2.851
## OnlineSecurityYes                    -2.438e-02  2.128e-01  -0.115
## OnlineBackupYes                       2.669e-01  2.095e-01   1.274
## DeviceProtectionYes                   3.036e-01  2.102e-01   1.444
## TechSupportYes                       -1.026e-01  2.141e-01  -0.479
## StreamingTVYes                        9.326e-01  3.880e-01   2.404
## StreamingMoviesYes                    9.986e-01  3.906e-01   2.557
## ContractOne year                     -7.273e-01  1.288e-01  -5.645
## ContractTwo year                     -1.341e+00  2.084e-01  -6.434
## PaperlessBillingYes                   3.138e-01  8.775e-02   3.576
## PaymentMethodCredit card (automatic) -1.043e-01  1.360e-01  -0.767
## PaymentMethodElectronic check         3.037e-01  1.119e-01   2.713
## PaymentMethodMailed check             1.935e-02  1.360e-01   0.142
## MonthlyCharges                       -7.633e-02  3.776e-02  -2.021
## TotalCharges                          3.890e-04  8.404e-05   4.628
##                                      Pr(>|z|)    
## (Intercept)                          0.034213 *  
## genderMale                           0.610245    
## SeniorCitizenYes                     0.001654 ** 
## PartnerYes                           0.960608    
## DependentsYes                        0.345944    
## tenure                                < 2e-16 ***
## PhoneServiceYes                      0.212551    
## MultipleLinesYes                     0.002156 ** 
## InternetServiceFiber optic           0.008802 ** 
## InternetServiceNo                    0.004364 ** 
## OnlineSecurityYes                    0.908790    
## OnlineBackupYes                      0.202642    
## DeviceProtectionYes                  0.148627    
## TechSupportYes                       0.631601    
## StreamingTVYes                       0.016227 *  
## StreamingMoviesYes                   0.010562 *  
## ContractOne year                     1.65e-08 ***
## ContractTwo year                     1.25e-10 ***
## PaperlessBillingYes                  0.000349 ***
## PaymentMethodCredit card (automatic) 0.442884    
## PaymentMethodElectronic check        0.006670 ** 
## PaymentMethodMailed check            0.886877    
## MonthlyCharges                       0.043250 *  
## TotalCharges                         3.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5702.8  on 4923  degrees of freedom
## Residual deviance: 4119.0  on 4900  degrees of freedom
## AIC: 4167
## 
## Number of Fisher Scoring iterations: 6

By examining the significance values, we see similar predictor variables of importance. Tenure length, contract status, and total charges have the lowest p-values and can be identified as the best predictors of customer churn.

Let’s examine the confusion matrix based on our logistic regression model.

lr_prob1 <- predict(lr_fit, dtest, type="response")
lr_pred1 <- ifelse(lr_prob1 > 0.5,"Yes","No")
table(Predicted = lr_pred1, Actual = dtest$Churn)
##          Actual
## Predicted   No  Yes
##       No  1395  228
##       Yes  153  332

Similar to the machine learning algorithms, the false negative rate is low (1395 correct vs. 153 incorrect) while not as quite as low. In contrast, the false positive rate (332 correct vs. 228 incorrect) is actually above 50%, therefore performing better than the machine learning algorithms.

The overal prediction accuracy can be obtained similar to the previous models.

lr_prob2 <- predict(lr_fit, dtrain, type="response")
lr_pred2 <- ifelse(lr_prob2 > 0.5,"Yes","No")
lr_tab1 <- table(Predicted = lr_pred2, Actual = dtrain$Churn)
lr_tab2 <- table(Predicted = lr_pred1, Actual = dtest$Churn)
lr_acc <- sum(diag(lr_tab2))/sum(lr_tab2)
lr_acc
## [1] 0.81926

The 81.93% accuracy rate of the logistic regression model slightly outperforms the decision tree and random forest models.

Now that we have fit several models and identified some important predictor variables for customer churn, let’s explore some further graphs based on our findings.

5. Data visualization based on models

Our modeling efforts pointed to several important churn predictors: contract status, internet status, tenure length, and total charges. Let’s examine how these variables split by churn status.

We will begin with the contract status variable.

p21 <- ggplot(datc, aes(x = Contract, fill = Churn)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3) +
  labs(title="Churn rate by contract status")

p21

As would be expected, the churn rate of month-to-month contract customers is much higher than the longer contract customers. Customers who are more willing to commit to longer contracts are less likely to leave.

What does this look like for the internet service status of the customer?

p22 <- ggplot(datc, aes(x = InternetService, fill = Churn)) +
  geom_bar() +
  geom_text(aes(y = ..count.. -200, 
                label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(.1), 
            size = 3) +
  labs(title="Churn rate by internet service status")

p22

It appears as if customers with internet service are more likely to churn than those that don’t. This is more pronounced for customers with fiber optic internet service, who are the most likely to churn.

Let’s examine the churn split for the tenure length distribution.

p23 <- ggplot(datc, aes(x = tenure, fill = Churn)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Months",
       title = "Churn rate by tenure")
p23

As expected, the length of time as a customer decreases the likelihood of churn. There is a large spike at 1 month, indicating that there are a large portion of customers that will leave the after just one month of service.

How about the distribution for total charges split by churn?

p24 <- ggplot(datc, aes(x = TotalCharges, fill = Churn)) +
  geom_histogram(binwidth = 100) +
  labs(x = "Dollars (binwidth=100)",
       title = "Churn rate by tenure")
p24

Similar to the tenure trend, customers who have spent more with the company tend not to leave. This could just be a reflection of the tenure effect, or it could be due to financial characteristics of the customer: customers who are more financially well off are less likely to leave.

Conclusion

After going through various preparatory steps including data/library loading and preprocessing, we carried out three statistical classification methods common in churn analysis. We identified several important churn predictor variables from these models and compared these models on accuracy measures.

Here is a summary of our findings: