Predict Customer Churn with R

Introduction

Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.

Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Lets get started.

This analysis taken from here

Data Preprocessing

The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customers attributes:

library(plyr)
library(corrplot)

## corrplot 0.84 loaded

library(ggplot2)
library(gridExtra)
library(ggthemes)
library(caret)

## Loading required package: lattice

library(MASS)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:plyr':
## 
##     empty

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

churn <- read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
head(churn)

##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service             DSL             No          Yes
## 2               No             DSL            Yes           No
## 3               No             DSL            Yes          Yes
## 4 No phone service             DSL            Yes           No
## 5               No     Fiber optic             No           No
## 6              Yes     Fiber optic             No           No
##   DeviceProtection TechSupport StreamingTV StreamingMovies       Contract
## 1               No          No          No              No Month-to-month
## 2              Yes          No          No              No       One year
## 3               No          No          No              No Month-to-month
## 4              Yes         Yes          No              No       One year
## 5               No          No          No              No Month-to-month
## 6              Yes          No         Yes             Yes Month-to-month
##   PaperlessBilling             PaymentMethod MonthlyCharges TotalCharges
## 1              Yes          Electronic check          29.85        29.85
## 2               No              Mailed check          56.95      1889.50
## 3              Yes              Mailed check          53.85       108.15
## 4               No Bank transfer (automatic)          42.30      1840.75
## 5              Yes          Electronic check          70.70       151.65
## 6              Yes          Electronic check          99.65       820.50
##   Churn
## 1    No
## 2    No
## 3   Yes
## 4    No
## 5   Yes
## 6   Yes

dim(churn)

## [1] 7043   21

str(churn)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

customerID
gender (female, male)
SeniorCitizen (Whether the customer is a senior citizen or not (1, 0))
Partner (Whether the customer has a partner or not (Yes, No))
Dependents (Whether the customer has dependents or not (Yes, No))
tenure (Number of months the customer has stayed with the company)
PhoneService (Whether the customer has a phone service or not (Yes, No))
MultipleLines (Whether the customer has multiple lines r not (Yes, No, No phone service)
InternetService (Customers internet service provider (DSL, Fiber optic, No)
OnlineSecurity (Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup (Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection (Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport (Whether the customer has tech support or not (Yes, No, No internet service)
streamingTV (Whether the customer has streaming TV or not (Yes, No, No internet service)
streamingMovies (Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract (The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling (Whether the customer has paperless billing or not (Yes, No))
PaymentMethod (The customers payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)))
MonthlyCharges (The amount charged to the customer monthly - numeric)
TotalCharges (The total amount charged to the customer - numeric)
Churn ( Whether the customer churned or not (Yes or No))

The raw data contains 7043 rows (customers) and 21 columns (features). The Churn column is our target.

We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, lets remove all rows with missing values.

sapply(churn, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

churn <- churn[complete.cases(churn),]  ## to remove which has null values
sapply(churn, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0                0 
##            Churn 
##                0

dim(churn)

## [1] 7032   21

#unique(churn)
unique(churn['OnlineSecurity'])

##         OnlineSecurity
## 1                   No
## 2                  Yes
## 12 No internet service

Look at the variables, we can see that we have some wrangling to do.

We will change ‘No internet service’ to ‘No’ for six columns, they are: ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘streamingTV, ’streamingMovies’.

cols_recode1 <- c(10:15)
for (i in 1:ncol(churn[, cols_recode1])) {
      churn[, cols_recode1][, i] <- as.factor(mapvalues(churn[, cols_recode1][, i], from = c("No internet service"), to = c("No")))
}

We will change ‘No phone service’ to ‘No’ for column ‘MultipleLines’

churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from = c("No phone service"), to = c("No")))
#str(churn)

Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups ‘0 - 12 Month’, ‘12 - 24 Month’, ‘24 - 48 Months’, ‘48 - 60 Month’, ‘> 60 Month’

min(churn$tenure); max(churn$tenure)

## [1] 1

## [1] 72

group_tenure <- function(tenure){
      if(tenure >= 0 & tenure <= 12){
            return('0 - 12 Month')
      }else if(tenure > 12 & tenure <= 24){
            return('12 - 24 Month')
      }else if(tenure > 24 & tenure <= 48){
            return('24 - 48 Month')
      }else if(tenure > 48 & tenure <= 60){
            return('48 - 60 Month')
      }else if(tenure > 60){
            return('> 60 Month')
      }
}

churn$tenure_group <- sapply(churn$tenure, group_tenure)
churn$tenure_group <- as.factor(churn$tenure_group)
#str(churn)

Change the values in column ‘SeniorCitizen’ from 0 or 1 to ‘No’ or ‘Yes’.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from = c("0", "1"), to = c("No", "Yes")))
#str(churn)

Remove the columns we do not need for the analysis.

churn$customerID <- NULL
churn$tenure <- NULL
#str(churn)

Exploratory data analysis and feature selection

Correlation between numeric variables

numeric_var <- sapply(churn, is.numeric)
corr_matrix <- cor(churn[, numeric_var])
corrplot(corr_matrix, main = "\n\nCorrelation Plot for Numerical Variables", method = "number")

The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.

churn$TotalCharges <- NULL
str(churn)

## 'data.frame':    7032 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
##  $ tenure_group    : Factor w/ 5 levels "> 60 Month","0 - 12 Month",..: 2 4 2 4 2 2 3 2 4 1 ...

Bar plots of categorical variables

p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") +
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
      ylab("Percentage") + coord_flip() + theme_minimal()

p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + 
      xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()

p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

grid.arrange(p1, p2, p3, p4, ncol=2)

p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + 
      xlab("Phone Service") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + 
      xlab("Multiple Lines") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + 
      xlab("Internet Service") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + 
      xlab("Online Security") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

grid.arrange(p5, p6, p7, p8, ncol=2)

p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + 
      xlab("Online Backup") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + 
      xlab("Device Protection") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + 
      xlab("Tech Support") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + 
      xlab("Streaming TV") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

grid.arrange(p9, p10, p11, p12, ncol=2)

p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + 
      xlab("Streaming Movies") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + 
      xlab("Contract") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + 
      xlab("Paperless Billing") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + 
      xlab("Payment Method") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + 
      xlab("Tenure Group") + 
      geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + 
      ylab("Percentage") + coord_flip() + theme_minimal()

grid.arrange(p13, p14, p15, p16, p17, ncol=2)

All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.

Modeling

Logistic Regression

First, we split the data into training and testing sets:

intrain <- createDataPartition(churn$Churn, p = 0.7, list = FALSE)
set.seed(2018)
training <- churn[intrain, ]
testing <- churn[- intrain, ]

Confirm the splitting is correct:

dim(training); dim(testing)

## [1] 4924   19

## [1] 2108   19

Fitting the Logistic Regression Model:

LogModel <- glm(Churn ~ ., family = binomial(link = "logit"), data = training)
print(summary(LogModel))

## 
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9271  -0.6723  -0.2885   0.6611   3.1462  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          -1.57938    1.00312  -1.574 0.115378
## genderMale                            0.03504    0.07762   0.451 0.651644
## SeniorCitizenYes                      0.14552    0.10094   1.442 0.149386
## PartnerYes                           -0.05034    0.09223  -0.546 0.585220
## DependentsYes                        -0.08635    0.10632  -0.812 0.416701
## PhoneServiceYes                      -0.17060    0.78773  -0.217 0.828538
## MultipleLinesYes                      0.24070    0.21404   1.125 0.260772
## InternetServiceFiber optic            1.11870    0.96816   1.155 0.247887
## InternetServiceNo                    -1.20721    0.97904  -1.233 0.217554
## OnlineSecurityYes                    -0.37231    0.21674  -1.718 0.085849
## OnlineBackupYes                      -0.12490    0.21108  -0.592 0.554038
## DeviceProtectionYes                  -0.07139    0.21130  -0.338 0.735455
## TechSupportYes                       -0.40162    0.21861  -1.837 0.066187
## StreamingTVYes                        0.40570    0.39499   1.027 0.304372
## StreamingMoviesYes                    0.50048    0.39549   1.265 0.205705
## ContractOne year                     -0.73808    0.12799  -5.767 8.09e-09
## ContractTwo year                     -1.68364    0.22107  -7.616 2.62e-14
## PaperlessBillingYes                   0.30753    0.08942   3.439 0.000583
## PaymentMethodCredit card (automatic) -0.11405    0.13727  -0.831 0.406063
## PaymentMethodElectronic check         0.29425    0.11237   2.619 0.008828
## PaymentMethodMailed check             0.09455    0.13500   0.700 0.483702
## MonthlyCharges                       -0.01084    0.03849  -0.282 0.778231
## tenure_group0 - 12 Month              1.63108    0.20623   7.909 2.59e-15
## tenure_group12 - 24 Month             0.79792    0.20232   3.944 8.02e-05
## tenure_group24 - 48 Month             0.31361    0.18637   1.683 0.092429
## tenure_group48 - 60 Month             0.11587    0.20198   0.574 0.566211
##                                         
## (Intercept)                             
## genderMale                              
## SeniorCitizenYes                        
## PartnerYes                              
## DependentsYes                           
## PhoneServiceYes                         
## MultipleLinesYes                        
## InternetServiceFiber optic              
## InternetServiceNo                       
## OnlineSecurityYes                    .  
## OnlineBackupYes                         
## DeviceProtectionYes                     
## TechSupportYes                       .  
## StreamingTVYes                          
## StreamingMoviesYes                      
## ContractOne year                     ***
## ContractTwo year                     ***
## PaperlessBillingYes                  ***
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check        ** 
## PaymentMethodMailed check               
## MonthlyCharges                          
## tenure_group0 - 12 Month             ***
## tenure_group12 - 24 Month            ***
## tenure_group24 - 48 Month            .  
## tenure_group48 - 60 Month               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5702.8  on 4923  degrees of freedom
## Residual deviance: 4088.6  on 4898  degrees of freedom
## AIC: 4140.6
## 
## Number of Fisher Scoring iterations: 6

Feature Analysis:

The top three most-relevant features include Contract, tenure_group and PaperlessBilling.

anova(LogModel, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Churn
## 
## Terms added sequentially (first to last)
## 
## 
##                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                              4923     5702.8              
## gender            1     0.00      4922     5702.8 0.9729897    
## SeniorCitizen     1    93.25      4921     5609.5 < 2.2e-16 ***
## Partner           1   128.77      4920     5480.7 < 2.2e-16 ***
## Dependents        1    25.45      4919     5455.3 4.543e-07 ***
## PhoneService      1     1.24      4918     5454.1 0.2646866    
## MultipleLines     1     2.95      4917     5451.1 0.0857753 .  
## InternetService   2   447.57      4915     5003.5 < 2.2e-16 ***
## OnlineSecurity    1   185.40      4914     4818.1 < 2.2e-16 ***
## OnlineBackup      1    86.41      4913     4731.7 < 2.2e-16 ***
## DeviceProtection  1    59.18      4912     4672.5 1.438e-14 ***
## TechSupport       1    88.99      4911     4583.5 < 2.2e-16 ***
## StreamingTV       1     2.54      4910     4581.0 0.1109645    
## StreamingMovies   1     2.73      4909     4578.3 0.0987267 .  
## Contract          2   287.56      4907     4290.7 < 2.2e-16 ***
## PaperlessBilling  1    10.97      4906     4279.8 0.0009269 ***
## PaymentMethod     3    32.01      4903     4247.7 5.197e-07 ***
## MonthlyCharges    1     0.34      4902     4247.4 0.5595692    
## tenure_group      4   158.84      4898     4088.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.

Assessing the predictive ability of the Logistic Regression model

testing$Churn <- as.character(testing$Churn)
testing$Churn[testing$Churn == "No"] <- "0"
testing$Churn[testing$Churn == "Yes"] <- "1"
fitted_results <- predict(LogModel, newdata = testing, type = "response")
fitted_results <- ifelse(fitted_results > 0.5, 1, 0)
misClasificError <- mean(fitted_results != testing$Churn)
print(paste('Logistic Regression Accuracy', 1- misClasificError))

## [1] "Logistic Regression Accuracy 0.80123339658444"

Logistic Regression Confusion Matrix

print("Confusion Matrix for Logistic Regression");

## [1] "Confusion Matrix for Logistic Regression"

table(testing$Churn, fitted_results > 0.5)

##    
##     FALSE TRUE
##   0  1399  149
##   1   270  290

Odds Ratio

One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.

library(MASS)
exp(cbind(OR = coef(LogModel), confint(LogModel)))

## Waiting for profiling to be done...

##                                             OR      2.5 %     97.5 %
## (Intercept)                          0.2061027 0.02879826  1.4709260
## genderMale                           1.0356662 0.88953429  1.2059624
## SeniorCitizenYes                     1.1566464 0.94872251  1.4093716
## PartnerYes                           0.9509072 0.79367731  1.1394696
## DependentsYes                        0.9172761 0.74424867  1.1292225
## PhoneServiceYes                      0.8431553 0.18002496  3.9515909
## MultipleLinesYes                     1.2721406 0.83639059  1.9359887
## InternetServiceFiber optic           3.0608839 0.45957766 20.4670984
## InternetServiceNo                    0.2990299 0.04381827  2.0364549
## OnlineSecurityYes                    0.6891435 0.45035040  1.0535297
## OnlineBackupYes                      0.8825834 0.58340611  1.3348317
## DeviceProtectionYes                  0.9310966 0.61522874  1.4088296
## TechSupportYes                       0.6692364 0.43565985  1.0266487
## StreamingTVYes                       1.5003467 0.69212568  3.2570409
## StreamingMoviesYes                   1.6495113 0.76029678  3.5848594
## ContractOne year                     0.4780309 0.37091140  0.6127889
## ContractTwo year                     0.1856963 0.11872472  0.2829360
## PaperlessBillingYes                  1.3600629 1.14176107  1.6212233
## PaymentMethodCredit card (automatic) 0.8922170 0.68134019  1.1672654
## PaymentMethodElectronic check        1.3421242 1.07757170  1.6742625
## PaymentMethodMailed check            1.0991613 0.84404234  1.4330800
## MonthlyCharges                       0.9892197 0.91727952  1.0666996
## tenure_group0 - 12 Month             5.1093838 3.42329065  7.6873006
## tenure_group12 - 24 Month            2.2209275 1.49821292  3.3134161
## tenure_group24 - 48 Month            1.3683604 0.95259556  1.9793333
## tenure_group48 - 60 Month            1.1228453 0.75574202  1.6697610

Decision Tree

Decision Tree visualization

For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are ‘Contract’, ‘tenure_group’ and ‘PaperlessBilling’.

tree <- ctree(Churn ~ Contract+tenure_group+PaperlessBilling, training)
plot(tree)

Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0 - 12 month, and using PaperlessBilling, then this customer is more likely to churn.

Decision Tree Confusion Matrix

We are using all the variables to product confusion matrix table and make predictions.

pred_tree <- predict(tree, testing)
print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn)

## [1] "Confusion Matrix for Decision Tree"

##          Actual
## Predicted    0    1
##       No  1406  336
##       Yes  142  224

Decision Tree Accuracy

p1 <- predict(tree, training)
tab1 <- table(Predicted = p1, Actual = training$Churn)
tab2 <- table(Predicted = pred_tree, Actual = testing$Churn)
print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2)))

## [1] "Decision Tree Accuracy 0.773244781783681"

The accuracy for Decision Tree has hardly improved. Lets see if we can do better using Random Forest.

Random Forest

Random Forest Initial Model

rfModel <- randomForest(Churn ~., data = training)
print(rfModel)

## 
## Call:
##  randomForest(formula = Churn ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 21.24%
## Confusion matrix:
##       No Yes class.error
## No  3237 378   0.1045643
## Yes  668 641   0.5103132

The error rate is relatively low when predicting ‘No’, and the error rate is much higher when predicting ‘Yes’.

Random Forest Prediction and Confusion Matrix

pred_rf <- predict(rfModel, testing)
#caret::confusionMatrix(pred_rf, testing$Churn)
table(Predicted = pred_rf, Actual = testing$Churn)

##          Actual
## Predicted    0    1
##       No  1393  278
##       Yes  155  282

Random Forest Error Rate

plot(rfModel)

We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.

Tune Random Forest Model

t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE,
            ntreeTry = 200, trace = TRUE, improve = 0.05)

## mtry = 4  OOB error = 21.77% 
## Searching left ...
## mtry = 8     OOB error = 23.01% 
## -0.05690299 0.05 
## Searching right ...
## mtry = 2     OOB error = 20.9% 
## 0.04011194 0.05

We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.

Fit the Random Forest Model After Tuning

rfModel_new <- randomForest(Churn ~., data = training, ntree = 200,
                            mtry = 2, importance = TRUE, proximity = TRUE)
print(rfModel_new)

## 
## Call:
##  randomForest(formula = Churn ~ ., data = training, ntree = 200,      mtry = 2, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.27%
## Confusion matrix:
##       No Yes class.error
## No  3333 282   0.0780083
## Yes  716 593   0.5469824

OOB error rate decreased to 20.41% from 20.61% on Figure 14.

Random Forest Predictions and Confusion Matrix After Tuning

pred_rf_new <- predict(rfModel_new, testing)
#caret::confusionMatrix(pred_rf_new, testing$Churn)
table(Predicted = pred_rf_new, Actual = testing$Churn)

##          Actual
## Predicted    0    1
##       No  1426  305
##       Yes  122  255

Both accuracy and sensitivity are improved, compare with Figure 15.

Random Forest Feature Importance

varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')

Summary

From the above example, we can see that Logistic Regression, Decision Tree and Random Forest can be used for customer churn analysis for this particular dataset equally fine.

Throughout the analysis, I have learned several important things:

Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn.

*There does not seem to be a relationship between gender and churn.

Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn.

Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.