Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.
Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Lets get started.
This analysis taken from here
The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customers attributes:
library(plyr)
library(corrplot)
## corrplot 0.84 loaded
library(ggplot2)
library(gridExtra)
library(ggthemes)
library(caret)
## Loading required package: lattice
library(MASS)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:plyr':
##
## empty
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
churn <- read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
head(churn)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
dim(churn)
## [1] 7043 21
str(churn)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
customerID
gender (female, male)
SeniorCitizen (Whether the customer is a senior citizen or not (1, 0))
Partner (Whether the customer has a partner or not (Yes, No))
Dependents (Whether the customer has dependents or not (Yes, No))
tenure (Number of months the customer has stayed with the company)
PhoneService (Whether the customer has a phone service or not (Yes, No))
MultipleLines (Whether the customer has multiple lines r not (Yes, No, No phone service)
InternetService (Customers internet service provider (DSL, Fiber optic, No)
OnlineSecurity (Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup (Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection (Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport (Whether the customer has tech support or not (Yes, No, No internet service)
streamingTV (Whether the customer has streaming TV or not (Yes, No, No internet service)
streamingMovies (Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract (The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling (Whether the customer has paperless billing or not (Yes, No))
PaymentMethod (The customers payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)))
MonthlyCharges (The amount charged to the customer monthly - numeric)
TotalCharges (The total amount charged to the customer - numeric)
Churn ( Whether the customer churned or not (Yes or No))
The raw data contains 7043 rows (customers) and 21 columns (features). The Churn column is our target.
We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, lets remove all rows with missing values.
sapply(churn, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
churn <- churn[complete.cases(churn),] ## to remove which has null values
sapply(churn, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Churn
## 0
dim(churn)
## [1] 7032 21
#unique(churn)
unique(churn['OnlineSecurity'])
## OnlineSecurity
## 1 No
## 2 Yes
## 12 No internet service
cols_recode1 <- c(10:15)
for (i in 1:ncol(churn[, cols_recode1])) {
churn[, cols_recode1][, i] <- as.factor(mapvalues(churn[, cols_recode1][, i], from = c("No internet service"), to = c("No")))
}
churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from = c("No phone service"), to = c("No")))
#str(churn)
min(churn$tenure); max(churn$tenure)
## [1] 1
## [1] 72
group_tenure <- function(tenure){
if(tenure >= 0 & tenure <= 12){
return('0 - 12 Month')
}else if(tenure > 12 & tenure <= 24){
return('12 - 24 Month')
}else if(tenure > 24 & tenure <= 48){
return('24 - 48 Month')
}else if(tenure > 48 & tenure <= 60){
return('48 - 60 Month')
}else if(tenure > 60){
return('> 60 Month')
}
}
churn$tenure_group <- sapply(churn$tenure, group_tenure)
churn$tenure_group <- as.factor(churn$tenure_group)
#str(churn)
churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from = c("0", "1"), to = c("No", "Yes")))
#str(churn)
churn$customerID <- NULL
churn$tenure <- NULL
#str(churn)
numeric_var <- sapply(churn, is.numeric)
corr_matrix <- cor(churn[, numeric_var])
corrplot(corr_matrix, main = "\n\nCorrelation Plot for Numerical Variables", method = "number")
The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.
churn$TotalCharges <- NULL
str(churn)
## 'data.frame': 7032 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## $ tenure_group : Factor w/ 5 levels "> 60 Month","0 - 12 Month",..: 2 4 2 4 2 2 3 2 4 1 ...
p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") +
xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol=2)
p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") +
xlab("Phone Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") +
xlab("Multiple Lines") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") +
xlab("Internet Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") +
xlab("Online Security") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p5, p6, p7, p8, ncol=2)
p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") +
xlab("Online Backup") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") +
xlab("Device Protection") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") +
xlab("Tech Support") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") +
xlab("Streaming TV") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p9, p10, p11, p12, ncol=2)
p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") +
xlab("Streaming Movies") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") +
xlab("Contract") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") +
xlab("Paperless Billing") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") +
xlab("Payment Method") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") +
xlab("Tenure Group") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p13, p14, p15, p16, p17, ncol=2)
All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.
intrain <- createDataPartition(churn$Churn, p = 0.7, list = FALSE)
set.seed(2018)
training <- churn[intrain, ]
testing <- churn[- intrain, ]
Confirm the splitting is correct:
dim(training); dim(testing)
## [1] 4924 19
## [1] 2108 19
LogModel <- glm(Churn ~ ., family = binomial(link = "logit"), data = training)
print(summary(LogModel))
##
## Call:
## glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9271 -0.6723 -0.2885 0.6611 3.1462
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57938 1.00312 -1.574 0.115378
## genderMale 0.03504 0.07762 0.451 0.651644
## SeniorCitizenYes 0.14552 0.10094 1.442 0.149386
## PartnerYes -0.05034 0.09223 -0.546 0.585220
## DependentsYes -0.08635 0.10632 -0.812 0.416701
## PhoneServiceYes -0.17060 0.78773 -0.217 0.828538
## MultipleLinesYes 0.24070 0.21404 1.125 0.260772
## InternetServiceFiber optic 1.11870 0.96816 1.155 0.247887
## InternetServiceNo -1.20721 0.97904 -1.233 0.217554
## OnlineSecurityYes -0.37231 0.21674 -1.718 0.085849
## OnlineBackupYes -0.12490 0.21108 -0.592 0.554038
## DeviceProtectionYes -0.07139 0.21130 -0.338 0.735455
## TechSupportYes -0.40162 0.21861 -1.837 0.066187
## StreamingTVYes 0.40570 0.39499 1.027 0.304372
## StreamingMoviesYes 0.50048 0.39549 1.265 0.205705
## ContractOne year -0.73808 0.12799 -5.767 8.09e-09
## ContractTwo year -1.68364 0.22107 -7.616 2.62e-14
## PaperlessBillingYes 0.30753 0.08942 3.439 0.000583
## PaymentMethodCredit card (automatic) -0.11405 0.13727 -0.831 0.406063
## PaymentMethodElectronic check 0.29425 0.11237 2.619 0.008828
## PaymentMethodMailed check 0.09455 0.13500 0.700 0.483702
## MonthlyCharges -0.01084 0.03849 -0.282 0.778231
## tenure_group0 - 12 Month 1.63108 0.20623 7.909 2.59e-15
## tenure_group12 - 24 Month 0.79792 0.20232 3.944 8.02e-05
## tenure_group24 - 48 Month 0.31361 0.18637 1.683 0.092429
## tenure_group48 - 60 Month 0.11587 0.20198 0.574 0.566211
##
## (Intercept)
## genderMale
## SeniorCitizenYes
## PartnerYes
## DependentsYes
## PhoneServiceYes
## MultipleLinesYes
## InternetServiceFiber optic
## InternetServiceNo
## OnlineSecurityYes .
## OnlineBackupYes
## DeviceProtectionYes
## TechSupportYes .
## StreamingTVYes
## StreamingMoviesYes
## ContractOne year ***
## ContractTwo year ***
## PaperlessBillingYes ***
## PaymentMethodCredit card (automatic)
## PaymentMethodElectronic check **
## PaymentMethodMailed check
## MonthlyCharges
## tenure_group0 - 12 Month ***
## tenure_group12 - 24 Month ***
## tenure_group24 - 48 Month .
## tenure_group48 - 60 Month
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5702.8 on 4923 degrees of freedom
## Residual deviance: 4088.6 on 4898 degrees of freedom
## AIC: 4140.6
##
## Number of Fisher Scoring iterations: 6
The top three most-relevant features include Contract, tenure_group and PaperlessBilling.
anova(LogModel, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Churn
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 4923 5702.8
## gender 1 0.00 4922 5702.8 0.9729897
## SeniorCitizen 1 93.25 4921 5609.5 < 2.2e-16 ***
## Partner 1 128.77 4920 5480.7 < 2.2e-16 ***
## Dependents 1 25.45 4919 5455.3 4.543e-07 ***
## PhoneService 1 1.24 4918 5454.1 0.2646866
## MultipleLines 1 2.95 4917 5451.1 0.0857753 .
## InternetService 2 447.57 4915 5003.5 < 2.2e-16 ***
## OnlineSecurity 1 185.40 4914 4818.1 < 2.2e-16 ***
## OnlineBackup 1 86.41 4913 4731.7 < 2.2e-16 ***
## DeviceProtection 1 59.18 4912 4672.5 1.438e-14 ***
## TechSupport 1 88.99 4911 4583.5 < 2.2e-16 ***
## StreamingTV 1 2.54 4910 4581.0 0.1109645
## StreamingMovies 1 2.73 4909 4578.3 0.0987267 .
## Contract 2 287.56 4907 4290.7 < 2.2e-16 ***
## PaperlessBilling 1 10.97 4906 4279.8 0.0009269 ***
## PaymentMethod 3 32.01 4903 4247.7 5.197e-07 ***
## MonthlyCharges 1 0.34 4902 4247.4 0.5595692
## tenure_group 4 158.84 4898 4088.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.
Assessing the predictive ability of the Logistic Regression model
testing$Churn <- as.character(testing$Churn)
testing$Churn[testing$Churn == "No"] <- "0"
testing$Churn[testing$Churn == "Yes"] <- "1"
fitted_results <- predict(LogModel, newdata = testing, type = "response")
fitted_results <- ifelse(fitted_results > 0.5, 1, 0)
misClasificError <- mean(fitted_results != testing$Churn)
print(paste('Logistic Regression Accuracy', 1- misClasificError))
## [1] "Logistic Regression Accuracy 0.80123339658444"
print("Confusion Matrix for Logistic Regression");
## [1] "Confusion Matrix for Logistic Regression"
table(testing$Churn, fitted_results > 0.5)
##
## FALSE TRUE
## 0 1399 149
## 1 270 290
One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.
library(MASS)
exp(cbind(OR = coef(LogModel), confint(LogModel)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.2061027 0.02879826 1.4709260
## genderMale 1.0356662 0.88953429 1.2059624
## SeniorCitizenYes 1.1566464 0.94872251 1.4093716
## PartnerYes 0.9509072 0.79367731 1.1394696
## DependentsYes 0.9172761 0.74424867 1.1292225
## PhoneServiceYes 0.8431553 0.18002496 3.9515909
## MultipleLinesYes 1.2721406 0.83639059 1.9359887
## InternetServiceFiber optic 3.0608839 0.45957766 20.4670984
## InternetServiceNo 0.2990299 0.04381827 2.0364549
## OnlineSecurityYes 0.6891435 0.45035040 1.0535297
## OnlineBackupYes 0.8825834 0.58340611 1.3348317
## DeviceProtectionYes 0.9310966 0.61522874 1.4088296
## TechSupportYes 0.6692364 0.43565985 1.0266487
## StreamingTVYes 1.5003467 0.69212568 3.2570409
## StreamingMoviesYes 1.6495113 0.76029678 3.5848594
## ContractOne year 0.4780309 0.37091140 0.6127889
## ContractTwo year 0.1856963 0.11872472 0.2829360
## PaperlessBillingYes 1.3600629 1.14176107 1.6212233
## PaymentMethodCredit card (automatic) 0.8922170 0.68134019 1.1672654
## PaymentMethodElectronic check 1.3421242 1.07757170 1.6742625
## PaymentMethodMailed check 1.0991613 0.84404234 1.4330800
## MonthlyCharges 0.9892197 0.91727952 1.0666996
## tenure_group0 - 12 Month 5.1093838 3.42329065 7.6873006
## tenure_group12 - 24 Month 2.2209275 1.49821292 3.3134161
## tenure_group24 - 48 Month 1.3683604 0.95259556 1.9793333
## tenure_group48 - 60 Month 1.1228453 0.75574202 1.6697610
For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are ‘Contract’, ‘tenure_group’ and ‘PaperlessBilling’.
tree <- ctree(Churn ~ Contract+tenure_group+PaperlessBilling, training)
plot(tree)
Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0 - 12 month, and using PaperlessBilling, then this customer is more likely to churn.
We are using all the variables to product confusion matrix table and make predictions.
pred_tree <- predict(tree, testing)
print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn)
## [1] "Confusion Matrix for Decision Tree"
## Actual
## Predicted 0 1
## No 1406 336
## Yes 142 224
p1 <- predict(tree, training)
tab1 <- table(Predicted = p1, Actual = training$Churn)
tab2 <- table(Predicted = pred_tree, Actual = testing$Churn)
print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2)))
## [1] "Decision Tree Accuracy 0.773244781783681"
The accuracy for Decision Tree has hardly improved. Lets see if we can do better using Random Forest.
rfModel <- randomForest(Churn ~., data = training)
print(rfModel)
##
## Call:
## randomForest(formula = Churn ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 21.24%
## Confusion matrix:
## No Yes class.error
## No 3237 378 0.1045643
## Yes 668 641 0.5103132
The error rate is relatively low when predicting ‘No’, and the error rate is much higher when predicting ‘Yes’.
pred_rf <- predict(rfModel, testing)
#caret::confusionMatrix(pred_rf, testing$Churn)
table(Predicted = pred_rf, Actual = testing$Churn)
## Actual
## Predicted 0 1
## No 1393 278
## Yes 155 282
plot(rfModel)
We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.
t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE,
ntreeTry = 200, trace = TRUE, improve = 0.05)
## mtry = 4 OOB error = 21.77%
## Searching left ...
## mtry = 8 OOB error = 23.01%
## -0.05690299 0.05
## Searching right ...
## mtry = 2 OOB error = 20.9%
## 0.04011194 0.05
We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.
rfModel_new <- randomForest(Churn ~., data = training, ntree = 200,
mtry = 2, importance = TRUE, proximity = TRUE)
print(rfModel_new)
##
## Call:
## randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.27%
## Confusion matrix:
## No Yes class.error
## No 3333 282 0.0780083
## Yes 716 593 0.5469824
OOB error rate decreased to 20.41% from 20.61% on Figure 14.
pred_rf_new <- predict(rfModel_new, testing)
#caret::confusionMatrix(pred_rf_new, testing$Churn)
table(Predicted = pred_rf_new, Actual = testing$Churn)
## Actual
## Predicted 0 1
## No 1426 305
## Yes 122 255
Both accuracy and sensitivity are improved, compare with Figure 15.
varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')
From the above example, we can see that Logistic Regression, Decision Tree and Random Forest can be used for customer churn analysis for this particular dataset equally fine.
Throughout the analysis, I have learned several important things:
*There does not seem to be a relationship between gender and churn.
Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.