Customer churn is the loss of customers. Many businesses use predictions of customer churn as a key business metric because the cost of acquiring new customers is much higher than the cost of retaining existing customers.
We obtained a dataset from a telecommunications company (downloaded in October 2018 from https://www.ibm.com/communities/analytics/watson-analytics-blog/predictive-insights-in-the-telco-customer-churn-data-set/), which comprised demographic and account information on 7,032 customers. A subset of the variables are included in the dataset below.
Dataset: telco-churn.csv
Here is a detailed description of the variables:
gender: the gender of the customer (Female or Male).
SeniorCitizen: 1 if the customer is a senior citizen, 0 otherwise
Partner: whether the customer has a partner
Dependents: whether the customer has dependents tenure: how long the customer has been with the company (in months)
PhoneService: whether the customer has phone service
InternetService: what type of internet service the customer has (DSL, Fiber optic, or No)
Contract: what type of contract the customer has (Month-to-month, One year, Two year)
PaperlessBilling: whether the customer has set up paperless billing
PaymentType: how the customer makes payments (Bank transfer (automatic), Credit card (automatic), Electronic check, Mailed check)
MonthlyCharges: how much money the customer is charged a month
TotalCharges: how much the customer has been charged over their tenure
Churn: 1 if the customer has left the business, 0 otherwise
In this problem, we will use various classification methods to try to predict customer churn.
Use the read.csv function to load the contents of telco-churn.csv into a data frame called customers.
# Read the dataset
customers = read.csv("telco-churn.csv")
# Calculate the number of customers that have churned
table(customers$Churn)
##
## 0 1
## 5163 1869
1869 customers have churned.
# Tabulate the amount of customers for each internet service
table(customers$InternetService)
##
## DSL Fiber optic No
## 2416 3096 1520
Fiber optic is the most common type of internet service in the dataset.
# Tabulate the mean monthly chargers amongst customers with month-to-month contracts
tapply(customers$MonthlyCharges, customers$Contract =="Month-to-month", mean)
## FALSE TRUE
## 62.83397 66.39849
66.39849 is the mean monthly charges amongst customers with month-to-month contracts.
Set your random seed to 1 and create a training and test split using the sample.split() function in the caTools library, with 70% of the observations in the training set and 30% in the testing set.
set.seed(1)
library(caTools)
split = sample.split(customers$Churn, SplitRatio = 0.7)
train = subset(customers, split ==TRUE)
test = subset(customers, split == FALSE)
It balances the dependent variable between the training and testing sets
# Tabulate the accuracy of the baseline model
z = table(test$Churn)
# Compute Accuracy
z[1]/sum(z)
## 0
## 0.7341232
Accuracy = 0.7341232.
# Tabulate true postiive rate of this baseline model
z = table (test$Churn)
# Compute accuracy
z[2]/sum(z)
## 1
## 0.2658768
True Positive Rate = 0.7341232.
# Tabulate false postiive rate of this baseline model
z = table (test$Churn)
# Compute accuracy
z[1]/sum(z)
## 0
## 0.7341232
False Positive Rate = 0.2658768
# Train a logistic regression
ChurnLog = glm(Churn ~ tenure, data=train, family=binomial)
summary(ChurnLog)
##
## Call:
## glm(formula = Churn ~ tenure, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1858 -0.8440 -0.4689 1.1690 2.3850
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.059313 0.050868 1.166 0.244
## tenure -0.039493 0.001685 -23.435 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 5005.1 on 4920 degrees of freedom
## AIC: 5009.1
##
## Number of Fisher Scoring iterations: 4
Tenure = -0.039493
# Make predictions
predictTrain = predict(ChurnLog, newdata = test, type="response")
# Confusion matrix
table(test$Churn, predictTrain > 0.5)
##
## FALSE TRUE
## 0 1478 71
## 1 442 119
z = table(test$Churn, predictTrain > 0.5)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.756872
Accuracy = 0.756872
# True Positive Rate
z[4]/(z[4]+z[2])
## [1] 0.2121212
True Positive Rate = 0.2121212
# False Positive Rate
z[3]/(z[3]+z[1])
## [1] 0.04583602
False Positive Rate = 0.04583602
Train a logistic regression model now using all of the variables in the training set.
# Train a logistic regression
ChurnLog2 = glm(Churn ~ ., data=train, family=binomial)
# Summary of logistic regression
summary(ChurnLog2)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7135 -0.7003 -0.2996 0.7876 3.4895
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01103107 0.26349469 0.042 0.966607
## genderMale 0.00428618 0.07672852 0.056 0.955452
## SeniorCitizen 0.17213188 0.09934357 1.733 0.083150 .
## PartnerYes 0.09758946 0.09189902 1.062 0.288272
## DependentsYes -0.13859303 0.10535359 -1.316 0.188341
## tenure -0.06069722 0.00737151 -8.234 < 2e-16 ***
## PhoneServiceYes -0.76862871 0.17157053 -4.480 7.47e-06 ***
## InternetServiceFiber optic 0.97852962 0.15912182 6.150 7.77e-10 ***
## InternetServiceNo -0.40565551 0.22467661 -1.806 0.070995 .
## ContractOne year -0.76591409 0.12663458 -6.048 1.46e-09 ***
## ContractTwo year -1.60480345 0.21040218 -7.627 2.40e-14 ***
## PaperlessBillingYes 0.33899080 0.08796066 3.854 0.000116 ***
## PaymentMethodCredit card (automatic) -0.21441973 0.13496542 -1.589 0.112128
## PaymentMethodElectronic check 0.29739693 0.11098050 2.680 0.007368 **
## PaymentMethodMailed check -0.13248020 0.13485137 -0.982 0.325895
## MonthlyCharges 0.00203949 0.00480544 0.424 0.671265
## TotalCharges 0.00032017 0.00008358 3.831 0.000128 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 4139.3 on 4905 degrees of freedom
## AIC: 4173.3
##
## Number of Fisher Scoring iterations: 6
tenure is significant.
When the customer is a senior citizen, the odds of the customer churning are 18.8% higher than if the customer is not.
# Make predictions
predictTrain = predict(ChurnLog2, newdata = test, type="response")
# Confusion matrix
table(test$Churn, predictTrain > 0.5)
##
## FALSE TRUE
## 0 1398 151
## 1 251 310
z = table(test$Churn, predictTrain > 0.5)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.8094787
Accuracy = 0.8094787
# Calculate AUC
library(ROCR)
pred = prediction(predictTrain, test$Churn)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.8435803
AUC = 0.8435803
Set the random seed to 2.
Then use the caret package and the train function to perform 10-fold cross validation with the training data set to select the best cp value for a CART model that predicts the dependent variable Churn using all of the possible independent variables. Select the cp value from a grid consisting of the 50 values 0.001, 0.002, …, 0.05.
Remember to convert the Churn column to a factor variable. If you have called your training set trainset, use the following code:
trainset$Churn = as.factor(trainset$Churn) Important Note: The train() function in caret does not handle factor variables well when they are used in a formula via the method that was shown in the recitation (e.g. Churn ~ . ). Because there are many factor variables in this dataset, please use the following workaround (assuming you have called your training set trainset) to cross-validate properly:
cv = train(y = trainset$Churn, x = subset(trainset, select=-c(Churn)), method = “rpart”, trControl = …, tuneGrid = …)
where the trControl and tuneGrid arguments can be handled as you have done throughout the course (recall that we are performing 10-fold cross-validation on cp values 0.001, 0.002, …, 0.05).
# Convert to a factor variable
train$Churn = as.factor(train$Churn)
# Set random seed
set.seed(2)
# Cross-Validation
library(caret)
library(e1071)
# Define cross-validation experiment
numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.001,0.05,0.001))
# Perform the cross validation
cv = train(y = train$Churn, x = subset(train, select=-c(Churn)), method = "rpart", trControl = numFolds, tuneGrid = cpGrid)
# Create a new CART model
ChurnTreeCV = rpart(Churn ~ ., data = train, method="class", cp = 0.005)
# CART Tree
prp(ChurnTreeCV)
Contract and tenure are presented in the tree.
# Make predictions
PredictCV = predict(ChurnTreeCV, newdata = test, type = "class")
z = table(test$Churn, PredictCV)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.7905213
Accuracy = 0.7905213
# True Positive Rate
z[4]/(z[4]+z[2])
## [1] 0.5846702
True Positive Rate = 0.5846702
# False Positive Rate
z[3]/(z[3]+z[1])
## [1] 0.1349258
False Positive Rate = 0.1349258
From the CART tree, the customer doesn’t churn (0)