Predicting Customer Churn

Background Information on the Dataset

Customer churn is the loss of customers. Many businesses use predictions of customer churn as a key business metric because the cost of acquiring new customers is much higher than the cost of retaining existing customers.

We obtained a dataset from a telecommunications company (downloaded in October 2018 from https://www.ibm.com/communities/analytics/watson-analytics-blog/predictive-insights-in-the-telco-customer-churn-data-set/), which comprised demographic and account information on 7,032 customers. A subset of the variables are included in the dataset below.

Dataset: telco-churn.csv

Here is a detailed description of the variables:

  • gender: the gender of the customer (Female or Male).

  • SeniorCitizen: 1 if the customer is a senior citizen, 0 otherwise

  • Partner: whether the customer has a partner

  • Dependents: whether the customer has dependents tenure: how long the customer has been with the company (in months)

  • PhoneService: whether the customer has phone service

  • InternetService: what type of internet service the customer has (DSL, Fiber optic, or No)

  • Contract: what type of contract the customer has (Month-to-month, One year, Two year)

  • PaperlessBilling: whether the customer has set up paperless billing

  • PaymentType: how the customer makes payments (Bank transfer (automatic), Credit card (automatic), Electronic check, Mailed check)

  • MonthlyCharges: how much money the customer is charged a month

  • TotalCharges: how much the customer has been charged over their tenure

  • Churn: 1 if the customer has left the business, 0 otherwise

In this problem, we will use various classification methods to try to predict customer churn.

Exploratory Data Analysis

Use the read.csv function to load the contents of telco-churn.csv into a data frame called customers.

# Read the dataset
customers = read.csv("telco-churn.csv")

How many customers churned?

# Calculate the number of customers that have churned
table(customers$Churn)
## 
##    0    1 
## 5163 1869

1869 customers have churned.

What is the most common type of internet service in the dataset?

# Tabulate the amount of customers for each internet service
table(customers$InternetService)
## 
##         DSL Fiber optic          No 
##        2416        3096        1520

Fiber optic is the most common type of internet service in the dataset.

What is the mean monthly charges amongst customers with month-to-month contracts?

# Tabulate the mean monthly chargers amongst customers with month-to-month contracts
tapply(customers$MonthlyCharges, customers$Contract =="Month-to-month", mean)
##    FALSE     TRUE 
## 62.83397 66.39849

66.39849 is the mean monthly charges amongst customers with month-to-month contracts.

Simple Logistic Regression

Set your random seed to 1 and create a training and test split using the sample.split() function in the caTools library, with 70% of the observations in the training set and 30% in the testing set.

set.seed(1)
library(caTools)
split = sample.split(customers$Churn, SplitRatio = 0.7)
train = subset(customers, split ==TRUE)
test = subset(customers, split == FALSE)

Why do we use the sample.split() function?

It balances the dependent variable between the training and testing sets

What is the (test) accuracy of this baseline model?

# Tabulate the accuracy of the baseline model
z = table(test$Churn)
# Compute Accuracy
z[1]/sum(z)
##         0 
## 0.7341232

Accuracy = 0.7341232.

What is the (test) true positive rate of this baseline model?

# Tabulate true postiive rate of this baseline model
z = table (test$Churn)
# Compute accuracy
z[2]/sum(z)
##         1 
## 0.2658768

True Positive Rate = 0.7341232.

What is the (test) false positive rate of this baseline model?

# Tabulate false postiive rate of this baseline model
z = table (test$Churn)
# Compute accuracy
z[1]/sum(z)
##         0 
## 0.7341232

False Positive Rate = 0.2658768

Train a logistic regression model using tenure as the independent variable. What is the coefficient of tenure?

# Train a logistic regression
ChurnLog = glm(Churn ~ tenure,  data=train, family=binomial)
summary(ChurnLog)
## 
## Call:
## glm(formula = Churn ~ tenure, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1858  -0.8440  -0.4689   1.1690   2.3850  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.059313   0.050868   1.166    0.244    
## tenure      -0.039493   0.001685 -23.435   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5699.5  on 4921  degrees of freedom
## Residual deviance: 5005.1  on 4920  degrees of freedom
## AIC: 5009.1
## 
## Number of Fisher Scoring iterations: 4

Tenure = -0.039493

Using your logistic regression model, obtain predictions on the test set. Then, using a probability threshold of 0.5, create a confusion matrix for the test set. What is the accuracy of your logistic regression model?

# Make predictions
predictTrain = predict(ChurnLog, newdata = test, type="response")
# Confusion matrix
table(test$Churn, predictTrain > 0.5)
##    
##     FALSE TRUE
##   0  1478   71
##   1   442  119
z = table(test$Churn, predictTrain > 0.5)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.756872

Accuracy = 0.756872

What is the true positive rate of your logistic regression model?

# True Positive Rate
z[4]/(z[4]+z[2])
## [1] 0.2121212

True Positive Rate = 0.2121212

What is the false positive rate of your logistic regression model?

# False Positive Rate
z[3]/(z[3]+z[1])
## [1] 0.04583602

False Positive Rate = 0.04583602

Suppose we wanted to to lower the prediction threshold (currently 0.5). Which metrics would be guaranteed to either improve or stay the same?

Adding more variables

Train a logistic regression model now using all of the variables in the training set.

# Train a logistic regression
ChurnLog2 = glm(Churn ~ .,  data=train, family=binomial)

Which of the following variables are significant at a level of 0.05 or less?

# Summary of logistic regression
summary(ChurnLog2)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7135  -0.7003  -0.2996   0.7876   3.4895  
## 
## Coefficients:
##                                         Estimate  Std. Error z value Pr(>|z|)    
## (Intercept)                           0.01103107  0.26349469   0.042 0.966607    
## genderMale                            0.00428618  0.07672852   0.056 0.955452    
## SeniorCitizen                         0.17213188  0.09934357   1.733 0.083150 .  
## PartnerYes                            0.09758946  0.09189902   1.062 0.288272    
## DependentsYes                        -0.13859303  0.10535359  -1.316 0.188341    
## tenure                               -0.06069722  0.00737151  -8.234  < 2e-16 ***
## PhoneServiceYes                      -0.76862871  0.17157053  -4.480 7.47e-06 ***
## InternetServiceFiber optic            0.97852962  0.15912182   6.150 7.77e-10 ***
## InternetServiceNo                    -0.40565551  0.22467661  -1.806 0.070995 .  
## ContractOne year                     -0.76591409  0.12663458  -6.048 1.46e-09 ***
## ContractTwo year                     -1.60480345  0.21040218  -7.627 2.40e-14 ***
## PaperlessBillingYes                   0.33899080  0.08796066   3.854 0.000116 ***
## PaymentMethodCredit card (automatic) -0.21441973  0.13496542  -1.589 0.112128    
## PaymentMethodElectronic check         0.29739693  0.11098050   2.680 0.007368 ** 
## PaymentMethodMailed check            -0.13248020  0.13485137  -0.982 0.325895    
## MonthlyCharges                        0.00203949  0.00480544   0.424 0.671265    
## TotalCharges                          0.00032017  0.00008358   3.831 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5699.5  on 4921  degrees of freedom
## Residual deviance: 4139.3  on 4905  degrees of freedom
## AIC: 4173.3
## 
## Number of Fisher Scoring iterations: 6

tenure is significant.

How would you interpret the coefficient of SeniorCitizen in the model?

When the customer is a senior citizen, the odds of the customer churning are 18.8% higher than if the customer is not.

Using your new logistic regression model, obtain predictions on the test set. Then, using a probability threshold of 0.5, create a confusion matrix for the test set. What is the accuracy of your logistic regression model?

# Make predictions
predictTrain = predict(ChurnLog2, newdata = test, type="response")
# Confusion matrix
table(test$Churn, predictTrain > 0.5)
##    
##     FALSE TRUE
##   0  1398  151
##   1   251  310
z = table(test$Churn, predictTrain > 0.5)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.8094787

Accuracy = 0.8094787

AUC

# Calculate AUC
library(ROCR)

pred = prediction(predictTrain, test$Churn)

as.numeric(performance(pred, "auc")@y.values)
## [1] 0.8435803

AUC = 0.8435803

CART

Set the random seed to 2.

Then use the caret package and the train function to perform 10-fold cross validation with the training data set to select the best cp value for a CART model that predicts the dependent variable Churn using all of the possible independent variables. Select the cp value from a grid consisting of the 50 values 0.001, 0.002, …, 0.05.

Remember to convert the Churn column to a factor variable. If you have called your training set trainset, use the following code:

trainset$Churn = as.factor(trainset$Churn) Important Note: The train() function in caret does not handle factor variables well when they are used in a formula via the method that was shown in the recitation (e.g. Churn ~ . ). Because there are many factor variables in this dataset, please use the following workaround (assuming you have called your training set trainset) to cross-validate properly:

cv = train(y = trainset$Churn, x = subset(trainset, select=-c(Churn)), method = “rpart”, trControl = …, tuneGrid = …)

where the trControl and tuneGrid arguments can be handled as you have done throughout the course (recall that we are performing 10-fold cross-validation on cp values 0.001, 0.002, …, 0.05).

# Convert to a factor variable
train$Churn = as.factor(train$Churn)
# Set random seed
set.seed(2)
# Cross-Validation
library(caret)
library(e1071)

# Define cross-validation experiment
numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.001,0.05,0.001)) 

# Perform the cross validation
cv = train(y = train$Churn, x = subset(train, select=-c(Churn)), method = "rpart", trControl = numFolds, tuneGrid = cpGrid)

# Create a new CART model
ChurnTreeCV = rpart(Churn ~ ., data = train, method="class", cp = 0.005)

Which of the following variables appear in the tree?

# CART Tree
prp(ChurnTreeCV)

Contract and tenure are presented in the tree.

What is the (test) accuracy of your CART model?

# Make predictions
PredictCV = predict(ChurnTreeCV, newdata = test, type = "class")
z = table(test$Churn, PredictCV)
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.7905213

Accuracy = 0.7905213

What is the (test) true positive rate of your CART model?

# True Positive Rate
z[4]/(z[4]+z[2])
## [1] 0.5846702

True Positive Rate = 0.5846702

What is the (test) false positive rate of your CART model?

# False Positive Rate
z[3]/(z[3]+z[1])
## [1] 0.1349258

False Positive Rate = 0.1349258

What does the CART model predict for customer with a one-year contract and a tenure of 12?

From the CART tree, the customer doesn’t churn (0)