The telecom market in the US is saturated and customer growth rates are low. They key focus of market players therefore is on retention and churn control. This project explores the churn dataset in the C50 library to identify the key drivers of churn and builds the best predictive model to predict churn. A strategy to reduce churn is presented and the proposal is evaluated against the model. Finally, the model is tuned to maximise the financial benefit of the proposed strategy.
library(ggplot2)
library(gridExtra)
library(dplyr)
library(corrplot)
library(pROC)
library(C50)
library(caret)
library(rpart)
data(churn)
dfChurnTrain <- churnTrain
dfChurnTest <- churnTest
The churn dataset is split into churnTrain (3333 obs.) and churnTest (1667 obs.) of 19 predictor variables and 1 response variable (churn = yes/no). The proportion of churned customers (churn = yes) is close to 14% and is evenly distributed across the 2 sets. churnTrain will be used for data exploration and model building, while churnTest will be used to measure model performance.
exp1 <- ggplot(dfChurnTrain, aes(area_code, fill = churn)) + geom_bar(position = "fill") + labs(x = "Area code", y = "") + theme(legend.position = "none")
exp2 <- ggplot(dfChurnTrain, aes(international_plan, fill = churn)) + geom_bar(position = "fill") + labs(x = "International?", y = "") + theme(legend.position = "none")
exp3 <- ggplot(dfChurnTrain, aes(voice_mail_plan, fill = churn)) + geom_bar(position = "fill") + labs(x = "Voicemail?", y = "") + theme(legend.position = "none")
exp4 <- ggplot(dfChurnTrain, aes(number_customer_service_calls, fill = churn)) + geom_bar(position = "fill") + labs(x = "Customer calls", y = "") + theme(legend.position = "none")
grid.arrange(exp1, exp2, exp3, exp4, ncol = 4, nrow = 1, top = "Churn/Non-churn Proportion")
The figure above shows the proportion of churn across of the categorical predictors (red = churned). The churn rate by area code does not differ from the overall churn ratio while churn ratio for customers not on Voicemail plans appears to be higher, although this needs to be tested for spurious correlation. More importantly, churn ratio on the International plan is 4 times the average and there is a significant spike in churn as customer calls increases beyond 4 customer calls.
The churn ratio varies across states. Any proposal to reduce churn must focus more on states having above average churn. The States with above average churn rates are:
top20 <- ct %>% arrange(desc(ChurnPercent)) %>% filter(ChurnPercent >= 15) %>% select(state)
print(toupper(unlist(top20$state)))
[1] "CALIFORNIA" "NEW JERSEY" "TEXAS" "MARYLAND"
[5] "SOUTH CAROLINA" "MICHIGAN" "MISSISSIPPI" "NEVADA"
[9] "WASHINGTON" "MAINE" "MONTANA" "ARKANSAS"
[13] "KANSAS" "NEW YORK" "MINNESOTA" "PENNSYLVANIA"
[17] "MASSACHUSETTS" "CONNECTICUT" "NORTH CAROLINA" "NEW HAMPSHIRE"
daymin <- ggplot(dfChurnTrain, aes(churn, total_day_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
evemin <- ggplot(dfChurnTrain, aes(churn, total_eve_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
nitmin <- ggplot(dfChurnTrain, aes(churn, total_night_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
intmin <- ggplot(dfChurnTrain, aes(churn, total_intl_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
daycal <- ggplot(dfChurnTrain, aes(churn, total_day_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
evecal <- ggplot(dfChurnTrain, aes(churn, total_eve_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
nitcal <- ggplot(dfChurnTrain, aes(churn, total_night_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
intcal <- ggplot(dfChurnTrain, aes(churn, total_intl_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
grid.arrange(daymin, evemin, nitmin, intmin,
daycal, evecal, nitcal, intcal,
ncol = 4, nrow = 2)
The continuos predictors include total minutes, number of calls & charges across day, evening, night and international calls. As call minutes and charges were strongly correlated, call charges has been excluded from the figure. Note that the total number of calls is roughly the same across churn and non-churn population. However, the duration of calls across the churn population is slightly higher than the non-churn population.
exp6 <- ggplot(dfChurnTrain, aes(account_length, fill = churn)) + geom_density(alpha = 0.7) + theme(legend.position = "null")
exp7 <- ggplot(dfChurnTrain, aes(number_vmail_messages, fill = churn)) + geom_density(alpha = 0.7) + theme(legend.position = "null")
grid.arrange(exp6, exp7, ncol = 2, nrow = 1)
Finally, the probability of churn is nearly the same across the account length or the number of voice mail predictors. From the figure, there is no indication that these predictors contribute much to churn predictability.
To identify the factors driving churn, we’ll build an interpretable model. A logistic model is usually a good model for interpeting the key factors.
Check for missing values
anyNA(dfChurnTrain)
[1] FALSE
Check for collinearity.
corrplot(cor(dfChurnTrain[sapply(dfChurnTrain, is.numeric)]))
There are no missing values. Charge variables are highly correlated with call minutes variables. Remove the highly correlated charge variables. Also, as the exploratory analysis already revealed which states have higher than average churn, the state predictor will not be considered in this model. Instead, this model will be used to interpret the remaining contributors of churn
dfChurnTrain$total_day_charge <- NULL
dfChurnTrain$total_eve_charge <- NULL
dfChurnTrain$total_night_charge <- NULL
dfChurnTrain$total_intl_charge <- NULL
dfChurnTrain$state <-NULL
dfChurnTest$state <- NULL
dfChurnTest$total_day_charge <- NULL
dfChurnTest$total_eve_charge <- NULL
dfChurnTest$total_night_charge <- NULL
dfChurnTest$total_intl_charge <- NULL
All models will be built using 3-fold cross validation.
modelCtrl <- trainControl(method = "cv", number = 3)
Logistic classification models are fast and highly interpretable. It doesn’t require input features to be scaled, doesn’t require any tuning, its easy to regularize and it outputs well calibrated probabilities.
#Logistic
set.seed(123)
glm.model <- train(churn ~ ., method = "glm", family = "binomial", data = dfChurnTrain, trControl = modelCtrl)
summary(glm.model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2671 0.1952 0.3387 0.5131 2.1642
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.5985158 0.7283542 11.805 < 2e-16 ***
account_length -0.0008001 0.0013920 -0.575 0.565415
area_codearea_code_415 0.0832653 0.1367445 0.609 0.542583
area_codearea_code_510 0.1035515 0.1571203 0.659 0.509858
international_planyes -2.0409043 0.1456274 -14.015 < 2e-16 ***
voice_mail_planyes 2.0048044 0.5730385 3.499 0.000468 ***
number_vmail_messages -0.0353252 0.0179801 -1.965 0.049451 *
total_day_minutes -0.0129945 0.0010841 -11.987 < 2e-16 ***
total_day_calls -0.0032180 0.0027580 -1.167 0.243294
total_eve_minutes -0.0072503 0.0011433 -6.341 2.28e-10 ***
total_eve_calls -0.0010881 0.0027813 -0.391 0.695640
total_night_minutes -0.0036918 0.0011099 -3.326 0.000880 ***
total_night_calls -0.0007539 0.0028415 -0.265 0.790771
total_intl_minutes -0.0877617 0.0203934 -4.303 1.68e-05 ***
total_intl_calls 0.0913425 0.0250074 3.653 0.000260 ***
number_customer_service_calls -0.5144152 0.0392785 -13.097 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2758.3 on 3332 degrees of freedom
Residual deviance: 2159.2 on 3317 degrees of freedom
AIC: 2191.2
Number of Fisher Scoring iterations: 6
The logistic model summary reveals the key drivers of churn. Hypothesis tests, at a significance level of 5% and for a sample size of 3333, of the variables marked *** in the model summary will conclude that there is a significant relationship between churn and the predictor variable. Specifically, these variables and their degree of importance to churn are:
varImp(glm.model)
glm variable importance
Overall
international_planyes 100.0000
number_customer_service_calls 93.3235
total_day_minutes 85.2497
total_eve_minutes 44.1914
total_intl_minutes 29.3698
total_intl_calls 24.6363
voice_mail_planyes 23.5158
total_night_minutes 22.2633
number_vmail_messages 12.3597
total_day_calls 6.5566
area_codearea_code_510 2.8638
area_codearea_code_415 2.4991
account_length 2.2511
total_eve_calls 0.9157
total_night_calls 0.0000
The statistical findings here ties back nicely to our own observations in the initial exploratory analysis phase. We noted that customers on the International plan had a significantly higher churn than customers on other plans. We also noted previously that churn spiked among customers who exceeded 3 customer service calls. Finally, churning customers had a higher call duration average than non-churning customers. A good strategy to address churn can focus on these three drivers to predict and address customer churn.
Based on the above, the following proposals could reduce customer churn:
* Survey International plan customers to understand pain points and identify root causes for churn intent. Then take steps to address those concerns.
* Escalate all calls beyond the 1st customer call to ensure that any issues that the customer faces is fixed before the next call. Proactively check on customers to confirm that their issue is fixed.
* Survey customers whose call minutes/charges are above average to check for churn intent. Identify root causes for churn intent and take steps to alleviate those concerns.
The test of any model is to evaluate it against an unseen dataset. We will predict the model against the churnTest dataset and measure the performance
glm.pred <- predict(glm.model, newdata = dfChurnTest)
glm.accuracy <- round(1-mean(glm.pred!=dfChurnTest$churn),2)
table(actual=dfChurnTest$churn, predicted=glm.pred)
predicted
actual yes no
yes 43 181
no 33 1410
The confusion matrix is used to review the prediction vs actual. It is important to note that while the accuracy of the model appears to be high, a lot of that accuracy is driven by a disproportionate number of non-churn customers predicted correctly. To benchmark the logistic model, we’ll compare its performance against another model, a single decision tree model (CART).
#CART
set.seed(123)
cart.model <- train(churn ~ ., method = "rpart2", data = dfChurnTrain, trControl = modelCtrl)
cart.pred <- predict(cart.model, newdata = dfChurnTest)
cart.accuracy <- 1-mean(cart.pred != dfChurnTest$churn)
table(actual=dfChurnTest$churn, predicted=cart.pred)
predicted
actual yes no
yes 120 104
no 19 1424
Note that in the CART model, the accuracy has improved. More importantly, a higher degree of true positives make up the this accuracy, a significant improvement over the previous logistic model. Thus, there is scope for improvement in the predictive performance and particularly in the true positive rate (specificity) given our core objective is to reduce customer. This leads us to search for better performing models in the next part of the project.
For the purpose of identifying the best predictive model, we’ll pick the best among the following tree based models: Single Decision Tree (CART), Bagged Trees, Boosted Tree and Random Forest
#CART - gini
set.seed(123)
gini.cart.model <- train(churn ~ ., method = "rpart", data = dfChurnTrain, parms = list(split = "gini"), trControl = modelCtrl)
#Bagging
set.seed(123)
treebag.model <- train(churn ~ ., method = "treebag", data = dfChurnTrain, trControl = modelCtrl)
# Boosted Tree
set.seed(123)
C5.model <- train(churn ~ ., method = "C5.0", data = dfChurnTrain, trControl = modelCtrl)
#Random Forest
set.seed(123)
rf.model <- train(churn ~ ., method = "rf", data = dfChurnTrain, trControl = modelCtrl)
results <- resamples(list(CART=gini.cart.model, BAG=treebag.model, BOOST=C5.model, RF=rf.model))
dotplot(results)
Both Boosted tree and Random Forest models appear to have the highest accuracy. We’ll put both these models to the test to pick the winner.
gb.pred <- predict(C5.model, newdata = dfChurnTest)
gb.accuracy <- 1-mean(gb.pred != dfChurnTest$churn)
rf.pred1 <- predict(rf.model, newdata = dfChurnTest)
rf.accuracy <- 1-mean(rf.pred1 != dfChurnTest$churn)
glm.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(glm.pred))
gb.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(gb.pred))
rf.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(rf.pred1))
plot(glm.roc, legacy.axes = TRUE, print.auc.y = 0.4, print.auc = TRUE)
plot(gb.roc, col = "blue", add = TRUE, print.auc.y = 0.5, print.auc = TRUE)
plot(rf.roc, col = "red" , add = TRUE, print.auc.y = 0.6, print.auc = TRUE)
legend(0.0,0.2, c("Random Forest", "Boosted Tree", "Logistic"), lty = c(1,1), lwd = c(2, 2), col = c("red", "blue", "black"), cex = 0.75)
The ROC curve is a graphical way to review the performance of classification models. The ROC is a better indicator of performance especially as we’re interested in the true positive rate of our models. The Area Under Curve (AUC) measures the models ability to correctly classify those who churned and those who did not. The higher the AUC the better. The ROC curve above shows that the Random Forest outperformed the Boosted Tree model marginally. Both models are a marked improvement over the Logistic model, which is useful enough for model interpretability but not so much for prediction performance. The Random Forest is the best tree based predictive model
The Company plans a promotion campaign geared specifically at the customers predicted to churn. The profitability of the campaign is measured in ‘benefit’, which is some function of profitability. Any promotional campaign must result not only in a positive benefit value but also must be tuned to achieve the highest benefit value. Subsequently, the marketing department has identified costs and benefit associated with each category of customers outlined above. The promotional campaign is highly effective at reducing churn but also a very expensive one. So it is important that the promo is presented ONLY to customers predicted to have churn-intent. There is a positive net benefit if churn-intent customers respond positively to the churn but there is also a cost (negative benefit) if churn-non-intent customers that are presented the promo sign up to the promo.
These assumption are captured in the ‘benefit’ function below. The function takes a classication table (tp, fn, fp, tn) as input and returns the computed benefit based on the assumptions:
benefit <- function(confMatrix) {
tp <- confMatrix[1,1]; fn <- confMatrix[1,2]
fp <- confMatrix[2,1]; tn <- confMatrix[2,2]
tp.bnft <- 1000 ; fn.cost <- -2000
fp.cost <- -250 ; tn.bnft <- 50
return(0.75*tp*tp.bnft +
(1-0.75)*tp*fn.cost + fn*fn.cost +
tn*tn.bnft +
0.90*fp*fp.cost + (1-0.9)*fp*tn.bnft)
}
Now we can compute the realized benefit for the best predictive tree based model built previously:
benefit(table(actual = dfChurnTest$churn, predicted = rf.pred1))
[1] -10720
At a net benefit loss of $10720, our plan is not profitable. We now need to tune our predictive model to achieve the maximum benefit gain. To achieve this, we will predict the model again but this time use the probability values instead of the class values used so far. Our goal will be to identify the best cutoff value for “churn=yes” to obtain the maximum benefit gain. Loop through each of the cutoff values and compute the benefit:
rf.pred2 <- predict(rf.model, newdata = dfChurnTest, type = "prob")
dfCB <- data.frame(numeric(), numeric())
for (i in seq(0,1,0.1)) {
newValue <- factor(ifelse( rf.pred2[,1] > i, "yes", "no"), levels = levels(dfChurnTest$churn))
cm <- table(actual=dfChurnTest$churn, predicted=newValue)
dfCB <- rbind(dfCB,c(i, benefit(cm)))
}
colnames(dfCB) <- c("CutOff","Benefit($)")
dfCB
CutOff Benefit($)
1 0.0 -249940
2 0.1 3860
3 0.2 27440
4 0.3 28340
5 0.4 6560
6 0.5 -10720
7 0.6 -46180
8 0.7 -83620
9 0.8 -117100
10 0.9 -184600
11 1.0 -375850
Using a cutoff value of 0.3 in the predictive model leads to the maximum benefit of $28340. That is, at this cutoff, our proposed strategy is not only profitable but also derives the most benefit for the conditions assumed.
Given the assumptions made in our financial model, the benefit decreases sharply outside the cutoff range of 0.2 to 0.3 and goes into the negative. Thus, tuning the cutoff of the Random Forest model to the final value of 0.3 makes the financial model most profitable.
rf2.pred <- factor(ifelse( rf.pred2[,1] > 0.3, "yes", "no"), levels = levels(dfChurnTest$churn))
rf2.accuracy <- round(1-mean(rf2.pred != dfChurnTest$churn),2)
glm.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(glm.pred))
rf2.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(rf2.pred))
plot(glm.roc, col = "blue", legacy.axes = TRUE, print.auc.y = 0.6, print.auc = TRUE)
plot(rf2.roc, col = "red" , add = TRUE, print.auc.x = 0.8, print.auc.y = 0.9, print.auc = TRUE)
legend(0.05,0.95, c("Predictive (rf)", "Interpretative (glm)"), lty = c(1,1), lwd = c(2, 2), col = c("red", "blue"), cex = 0.75)
Using a cutoff value of 0.3 in the predictive model leads to the maximum predicted benefit of $28340. That is, at this cutoff, our proposed strategy is not only profitable but also derives the most benefit for the conditions assumed. ROC curves comparing our interpretive and predictive models is shown above. The best predictive model is a Random Forest model with a cutoff of 0.3 (“churn=yes”) and the interpretative model is a Logistic model. The accuracy of the predictive model is 0.96 and that of the interpretative model is 0.87, although in this case the AUC would be a better indicator of model performance. The AUC for the predictive model is 0.58 and that of the interpretative model is 0.9.