1.0 Synopsis

The telecom market in the US is saturated and customer growth rates are low. They key focus of market players therefore is on retention and churn control. This project explores the churn dataset in the C50 library to identify the key drivers of churn and builds the best predictive model to predict churn. A strategy to reduce churn is presented and the proposal is evaluated against the model. Finally, the model is tuned to maximise the financial benefit of the proposed strategy.

library(ggplot2)
library(gridExtra)
library(dplyr)
library(corrplot)
library(pROC)
library(C50)
library(caret)
library(rpart)
data(churn)
dfChurnTrain <- churnTrain
dfChurnTest <- churnTest

2.0 Exploratory data analysis

The churn dataset is split into churnTrain (3333 obs.) and churnTest (1667 obs.) of 19 predictor variables and 1 response variable (churn = yes/no). The proportion of churned customers (churn = yes) is close to 14% and is evenly distributed across the 2 sets. churnTrain will be used for data exploration and model building, while churnTest will be used to measure model performance.

2.1 Churn ratio by categorical predictors
exp1 <- ggplot(dfChurnTrain, aes(area_code, fill = churn)) + geom_bar(position = "fill") + labs(x = "Area code", y = "") + theme(legend.position = "none")
exp2 <- ggplot(dfChurnTrain, aes(international_plan, fill = churn)) + geom_bar(position = "fill") + labs(x = "International?", y = "") + theme(legend.position = "none")
exp3 <- ggplot(dfChurnTrain, aes(voice_mail_plan, fill = churn)) + geom_bar(position = "fill") + labs(x = "Voicemail?", y = "") + theme(legend.position = "none") 
exp4 <- ggplot(dfChurnTrain, aes(number_customer_service_calls, fill = churn)) + geom_bar(position = "fill") + labs(x = "Customer calls", y = "") + theme(legend.position = "none") 
grid.arrange(exp1, exp2, exp3, exp4, ncol = 4, nrow = 1, top = "Churn/Non-churn Proportion")

The figure above shows the proportion of churn across of the categorical predictors (red = churned). The churn rate by area code does not differ from the overall churn ratio while churn ratio for customers not on Voicemail plans appears to be higher, although this needs to be tested for spurious correlation. More importantly, churn ratio on the International plan is 4 times the average and there is a significant spike in churn as customer calls increases beyond 4 customer calls.

2.2 Churn ratio by State

The churn ratio varies across states. Any proposal to reduce churn must focus more on states having above average churn. The States with above average churn rates are:

top20 <- ct %>% arrange(desc(ChurnPercent)) %>% filter(ChurnPercent >= 15) %>% select(state)
print(toupper(unlist(top20$state)))
 [1] "CALIFORNIA"     "NEW JERSEY"     "TEXAS"          "MARYLAND"      
 [5] "SOUTH CAROLINA" "MICHIGAN"       "MISSISSIPPI"    "NEVADA"        
 [9] "WASHINGTON"     "MAINE"          "MONTANA"        "ARKANSAS"      
[13] "KANSAS"         "NEW YORK"       "MINNESOTA"      "PENNSYLVANIA"  
[17] "MASSACHUSETTS"  "CONNECTICUT"    "NORTH CAROLINA" "NEW HAMPSHIRE" 
2.3 Explore distributions by continuous predictors
daymin <- ggplot(dfChurnTrain, aes(churn, total_day_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
evemin <- ggplot(dfChurnTrain, aes(churn, total_eve_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
nitmin <- ggplot(dfChurnTrain, aes(churn, total_night_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
intmin <- ggplot(dfChurnTrain, aes(churn, total_intl_minutes, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
daycal <- ggplot(dfChurnTrain, aes(churn, total_day_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
evecal <- ggplot(dfChurnTrain, aes(churn, total_eve_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
nitcal <- ggplot(dfChurnTrain, aes(churn, total_night_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
intcal <- ggplot(dfChurnTrain, aes(churn, total_intl_calls, fill = churn)) + geom_boxplot(alpha = 0.8) + theme(legend.position = "null")
grid.arrange(daymin, evemin, nitmin, intmin, 
             daycal, evecal, nitcal, intcal, 
             ncol = 4, nrow = 2)

The continuos predictors include total minutes, number of calls & charges across day, evening, night and international calls. As call minutes and charges were strongly correlated, call charges has been excluded from the figure. Note that the total number of calls is roughly the same across churn and non-churn population. However, the duration of calls across the churn population is slightly higher than the non-churn population.

exp6 <- ggplot(dfChurnTrain, aes(account_length, fill = churn)) + geom_density(alpha = 0.7) + theme(legend.position = "null")
exp7 <- ggplot(dfChurnTrain, aes(number_vmail_messages, fill = churn)) + geom_density(alpha = 0.7) + theme(legend.position = "null")
grid.arrange(exp6, exp7, ncol = 2, nrow = 1)

Finally, the probability of churn is nearly the same across the account length or the number of voice mail predictors. From the figure, there is no indication that these predictors contribute much to churn predictability.

3.0 Factors driving churn

To identify the factors driving churn, we’ll build an interpretable model. A logistic model is usually a good model for interpeting the key factors.

3.1 Data Cleaning

Check for missing values

anyNA(dfChurnTrain)
[1] FALSE

Check for collinearity.

corrplot(cor(dfChurnTrain[sapply(dfChurnTrain, is.numeric)]))

There are no missing values. Charge variables are highly correlated with call minutes variables. Remove the highly correlated charge variables. Also, as the exploratory analysis already revealed which states have higher than average churn, the state predictor will not be considered in this model. Instead, this model will be used to interpret the remaining contributors of churn

dfChurnTrain$total_day_charge <- NULL
dfChurnTrain$total_eve_charge <- NULL
dfChurnTrain$total_night_charge <- NULL
dfChurnTrain$total_intl_charge <- NULL
dfChurnTrain$state <-NULL
dfChurnTest$state <- NULL
dfChurnTest$total_day_charge <- NULL
dfChurnTest$total_eve_charge <- NULL
dfChurnTest$total_night_charge <- NULL
dfChurnTest$total_intl_charge <- NULL
3.2 Build and interpret the model

All models will be built using 3-fold cross validation.

modelCtrl <- trainControl(method = "cv", number = 3)

Logistic classification models are fast and highly interpretable. It doesn’t require input features to be scaled, doesn’t require any tuning, its easy to regularize and it outputs well calibrated probabilities.

#Logistic
set.seed(123)
glm.model <- train(churn ~ ., method = "glm", family = "binomial", data = dfChurnTrain, trControl = modelCtrl)
summary(glm.model)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2671   0.1952   0.3387   0.5131   2.1642  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    8.5985158  0.7283542  11.805  < 2e-16 ***
account_length                -0.0008001  0.0013920  -0.575 0.565415    
area_codearea_code_415         0.0832653  0.1367445   0.609 0.542583    
area_codearea_code_510         0.1035515  0.1571203   0.659 0.509858    
international_planyes         -2.0409043  0.1456274 -14.015  < 2e-16 ***
voice_mail_planyes             2.0048044  0.5730385   3.499 0.000468 ***
number_vmail_messages         -0.0353252  0.0179801  -1.965 0.049451 *  
total_day_minutes             -0.0129945  0.0010841 -11.987  < 2e-16 ***
total_day_calls               -0.0032180  0.0027580  -1.167 0.243294    
total_eve_minutes             -0.0072503  0.0011433  -6.341 2.28e-10 ***
total_eve_calls               -0.0010881  0.0027813  -0.391 0.695640    
total_night_minutes           -0.0036918  0.0011099  -3.326 0.000880 ***
total_night_calls             -0.0007539  0.0028415  -0.265 0.790771    
total_intl_minutes            -0.0877617  0.0203934  -4.303 1.68e-05 ***
total_intl_calls               0.0913425  0.0250074   3.653 0.000260 ***
number_customer_service_calls -0.5144152  0.0392785 -13.097  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2758.3  on 3332  degrees of freedom
Residual deviance: 2159.2  on 3317  degrees of freedom
AIC: 2191.2

Number of Fisher Scoring iterations: 6

The logistic model summary reveals the key drivers of churn. Hypothesis tests, at a significance level of 5% and for a sample size of 3333, of the variables marked *** in the model summary will conclude that there is a significant relationship between churn and the predictor variable. Specifically, these variables and their degree of importance to churn are:

varImp(glm.model)
glm variable importance

                               Overall
international_planyes         100.0000
number_customer_service_calls  93.3235
total_day_minutes              85.2497
total_eve_minutes              44.1914
total_intl_minutes             29.3698
total_intl_calls               24.6363
voice_mail_planyes             23.5158
total_night_minutes            22.2633
number_vmail_messages          12.3597
total_day_calls                 6.5566
area_codearea_code_510          2.8638
area_codearea_code_415          2.4991
account_length                  2.2511
total_eve_calls                 0.9157
total_night_calls               0.0000

The statistical findings here ties back nicely to our own observations in the initial exploratory analysis phase. We noted that customers on the International plan had a significantly higher churn than customers on other plans. We also noted previously that churn spiked among customers who exceeded 3 customer service calls. Finally, churning customers had a higher call duration average than non-churning customers. A good strategy to address churn can focus on these three drivers to predict and address customer churn.

3.3 Strategies for Churn reduction

Based on the above, the following proposals could reduce customer churn:
* Survey International plan customers to understand pain points and identify root causes for churn intent. Then take steps to address those concerns.
* Escalate all calls beyond the 1st customer call to ensure that any issues that the customer faces is fixed before the next call. Proactively check on customers to confirm that their issue is fixed.
* Survey customers whose call minutes/charges are above average to check for churn intent. Identify root causes for churn intent and take steps to alleviate those concerns.

3.4 Model Performance

The test of any model is to evaluate it against an unseen dataset. We will predict the model against the churnTest dataset and measure the performance

glm.pred <- predict(glm.model, newdata = dfChurnTest)
glm.accuracy <- round(1-mean(glm.pred!=dfChurnTest$churn),2)
table(actual=dfChurnTest$churn, predicted=glm.pred)
      predicted
actual  yes   no
   yes   43  181
   no    33 1410

The confusion matrix is used to review the prediction vs actual. It is important to note that while the accuracy of the model appears to be high, a lot of that accuracy is driven by a disproportionate number of non-churn customers predicted correctly. To benchmark the logistic model, we’ll compare its performance against another model, a single decision tree model (CART).

#CART
set.seed(123)
cart.model <- train(churn ~ ., method = "rpart2", data = dfChurnTrain, trControl = modelCtrl)
cart.pred <- predict(cart.model, newdata = dfChurnTest)
cart.accuracy <- 1-mean(cart.pred != dfChurnTest$churn)
table(actual=dfChurnTest$churn, predicted=cart.pred)
      predicted
actual  yes   no
   yes  120  104
   no    19 1424

Note that in the CART model, the accuracy has improved. More importantly, a higher degree of true positives make up the this accuracy, a significant improvement over the previous logistic model. Thus, there is scope for improvement in the predictive performance and particularly in the true positive rate (specificity) given our core objective is to reduce customer. This leads us to search for better performing models in the next part of the project.

4.0 Building the Best Predictive Model

4.1 Model Selection

For the purpose of identifying the best predictive model, we’ll pick the best among the following tree based models: Single Decision Tree (CART), Bagged Trees, Boosted Tree and Random Forest

#CART - gini
set.seed(123)
gini.cart.model <- train(churn ~ ., method = "rpart", data = dfChurnTrain,  parms = list(split = "gini"), trControl = modelCtrl)
#Bagging
set.seed(123)
treebag.model <- train(churn ~ ., method = "treebag", data = dfChurnTrain, trControl = modelCtrl)
# Boosted Tree
set.seed(123)
C5.model <- train(churn ~ ., method = "C5.0", data = dfChurnTrain, trControl = modelCtrl)
#Random Forest
set.seed(123)
rf.model <- train(churn ~ ., method = "rf", data = dfChurnTrain, trControl = modelCtrl)
results <- resamples(list(CART=gini.cart.model, BAG=treebag.model, BOOST=C5.model, RF=rf.model))
dotplot(results)

Both Boosted tree and Random Forest models appear to have the highest accuracy. We’ll put both these models to the test to pick the winner.

gb.pred <- predict(C5.model, newdata = dfChurnTest)
gb.accuracy <- 1-mean(gb.pred != dfChurnTest$churn)

rf.pred1 <- predict(rf.model, newdata = dfChurnTest)
rf.accuracy <- 1-mean(rf.pred1 != dfChurnTest$churn)

glm.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(glm.pred))
gb.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(gb.pred))
rf.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(rf.pred1))

plot(glm.roc,      legacy.axes = TRUE, print.auc.y = 0.4, print.auc = TRUE)
plot(gb.roc, col = "blue", add = TRUE, print.auc.y = 0.5, print.auc = TRUE)
plot(rf.roc, col = "red" , add = TRUE, print.auc.y = 0.6, print.auc = TRUE)

legend(0.0,0.2, c("Random Forest", "Boosted Tree", "Logistic"), lty = c(1,1), lwd = c(2, 2), col = c("red", "blue", "black"), cex = 0.75)

The ROC curve is a graphical way to review the performance of classification models. The ROC is a better indicator of performance especially as we’re interested in the true positive rate of our models. The Area Under Curve (AUC) measures the models ability to correctly classify those who churned and those who did not. The higher the AUC the better. The ROC curve above shows that the Random Forest outperformed the Boosted Tree model marginally. Both models are a marked improvement over the Logistic model, which is useful enough for model interpretability but not so much for prediction performance. The Random Forest is the best tree based predictive model

4.2 Model Tuning (based on business cost benefit criterion)

The Company plans a promotion campaign geared specifically at the customers predicted to churn. The profitability of the campaign is measured in ‘benefit’, which is some function of profitability. Any promotional campaign must result not only in a positive benefit value but also must be tuned to achieve the highest benefit value. Subsequently, the marketing department has identified costs and benefit associated with each category of customers outlined above. The promotional campaign is highly effective at reducing churn but also a very expensive one. So it is important that the promo is presented ONLY to customers predicted to have churn-intent. There is a positive net benefit if churn-intent customers respond positively to the churn but there is also a cost (negative benefit) if churn-non-intent customers that are presented the promo sign up to the promo.

Assumptions:
  • ‘tp’ represents the number of customers that were correctly predicted as intending to churn
  • ‘tn’ represents the number of customers that were correctly predicted as not intending to churn
  • ‘fn’ represents the number of customers that were incorrectly predicted as not intending to churn but actually did churn
  • ‘fp’ represents the number of customers that were incorrectly predicted as intending to churn but never churned or intended to either
  • Each churn-intent customer that is predicted to churn but responds positively to the campaign and ends up not churning subsequently contributes to gain in benefit of $1000 (tp.bnft) over the next year.
  • Each churn-intent customer that eventually churns contributes to a loss of benefit of -$2000 (fn.cost) due lost revenue over the next year and marketing costs to procure that customer back in the future
  • Each churn-non-intent customer predicted incorrectly to have churn-intent that signs up to the promotional campaign contributes to a loss of benefit of -$250 (fp.cost) relative to the positive benefit that they would have contributed anyway if they had not been presented the promo
  • Each churn-non-intent customer will continue to contribute a net positive benefit of $50 over the next year
  • 75% of churn-intent customers presented the promo will respond positively while the rest will churn
  • 100% of churn-intent customers not presented the promo will churn
  • 90% of churn-non-intent customers presented the promo will sign up to the promo
  • 100% of churn-non-intent customers will continue to be customers

These assumption are captured in the ‘benefit’ function below. The function takes a classication table (tp, fn, fp, tn) as input and returns the computed benefit based on the assumptions:

benefit <- function(confMatrix) {
  tp <- confMatrix[1,1]; fn <- confMatrix[1,2]
  fp <- confMatrix[2,1]; tn <- confMatrix[2,2]
  
  tp.bnft <- 1000      ; fn.cost <- -2000
  fp.cost <- -250      ; tn.bnft <- 50
  
  return(0.75*tp*tp.bnft + 
           (1-0.75)*tp*fn.cost + fn*fn.cost + 
           tn*tn.bnft + 
           0.90*fp*fp.cost + (1-0.9)*fp*tn.bnft)
}

Now we can compute the realized benefit for the best predictive tree based model built previously:

benefit(table(actual = dfChurnTest$churn, predicted = rf.pred1))
[1] -10720

At a net benefit loss of $10720, our plan is not profitable. We now need to tune our predictive model to achieve the maximum benefit gain. To achieve this, we will predict the model again but this time use the probability values instead of the class values used so far. Our goal will be to identify the best cutoff value for “churn=yes” to obtain the maximum benefit gain. Loop through each of the cutoff values and compute the benefit:

rf.pred2 <- predict(rf.model, newdata = dfChurnTest, type = "prob")
dfCB <- data.frame(numeric(), numeric())
for (i in seq(0,1,0.1)) {
  newValue <- factor(ifelse( rf.pred2[,1] > i, "yes", "no"), levels = levels(dfChurnTest$churn))
  cm <- table(actual=dfChurnTest$churn, predicted=newValue)
  dfCB <- rbind(dfCB,c(i, benefit(cm)))
}
colnames(dfCB) <- c("CutOff","Benefit($)")
dfCB
   CutOff Benefit($)
1     0.0    -249940
2     0.1       3860
3     0.2      27440
4     0.3      28340
5     0.4       6560
6     0.5     -10720
7     0.6     -46180
8     0.7     -83620
9     0.8    -117100
10    0.9    -184600
11    1.0    -375850

Using a cutoff value of 0.3 in the predictive model leads to the maximum benefit of $28340. That is, at this cutoff, our proposed strategy is not only profitable but also derives the most benefit for the conditions assumed.
Given the assumptions made in our financial model, the benefit decreases sharply outside the cutoff range of 0.2 to 0.3 and goes into the negative. Thus, tuning the cutoff of the Random Forest model to the final value of 0.3 makes the financial model most profitable.

5.0 Conclusion

rf2.pred <- factor(ifelse( rf.pred2[,1] > 0.3, "yes", "no"), levels = levels(dfChurnTest$churn))
rf2.accuracy <- round(1-mean(rf2.pred != dfChurnTest$churn),2)

glm.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(glm.pred))
rf2.roc <- roc(response = dfChurnTest$churn, predictor = as.numeric(rf2.pred))

plot(glm.roc, col = "blue", legacy.axes = TRUE, print.auc.y = 0.6, print.auc = TRUE)
plot(rf2.roc, col = "red" , add = TRUE, print.auc.x = 0.8, print.auc.y = 0.9, print.auc = TRUE)

legend(0.05,0.95, c("Predictive (rf)", "Interpretative (glm)"), lty = c(1,1), lwd = c(2, 2), col = c("red", "blue"), cex = 0.75)

Using a cutoff value of 0.3 in the predictive model leads to the maximum predicted benefit of $28340. That is, at this cutoff, our proposed strategy is not only profitable but also derives the most benefit for the conditions assumed. ROC curves comparing our interpretive and predictive models is shown above. The best predictive model is a Random Forest model with a cutoff of 0.3 (“churn=yes”) and the interpretative model is a Logistic model. The accuracy of the predictive model is 0.96 and that of the interpretative model is 0.87, although in this case the AUC would be a better indicator of model performance. The AUC for the predictive model is 0.58 and that of the interpretative model is 0.9.