Introduction

This document outlines a machine learning workflow to predict customer turnover. We start by loading the necessary packages and reading the data. We then go into some basic data manipulations to make the data more malleable for us to use later on. We then plan to make a few models to determine key insights on the customers. with these insights we can make a plan for the company to use in order to further advance business.

Libraries and Prepping the data

Load the required libraries and read the dataset. Make sure that character columns are factors, convert TotalCharges to numeric values, and handle missing values.

# Load the required libraries for the project
library(dplyr)         
library(ggplot2)      
library(caret)          
library(randomForest)  
library(pROC)         
library(rpart)        
library(tidymodels)     
library(vip)           

# Read the dataset 
data <- read.csv("customer_retention.csv")

# Convert all the columns to factors, convert to numeric and get rid of any NAs
data <- data %>% mutate(across(where(is.character), as.factor))
data$TotalCharges <- as.numeric(as.character(data$TotalCharges))
data <- na.omit(data)

Examine Target Distribution and Baseline Rates

Looking at “Status”, we must determine both the distribution and proportions of churn and retention rate in each category. This is done to gather a baseline.

table(data$Status)
## 
## Current    Left 
##    5132    1856
prop.table(table(data$Status))
## 
##   Current      Left 
## 0.7344018 0.2655982

Visualizations

Use ggplot to create visualizations for both tenure distribution and internet service by status. This gives us visuals of how long people stay with our internet service, and how each service performs for us.

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5, position = "dodge") +
  labs(title = "Tenure Distribution by Status", x = "Tenure (Months)", y = "Count")

ggplot(data, aes(x = InternetService, fill = Status)) +
  geom_bar(position = "dodge") +
  labs(title = "Internet Service Type by Status", x = "Internet Service", y = "Count")

Graph Analysis

Distribution Graph

We can see by the 1st graph that the largest group of current customers has been with the company for 70 months or more, and the distribution is relatively level before that. For the customers that left the overwhelming majority left within the first 20 months of service.

Services Graph

The services graph shows us that most of the people with the DSL service are keeping their plan, while a little under half of the fiber optic customers have chosen to go with a different provider.

Split the Data

set.seed(123)
train_index <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

Explaination

We believe using the 70/30 split will allow us to get the best estimate for reality while avoiding overfitting, which would defeat the purpose of the data.

Logistic Regression, Decosion Tree, and Random Forest

# Logistic Regression
log_model <- glm(Status ~ ., data = train_data, family = "binomial")
log_pred <- predict(log_model, test_data, type = "response")
log_pred_class <- ifelse(log_pred > 0.5, "Left", "Current")
confusionMatrix(as.factor(log_pred_class), test_data$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1363  236
##    Left        176  320
##                                           
##                Accuracy : 0.8033          
##                  95% CI : (0.7857, 0.8202)
##     No Information Rate : 0.7346          
##     P-Value [Acc > NIR] : 1.21e-13        
##                                           
##                   Kappa : 0.4776          
##                                           
##  Mcnemar's Test P-Value : 0.003652        
##                                           
##             Sensitivity : 0.8856          
##             Specificity : 0.5755          
##          Pos Pred Value : 0.8524          
##          Neg Pred Value : 0.6452          
##              Prevalence : 0.7346          
##          Detection Rate : 0.6506          
##    Detection Prevalence : 0.7632          
##       Balanced Accuracy : 0.7306          
##                                           
##        'Positive' Class : Current         
## 
roc_log <- roc(as.numeric(test_data$Status == "Left"), as.numeric(log_pred))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_log)
## Area under the curve: 0.8344
# Decision Tree
tree_model <- rpart(Status ~ ., data = train_data, method = "class")
tree_pred <- predict(tree_model, test_data, type = "class")
confusionMatrix(tree_pred, test_data$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1399  305
##    Left        140  251
##                                           
##                Accuracy : 0.7876          
##                  95% CI : (0.7694, 0.8049)
##     No Information Rate : 0.7346          
##     P-Value [Acc > NIR] : 1.115e-08       
##                                           
##                   Kappa : 0.3982          
##                                           
##  Mcnemar's Test P-Value : 7.584e-15       
##                                           
##             Sensitivity : 0.9090          
##             Specificity : 0.4514          
##          Pos Pred Value : 0.8210          
##          Neg Pred Value : 0.6419          
##              Prevalence : 0.7346          
##          Detection Rate : 0.6678          
##    Detection Prevalence : 0.8134          
##       Balanced Accuracy : 0.6802          
##                                           
##        'Positive' Class : Current         
## 
tree_pred_prob <- predict(tree_model, test_data, type = "prob")[, 2]
roc_tree <- roc(as.numeric(test_data$Status == "Left"), tree_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_tree)
## Area under the curve: 0.7953
# Random Forest
rf_model <- randomForest(Status ~ ., data = train_data, importance = TRUE)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_data$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1368  275
##    Left        171  281
##                                           
##                Accuracy : 0.7871          
##                  95% CI : (0.7689, 0.8045)
##     No Information Rate : 0.7346          
##     P-Value [Acc > NIR] : 1.502e-08       
##                                           
##                   Kappa : 0.4193          
##                                           
##  Mcnemar's Test P-Value : 1.076e-06       
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.5054          
##          Pos Pred Value : 0.8326          
##          Neg Pred Value : 0.6217          
##              Prevalence : 0.7346          
##          Detection Rate : 0.6530          
##    Detection Prevalence : 0.7842          
##       Balanced Accuracy : 0.6971          
##                                           
##        'Positive' Class : Current         
## 

Explaination

Logistic Regression

Logistic regression had the highest area under the curve which tells us that it is best at determining weather a customer chooses to stay or not. It also had the highest accuracy indicating that Logistic regression is our best option when predicting not only someones likelihood to stay, but also who is most likely to stay.

Decision Tree

The Decision tree had slightly lower statistics than Logistic Regression. We see that since sensitivity is a positive rate and specificity is a negative rate, the decision tree is very good at predicting who stays, but not so good at predicting who leaves.

Random Forest

We can assume using the confusion matrix that AUC on the Random Forest had similar if not slightly lower performace compared to Regression.

Random Forest Tuning

We have to define the control, the paramater, and evaluate the model. Parameter tuning for a Random Forest model is intended to find the best combination that will yield our best results. Since a random forest is multiple decision trees, it allows for more customization and tuning.

# Define the control
control <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = twoClassSummary,
  classProbs = TRUE
)

# Define the parameter
tune_grid <- expand.grid(
  mtry = c(2, 4, 6, 8),
  splitrule = "gini",
  min.node.size = c(1, 5, 10)
)

set.seed(123)
rf_tuned <- train(
  Status ~ .,
  data = train_data,
  method = "ranger",
  metric = "ROC",
  trControl = control,
  tuneGrid = tune_grid,
  importance = "impurity"
)

print(rf_tuned$bestTune)
##   mtry splitrule min.node.size
## 6    4      gini            10
# Evaluate tuned model
rf_pred <- predict(rf_tuned, test_data)
confusionMatrix(rf_pred, test_data$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1384  274
##    Left        155  282
##                                           
##                Accuracy : 0.7952          
##                  95% CI : (0.7773, 0.8123)
##     No Information Rate : 0.7346          
##     P-Value [Acc > NIR] : 6.457e-11       
##                                           
##                   Kappa : 0.4363          
##                                           
##  Mcnemar's Test P-Value : 1.219e-08       
##                                           
##             Sensitivity : 0.8993          
##             Specificity : 0.5072          
##          Pos Pred Value : 0.8347          
##          Neg Pred Value : 0.6453          
##              Prevalence : 0.7346          
##          Detection Rate : 0.6606          
##    Detection Prevalence : 0.7914          
##       Balanced Accuracy : 0.7032          
##                                           
##        'Positive' Class : Current         
## 
rf_pred_prob <- predict(rf_tuned, test_data, type = "prob")[, "Left"]
roc_rf <- roc(as.numeric(test_data$Status == "Left"), rf_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- auc(roc_rf)
cat("Test Set AUC for Tuned Random Forest:", auc_rf, "\n")
## Test Set AUC for Tuned Random Forest: 0.831013
varImp(rf_tuned) %>% plot()

Explaination

We ended up having around a 90% sensitivity, which means it correctly predicts 90% of people that are going to stay. We have a 50% specificity which means it can correctly predict around 50% of the people that are going to be leaving. Our balanced accuracy is around 70% which is the average of both. Our Kappa was .44 which indicates that we can confidently assume that the Random Forest is better than just randomly guessing.

Final Model

We take our most successful parameters and plot features by importance.

best_params <- rf_tuned$bestTune

set.seed(123)
final_rf_model <- randomForest(
  Status ~ .,
  data = train_data,
  mtry = best_params$mtry,
  ntree = 100,
  importance = TRUE
)

var_imp <- importance(final_rf_model)
var_imp_df <- data.frame(
  Feature = rownames(var_imp),
  Importance = var_imp[, 1]
) %>%
  arrange(desc(Importance))

ggplot(var_imp_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Most Important Features - Random Forest",
    x = "Features",
    y = "Importance (Mean Decrease in Accuracy)"
  ) +
  theme_minimal()

Explaination

We see in the graph that Total Charges, Tenure, and Monthly Charges are the top 3 most important things in predicting customer turnover. Service quality is the rest of the criteria that we see in the graph. While Service quality is important, we can assume that taking out tenure and fees it would be virtually impossible to predict who will leave and who will stay.

Revenue and Risk Retention Strategy.

We identify predicted “churners”, calculate revenue lost from at-risk customers, and assume a $50 incentive cost per at risk customer.

churned_customers <- test_data[rf_pred == "Left", ]


predicted_loss <- sum(churned_customers$MonthlyCharges)
cat("Predicted monthly revenue loss:", predicted_loss, "\n")
## Predicted monthly revenue loss: 33436.55
retention_cost <- nrow(churned_customers) * 50
benefit <- predicted_loss - retention_cost

cat("Net benefit of retention plan:", benefit, "\n")
## Net benefit of retention plan: 11586.55

Explaination

We see that the company is set to lose around 33,000 per month in revenue with no intervention. With our plan the company is set to make around 11,000 more in revenue. This extra 11000 comes from the added money from the plans that are retained, while taking into account the extra money that each customer is is saving.

Conclusion

We looked into the company’s data and manipulated it to show us the demographics that are most likely to stay, and who are most likely to leave. With this data we set up a few models to determine what is most important to customers who leave, and to predict who could be leaving in the future. We discovered that the most important things to the customer are fees, tenure, and services. Using this we were able to create a plan for the company to implement that would involve giving discounts to people who are most vulnerable to leaving. Our plan involved giving a little money back to the customers in order to retain their subscriptions, and ultimately could end up making them money instead of losing them money like they are now.