This document outlines a machine learning workflow to predict customer turnover. We start by loading the necessary packages and reading the data. We then go into some basic data manipulations to make the data more malleable for us to use later on. We then plan to make a few models to determine key insights on the customers. with these insights we can make a plan for the company to use in order to further advance business.
Load the required libraries and read the dataset. Make sure that
character columns are factors, convert TotalCharges to
numeric values, and handle missing values.
# Load the required libraries for the project
library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(pROC)
library(rpart)
library(tidymodels)
library(vip)
# Read the dataset
data <- read.csv("customer_retention.csv")
# Convert all the columns to factors, convert to numeric and get rid of any NAs
data <- data %>% mutate(across(where(is.character), as.factor))
data$TotalCharges <- as.numeric(as.character(data$TotalCharges))
data <- na.omit(data)
Looking at “Status”, we must determine both the distribution and proportions of churn and retention rate in each category. This is done to gather a baseline.
table(data$Status)
##
## Current Left
## 5132 1856
prop.table(table(data$Status))
##
## Current Left
## 0.7344018 0.2655982
Use ggplot to create visualizations for both tenure distribution and internet service by status. This gives us visuals of how long people stay with our internet service, and how each service performs for us.
ggplot(data, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5, position = "dodge") +
labs(title = "Tenure Distribution by Status", x = "Tenure (Months)", y = "Count")
ggplot(data, aes(x = InternetService, fill = Status)) +
geom_bar(position = "dodge") +
labs(title = "Internet Service Type by Status", x = "Internet Service", y = "Count")
We can see by the 1st graph that the largest group of current customers has been with the company for 70 months or more, and the distribution is relatively level before that. For the customers that left the overwhelming majority left within the first 20 months of service.
The services graph shows us that most of the people with the DSL service are keeping their plan, while a little under half of the fiber optic customers have chosen to go with a different provider.
set.seed(123)
train_index <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
We believe using the 70/30 split will allow us to get the best estimate for reality while avoiding overfitting, which would defeat the purpose of the data.
# Logistic Regression
log_model <- glm(Status ~ ., data = train_data, family = "binomial")
log_pred <- predict(log_model, test_data, type = "response")
log_pred_class <- ifelse(log_pred > 0.5, "Left", "Current")
confusionMatrix(as.factor(log_pred_class), test_data$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1363 236
## Left 176 320
##
## Accuracy : 0.8033
## 95% CI : (0.7857, 0.8202)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 1.21e-13
##
## Kappa : 0.4776
##
## Mcnemar's Test P-Value : 0.003652
##
## Sensitivity : 0.8856
## Specificity : 0.5755
## Pos Pred Value : 0.8524
## Neg Pred Value : 0.6452
## Prevalence : 0.7346
## Detection Rate : 0.6506
## Detection Prevalence : 0.7632
## Balanced Accuracy : 0.7306
##
## 'Positive' Class : Current
##
roc_log <- roc(as.numeric(test_data$Status == "Left"), as.numeric(log_pred))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_log)
## Area under the curve: 0.8344
# Decision Tree
tree_model <- rpart(Status ~ ., data = train_data, method = "class")
tree_pred <- predict(tree_model, test_data, type = "class")
confusionMatrix(tree_pred, test_data$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1399 305
## Left 140 251
##
## Accuracy : 0.7876
## 95% CI : (0.7694, 0.8049)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 1.115e-08
##
## Kappa : 0.3982
##
## Mcnemar's Test P-Value : 7.584e-15
##
## Sensitivity : 0.9090
## Specificity : 0.4514
## Pos Pred Value : 0.8210
## Neg Pred Value : 0.6419
## Prevalence : 0.7346
## Detection Rate : 0.6678
## Detection Prevalence : 0.8134
## Balanced Accuracy : 0.6802
##
## 'Positive' Class : Current
##
tree_pred_prob <- predict(tree_model, test_data, type = "prob")[, 2]
roc_tree <- roc(as.numeric(test_data$Status == "Left"), tree_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_tree)
## Area under the curve: 0.7953
# Random Forest
rf_model <- randomForest(Status ~ ., data = train_data, importance = TRUE)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_data$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1368 275
## Left 171 281
##
## Accuracy : 0.7871
## 95% CI : (0.7689, 0.8045)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 1.502e-08
##
## Kappa : 0.4193
##
## Mcnemar's Test P-Value : 1.076e-06
##
## Sensitivity : 0.8889
## Specificity : 0.5054
## Pos Pred Value : 0.8326
## Neg Pred Value : 0.6217
## Prevalence : 0.7346
## Detection Rate : 0.6530
## Detection Prevalence : 0.7842
## Balanced Accuracy : 0.6971
##
## 'Positive' Class : Current
##
Logistic regression had the highest area under the curve which tells us that it is best at determining weather a customer chooses to stay or not. It also had the highest accuracy indicating that Logistic regression is our best option when predicting not only someones likelihood to stay, but also who is most likely to stay.
The Decision tree had slightly lower statistics than Logistic Regression. We see that since sensitivity is a positive rate and specificity is a negative rate, the decision tree is very good at predicting who stays, but not so good at predicting who leaves.
We can assume using the confusion matrix that AUC on the Random Forest had similar if not slightly lower performace compared to Regression.
We have to define the control, the paramater, and evaluate the model. Parameter tuning for a Random Forest model is intended to find the best combination that will yield our best results. Since a random forest is multiple decision trees, it allows for more customization and tuning.
# Define the control
control <- trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE
)
# Define the parameter
tune_grid <- expand.grid(
mtry = c(2, 4, 6, 8),
splitrule = "gini",
min.node.size = c(1, 5, 10)
)
set.seed(123)
rf_tuned <- train(
Status ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = control,
tuneGrid = tune_grid,
importance = "impurity"
)
print(rf_tuned$bestTune)
## mtry splitrule min.node.size
## 6 4 gini 10
# Evaluate tuned model
rf_pred <- predict(rf_tuned, test_data)
confusionMatrix(rf_pred, test_data$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1384 274
## Left 155 282
##
## Accuracy : 0.7952
## 95% CI : (0.7773, 0.8123)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 6.457e-11
##
## Kappa : 0.4363
##
## Mcnemar's Test P-Value : 1.219e-08
##
## Sensitivity : 0.8993
## Specificity : 0.5072
## Pos Pred Value : 0.8347
## Neg Pred Value : 0.6453
## Prevalence : 0.7346
## Detection Rate : 0.6606
## Detection Prevalence : 0.7914
## Balanced Accuracy : 0.7032
##
## 'Positive' Class : Current
##
rf_pred_prob <- predict(rf_tuned, test_data, type = "prob")[, "Left"]
roc_rf <- roc(as.numeric(test_data$Status == "Left"), rf_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- auc(roc_rf)
cat("Test Set AUC for Tuned Random Forest:", auc_rf, "\n")
## Test Set AUC for Tuned Random Forest: 0.831013
varImp(rf_tuned) %>% plot()
We ended up having around a 90% sensitivity, which means it correctly predicts 90% of people that are going to stay. We have a 50% specificity which means it can correctly predict around 50% of the people that are going to be leaving. Our balanced accuracy is around 70% which is the average of both. Our Kappa was .44 which indicates that we can confidently assume that the Random Forest is better than just randomly guessing.
We take our most successful parameters and plot features by importance.
best_params <- rf_tuned$bestTune
set.seed(123)
final_rf_model <- randomForest(
Status ~ .,
data = train_data,
mtry = best_params$mtry,
ntree = 100,
importance = TRUE
)
var_imp <- importance(final_rf_model)
var_imp_df <- data.frame(
Feature = rownames(var_imp),
Importance = var_imp[, 1]
) %>%
arrange(desc(Importance))
ggplot(var_imp_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Most Important Features - Random Forest",
x = "Features",
y = "Importance (Mean Decrease in Accuracy)"
) +
theme_minimal()
We see in the graph that Total Charges, Tenure, and Monthly Charges are the top 3 most important things in predicting customer turnover. Service quality is the rest of the criteria that we see in the graph. While Service quality is important, we can assume that taking out tenure and fees it would be virtually impossible to predict who will leave and who will stay.
We identify predicted “churners”, calculate revenue lost from at-risk customers, and assume a $50 incentive cost per at risk customer.
churned_customers <- test_data[rf_pred == "Left", ]
predicted_loss <- sum(churned_customers$MonthlyCharges)
cat("Predicted monthly revenue loss:", predicted_loss, "\n")
## Predicted monthly revenue loss: 33436.55
retention_cost <- nrow(churned_customers) * 50
benefit <- predicted_loss - retention_cost
cat("Net benefit of retention plan:", benefit, "\n")
## Net benefit of retention plan: 11586.55
We see that the company is set to lose around 33,000 per month in revenue with no intervention. With our plan the company is set to make around 11,000 more in revenue. This extra 11000 comes from the added money from the plans that are retained, while taking into account the extra money that each customer is is saving.
We looked into the company’s data and manipulated it to show us the demographics that are most likely to stay, and who are most likely to leave. With this data we set up a few models to determine what is most important to customers who leave, and to predict who could be leaving in the future. We discovered that the most important things to the customer are fees, tenure, and services. Using this we were able to create a plan for the company to implement that would involve giving discounts to people who are most vulnerable to leaving. Our plan involved giving a little money back to the customers in order to retain their subscriptions, and ultimately could end up making them money instead of losing them money like they are now.