I was hired as a data scientist to research Regork’s problem with customer churn. Many customers are leaving for competitors and it is causing a lot of financial losses for Regork. Acquiring new customers is generally more expensive than keeping a current one, which is why customer churn can become a long term issue for Regork. My goal is to build a predictive model to identify which customers are likely to leave and Regork can use that model to take action before the predicted customers actually leave. After my analysis and creation of a predicitve model, I will provide the data I found and some recommendations for Regork to solve their issue with customer churn.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rpart.plot)
library(readr)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
data <- read_csv("Desktop/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data$TotalCharges <- as.numeric(data$TotalCharges)
data <- data %>% drop_na(TotalCharges)
data <- data %>%
mutate(across(c(Status, Gender, Partner, Dependents, PhoneService, MultipleLines,
InternetService, OnlineSecurity, OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling,
PaymentMethod), as.factor))
data %>%
count(Status) %>%
mutate(percentage = n / sum(n) * 100)
## # A tibble: 2 × 3
## Status n percentage
## <fct> <int> <dbl>
## 1 Current 5132 73.4
## 2 Left 1856 26.6
In this chart, I noticed that tenure is has a lot of association with customer churn. Customers with short tenure, such as a couple of months, show larger churn rates than the customers who have very long tenure with Regork. There is a less than 5% churn rate for the customers who have been with Regork the longest.
ggplot(data, aes(x = Tenure, fill = Status)) +
geom_histogram(position = "fill", bins = 30) +
labs(title = "Churn Rate by Tenure", y = "Proportion")
Based off of this chart, contract type is a major predictor in customer churn. It shows that customers with month-to-month contracts are more likely to leave compared to those with year long contracts. Month-to-month allows the customers to opt out quicker than those who are bound to the company by long-term contracts. Customers should be incentivized to sign a long term contract. This can include discounts, rewards for loyalty or premium services at a discount.
ggplot(data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Churn Rate by Contract Type", y = "Proportion")
set.seed(123)
splitIndex <- createDataPartition(data$Status, p = 0.8, list = FALSE)
trainData <- data[splitIndex, ]
testData <- data[-splitIndex, ]
Here I will begin to implement a new model for Regork
log_model <- glm(Status ~ ., data = trainData, family = binomial)
tree_model <- rpart(Status ~ ., data = trainData, method = "class")
rpart.plot(tree_model)
This decision tree model highlights both contract type and tenure and it can be used as a prediction model to predict the chances of their current customers leaving. This tree matches the data that we saw in the Tenure vs Churn chart and the Contract Type vs Churn chart. The alignment between exploratory data analysis and machine learning make this model worth using.
rf_model <- randomForest(Status ~ ., data = trainData, ntree = 500, importance = TRUE)
The exploratory data analysis charts allowed us to visualize the relationships between individual predictors and customer churn, while the logistic regression model allows us to see all variables and quantifies it to predict its effect on customer churn.
log_pred_prob <- predict(log_model, newdata = testData, type = "response")
log_pred <- ifelse(log_pred_prob > 0.5, "Left", "Current")
confusionMatrix(factor(log_pred), testData$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 917 158
## Left 109 213
##
## Accuracy : 0.8089
## 95% CI : (0.7873, 0.8292)
## No Information Rate : 0.7344
## P-Value [Acc > NIR] : 4.452e-11
##
## Kappa : 0.4885
##
## Mcnemar's Test P-Value : 0.003308
##
## Sensitivity : 0.8938
## Specificity : 0.5741
## Pos Pred Value : 0.8530
## Neg Pred Value : 0.6615
## Prevalence : 0.7344
## Detection Rate : 0.6564
## Detection Prevalence : 0.7695
## Balanced Accuracy : 0.7339
##
## 'Positive' Class : Current
##
log_roc <- roc(testData$Status, as.numeric(log_pred_prob))
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(log_roc)
## Area under the curve: 0.8363
tree_pred <- predict(tree_model, newdata = testData, type = "class")
confusionMatrix(tree_pred, testData$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 912 192
## Left 114 179
##
## Accuracy : 0.781
## 95% CI : (0.7583, 0.8024)
## No Information Rate : 0.7344
## P-Value [Acc > NIR] : 3.413e-05
##
## Kappa : 0.3981
##
## Mcnemar's Test P-Value : 1.074e-05
##
## Sensitivity : 0.8889
## Specificity : 0.4825
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.6109
## Prevalence : 0.7344
## Detection Rate : 0.6528
## Detection Prevalence : 0.7903
## Balanced Accuracy : 0.6857
##
## 'Positive' Class : Current
##
tree_prob <- predict(tree_model, newdata = testData, type = "prob")[, "Left"]
tree_roc <- roc(testData$Status, tree_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(tree_roc)
## Area under the curve: 0.8074
rf_pred <- predict(rf_model, newdata = testData)
confusionMatrix(rf_pred, testData$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 916 188
## Left 110 183
##
## Accuracy : 0.7867
## 95% CI : (0.7643, 0.8079)
## No Information Rate : 0.7344
## P-Value [Acc > NIR] : 3.574e-06
##
## Kappa : 0.4138
##
## Mcnemar's Test P-Value : 8.177e-06
##
## Sensitivity : 0.8928
## Specificity : 0.4933
## Pos Pred Value : 0.8297
## Neg Pred Value : 0.6246
## Prevalence : 0.7344
## Detection Rate : 0.6557
## Detection Prevalence : 0.7903
## Balanced Accuracy : 0.6930
##
## 'Positive' Class : Current
##
rf_prob <- predict(rf_model, newdata = testData, type = "prob")[, "Left"]
rf_roc <- roc(testData$Status, rf_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(rf_roc)
## Area under the curve: 0.8136
varImpPlot(rf_model)
The random forest model is very straight forward and it also confirms
that Contract Type, Tenure, and Monthly Charges are the three main
factors controlling customer churn. This model allows analysts at Regork
to identify the problems and target certain customers with efforts to
bring them back into the company. I would say that is model is the most
helpful for Regork as it allows them to focus their resources in the
right parts of their services.
plot(log_roc, col = "blue", lwd = 2, main = "ROC Curves Comparison")
plot(tree_roc, col = "green", lwd = 2, add = TRUE)
plot(rf_roc, col = "red", lwd = 2, add = TRUE)
legend("bottomright",
legend = c(paste("Logistic Regression (AUC =", round(auc(log_roc), 2), ")"),
paste("Decision Tree (AUC =", round(auc(tree_roc), 2), ")"),
paste("Random Forest (AUC =", round(auc(rf_roc), 2), ")")),
col = c("blue", "green", "red"),
lwd = 2)
I included this chart to show the effectiveness of all of the models that I used during my research. Every model performed very well with AUC values. This chart also gives Regork the opportunity to choose whichever model they think suits their needs the best and will help them with their efforts in retention.
testData$predicted_status <- predict(rf_model, testData)
churn_customers <- testData %>% filter(predicted_status == "Left")
nrow(churn_customers)
## [1] 294
Customers predicted to leave : 294
predicted_loss_per_month <- sum(churn_customers$MonthlyCharges)
predicted_loss_per_month
## [1] 22984.6
Monthly Revenue Loss Estimation : $22,984.60
cost_per_customer <- 10
total_incentive_cost_per_month <- nrow(churn_customers) * cost_per_customer
net_monthly_savings <- predicted_loss_per_month - total_incentive_cost_per_month
net_monthly_savings
## [1] 20044.6
Monthly Savings After Incentive : $20,044.60
In my analysis, my main goal was to help Regork find the key factors in their problem with customer churn. My data analysis found that the leading factors were: Contract Type, Tenure, and Monthly Charges. Customers that had short tenure, and were on monthly contracts were more likely to leave Regork’s system. Since all models performed with nearly the same AUC value, they’re all good choices to use to predict if customers will leave or not. After analyzing which factors affected churn the most, I gave a recommendation to offer a $10 discount per month to customers on a month-to-month contract with shorter tenure. Focusing on the at-risk customers, it is best to offer them a financial incentive that can help keep them with Regork. This discount would reduce churn, but also raise customer satisfaction with the company. After looking at the decision tree, this discount has the chance of reducing customer churn by nearly 30%. This discount could also lead the month-to-month customers to a long-term contract with tenure. In conclusion, if Regork were to implement these machine learning models, they would effectively be able to reduce their churn rate and save themselves from significant revenue loss.