Introduction

I was hired as a data scientist to research Regork’s problem with customer churn. Many customers are leaving for competitors and it is causing a lot of financial losses for Regork. Acquiring new customers is generally more expensive than keeping a current one, which is why customer churn can become a long term issue for Regork. My goal is to build a predictive model to identify which customers are likely to leave and Regork can use that model to take action before the predicted customers actually leave. After my analysis and creation of a predicitve model, I will provide the data I found and some recommendations for Regork to solve their issue with customer churn.

Load Neccesary Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rpart)
library(rpart.plot)
library(readr)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Load the “Customer Retention” dataset

data <- read_csv("Desktop/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get rid of missing values and make sure all data is numeric

data$TotalCharges <- as.numeric(data$TotalCharges)
data <- data %>% drop_na(TotalCharges)

Convert appropriate variables to factors

data <- data %>%
  mutate(across(c(Status, Gender, Partner, Dependents, PhoneService, MultipleLines,
                  InternetService, OnlineSecurity, OnlineBackup, DeviceProtection,
                  TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling,
                  PaymentMethod), as.factor))

Exploratory Data Analysis

data %>%
  count(Status) %>%
  mutate(percentage = n / sum(n) * 100)
## # A tibble: 2 × 3
##   Status      n percentage
##   <fct>   <int>      <dbl>
## 1 Current  5132       73.4
## 2 Left     1856       26.6

Tenure vs Churn

In this chart, I noticed that tenure is has a lot of association with customer churn. Customers with short tenure, such as a couple of months, show larger churn rates than the customers who have very long tenure with Regork. There is a less than 5% churn rate for the customers who have been with Regork the longest.

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(position = "fill", bins = 30) +
  labs(title = "Churn Rate by Tenure", y = "Proportion")

Contract Type vs Churn

Based off of this chart, contract type is a major predictor in customer churn. It shows that customers with month-to-month contracts are more likely to leave compared to those with year long contracts. Month-to-month allows the customers to opt out quicker than those who are bound to the company by long-term contracts. Customers should be incentivized to sign a long term contract. This can include discounts, rewards for loyalty or premium services at a discount.

ggplot(data, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Contract Type", y = "Proportion")

Train/Test Split

set.seed(123)
splitIndex <- createDataPartition(data$Status, p = 0.8, list = FALSE)
trainData <- data[splitIndex, ]
testData <- data[-splitIndex, ]

Model Building

Here I will begin to implement a new model for Regork

log_model <- glm(Status ~ ., data = trainData, family = binomial)

Decision Tree

tree_model <- rpart(Status ~ ., data = trainData, method = "class")
rpart.plot(tree_model)

This decision tree model highlights both contract type and tenure and it can be used as a prediction model to predict the chances of their current customers leaving. This tree matches the data that we saw in the Tenure vs Churn chart and the Contract Type vs Churn chart. The alignment between exploratory data analysis and machine learning make this model worth using.

Random Forest

rf_model <- randomForest(Status ~ ., data = trainData, ntree = 500, importance = TRUE) 

Logistic Regression

The exploratory data analysis charts allowed us to visualize the relationships between individual predictors and customer churn, while the logistic regression model allows us to see all variables and quantifies it to predict its effect on customer churn.

log_pred_prob <- predict(log_model, newdata = testData, type = "response")
log_pred <- ifelse(log_pred_prob > 0.5, "Left", "Current")
confusionMatrix(factor(log_pred), testData$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current     917  158
##    Left        109  213
##                                           
##                Accuracy : 0.8089          
##                  95% CI : (0.7873, 0.8292)
##     No Information Rate : 0.7344          
##     P-Value [Acc > NIR] : 4.452e-11       
##                                           
##                   Kappa : 0.4885          
##                                           
##  Mcnemar's Test P-Value : 0.003308        
##                                           
##             Sensitivity : 0.8938          
##             Specificity : 0.5741          
##          Pos Pred Value : 0.8530          
##          Neg Pred Value : 0.6615          
##              Prevalence : 0.7344          
##          Detection Rate : 0.6564          
##    Detection Prevalence : 0.7695          
##       Balanced Accuracy : 0.7339          
##                                           
##        'Positive' Class : Current         
## 
log_roc <- roc(testData$Status, as.numeric(log_pred_prob))
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(log_roc)
## Area under the curve: 0.8363
tree_pred <- predict(tree_model, newdata = testData, type = "class")
confusionMatrix(tree_pred, testData$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current     912  192
##    Left        114  179
##                                           
##                Accuracy : 0.781           
##                  95% CI : (0.7583, 0.8024)
##     No Information Rate : 0.7344          
##     P-Value [Acc > NIR] : 3.413e-05       
##                                           
##                   Kappa : 0.3981          
##                                           
##  Mcnemar's Test P-Value : 1.074e-05       
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.4825          
##          Pos Pred Value : 0.8261          
##          Neg Pred Value : 0.6109          
##              Prevalence : 0.7344          
##          Detection Rate : 0.6528          
##    Detection Prevalence : 0.7903          
##       Balanced Accuracy : 0.6857          
##                                           
##        'Positive' Class : Current         
## 
tree_prob <- predict(tree_model, newdata = testData, type = "prob")[, "Left"]
tree_roc <- roc(testData$Status, tree_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(tree_roc)
## Area under the curve: 0.8074
rf_pred <- predict(rf_model, newdata = testData)
confusionMatrix(rf_pred, testData$Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current     916  188
##    Left        110  183
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7643, 0.8079)
##     No Information Rate : 0.7344          
##     P-Value [Acc > NIR] : 3.574e-06       
##                                           
##                   Kappa : 0.4138          
##                                           
##  Mcnemar's Test P-Value : 8.177e-06       
##                                           
##             Sensitivity : 0.8928          
##             Specificity : 0.4933          
##          Pos Pred Value : 0.8297          
##          Neg Pred Value : 0.6246          
##              Prevalence : 0.7344          
##          Detection Rate : 0.6557          
##    Detection Prevalence : 0.7903          
##       Balanced Accuracy : 0.6930          
##                                           
##        'Positive' Class : Current         
## 
rf_prob <- predict(rf_model, newdata = testData, type = "prob")[, "Left"]
rf_roc <- roc(testData$Status, rf_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(rf_roc)
## Area under the curve: 0.8136
varImpPlot(rf_model)

The random forest model is very straight forward and it also confirms that Contract Type, Tenure, and Monthly Charges are the three main factors controlling customer churn. This model allows analysts at Regork to identify the problems and target certain customers with efforts to bring them back into the company. I would say that is model is the most helpful for Regork as it allows them to focus their resources in the right parts of their services.

Plotting All 3 Curves Together

plot(log_roc, col = "blue", lwd = 2, main = "ROC Curves Comparison")
plot(tree_roc, col = "green", lwd = 2, add = TRUE)
plot(rf_roc, col = "red", lwd = 2, add = TRUE)

legend("bottomright",
       legend = c(paste("Logistic Regression (AUC =", round(auc(log_roc), 2), ")"),
                  paste("Decision Tree (AUC =", round(auc(tree_roc), 2), ")"),
                  paste("Random Forest (AUC =", round(auc(rf_roc), 2), ")")),
       col = c("blue", "green", "red"),
       lwd = 2)

I included this chart to show the effectiveness of all of the models that I used during my research. Every model performed very well with AUC values. This chart also gives Regork the opportunity to choose whichever model they think suits their needs the best and will help them with their efforts in retention.

Business Analyst ; Customers Predicted to Leave

testData$predicted_status <- predict(rf_model, testData)
churn_customers <- testData %>% filter(predicted_status == "Left")
nrow(churn_customers)
## [1] 294

Customers predicted to leave : 294

Monthly Revenue Loss Estimation

predicted_loss_per_month <- sum(churn_customers$MonthlyCharges)
predicted_loss_per_month
## [1] 22984.6

Monthly Revenue Loss Estimation : $22,984.60

My Incentive Proposal : $10 Discount per Month

cost_per_customer <- 10
total_incentive_cost_per_month <- nrow(churn_customers) * cost_per_customer
net_monthly_savings <- predicted_loss_per_month - total_incentive_cost_per_month

net_monthly_savings
## [1] 20044.6

Monthly Savings After Incentive : $20,044.60

Conclusion

In my analysis, my main goal was to help Regork find the key factors in their problem with customer churn. My data analysis found that the leading factors were: Contract Type, Tenure, and Monthly Charges. Customers that had short tenure, and were on monthly contracts were more likely to leave Regork’s system. Since all models performed with nearly the same AUC value, they’re all good choices to use to predict if customers will leave or not. After analyzing which factors affected churn the most, I gave a recommendation to offer a $10 discount per month to customers on a month-to-month contract with shorter tenure. Focusing on the at-risk customers, it is best to offer them a financial incentive that can help keep them with Regork. This discount would reduce churn, but also raise customer satisfaction with the company. After looking at the decision tree, this discount has the chance of reducing customer churn by nearly 30%. This discount could also lead the month-to-month customers to a long-term contract with tenure. In conclusion, if Regork were to implement these machine learning models, they would effectively be able to reduce their churn rate and save themselves from significant revenue loss.