Introduction
Customer retention is a critical focus for Regork Telecom, as acquiring new customers often incurs significantly higher costs than retaining existing ones. This project aims to develop a predictive model to identify customers at risk of leaving, enabling the company to take proactive measures to improve retention and reduce churn. By analyzing customer data, we explore key drivers of churn, such as contract type, tenure, monthly charges, and add-on services, to provide actionable insights for targeted interventions. The project involves evaluating multiple machine learning models, including logistic regression, decision trees, and Random Forest, to select the most effective approach for predicting churn. Ultimately, this analysis will empower Regork Telecom to make data-driven decisions, design tailored retention strategies, and improve long-term customer loyalty and profitability.
Synopsis
This project focuses on using predictive analytics to identify factors contributing to customer churn and developing strategies to improve customer retention for Regork Telecom. By analyzing historical customer data, we built and evaluated machine learning models, including Logistic Regression, Decision Tree, and Random Forest, to predict the likelihood of customers leaving. Key predictors such as tenure, total charges, monthly charges, and contract type were identified as significant drivers of churn. Using these insights, a cost-benefit analysis was conducted to propose an incentive plan targeting high-risk customers, balancing the cost of retention strategies with the revenue saved. The findings provide actionable recommendations for reducing churn and improving long-term customer loyalty while demonstrating the value of data-driven decision-making in business operations.
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(dplyr)
library(caret)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
“customer_retention.csv”
# Load libraries
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ recipes 1.1.0
## ✔ dials 1.3.0 ✔ rsample 1.2.1
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.1
## ✔ modeldata 1.4.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(baguette)
## Warning: package 'baguette' was built under R version 4.4.2
library(vip)
## Warning: package 'vip' was built under R version 4.4.2
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(pdp)
## Warning: package 'pdp' was built under R version 4.4.2
##
## Attaching package: 'pdp'
##
## The following object is masked from 'package:purrr':
##
## partial
library(here)
## here() starts at C:/Users/thoma/Desktop/Miners/Final
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## The following object is masked from 'package:scales':
##
## alpha
library(ggplot2)
library(ranger)
## Warning: package 'ranger' was built under R version 4.4.2
library(earth)
## Warning: package 'earth' was built under R version 4.4.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.2
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
##
## The following object is masked from 'package:scales':
##
## rescale
library(dplyr)
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ranger':
##
## importance
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2
# Set working directory
setwd("C:/Users/thoma/Desktop/Miners/Final")
# Load dataset
data <- read.csv("customer_retention.csv")
# Check for missing values
colSums(is.na(data))
# Convert columns
data$TotalCharges <- as.numeric(as.character(data$TotalCharges)) # Convert TotalCharges to numeric
data <- data[!is.na(data$TotalCharges), ] # Remove rows with NA in TotalCharges
data$Status <- ifelse(data$Status == "Left", 1, 0) # Convert Status to binary: 1 = Churned, 0 = Retained
data$InternetService <- as.factor(data$InternetService) # Ensure InternetService is a factor
# Split data into train and test sets
set.seed(42)
trainIndex <- createDataPartition(data$Status, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
# Fit logistic regression model
model <- glm(Status ~ InternetService + Tenure, data = trainData, family = binomial)
# Summary of the model
summary(model)
# Predictions
testData$PredictedProb <- predict(model, testData, type = "response")
testData$PredictedStatus <- ifelse(testData$PredictedProb > 0.5, 1, 0)
# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(testData$PredictedStatus), as.factor(testData$Status))
print(conf_matrix)
# ROC Curve
roc_obj <- roc(testData$Status, testData$PredictedProb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue")
auc_value <- auc(roc_obj)
print(paste("AUC:", round(auc_value, 3)))
# Example prediction: Ensure InternetService is treated as a factor
example <- data.frame(
InternetService = factor(1, levels = levels(trainData$InternetService)), # Match factor levels
Tenure = 24
)
# Predict churn probability for the example
predicted_churn_prob <- predict(model, example, type = "response")
print(paste("Likelihood of churn for a customer with Fiber optic and 24 months tenure:", round(predicted_churn_prob, 2)))
#bar chart to show the distribution of Contract type and PaymentMethod
ggplot(data, aes(x = Contract, fill = PaymentMethod)) +
geom_bar(position = "dodge") +
labs(
title = "Distribution of Contract Type by Payment Method",
x = "Contract Type",
y = "Count",
fill = "Payment Method"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph shows the distribution of Contract Type (Month-to-month, One year, Two year) across various Payment Methods (Bank transfer, Credit card, Electronic check, and Mailed check). The majority of customers with Month-to-month contracts use Electronic check, which may indicate a higher churn risk due to the lack of commitment in their contract type and the flexibility of this payment method. In contrast, One-year and Two-year contracts are more evenly distributed across Bank transfer, Credit card, and Mailed check, suggesting that longer-term contracts attract customers with more stable payment methods. These insights emphasize the need to target Month-to-month customers using Electronic check for retention strategies, as they represent the most vulnerable group.
# Calculate percentages per 100 customers
status_summary <- data %>%
group_by(PaymentMethod, PaperlessBilling, Status) %>%
summarise(Count = n()) %>%
group_by(PaymentMethod, PaperlessBilling) %>%
mutate(Percentage = (Count / sum(Count)) * 100) %>%
mutate(Status = ifelse(Status == 1, "Left", "Current")) # Convert Status to descriptive labels
## `summarise()` has grouped output by 'PaymentMethod', 'PaperlessBilling'. You
## can override using the `.groups` argument.
# Create a grouped bar chart with percentages
ggplot(status_summary, aes(x = PaymentMethod, y = Percentage, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ PaperlessBilling, labeller = as_labeller(c(`Yes` = "Paperless Billing: Yes", `No` = "Paperless Billing: No"))) +
labs(
title = "Customer Status by Payment Method and Paperless Billing (Per 100 Customers)",
x = "Payment Method",
y = "Percentage of Customers",
fill = "Customer Status"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph illustrates Customer Status (Current or Left) by Payment Method and Paperless Billing preferences (Yes or No). For customers without Paperless Billing, the majority across all payment methods remain Current, with Electronic check showing a slightly higher proportion of churned customers compared to other methods. For customers with Paperless Billing, there is a notable increase in churn, particularly among those using Electronic check, which has the highest churn rate compared to other payment methods. These insights suggest that Electronic check users with Paperless Billing are at higher risk of leaving, highlighting the need for targeted retention strategies for this segment.
# Create a box plot for PaymentMethod vs. MonthlyCharges
ggplot(data, aes(x = PaymentMethod, y = MonthlyCharges, fill = PaymentMethod)) +
geom_boxplot() +
labs(
title = "Monthly Charges by Payment Method",
x = "Payment Method",
y = "Monthly Charges ($)",
fill = "Payment Method"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph displays the distribution of Monthly Charges across different Payment Methods. Customers using Electronic check have the widest range of monthly charges, indicating diverse usage patterns and a tendency for higher charges compared to other methods. Bank transfer (automatic) and Credit card (automatic) exhibit similar distributions, with a median monthly charge around $75, suggesting consistency in customer billing. In contrast, customers using Mailed check tend to have lower monthly charges and a narrower range, indicating they might belong to a more stable or lower-usage segment. These insights highlight that Electronic check users, particularly those with higher charges, may warrant closer attention for retention strategies.
logistics
# Split data into training and test sets
set.seed(42)
trainIndex <- createDataPartition(data$Status, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
# Logistic Regression Model
log_model <- glm(Status ~ ., data = trainData, family = binomial)
# Predict probabilities and compute AUC
log_probs <- predict(log_model, testData, type = "response")
log_roc <- roc(testData$Status, log_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
log_auc <- auc(log_roc)
# Adjust threshold to 0.4 for predictions
log_preds <- ifelse(log_probs > 0.4, 1, 0)
# Confusion Matrix with adjusted threshold
log_confusion <- confusionMatrix(as.factor(log_preds), as.factor(testData$Status))
# Results
print(paste("AUC:", log_auc))
## [1] "AUC: 0.848297984988632"
print(log_confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1293 188
## 1 236 379
##
## Accuracy : 0.7977
## 95% CI : (0.7799, 0.8147)
## No Information Rate : 0.7295
## P-Value [Acc > NIR] : 2.657e-13
##
## Kappa : 0.5007
##
## Mcnemar's Test P-Value : 0.02246
##
## Sensitivity : 0.8457
## Specificity : 0.6684
## Pos Pred Value : 0.8731
## Neg Pred Value : 0.6163
## Prevalence : 0.7295
## Detection Rate : 0.6169
## Detection Prevalence : 0.7066
## Balanced Accuracy : 0.7570
##
## 'Positive' Class : 0
##
The AUC (Area Under the Curve) of 0.8483 indicates that the model has strong discriminative power, effectively distinguishing between customers who churn and those who remain. A value closer to 1 reflects better performance, and this result demonstrates that the model is robust and reliable for predicting customer behavior.
Decision Tree
tree_model <- rpart(Status ~ ., data = trainData, method = "class", control = rpart.control(cp = 0.01))
# Predict probabilities and compute AUC
tree_probs <- predict(tree_model, testData, type = "prob")[, 2]
tree_roc <- roc(testData$Status, tree_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
tree_auc <- auc(tree_roc)
# Confusion Matrix
tree_preds <- ifelse(tree_probs > 0.5, 1, 0)
tree_confusion <- confusionMatrix(as.factor(tree_preds), as.factor(testData$Status))
# Results
print(tree_auc)
## Area under the curve: 0.8037
print(tree_confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1452 351
## 1 77 216
##
## Accuracy : 0.7958
## 95% CI : (0.7779, 0.8129)
## No Information Rate : 0.7295
## P-Value [Acc > NIR] : 1.213e-12
##
## Kappa : 0.3899
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9496
## Specificity : 0.3810
## Pos Pred Value : 0.8053
## Neg Pred Value : 0.7372
## Prevalence : 0.7295
## Detection Rate : 0.6927
## Detection Prevalence : 0.8602
## Balanced Accuracy : 0.6653
##
## 'Positive' Class : 0
##
# Decision Tree model
tree_model <- rpart(Status ~ ., data = trainData, method = "class", control = rpart.control(cp = 0.01))
# Plot
rpart.plot(
tree_model,
type = 2, # Type 2: All nodes are shown with decision rules
extra = 104, # Show class probabilities and percentage of observations
under = TRUE, # Show the number of samples below each node
faclen = 0, # Use full names for factor levels
cex = 0.7, # Scale text size for better readability
main = "Decision Tree for Predicting Customer Churn"
)
The decision tree identifies key factors influencing customer churn. The most critical factor is contract type, with customers on longer-term contracts (One year or Two year) being far less likely to churn compared to those on Month-to-month contracts. OnlineSecurity also plays a significant role, as customers without this service or with no internet service are more prone to churn. Financial variables, such as TotalCharges and MonthlyCharges, highlight that customers with higher total charges and lower monthly charges are more likely to stay. Additionally, customers who do not use PaperlessBilling are less likely to churn, possibly indicating they are more stable or less tech-savvy. These insights suggest targeted retention strategies should focus on Month-to-month customers, particularly those without OnlineSecurity or with higher monthly charges, while upselling longer-term contracts and additional services like OnlineSecurity could further reduce churn.
random forest
#
library(randomForest)
# Train the Random Forest Model
set.seed(42)
rf_model <- randomForest(Status ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Feature Importance
rf_importance <- randomForest::importance(rf_model) # Use the namespace to ensure proper method application
rf_importance_df <- as.data.frame(rf_importance)
rf_importance_df$Feature <- rownames(rf_importance_df)
rownames(rf_importance_df) <- NULL
# Print Feature Importance
print(rf_importance_df)
## %IncMSE IncNodePurity Feature
## 1 -4.081343 14.352451 Gender
## 2 9.834132 14.386356 SeniorCitizen
## 3 8.488376 13.390663 Partner
## 4 8.451933 12.384981 Dependents
## 5 46.529653 108.892088 Tenure
## 6 6.863222 4.247201 PhoneService
## 7 12.910203 15.133211 MultipleLines
## 8 27.464481 34.761308 InternetService
## 9 21.322103 34.329820 OnlineSecurity
## 10 14.951578 19.930471 OnlineBackup
## 11 11.973991 16.754778 DeviceProtection
## 12 24.462608 34.438968 TechSupport
## 13 12.966043 12.780881 StreamingTV
## 14 14.221430 12.085076 StreamingMovies
## 15 37.703764 61.166377 Contract
## 16 14.383146 17.621158 PaperlessBilling
## 17 18.149934 33.797039 PaymentMethod
## 18 32.634723 99.182373 MonthlyCharges
## 19 45.405396 113.866637 TotalCharges
# Ensure 'Status' is a factor in both training and test datasets
trainData$Status <- as.factor(trainData$Status)
testData$Status <- as.factor(testData$Status)
# Train the Random Forest Model for classification
set.seed(42)
rf_model <- randomForest(Status ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)
# Predict probabilities for the test dataset
rf_probs <- predict(rf_model, testData, type = "prob")[, 2] # Extract probabilities for the positive class
# Compute ROC and AUC
library(pROC)
roc_obj <- roc(testData$Status, rf_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_obj)
# Print AUC
print(paste("AUC:", round(auc_value, 3)))
## [1] "AUC: 0.837"
library(grid)
library(png)
# Load the image
image_path <- "C:/Users/thoma/Desktop/Miners/Final/000010.png"
# Display the image
grid::grid.raster(png::readPNG(image_path))
# I can send code for graph it was not kniting for some reason
This graph illustrates the feature importance for predicting customer churn based on the Mean Decrease in Accuracy metric. The most influential feature is Tenure, indicating that the length of time a customer has been with the company significantly impacts the model’s predictive accuracy. TotalCharges and Contract type are also critical, suggesting that financial and contractual stability play a major role in determining churn behavior. MonthlyCharges, TechSupport, and OnlineSecurity follow closely, showing that customers’ service usage and security add-ons contribute to churn risk. Features like Gender, Partner, and PhoneService have minimal importance, implying they provide little additional value in predicting churn. This insight highlights the need to focus retention strategies on customers with shorter tenures, higher charges, or month-to-month contracts, while leveraging add-on services like tech support and online security to improve customer loyalty.
# Load Required Libraries
library(randomForest)
library(ggplot2)
library(dplyr)
# Ensure `trainData` exists and has the necessary structure
if (!exists("trainData") || !"Status" %in% colnames(trainData)) {
stop("The dataset 'trainData' is not found or does not contain the column 'Status'.")
}
# Convert the response variable to a factor for classification
if (!is.factor(trainData$Status)) {
trainData$Status <- as.factor(trainData$Status)
}
# Train the Random Forest model
set.seed(42)
rf_model <- randomForest(Status ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)
# Extract Feature Importance
importance_raw <- randomForest::importance(rf_model)
# Check if importance extraction was successful
if (is.null(importance_raw)) {
stop("Feature importance extraction failed.")
}
# Convert to DataFrame and Add Feature Names
importance_df <- as.data.frame(importance_raw)
importance_df$Feature <- rownames(importance_df)
rownames(importance_df) <- NULL
# Ensure that `MeanDecreaseGini` column exists
if (!"MeanDecreaseGini" %in% colnames(importance_df)) {
stop("The 'MeanDecreaseGini' column is not found in the importance data.")
}
# Remove rows with NA values
importance_df <- importance_df %>% filter(across(everything(), ~ !is.na(.)))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Sort by MeanDecreaseGini for Visualization
importance_df <- importance_df %>% arrange(desc(MeanDecreaseGini))
# Create a Bar Chart for MeanDecreaseGini
gini_plot <- ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "lightgreen") +
coord_flip() +
labs(
title = "Feature Importance: Mean Decrease in Impurity Measure",
x = "Features",
y = "Mean Decrease in Impurity Measure"
) +
theme_minimal()
# Print the Plot
print(gini_plot)
This graph illustrates the feature importance of variables in the Random Forest model based on their contribution to reducing impurity (mean decrease in the Gini Index). The most influential features are TotalCharges, Tenure, and MonthlyCharges, indicating that these financial and engagement metrics play a critical role in predicting customer churn. Additionally, variables like Contract and InternetService are also significant, highlighting the importance of subscription types and service offerings in retaining customers. Features such as TechSupport, OnlineSecurity, and PaymentMethod further contribute to churn predictions, suggesting that value-added services and payment preferences influence customer behavior. On the other hand, features like StreamingMovies, Dependents, and PhoneService show lower importance, implying they have minimal impact on churn outcomes. This information enables Regork to prioritize strategies focused on customers with high charges, short tenures, or flexible contracts, ultimately improving retention efforts by addressing key churn drivers.
Most Optimal
The most optimal model for predicting customer churn is Logistic Regression, as it has the highest AUC score (84.8). This indicates that it has the best ability to distinguish between churned and retained customers. While Random Forest performs well with an AUC of 83.7, Logistic Regression slightly outperforms it in this case and would be the preferred model for this specific analysis.
Relative Importance of Predictors and Focus Areas for Retention
Based on the analysis, the most influential predictors of customer churn are TotalCharges, Tenure, and MonthlyCharges, followed by Contract type, OnlineSecurity, and TechSupport. As a business manager, I would focus on these critical factors to reduce churn. For example, customers with shorter tenures and month-to-month contracts are at a higher risk of leaving, so offering incentives such as discounts for signing longer-term contracts could improve retention. Additionally, customers with higher monthly charges or those without OnlineSecurity or TechSupport services might benefit from bundled service promotions that enhance perceived value and reduce their financial burden. These actions align with the insights derived from the model, targeting the factors most likely to influence customer decisions.
Predicted Revenue Loss and Retention Proposal
Using the optimal Random Forest model, we identified all customers in the test dataset who are predicted to churn. By aggregating their MonthlyCharges, we estimate the total predicted revenue loss per month if no action is taken. For instance, if 500 customers are predicted to churn and their average monthly charge is 70, the monthly revenue loss would be approximately 35,000. To mitigate this, we propose an incentive scheme offering a 15% discount on monthly charges for customers predicted to churn, in exchange for switching to one-year contracts. If the cost of the incentive averages 10 per customer, and 60% of these customers are retained, the company would retain 21,000 in monthly revenue, while incurring only 5,000 in incentive costs, resulting in a net benefit of $16,000 per month.
Conclusion
The model’s insights emphasize the importance of addressing financial, contractual, and service-related factors to reduce customer churn effectively. By targeting customers at high risk of leaving with incentives like discounts for longer-term contracts and bundled services, Regork Telecom can substantially reduce revenue loss while maintaining a healthy customer base. The proposed incentive scheme, supported by a cost-benefit analysis, demonstrates a clear path to achieving this goal with a significant return on investment. This strategy ensures not only short-term revenue retention but also fosters long-term customer loyalty and satisfaction, aligning with Regork Telecom’s business objectives.