Introduction

Business Problem In the competitive telecommunications market, customer churn presents a critical challenge for Regork. Retaining existing customers is far more cost-effective than acquiring new ones, as the costs of marketing, promotions, and onboarding significantly outweigh retention efforts. Churn not only disrupts revenue streams but also hurts customer lifetime value and damages Regork’s ability to build a loyal customer base. As Regork expands its offerings—such as internet service, phone service, and online streaming—it becomes imperative to proactively identify and retain customers at risk of leaving. This report addresses this challenge by providing actionable insights and a predictive framework for managing churn effectively.

How the Problem was Adressed To tackle the churn problem, customer data was analyzed to uncover key patterns and predictors of churn. Using a combination of exploratory data analysis (EDA) and machine learning techniques, a highly accurate Random Forest model (AUC of 0.83) was developed to predict which customers are most likely to leave. Key variables such as Tenure, Contract Type, and Monthly Charges emerged as the primary drivers of churn. The analysis not only pinpointed at-risk customers but also quantified the potential revenue loss if no action is taken—estimated at $21,566.65 per month.

Insights This analysis equips Regork leadership with a data-driven strategy to mitigate churn and improve retention. We will explain a targeted retention program, including various incentives such as a teired discount system, limited time value-added service, as well as loyalty rewards. By implementing these measures Regork will see a significant reduction in revenue loss and improvement in customer loyalty. These insights, along with the predictive model, provide a robust foundation for actionable decision-making.


knitr::opts_chunk$set(
  echo = TRUE,         # Show code in the final output
  warning = FALSE,     # Suppress warnings
  message = FALSE,     # Suppress messages
  error = FALSE        # Suppress errors
)

Packages Required

# Load all required libraries
library(dplyr)
library(ggplot2)
library(caret)
library(pROC)
library(randomForest)

Data Preparation

# Load dataset
data <- read.csv("C:/Users/sszuber/Downloads/customer_retention.csv")

# Handle missing values
data$TotalCharges[is.na(data$TotalCharges)] <- median(data$TotalCharges, na.rm = TRUE)

# Convert Contract to factor
data$Contract <- as.factor(data$Contract)

# Convert Status to numeric binary
data$Status <- ifelse(data$Status == "Left", 1, 0)

The dataset includes customer demographics, subscription details, and churn status. Missing values in TotalCharges are replaced with the median to maintain data integrity. The Contract variable is converted to a factor for accurate analysis, and Status is transformed into a binary numeric variable (1 for churned, 0 for retained) to facilitate modeling

Exploratory Data Analysis

# Plot status distribution
ggplot(data, aes(x = factor(Status, labels = c("Current", "Left")), fill = factor(Status, labels = c("Current", "Left")))) +
  geom_bar() +
  theme_minimal() +
  ggtitle("Customer Status Distribution") +
  labs(x = "Customer Status", y = "Count", fill = "Status")

The plot shows the distribution of customers who stayed versus those who churned, providing a baseline understanding of the churn rate.

Key Variables and their impact on churn

# Boxplot of tenure vs. status
ggplot(data, aes(x = factor(Status, labels = c("Current", "Left")), y = Tenure, fill = factor(Status, labels = c("Current", "Left")))) +
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Tenure vs. Customer Status") +
  labs(x = "Customer Status", y = "Tenure (Months)", fill = "Status")

The boxplot reveals that customers with shorter tenures are more prone to churn, suggesting that early engagement is crucial for retention.

# Bar plot for contract types
ggplot(data, aes(x = Contract, fill = factor(Status, labels = c("Current", "Left")))) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ggtitle("Contract Type vs. Customer Status") +
  labs(x = "Contract Type", y = "Proportion", fill = "Status")

This plot indicates that customers on month-to-month contracts exhibit higher churn rates compared to those with long-term contracts, highlighting the stabilizing effect of longer commitments.

Machine Learning

Data Spliting

set.seed(42)

data$Status <- factor(data$Status, levels = c(0, 1), labels = c("Current", "Left"))

# Split into training and testing datasets
trainIndex <- createDataPartition(data$Status, p = .8, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]

The dataset is split into training (80%) and testing (20%) sets to build and evaluate models.

Logistic Regression Model

# Logistic Regression with cross-validation
logistic_model <- train(
  Status ~ .,
  data = trainData,
  method = "glm",
  family = "binomial",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC"
)

# Evaluate on test data
logistic_preds <- predict(logistic_model, newdata = testData, type = "prob")[,2]
logistic_roc <- roc(testData$Status, logistic_preds)
auc(logistic_roc)
## Area under the curve: 0.8417

The logistic regression model, trained with 5-fold cross-validation, achieves an AUC of 0.842 on the test data, indicating superior predictive capability.

Decision Tree

# Decision Tree with cross-validation
decision_tree_model <- train(
  Status ~ .,
  data = trainData,
  method = "rpart",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC",
  tuneLength = 10
)

# Evaluate on test data
tree_preds <- predict(decision_tree_model, newdata = testData, type = "prob")[,2]
tree_roc <- roc(testData$Status, tree_preds)
auc(tree_roc)
## Area under the curve: 0.7985

The decision tree model, also trained with 5-fold cross-validation, yields an AUC of 0.799 on the test data, suggesting moderate predictive performance.

Random Forest Model

# Ensure Status is a factor (redundancy added for debugging)
trainData$Status <- factor(trainData$Status, levels = c("Current", "Left"))
testData$Status <- factor(testData$Status, levels = c("Current", "Left"))

# Train Random Forest with cross-validation
random_forest_model <- train(
  Status ~ ., 
  data = trainData, 
  method = "rf", 
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary), 
  tuneLength = 3, # Reduce to 3 for faster computation
  metric = "ROC"
)

# Evaluate on test data
rf_preds <- predict(random_forest_model, newdata = testData, type = "prob")[,2]
rf_roc <- roc(testData$Status, rf_preds)
auc(rf_roc)
## Area under the curve: 0.8311

The random forest model, trained with 5-fold cross-validation and hyperparameter tuning, achieves the highest AUC of 0.83 on the test data, indicating a good predictive performance among the models tested.

Model Comparison and Analysis

# Compare AUCs for all models
results <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
  AUC = c(auc(logistic_roc), auc(tree_roc), auc(rf_roc))
)
results
##                 Model       AUC
## 1 Logistic Regression 0.8417465
## 2       Decision Tree 0.7984559
## 3       Random Forest 0.8310723

The AUC values for all models are compared to determine the best-performing model.

Feature Importance

# Feature importance for the best model (e.g., Random Forest)
importance <- varImp(logistic_model) 
plot(importance, main = "Feature Importance: Best Model")

The most influential features driving churn are: Contract Type, Charges, and Tenure. These insights help focus retention strategies on contract terms, pricing, and early customer engagement.

Generalization Errors

# Evaluate the model on test data
predicted <- predict(random_forest_model, newdata = testData, type = "prob")[,2]
testData$Predicted <- ifelse(predicted > 0.5, "Left", "Current")


# Confusion matrix
confusionMatrix(factor(testData$Predicted), factor(testData$Status))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current     984  249
##    Left         44  122
##                                           
##                Accuracy : 0.7906          
##                  95% CI : (0.7683, 0.8116)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 7.677e-07       
##                                           
##                   Kappa : 0.3474          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9572          
##             Specificity : 0.3288          
##          Pos Pred Value : 0.7981          
##          Neg Pred Value : 0.7349          
##              Prevalence : 0.7348          
##          Detection Rate : 0.7034          
##    Detection Prevalence : 0.8813          
##       Balanced Accuracy : 0.6430          
##                                           
##        'Positive' Class : Current         
## 

The confusion matrix summarizes the model’s performance, indicating the accuracy and error rates in predicting churn.

Business Analysis and Conclusion

Most Important Predictor Variables

The analysis indicates that Tenure, Contract Type, and Charges are the most influential predictors of customer churn.

  • Tenure: Tenure emerged as the most critical predictor variable. Customers with longer tenures are significantly less likely to churn, reflecting their satisfaction and loyalty to the company. Conversely, customers with shorter tenures are at a higher risk of leaving. This highlights the importance of engaging new customers early and fostering long-term relationships. Regork should focus on improving customer retention efforts during the first few months of service by offering onboarding programs, personalized support, and loyalty incentives to reduce churn and increase tenure.
  • Contract Type: Contract Type was the second most influential predictor variable. Customers on month-to-month contracts are much more likely to churn compared to those on longer-term contracts. This finding underscores the importance of promoting annual or multi-year contracts. Regork should incentivize customers to switch from month-to-month plans to longer-term contracts by offering discounts, rewards, or exclusive benefits. This strategy would not only reduce churn rates but also stabilize revenue streams and build stronger customer relationships.
  • Monthly Charges and Total Charges: Both Monthly Charges and Total Charges were consistently ranked as significant predictors. Customers paying higher monthly charges or accumulating significant total charges appear more likely to churn, likely due to dissatisfaction with perceived value or affordability. This underscores the need for Regork to carefully evaluate its pricing structures and ensure they align with customer expectations. Offering tailored packages, loyalty discounts, or bundling services could help mitigate churn among high-paying customers while enhancing their perceived value.

Predicted Churn and Revenue Lost

Using the optimal Logistic model with an AUC of 0.84, we identified that 256 customers in the test dataset are predicted to leave each month. If no action is taken, the estimated monthly revenue loss is $26,147.25, calculated as the sum of the monthly charges of these at-risk customers. When analyzing the demographics of the customers predicted to leave, several trends emerge. The majority of these customers are younger, not senior citizens, and single. Additionally, a significant portion of them are not dependents and are on month-to-month contracts with higher monthly charges. These findings suggest that financial stability and commitment flexibility are critical factors in churn.

Incentives

Tiered Discount System: Offer a 15% discount for customers agreeing to switch to a 1-year contract or a 25% discount for committing to a 2-year contract. This approach encourages longer commitments while offering flexibility to those hesitant about extended contracts.

Value-Added Services: Include free access to premium services (e.g., streaming add-ons or tech support) for 3 months to enhance perceived value without increasing direct costs.

Loyalty Rewards: Customers who remain on the new 1-2 year contract for its full duration receive a $50 credit on their next bill, further incentivizing retention.

Conclusion Statement

In conclusion, Regork’s venture into telecommunications shows strong potential for success, provided the company addresses key drivers of churn. Our analysis identified Tenure, Contract Type, and Charges as the most critical predictors of customer churn. These findings provide actionable insights to guide retention efforts:

  1. Strengthen Early Customer Engagement: Focus on improving customer experiences during the critical early months to increase tenure and reduce churn.
  2. Promote Long-Term Contracts: Incentivize customers to adopt longer-term contracts to reduce the likelihood of churn.
  3. Reevaluate Pricing Strategies: Ensure pricing is competitive while delivering value to customers, particularly those with higher monthly charges.

The analysis estimates that without intervention, Regork stands to lose $26,147.25 per month due to churn. By implementing the proposed incentive scheme, the company could retain a significant portion of at-risk customers, saving a substantial amount of money each month.

While our models achieve an average AUC of 0.84, indicating strong predictive performance, there is room for further refinement to improve accuracy. Overall, these results provide a solid foundation for data-driven decision-making and customer retention strategies that will ensure Regork’s long-term success in the telecommunications market.