Customer churn is a significant issue for Regork Telecommunications. Acquiring new customers is substantially more expensive than retaining existing ones, making customer retention a high-priority business objective. The CEO should be interested in this analysis because understanding and addressing churn not only prevents revenue loss but also strengthens customer relationships, leading to improved customer lifetime value and market share.
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(earth)
## Warning: package 'earth' was built under R version 4.4.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.2
## Loading required package: plotrix
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
# Load the dataset
data <- read.csv("customer_retention.csv")
# Handle missing values in TotalCharges
data$TotalCharges[is.na(data$TotalCharges)] <- median(data$TotalCharges, na.rm = TRUE)
# Encode categorical variables as factors
categorical_vars <- c("Gender", "Partner", "Dependents", "PhoneService",
"MultipleLines", "InternetService", "OnlineSecurity",
"OnlineBackup", "DeviceProtection", "TechSupport",
"StreamingTV", "StreamingMovies", "Contract",
"PaperlessBilling", "PaymentMethod")
data[categorical_vars] <- lapply(data[categorical_vars], as.factor)
data$Status <- as.factor(data$Status)
Distribution of Customer Status
ggplot(data, aes(x = Status)) +
geom_bar(fill = "skyblue") +
labs(
title = "Distribution of Customer Status",
x = "Customer Status",
y = "Count"
) +
theme_minimal()
Observation: Approximately 73.5% of customers are current, while 26.5% have churned. This baseline churn rate highlights the need for strategies to retain at-risk customers.
Tenure Distribution by Status
ggplot(data, aes(x = Tenure, fill = Status)) +
geom_histogram(bins = 30, position = "dodge", alpha = 0.7) +
labs(
title = "Tenure Distribution by Customer Status",
x = "Tenure (Months)",
y = "Count"
) +
theme_minimal()
Customers with shorter tenures are more likely to churn. This trend suggests that newer customers may need additional support or incentives to remain engaged.
Monthly Charges by Status
ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(
title = "Monthly Charges by Customer Status",
x = "Customer Status",
y = "Monthly Charges ($)"
) +
theme_minimal()
Customers who churn tend to have higher monthly charges compared to those who stay. This suggests that pricing may be a factor influencing churn.
# Machine Learning {.tabset}
Splitting the Data
set.seed(42)
split <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train <- data[split, ]
test <- data[-split, ]
Logistic Regression Logistic Regression
# Train logistic regression with reduced predictors
log_model <- train(
Status ~ .,
data = train,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 5, savePredictions = "final")
)
# Predict and calculate confusion matrix
log_predictions <- predict(log_model, newdata = test, type = "raw")
log_cm <- confusionMatrix(log_predictions, test$Status)
# Calculate AUC
log_prob <- predict(log_model, newdata = test, type = "prob")[, 2]
log_auc <- roc(test$Status, log_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
# Extract coefficients
log_coefficients <- summary(log_model$finalModel)$coefficients
log_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1387 251
## Left 155 305
##
## Accuracy : 0.8065
## 95% CI : (0.7889, 0.8232)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1.09e-14
##
## Kappa : 0.4742
##
## Mcnemar's Test P-Value : 2.42e-06
##
## Sensitivity : 0.8995
## Specificity : 0.5486
## Pos Pred Value : 0.8468
## Neg Pred Value : 0.6630
## Prevalence : 0.7350
## Detection Rate : 0.6611
## Detection Prevalence : 0.7807
## Balanced Accuracy : 0.7240
##
## 'Positive' Class : Current
##
auc(log_auc)
## Area under the curve: 0.8551
MARS Model
# Train the MARS model
mars_model <- train(
Status ~ .,
data = train,
method = "earth",
trControl = trainControl(method = "cv", number = 5, savePredictions = "final"),
tuneGrid = expand.grid(degree = c(1, 2), nprune = seq(2, 50, by = 2))
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Predict and calculate confusion matrix
mars_predictions <- predict(mars_model, newdata = test, type = "raw")
mars_cm <- confusionMatrix(mars_predictions, test$Status)
# Calculate AUC
mars_prob <- predict(mars_model, newdata = test, type = "prob")[, 2]
mars_auc <- roc(test$Status, mars_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
mars_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1402 260
## Left 140 296
##
## Accuracy : 0.8093
## 95% CI : (0.7919, 0.8259)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 8.705e-16
##
## Kappa : 0.4743
##
## Mcnemar's Test P-Value : 2.681e-09
##
## Sensitivity : 0.9092
## Specificity : 0.5324
## Pos Pred Value : 0.8436
## Neg Pred Value : 0.6789
## Prevalence : 0.7350
## Detection Rate : 0.6683
## Detection Prevalence : 0.7922
## Balanced Accuracy : 0.7208
##
## 'Positive' Class : Current
##
auc(mars_auc)
## Area under the curve: 0.8579
Random Forest
# Train the Random Forest model
rf_model <- train(
Status ~ .,
data = train,
method = "rf",
trControl = trainControl(method = "cv", number = 5, savePredictions = "final"),
tuneGrid = expand.grid(mtry = c(2, 4, 6, 8)),
ntree = 500
)
# Predict and calculate confusion matrix
rf_predictions <- predict(rf_model, newdata = test, type = "raw")
rf_cm <- confusionMatrix(rf_predictions, test$Status)
# Calculate AUC
rf_prob <- predict(rf_model, newdata = test, type = "prob")[, 2]
rf_auc <- roc(test$Status, rf_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Current Left
## Current 1384 258
## Left 158 298
##
## Accuracy : 0.8017
## 95% CI : (0.784, 0.8186)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 5.790e-13
##
## Kappa : 0.46
##
## Mcnemar's Test P-Value : 1.211e-06
##
## Sensitivity : 0.8975
## Specificity : 0.5360
## Pos Pred Value : 0.8429
## Neg Pred Value : 0.6535
## Prevalence : 0.7350
## Detection Rate : 0.6597
## Detection Prevalence : 0.7827
## Balanced Accuracy : 0.7168
##
## 'Positive' Class : Current
##
auc(rf_auc)
## Area under the curve: 0.8465
# Model Comparison {.tabset}
Confusion Matrix Summary
data.frame(
Model = c("Logistic Regression", "MARS", "Random Forest"),
Accuracy = c(log_cm$overall['Accuracy'], mars_cm$overall['Accuracy'], rf_cm$overall['Accuracy']),
AUC = c(auc(log_auc), auc(mars_auc), auc(rf_auc))
)
## Model Accuracy AUC
## 1 Logistic Regression 0.8064824 0.8550729
## 2 MARS 0.8093422 0.8578793
## 3 Random Forest 0.8017159 0.8464645
Logistic Regression provides strong interpretability with competitive accuracy and AUC. MARS captures non-linear relationships but has slightly lower accuracy compared to Random Forest. Random Forest balances accuracy, AUC, and feature importance analysis, making it suitable for operational use.
Key Factors to Focus On Tenure: Customers with shorter tenure are at higher risk of churn. Focus on engaging and retaining newer customers. MonthlyCharges: High monthly charges correlate with churn. Offering discounts or added value can help retain high-paying customers. Contract Type: Month-to-month contracts have higher churn rates. Incentivizing longer-term contracts can stabilize retention.
Potential Revenue Loss
# Estimate potential revenue loss
predicted_churn <- test[predict(rf_model, test) == "Left", ]
potential_revenue_loss <- sum(predicted_churn$MonthlyCharges)
potential_revenue_loss
## [1] 36259.45
If no action is taken, Regork stands to lose approximately $35826 in monthly revenue.
Proposed Incentive Scheme Target Group: Customers predicted to churn based on model predictions. Incentives: Offer a 10-15% discount for high-paying customers. Provide one-time discounts or free services for customers switching to annual contracts. Enhance service offerings, such as faster internet or improved customer support. Summary for the CEO Key Insights: Tenure, MonthlyCharges, and Contract Type are the top predictors of churn. Retention efforts targeting high-risk customers can significantly reduce revenue loss. Proposed Actions: Implement the outlined incentive scheme and enhance service satisfaction. Focus on retaining high-paying and newer customers. Expected Outcome: A significant reduction in churn rate and preservation of monthly revenue.