1 Introduction

1.1 Business Problem

Customer churn is a significant issue for Regork Telecommunications. Acquiring new customers is substantially more expensive than retaining existing ones, making customer retention a high-priority business objective. The CEO should be interested in this analysis because understanding and addressing churn not only prevents revenue loss but also strengthens customer relationships, leading to improved customer lifetime value and market share.


2 Exploratory Data Analysis

2.1 Loading Libraries and Data

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(earth)
## Warning: package 'earth' was built under R version 4.4.2
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.2
## Loading required package: plotrix
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Load the dataset
data <- read.csv("customer_retention.csv")

# Handle missing values in TotalCharges
data$TotalCharges[is.na(data$TotalCharges)] <- median(data$TotalCharges, na.rm = TRUE)

# Encode categorical variables as factors
categorical_vars <- c("Gender", "Partner", "Dependents", "PhoneService", 
                      "MultipleLines", "InternetService", "OnlineSecurity", 
                      "OnlineBackup", "DeviceProtection", "TechSupport", 
                      "StreamingTV", "StreamingMovies", "Contract", 
                      "PaperlessBilling", "PaymentMethod")
data[categorical_vars] <- lapply(data[categorical_vars], as.factor)
data$Status <- as.factor(data$Status)

Distribution of Customer Status

ggplot(data, aes(x = Status)) +
  geom_bar(fill = "skyblue") +
  labs(
    title = "Distribution of Customer Status",
    x = "Customer Status",
    y = "Count"
  ) +
  theme_minimal()

Observation: Approximately 73.5% of customers are current, while 26.5% have churned. This baseline churn rate highlights the need for strategies to retain at-risk customers.

Tenure Distribution by Status

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(bins = 30, position = "dodge", alpha = 0.7) +
  labs(
    title = "Tenure Distribution by Customer Status",
    x = "Tenure (Months)",
    y = "Count"
  ) +
  theme_minimal()

Customers with shorter tenures are more likely to churn. This trend suggests that newer customers may need additional support or incentives to remain engaged.

Monthly Charges by Status

ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() +
  labs(
    title = "Monthly Charges by Customer Status",
    x = "Customer Status",
    y = "Monthly Charges ($)"
  ) +
  theme_minimal()

Customers who churn tend to have higher monthly charges compared to those who stay. This suggests that pricing may be a factor influencing churn.

# Machine Learning {.tabset}

Splitting the Data

set.seed(42)
split <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train <- data[split, ]
test <- data[-split, ]

Logistic Regression Logistic Regression

# Train logistic regression with reduced predictors
log_model <- train(
  Status ~ ., 
  data = train, 
  method = "glm", 
  family = "binomial", 
  trControl = trainControl(method = "cv", number = 5, savePredictions = "final")
)

# Predict and calculate confusion matrix
log_predictions <- predict(log_model, newdata = test, type = "raw")
log_cm <- confusionMatrix(log_predictions, test$Status)

# Calculate AUC
log_prob <- predict(log_model, newdata = test, type = "prob")[, 2]
log_auc <- roc(test$Status, log_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
# Extract coefficients
log_coefficients <- summary(log_model$finalModel)$coefficients

log_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1387  251
##    Left        155  305
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.7889, 0.8232)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1.09e-14        
##                                           
##                   Kappa : 0.4742          
##                                           
##  Mcnemar's Test P-Value : 2.42e-06        
##                                           
##             Sensitivity : 0.8995          
##             Specificity : 0.5486          
##          Pos Pred Value : 0.8468          
##          Neg Pred Value : 0.6630          
##              Prevalence : 0.7350          
##          Detection Rate : 0.6611          
##    Detection Prevalence : 0.7807          
##       Balanced Accuracy : 0.7240          
##                                           
##        'Positive' Class : Current         
## 
auc(log_auc)
## Area under the curve: 0.8551

MARS Model

# Train the MARS model
mars_model <- train(
  Status ~ ., 
  data = train, 
  method = "earth", 
  trControl = trainControl(method = "cv", number = 5, savePredictions = "final"),
  tuneGrid = expand.grid(degree = c(1, 2), nprune = seq(2, 50, by = 2))
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Predict and calculate confusion matrix
mars_predictions <- predict(mars_model, newdata = test, type = "raw")
mars_cm <- confusionMatrix(mars_predictions, test$Status)

# Calculate AUC
mars_prob <- predict(mars_model, newdata = test, type = "prob")[, 2]
mars_auc <- roc(test$Status, mars_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
mars_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1402  260
##    Left        140  296
##                                           
##                Accuracy : 0.8093          
##                  95% CI : (0.7919, 0.8259)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 8.705e-16       
##                                           
##                   Kappa : 0.4743          
##                                           
##  Mcnemar's Test P-Value : 2.681e-09       
##                                           
##             Sensitivity : 0.9092          
##             Specificity : 0.5324          
##          Pos Pred Value : 0.8436          
##          Neg Pred Value : 0.6789          
##              Prevalence : 0.7350          
##          Detection Rate : 0.6683          
##    Detection Prevalence : 0.7922          
##       Balanced Accuracy : 0.7208          
##                                           
##        'Positive' Class : Current         
## 
auc(mars_auc)
## Area under the curve: 0.8579

Random Forest

# Train the Random Forest model
rf_model <- train(
  Status ~ ., 
  data = train, 
  method = "rf", 
  trControl = trainControl(method = "cv", number = 5, savePredictions = "final"),
  tuneGrid = expand.grid(mtry = c(2, 4, 6, 8)),
  ntree = 500
)

# Predict and calculate confusion matrix
rf_predictions <- predict(rf_model, newdata = test, type = "raw")
rf_cm <- confusionMatrix(rf_predictions, test$Status)

# Calculate AUC
rf_prob <- predict(rf_model, newdata = test, type = "prob")[, 2]
rf_auc <- roc(test$Status, rf_prob)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1384  258
##    Left        158  298
##                                          
##                Accuracy : 0.8017         
##                  95% CI : (0.784, 0.8186)
##     No Information Rate : 0.735          
##     P-Value [Acc > NIR] : 5.790e-13      
##                                          
##                   Kappa : 0.46           
##                                          
##  Mcnemar's Test P-Value : 1.211e-06      
##                                          
##             Sensitivity : 0.8975         
##             Specificity : 0.5360         
##          Pos Pred Value : 0.8429         
##          Neg Pred Value : 0.6535         
##              Prevalence : 0.7350         
##          Detection Rate : 0.6597         
##    Detection Prevalence : 0.7827         
##       Balanced Accuracy : 0.7168         
##                                          
##        'Positive' Class : Current        
## 
auc(rf_auc)
## Area under the curve: 0.8465

# Model Comparison {.tabset}

Confusion Matrix Summary

data.frame(
  Model = c("Logistic Regression", "MARS", "Random Forest"),
  Accuracy = c(log_cm$overall['Accuracy'], mars_cm$overall['Accuracy'], rf_cm$overall['Accuracy']),
  AUC = c(auc(log_auc), auc(mars_auc), auc(rf_auc))
)
##                 Model  Accuracy       AUC
## 1 Logistic Regression 0.8064824 0.8550729
## 2                MARS 0.8093422 0.8578793
## 3       Random Forest 0.8017159 0.8464645

Logistic Regression provides strong interpretability with competitive accuracy and AUC. MARS captures non-linear relationships but has slightly lower accuracy compared to Random Forest. Random Forest balances accuracy, AUC, and feature importance analysis, making it suitable for operational use.

3 Business Analysis & Summary

Key Factors to Focus On Tenure: Customers with shorter tenure are at higher risk of churn. Focus on engaging and retaining newer customers. MonthlyCharges: High monthly charges correlate with churn. Offering discounts or added value can help retain high-paying customers. Contract Type: Month-to-month contracts have higher churn rates. Incentivizing longer-term contracts can stabilize retention.

Potential Revenue Loss

# Estimate potential revenue loss
predicted_churn <- test[predict(rf_model, test) == "Left", ]
potential_revenue_loss <- sum(predicted_churn$MonthlyCharges)
potential_revenue_loss
## [1] 36259.45

If no action is taken, Regork stands to lose approximately $35826 in monthly revenue.

Proposed Incentive Scheme Target Group: Customers predicted to churn based on model predictions. Incentives: Offer a 10-15% discount for high-paying customers. Provide one-time discounts or free services for customers switching to annual contracts. Enhance service offerings, such as faster internet or improved customer support. Summary for the CEO Key Insights: Tenure, MonthlyCharges, and Contract Type are the top predictors of churn. Retention efforts targeting high-risk customers can significantly reduce revenue loss. Proposed Actions: Implement the outlined incentive scheme and enhance service satisfaction. Focus on retaining high-paying and newer customers. Expected Outcome: A significant reduction in churn rate and preservation of monthly revenue.