Analyzing Customer Retention Using Machine Learning

Introduction

Business Problem In today’s fast-evolving technology landscape, customer expectations for smart devices are rapidly increasing. To stay competitive, Regork Telecom must understand how to retain customers and offer products that meet modern demands for connectivity, entertainment, and convenience. The CEO should be particularly interested in this analysis, as customer retention is directly tied to long-term profitability and brand loyalty in an increasingly saturated market.

Approach To address this challenge, we conducted a machine learning-based analysis of customer retention. We used a dataset containing customer behavior, service usage, and demographic information to identify patterns and key drivers behind customer churn. By applying classification models and data visualization techniques, we were able to extract actionable insights that informed product development and pricing strategy.

Proposed Solution Our analysis supports the launch of a new generation of Smart Home TVs, designed not only to enhance user experience but also to strengthen customer engagement and retention. The TVs offer advanced features including personalized content, seamless access to popular platforms like Netflix and Spotify, and direct human-device interaction. A complimentary three-month premium subscription, followed by a competitive pricing model, further encourages long-term customer loyalty. These strategies aim to reduce churn and position Regork Telecom as an industry leader in smart home innovation.

Packages Required

library(tidymodels)
library(completejourney)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(caret)
library(corrplot)
library(randomForest)
library(earth)
library(vip)
library(ranger) 
library(rsample)

Data Preparation

library(readr)
data <- read_csv("customer_retention.csv")

class(data)
summary(data)
str(data)
data <- na.omit(data)
table(data$Status)
table(data$Gender)
table(data$InternetService)
summary(data$Tenure)
summary(data$MonthlyCharges)

data$Gender <- as.factor(data$Gender)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
data$Partner <- as.factor(data$Partner)
data$Dependents <- as.factor(data$Dependents)
data$PhoneService <- as.factor(data$PhoneService)
data$MultipleLines <- as.factor(data$MultipleLines)
data$InternetService <- as.factor(data$InternetService)
data$OnlineSecurity <- as.factor(data$OnlineSecurity)
data$OnlineBackup <- as.factor(data$OnlineBackup)
data$DeviceProtection <- as.factor(data$DeviceProtection)
data$TechSupport <- as.factor(data$TechSupport)
data$StreamingTV <- as.factor(data$StreamingTV)
data$StreamingMovies <- as.factor(data$StreamingMovies)
data$Contract <- as.factor(data$Contract)
data$PaperlessBilling <- as.factor(data$PaperlessBilling)
data$PaymentMethod <- as.factor(data$PaymentMethod)
data$Status <- as.factor(data$Status)
data$Tenure <- scale(data$Tenure)
data$MonthlyCharges <- scale(data$MonthlyCharges)
data$TotalCharges <- scale(data$TotalCharges)
sum(is.na(data))
data <- na.omit(data)

Exploratory Analysis

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Distribution of Tenure by Customer Status", x = "Tenure (months)", y = "Count") +
  theme_minimal()

The graph titled “Tenure by Customer Status” shows that loyalty to Regork Premium grows over time. Customers who stay subscribed longer tend to invest more time and money into the service. The data reveals that those who have used the Premium Pack for 70 months or more make up the largest portion of users. In contrast, many former subscribers tend to cancel within the first few months of using the service.

ggplot(data, aes(x = TotalCharges)) +
  geom_histogram(bins = 30, fill = "cyan", color = "pink") +
  labs(title = "Distribution of Total Charges by Price Range", 
       x = "Total Charges ($)", y = "Count") +
  theme_minimal()

The illustration clearly shows that most of our customers are comfortable spending close to $1,000 on a TV. Around 1,000 people surveyed said they consider $970 a fair price. The next most popular price point was $1,050, which also had strong support. Based on these insights, our team has chosen to price our Smart TVs between $1,050 and $1,250, depending on the features and memory, to align with what customers are looking for.

ggplot(data, aes(x = PaymentMethod)) +
  geom_bar(fill = "purple") +
  labs(title = "Distribution of Payment Methods", 
       x = "Payment Method", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 360, hjust = 1))

According to the bar chart above on payment method frequency, it is surprising to see that electronic checks are the most frequently used payment method among the four options. Mailed checks come in second, while the two automatic methods—bank transfer and credit—have roughly the same number of users.

ggplot(data, aes(x = SeniorCitizen, fill = DeviceProtection)) +
  geom_bar(position = "dodge") +
  labs(title = "Senior Citizens and Device Protection", x = "Senior Citizen", y = "Count")

If older customers are more likely to choose Device Protection because they’re more aware of potential risks, this opens up a great chance for targeted marketing.

We can boost our promotional efforts for Regork Premium by emphasizing that it includes Device Protection, specifically geared toward senior customers. This approach would not only offer them extra value and reassurance but also encourage more of them to choose our Smart Home TV and Premium package.

Machine Learning

Logistic Regression

set.seed(42)
split_data <- initial_split(data, prop = 0.75, strata = Status)

train_data <- training(split_data)
test_data <- testing(split_data)

set.seed(42)  
cv_folds <- vfold_cv(train_data, v = 10, strata = Status)

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%  
  set_mode("classification")

logistic_resamples <- fit_resamples(logistic_model, Status ~ ., resamples = cv_folds)
logistic_resamples %>%
  collect_metrics()

The model performs well, particularly in distinguishing between different classes, as shown by its high ROC AUC score. Its accuracy and Brier score further suggest that the predictions are reliable and the probability estimates are reasonably well-calibrated.

Multivariate Adaptive Regression Spline (MARS)

set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = "Status")
train_data <- training(data_split)
test_data <- testing(data_split)

mars_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes())

set.seed(123)
mars_kfolds <- vfold_cv(train_data, v = 7, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification") %>%
  set_engine("earth")

mars_grid <- grid_regular(num_terms(range = c(1, 20)), prod_degree(), levels = 5)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

mars_results <- mars_wf %>%
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

The top-performing setups use 20 or 15 terms with a prod_degree of 1, consistently achieving high ROC AUC scores, which indicate strong classification performance. Raising the prod_degree slightly lowers performance, while using fewer than 10 terms significantly weakens the model’s ability to recognize key patterns in the data. Configurations with only one term perform poorly and underfit the data, producing a ROC AUC of 0.5—no better than random guessing.

autoplot(mars_results)

The MARS model delivers the best results with 15 to 20 terms and an interaction degree of 1, showing high accuracy, a low Brier score, and a strong ROC AUC. This configuration strikes a good balance between model complexity and predictive power, making it well-suited for real-world use. Adding more interaction complexity offers little improvement and isn’t needed for this dataset.

mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
  add_model(mars_mod) %>%
  add_recipe(mars_recipe) %>%
  finalize_workflow(mars_best)

mars_final_model <- mars_final_wf %>%
  fit(data = train_data)

mars_final_model %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

The analysis highlights that behavioral and financial factors—such as tenure, total charges, and monthly fees—are key drivers of customer retention. Moreover, value-added services like online security and tech support, along with long-term contracts, significantly contribute to higher customer satisfaction and lower churn rates.

Random Forest Model

set.seed(123)
rf_split <- initial_split(data, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train) %>%
  step_dummy(all_nominal(), -all_outcomes())   

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

set.seed(123)
rf_kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, rf_kfold)

collect_metrics(results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.800     5 0.00591 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00350 Preprocessor1_Model1
## 3 roc_auc     binary     0.842     5 0.0105  Preprocessor1_Model1

The model is performing effectively, with an accuracy of 80%, a solid Brier score suggesting good calibration, and a strong ROC AUC score indicating strong discriminative ability. Small adjustments in calibration or feature engineering could further enhance performance.

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_param_grid <- grid_regular(
  trees(range = c(200, 2000)),  
  mtry(range = c(2, 20)),        
  min_n(range = c(2, 15)),       
  levels = 5                     
)

rf_tune_results <- tune_grid(
  rf_mod, 
  rf_recipe, 
  resamples = rf_kfold, 
  grid = rf_param_grid
)

show_best(rf_tune_results)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.841     5  0.0108 Preprocessor1_Model1

This model is effective for the classification task, and additional optimization could further enhance its performance.

rf_best <- select_best(rf_tune_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% 
  extract_fit_parsnip() %>%
  vip::vip(num_features = 10) + 
  theme_minimal() + 
  labs(
    title = "Top 10 Important Features for Random Forest Model",
    x = "Importance",
    y = "Features"
  )

The feature importance analysis shows that customer commitment (tenure and contract type), financial investment (monthly and total charges), and service engagement (internet services, online security) are the key factors in predicting customer retention. These findings offer practical strategies for businesses to reduce churn and increase customer satisfaction. By prioritizing the retention of long-term, loyal customers and offering value-added services, businesses can strengthen customer loyalty and improve their profitability.

Confusion Matrix

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_predictions <- rf_mod %>%
  fit(Status ~ ., data = rf_train) %>%
  predict(rf_test)

rf_predictions %>%
  bind_cols(rf_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1379  251
##    Left        161  306

The Random Forest model performs fairly well with an accuracy of 81.2%, but there is potential to improve its ability to predict customer churn. Emphasis should be placed on boosting precision and recall for the “Left” class, particularly if this outcome is crucial for the business. Additional adjustments to the model and methods for addressing class imbalance could further improve performance.

Business Analysis & Conclusion

Key Factors Influencing Customer Retention

Our analysis identified five critical variables that significantly influence customer churn: Tenure, Total Charges, Monthly Charges, Payment Method, and Online Security. Among these, Tenure emerged as the most influential factor. Customers with longer tenure tend to show stronger loyalty, making them a key focus for retention efforts. By analyzing how these variables interact, we gained insights into which customer segments are at higher risk and how to better tailor our services.

Estimated Revenue Loss if No Action is Taken

To highlight the financial risk associated with customer churn, we estimated the potential monthly revenue loss:

\[ \text{Estimated Monthly Loss} = \frac{1000 + 1250}{2} \times 1000 = \$1,125,000 \]

Without a proactive retention strategy, Regork Telecom risks losing over $1.12 million per month, which emphasizes the urgency of acting on these insights.

Retention Incentive Strategy

Based on our findings, we propose the following incentive schemes to reduce churn:

Three-month free trial of the premium package for new customers.
Discounted subscriptions and added security features for long-tenured users.
Device protection plans focused on senior customers, a group that showed high sensitivity to security features.

These strategies are designed to increase customer satisfaction, boost perceived value, and encourage long-term engagement.

Implications & CEO Proposal

Our analysis shows that matching customer needs with product features—particularly in areas like device protection and subscription flexibility—can significantly improve retention. We recommend that the CEO support:

Personalized product bundling
Dynamic, data-driven pricing
Targeted marketing to high-risk segments

These actions will not only reduce churn but also position Regork Telecom as a customer-first, innovative brand in the Smart Home market.

Limitations & Opportunities for Improvement

While our Logistic Regression model achieved around 80% accuracy, it has limitations due to:

The unpredictability of customer behavior
Limited historical data for Smart Home TVs
Time and budget constraints

To enhance future analysis, we suggest:

Collecting more detailed behavioral data
Exploring ensemble or deep learning models
Conducting real-time A/B testing of retention strategies

These steps will help improve model accuracy and refine customer retention initiatives moving forward.

Presentation

https://lcob.mediaspace.kaltura.com/media/2025SpringSemester_BANA4080_Group7_KhoiPham_TrucHuynh_Final_Project/1_jynygxjx

Final Project Group 7

Truc Huynh, Khoi Pham

2025-04-22