1. Introduction

In order to ensure Regork’s success in its launching of a new telecommunications offering, we undertook comprehensive data analysis to derive customer retention initiatives. Retaining existing customers is arguably one of the most potent challenges in the telecommunications industry, wherein the cost of acquisition is usually either higher than or much higher than the cost of retention. The aim of this project includes predictive model formulation that depicted at-risk customers, giving Regork the avenue to intervene in such cases and avoid customer attrition.

In order to reach this goal, we began with exploratory data analysis (EDA) in order to extract out trends and patterns existing in the data supplied. This step includes visual insight into the individual and combined effects of various customer attributes in terms of their relation to churn behavior. After the EDA, we engaged in the completion of building, training, and testing three models with machine learning: logistic regression, regularized regression, and random forest. These models were chosen because of their different handling of data complexities and interpretations that matter for business.

The models and their mistake rates were assessed on an AUC-based evaluation. This way, we could effectively compare their predictive accuracy and reliability. By way of these, the major predictors influencing customer retention were to be identified, from which presentation and recommendations were made for the Regork leadership teams when making strategic decisions.

2. Packages required

Here’s a list packages used in our analysis

library(tidymodels)
library(tidyverse)
library(vip)
library(kernlab)
library(ggplot2)
library(dplyr)
if (!require("ggmosaic")) install.packages("ggmosaic", dependencies = TRUE)
library(ggmosaic)
if (!require("ranger")) install.packages("ggmosaic", dependencies = TRUE)
library(ranger)
if (!require("yardstick")) install.packages("ggmosaic", dependencies = TRUE)
library(yardstick)
if (!require("tune")) install.packages("ggmosaic", dependencies = TRUE)
library(tune)

3. Data preparation & Exploratory Data Analysis

A. Data preparation

The following data will be stored to global environment and removed records containing missing values, ensuring a cleaner data set for analysis

retention <- read.csv("C:/Users/thanh/OneDrive - University of Cincinnati/Nie/F24_Data mining/customer_retention.csv")
retention <- mutate(retention, Status = factor(Status))
retention <- na.omit(retention)

B. Exploratory Data Analysis

We begin by exploring some underlying trends within the dataset.

# Aggregate data for Senior Citizen and Status
senior_citizen_data <- retention %>%
  group_by(SeniorCitizen, Status) %>%
  summarise(Count = n())

# Grouped bar chart
ggplot(senior_citizen_data, aes(x = SeniorCitizen, y = Count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Senior Citizen Distribution by Customer Status",
    x = "Senior Citizen (0 = No, 1 = Yes)",
    y = "Count"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("#6699CC", "#FF6666"))

The chart displays how the disengagement credit varies between senior citizens and others. It presents whether the senior category has any obvious influence in staying or departing from the company. Most sales showed that, if there are higher churn rates among senior citizens, it would denote their dissatisfaction with the services or that they found them less accessible. Recork will then want to devise targeted campaigns around improving retention among senior customers, perhaps by making plans a bit more affordable or providing special discounts, or through dedicated service truly to enhance customer satisfaction.

# Prepare data for Streaming Services Usage
streaming_data <- retention %>%
  mutate(StreamingUsage = case_when(
    StreamingTV == "Yes" & StreamingMovies == "Yes" ~ "Both",
    StreamingTV == "Yes" ~ "StreamingTV Only",
    StreamingMovies == "Yes" ~ "StreamingMovies Only",
    TRUE ~ "None"
  )) %>%
  drop_na(StreamingUsage, Status)  # Remove rows with missing values

# Ensure factors
streaming_data$StreamingUsage <- factor(streaming_data$StreamingUsage)
streaming_data$Status <- factor(streaming_data$Status)

# Create a mosaic plot using ggmosaic
ggplot(data = streaming_data) +
  geom_mosaic(aes(weight = 1, x = product(StreamingUsage, Status), fill = Status)) +
  labs(
    title = "Streaming Services Usage by Customer Status",
    x = "Streaming Services",
    y = "Proportion"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("#88CCEE", "#CC6677"))

This chart examines how customers’ choice to stay with or to leave a streaming subscription (either TV or movies) is conditioned by their view of streaming services. For example, people with both unsubscribed TV and streaming movies might churn less frequently, in a manner indicating that bundled entertainment options appeal to them. Conversely, if those who have no streaming services are churning at a higher rate, there may be an opportunity here for Regork to upsell or promote those services in order to retain customers. Learning about those preferences would assist the company in marketing the packages and improving service delivery.

# Aggregate data for Tech Support
tech_support_data <- retention %>%
  group_by(TechSupport, Status) %>%
  summarise(Count = n())

# Dot plot
ggplot(tech_support_data, aes(x = TechSupport, y = Count, color = Status)) +
  geom_point(size = 4) +
  labs(
    title = "Tech Support Subscription by Customer Status",
    x = "Tech Support (Yes/No)",
    y = "Count"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#117733", "#882255"))

The chart looks into the correlation between tech helping subscriptions and customer retention. When customers without technology help show higher churn rates, it may lead to the conclusion that the lack of assistance brings about their dissatisfaction. Conversely, those who subscribe to tech helping and stay could show that assistance resolving technical issues is vital. Regork could look to promote the tech helping program to be a value-adding service on an included trial basis to allow customers to get converted to use it on a continued basis.

4. Machine learning

Logistics regression

# Set random seed for reproducibility
set.seed(123)

# Split data into training and test sets
split <- initial_split(retention, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)

# Logistic Regression Model
logistic_model <- logistic_reg(mode = "classification") %>%
  set_engine("glm")

# 5-fold Cross-validation
logistic_cv <- vfold_cv(train_data, v = 5, strata = Status)

# Train the model with resampling
logistic_results <- fit_resamples(
  logistic_model,
  Status ~ ., 
  resamples = logistic_cv,
  metrics = metric_set(roc_auc, accuracy)
)

# Collect metrics
logistic_metrics <- collect_metrics(logistic_results)
logistic_metrics
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00449 Preprocessor1_Model1
## 2 roc_auc  binary     0.845     5 0.00590 Preprocessor1_Model1
# Final logistic regression model
final_logistic_fit <- logistic_model %>% fit(Status ~ ., data = train_data)

With an AUC of 0.8445, the model can reliably identify customers likely to churn. However, the AUC value for logistic regression will need to be compared against other models (e.g., Random Forest and Decision Tree) to determine if it’s the best-performing model.

Random Forest Model

# Specify the Random Forest Model
rf_model <- rand_forest(
  mode = "classification",  # Classification task
  mtry = tune(),            # Tune the number of predictors to sample
  trees = tune()            # Tune the number of trees
) %>%
  set_engine("ranger")       # Use the ranger engine for Random Forest

# Set up Cross-Validation for Random Forest
rf_cv <- vfold_cv(train_data, v = 5, strata = Status)  # 5-fold cross-validation with stratification

# Define a grid of hyperparameters to tune
rf_grid <- grid_random(
  trees(range = c(100, 500)),  # Number of trees
  mtry(range = c(1, 10)),      # Number of predictors sampled
  size = 10                    # Number of random combinations
)

# Tune the Random Forest model with the grid
set.seed(123)
rf_results <- tune_grid(
  rf_model,
  Status ~ .,              # Model formula
  resamples = rf_cv,       # Cross-validation folds
  grid = rf_grid,          # Hyperparameter grid
  metrics = metric_set(roc_auc, accuracy)  # Evaluation metrics
)

# Collect and display the metrics for all parameter combinations
rf_metrics <- collect_metrics(rf_results)
print(rf_metrics)  # Print the collected metrics to identify the AUC
## # A tibble: 20 × 8
##     mtry trees .metric  .estimator  mean     n std_err .config              
##    <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     6   479 accuracy binary     0.797     5 0.00386 Preprocessor1_Model01
##  2     6   479 roc_auc  binary     0.835     5 0.00339 Preprocessor1_Model01
##  3     9   230 accuracy binary     0.794     5 0.00411 Preprocessor1_Model02
##  4     9   230 roc_auc  binary     0.831     5 0.00239 Preprocessor1_Model02
##  5     4   378 accuracy binary     0.797     5 0.00243 Preprocessor1_Model03
##  6     4   378 roc_auc  binary     0.838     5 0.00401 Preprocessor1_Model03
##  7     9   453 accuracy binary     0.794     5 0.00299 Preprocessor1_Model04
##  8     9   453 roc_auc  binary     0.831     5 0.00310 Preprocessor1_Model04
##  9     1   191 accuracy binary     0.785     5 0.00464 Preprocessor1_Model05
## 10     1   191 roc_auc  binary     0.839     5 0.00456 Preprocessor1_Model05
## 11     8   214 accuracy binary     0.794     5 0.00394 Preprocessor1_Model06
## 12     8   214 roc_auc  binary     0.833     5 0.00381 Preprocessor1_Model06
## 13     2   298 accuracy binary     0.800     5 0.00484 Preprocessor1_Model07
## 14     2   298 roc_auc  binary     0.843     5 0.00397 Preprocessor1_Model07
## 15     1   231 accuracy binary     0.787     5 0.00398 Preprocessor1_Model08
## 16     1   231 roc_auc  binary     0.840     5 0.00526 Preprocessor1_Model08
## 17     6   311 accuracy binary     0.798     5 0.00331 Preprocessor1_Model09
## 18     6   311 roc_auc  binary     0.835     5 0.00396 Preprocessor1_Model09
## 19     2   158 accuracy binary     0.799     5 0.00265 Preprocessor1_Model10
## 20     2   158 roc_auc  binary     0.842     5 0.00464 Preprocessor1_Model10
# Select the best parameters based on AUC
rf_best <- select_best(rf_results, metric = "roc_auc")

# Finalize the Random Forest Workflow with the best parameters
final_rf_wf <- workflow() %>%
  add_model(rf_model) %>%
  add_formula(Status ~ .) %>%
  finalize_workflow(rf_best)

# Fit the finalized model to the training data
final_rf_fit <- final_rf_wf %>% fit(data = train_data)

# Evaluate the final model's AUC on the test set
rf_preds <- predict(final_rf_fit, new_data = test_data, type = "prob")
rf_auc <- rf_preds %>%
  bind_cols(test_data %>% select(Status)) %>%
  roc_auc(truth = Status, .pred_Current)

# Print the AUC for the test set
print(rf_auc)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.843

An AUC of 0.8421 is a good result, indicating that the model is effective at distinguishing between classes.This suggests the model has a high true positive rate and a low false positive rate for most thresholds

Decision Tree Model

# Decision Tree Model
dt_model <- decision_tree(cost_complexity = tune(), tree_depth = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# Cross-validation for Decision Tree
dt_cv <- vfold_cv(train_data, v = 5, strata = Status)

# Hyperparameter grid for Decision Tree
dt_grid <- grid_regular(
  cost_complexity(),
  tree_depth(range = c(1, 10)), # Set a range for tree depth
  levels = 5
)

# Tune Decision Tree
set.seed(123)
dt_results <- tune_grid(
  dt_model,
  Status ~ .,
  resamples = dt_cv,
  grid = dt_grid,
  metrics = metric_set(roc_auc, accuracy) # Ensure AUC is included as a metric
)

# Collect metrics to review AUC for each combination
dt_metrics <- collect_metrics(dt_results)

# Print metrics to see AUC values
print(dt_metrics)
## # A tibble: 50 × 8
##    cost_complexity tree_depth .metric  .estimator  mean     n  std_err .config  
##              <dbl>      <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>    
##  1    0.0000000001          1 accuracy binary     0.734     5 0.000137 Preproce…
##  2    0.0000000001          1 roc_auc  binary     0.5       5 0        Preproce…
##  3    0.0000000178          1 accuracy binary     0.734     5 0.000137 Preproce…
##  4    0.0000000178          1 roc_auc  binary     0.5       5 0        Preproce…
##  5    0.00000316            1 accuracy binary     0.734     5 0.000137 Preproce…
##  6    0.00000316            1 roc_auc  binary     0.5       5 0        Preproce…
##  7    0.000562              1 accuracy binary     0.734     5 0.000137 Preproce…
##  8    0.000562              1 roc_auc  binary     0.5       5 0        Preproce…
##  9    0.1                   1 accuracy binary     0.734     5 0.000137 Preproce…
## 10    0.1                   1 roc_auc  binary     0.5       5 0        Preproce…
## # ℹ 40 more rows
# Select the best parameters based on AUC
dt_best <- select_best(dt_results, metric = "roc_auc")

# Finalize Decision Tree Workflow with Best Parameters
final_dt_wf <- workflow() %>%
  add_model(dt_model) %>%
  add_formula(Status ~ .) %>%
  finalize_workflow(dt_best)

# Fit the final Decision Tree model
final_dt_fit <- final_dt_wf %>% fit(data = train_data)

# Predict on test data
dt_preds <- predict(final_dt_fit, new_data = test_data, type = "prob")

# Calculate AUC on the test set
dt_auc <- dt_preds %>%
  bind_cols(test_data %>% select(Status)) %>%
  roc_auc(truth = Status, .pred_Current)

# Print the test set AUC
print(dt_auc)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.826

The AUC values for the Decision Tree model in this table are consistently 0.5000, which is the worst-case scenario for a classification model.

Model Evaluation

# Load necessary libraries
library(yardstick)
library(dplyr)

# Ensure the truth column is a factor with correct levels
test_data <- test_data %>%
  mutate(Status = factor(Status, levels = c("Current", "Left")))

# Logistic Regression: Predict on test data and compute ROC AUC
logistic_preds <- predict(final_logistic_fit, new_data = test_data, type = "prob")
logistic_auc <- logistic_preds %>%
  bind_cols(test_data %>% select(Status)) %>%
  roc_auc(truth = Status, .pred_Current)

# Random Forest: Predict on test data and compute ROC AUC
rf_preds <- predict(final_rf_fit, new_data = test_data, type = "prob")
rf_auc <- rf_preds %>%
  bind_cols(test_data %>% select(Status)) %>%
  roc_auc(truth = Status, .pred_Current)

# Decision Tree: Predict on test data and compute ROC AUC
dt_preds <- predict(final_dt_fit, new_data = test_data, type = "prob")
dt_auc <- dt_preds %>%
  bind_cols(test_data %>% select(Status)) %>%
  roc_auc(truth = Status, .pred_Current)

# Combine results into a summary table
evaluation_results <- tibble(
  Model = c("Logistic Regression", "Random Forest", "Decision Tree"),
  ROC_AUC = c(logistic_auc$.estimate, rf_auc$.estimate, dt_auc$.estimate)
)

# Print evaluation results
print(evaluation_results)

Based on the provided AUC values, Logistic Regression has the best performance as the AUC value of 0.8447428 is the highest for customer churn prediction. Interpretability is important for actionable insights in customer churn predictions. This allows for recommendations such as, “tenure has the highest impact on retention.” Logistic Regression is less computationally intensive with lower operational overhead than Random Forest and, therefore, easier to deploy and maintain.

To evaluate the accuracy and reliability of the logistic regression model, we constructed a confusion matrix to examine its performance by identifying false positives and false negatives in the predictions compared to the actual data. Confusion Matrix for Final Model

# Generate predictions for the test data
logistic_pred_class <- predict(final_logistic_fit, new_data = test_data, type = "class")

# Construct a confusion matrix
logistic_conf_matrix <- logistic_pred_class %>%
  bind_cols(test_data %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

# Print the confusion matrix
print(logistic_conf_matrix)

Optimal Model: Logistic Regression Top 3 Most Influential Features of Customer Behavior

# Feature importance (optional with vip)
vip(final_logistic_fit)

Generalization error

# Evaluate model on test set
test_predictions <- predict(final_logistic_fit, new_data = test_data, type = "class")
test_accuracy <- mean(test_predictions$.pred_class == test_data$Status)

# Retrieve cross-validation accuracy
cv_accuracy <- logistic_metrics %>% filter(.metric == "accuracy") %>% pull(mean)

# Print results
cat("Test Set Accuracy:", test_accuracy, "\n")
cat("Cross-Validation Accuracy:", cv_accuracy, "\n")

A slight drop (~1.4%) is acceptable and indicates the model generalizes well to unseen data.

As a person responsible for making business decisions, what else are you learning from the observations in this section? The analysis shows that the chosen logistic regression model performed incredibly well on training data, indicating a good ability to capture distinct patterns of customer behavior. On the other hand, when evaluated on the test set, the model gave some consistent results, indicating limited generalization error, which means it had not overfitted on training data and would behave more reliably while predicting unseen data. These results inspire confidence on the robust prediction and guided decisions based on the data to unfold in further scenarios.

5. Business Analysis and Conclusion

Most Influential Predictors and Business Focus

Per our model, the critical predictors influencing customers’ behavior are:

  • Tenure: Customers in the initial stage of their lifetime in your segment are prone to churn. This indicates that latter users have established a certain level of loyalty and are thus less sensitive to churn. Focus: Onboarding and personalized programs to foster better first interaction with new customers.

  • Contract Type (One-Year and Two-Year) : Churn occurs predominantly in month-to-month contracts Focus: Encourage customers to shift from month-to-month to either one-year or two-year contracts through enticing discounts and incentives, such as adding extra features free of charge.

  • Monthly Charges: High monthly charges can contribute to higher churn incidence. Focus: Reassess pricing strategies and offer tailored discounts or customized plans for high-spending customers to maintain satisfaction. Focus: Rethink price optimization strategies and offer special promotional discounts or customized services to always high-value customers so as to keep them satisfied.

Customers predicted to leave

Using our model, we filtered the test data set for customers who are highly likely to leave. The main traits of such customers are:

-They tend to have shorter tenure with the company. -Customers on month-to-month contracts. -Customers paying high monthly charges. This group presents substantial risk for churn because of the combination of higher costs and shorter commitment—all retention efforts.

Predicted loss in Revenue

With respect to revenue, the expected monthly loss is approximately $450,375.20 in the event that no action is undertaken. This estimation is achieved by summing up the monthly revenues from each customer likely to churn. Though this is only an estimate, it nevertheless alerts us of the considerable importance of acting as soon as possible with respect to addressing customer dissatisfaction if we are to avert significant financial consequences.

Proposed incentive scheme

To keep these high-risk customers, the following strategies are being proposed:

  • Loyalty Discounts: Tiers for a reward program that offer customers discounts depending on time spent with the company or sums spent. For instance, customers on a one-year contract would receive a 10% discount; a 15% discount would apply to two-year contracts.
  • Personalized Offers: Provide specific, data-driven promotions based on a customer’s unique usage patterns (services most frequently selected).
  • Recognition Campaigns: ecognition campaigns to celebrate customer milestones (like anniversaries or birthdays) via small rewards, like discounts or gift cards, enhance good-will.
  • Feedback for Rewards: Get customer feedback from those who are at risk of leaving in exchange for minor rewards, such as discounts on bills. Such engagement reinforces the customer’s feeling of being appreciated and may provide an opportunity to remedy a situation leading towards churn.

Conclusion

Our study shows that tenure, contract type, and monthly charges are the main drivers of churn. Implementing retention mechanisms through various incentives to engage with the customer would address these problems and potentially enhance their retention. While this would entail an initial outlay, the savings on churn and an improved customer relationship would far outweigh the associated costs in the long run.

Implications - It will be necessary to monitor, work with Finance to determine the maximum discount possible without seriously undermining profit. - Retention strategies should always be tweaked according to customer feedback and continuous model predictions.

Limitation Although with useful insights, this analysis does not go deeper to explore specific demographic qualities attached to age, income levels or regional behaviours which might further give insight into churn behaviour. Adding data sources could help refine future models with better precision and higher granularity.