BANA 4080 - Final Project

Introduction

Customer relation management is undoubtedly one of the key identifiers of any successful company. This includes not only incentives to attract new segments of customers but also retain existing client base.

As such, this project specifically aims to enhance customer retention for Regork’s telecommunication services as they have recently entered the field.

To realize this goal, I analyzed the dataset provided and visualized outstanding trends between the most influential predictor variables and the response variable. Following that direction, I assessed the data by harnessing 4 different machine learning algorithms including logistic regression, MARS (Multivariate Adaptive Regression Splines…), bagging, and random forest, then came up with the best model after comparing their respective AUC (Area Under the Curve).

$Regork - AI-generated illustration$

Regork - AI-generated illustration

Data Preparation & Exploratory Data Analysis

Data Preparation

Prior to visualizing the data and building the models, I installed and loaded the following packages:

tidyverse: to load suits of tools for data science, including the following packages
dplyr: to filter, summarize, join, reshape data
stringr: to simplify working with strings
tidymodels: to build workflows, preprocess data, and tune models
parsnip: to provide a tidy, consistent interface to different model engines
tune: to handle hyperparameter tuning using grid or Bayesian search, integrated with tidymodels workflows
yardstick: to compute performance metrics for classification and regression models
vip: to visualize feature importance for any model
pdp: to generate Partial Dependence Plots to show how features influence model predictions
glmnet: to fir regularized linear models
earth: to implement Multivariate Adaptive Regression Splines
baguette: to provide access to ensemble models like bagged trees
ranger: to implement random forests
ggplot2: to create graphics
scales: to control axis labels and legends in plots (formating units)
viridis: to provide colorblind-friendly color palettes for plots

library(tidyverse)
library(dplyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(scales)  
library(viridis)
library(tidymodels)
library(vip)
library(glmnet)
library(earth)
library(yardstick)
library(baguette)
library(ranger)
library(tune)
library(parsnip)
library(pdp)

Additionally, I performed some data cleaning steps such as mutating the response variable, Status, and removing null values. This will more or less minimize errors and inconsistencies as well as amplify the performance of my models:

rdata <- read_csv("C:/Users/linhk/OneDrive/Desktop/BANA 4080 (DM) - SS25/Final Project/customer_retention.csv")
rdata <- mutate(rdata, Status = factor(Status))
rdata <- na.omit(rdata)

Exploratory Data Analysis

To commence with, I examined the number of customers who either stay or leave Regork over the course of their tenure. It appears that the company struggles with early customer retention. Long-term customers are loyal, and churn significantly drops as tenure increases.

# Aggregating data by Tenure and Status
agg_data <- rdata %>%
  group_by(Tenure, Status) %>%
  summarise(count = n()) %>%
  ungroup()
# Making sure Status is a factor with desired order
agg_data$Status <- factor(agg_data$Status, levels = c("Left", "Current"))
# Plotting the line graph
ggplot(agg_data, aes(x = Tenure, y = count, color = Status, group = Status)) +
  geom_line(size = 1) +  # Line graph
  geom_point(size = 1.4) +  # Optional: Add points for clarity
  labs(x = "Tenure (months)", y = "Customer Count", title = "Trends of Current and Left Customers by Tenure") +
  scale_color_viridis_d() +  
  theme(axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 13))

Building off of that, I faceted both figures by contract type. Most customers who leave the company early in their tenure tend to have signed Month-to-month contracts, meaning low tenure can trigger high churn risk. The situation for customers who stay is more divergent as their data stretches across all types of contracts. However, longer contracts (especially 2-year) typically seem to encourage long-term commitment.

ggplot(rdata, aes(x = Tenure, fill = Status)) +
  geom_bar(position = "dodge") +
  facet_wrap(~Contract) +
  labs(x = "Tenure (months)", y = "Customer Count", title = "Customer Tenure by Status and Contract Type") +
  scale_fill_viridis_d(option = "cividis") +  
  theme(axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 13))

When it comes to telecommunications, I believe it is helpful that we consider such factors as online security or payment methods. The following graph illustrates how many senior and non-senior customers who either do or do not have online security, according to whether or not they are senior citizens, stay or leave Regork. It is clear that senior customers who set up online security are proportionally less prone to churning than those who do not. The same goes for customers who are non-senior citizens, except that the their churn rate is significantly higher, especially when comparing seniors and non-seniors who neglect online security. Still, both segments primarily choose not to protect over protect their online ids.

ggplot(rdata, aes(x = OnlineSecurity, fill = Status)) +
  geom_bar() +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~SeniorCitizen) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Online Security", y = "Customer Count", title = "Online Security by Seniority of Citizenship") +
  scale_fill_viridis_d(option = "inferno") +  
  theme(axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 13))

This plot below depicts the number of clients who use different payment Methods based on whether or not they stream movies. Customers who stream movies are more prone to automatic and electronic payments than those who do not. Customers who don’t stream movies tend to use more traditional methods like through mailed checks.

ggplot(rdata, aes(x = StreamingMovies)) + 
  geom_bar(aes(fill = StreamingMovies)) +  # add aes(fill=...) to map colors
  facet_wrap(~PaymentMethod)+
  labs(x = "Streaming Movies", y = "Customer Count", title = "Whether Customers Stream Movies by Payment Methods") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_viridis_d(option = "plasma") +  
  theme(axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 13))

Machine Learning

For each of the 4 models, I began by splitting the data into designated training and test sets and used 5-fold cross-validation to assess its performance. I also performed hyperparameter tuning on all the models, except for the logistical regression model, so as to achieve the maximum AUCs possible. Subsequently, I took into consideration their confusion matrices, explained how the models behave, and plotted their respective Receiver Operator Curve (ROC).

Logistic Regression - Most Optimal Model

The first and also most optimal model in records is the logistical regression model, with an AUC of approximately 0.865, the highest among all 4 models. The confusion matrix of this model shows that it does a better job at predicting customers who left compared to those who sticked around. However, it is also more biased towards at predicting “Current” customers who actually left (high False Positive) than predicting “Left” customers who ended up staying (lower False Negative).

# Splititng the data sets
set.seed(123)
logistic_split <- initial_split(rdata, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

# K-fold Validations
set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(Status ~ ., logistic_kfolds) %>%
  collect_metrics(summarize = FALSE) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(.estimate))

# Final fit
logistic_final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = logistic_train)

# Confusion matrix
logistic_final_fit %>%
   predict(logistic_test) %>%
   bind_cols(logistic_test %>% select(Status)) %>% 
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

# AUC graph
logistic_final_fit %>%
  predict(logistic_train, type = "prob") %>%
  mutate(truth = logistic_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

I plotted the feature importance and found that Tenure is the most influential factor affecting the customer churn rate. As analyzed in the previous EDA section, Tenure bears a direct relation with whether or not customers stay in that the number of current customers spikes as they enter the high-tenure zone. Inversely, the number of customers who leave rather starts out high into the short-tenure range then sharply decreases towards the end. Regork can therefore focus on incentivizing long-tenure customers through better loyalty membership benefits, while enforcing more attractive offers to retain the low-tenure segments.

vip::vip(logistic_final_fit)

Demonstrated in this Partial Dependence Plot (PDP) as a smooth, consistent upward curve, the positive non-linear relation between Tenure and the probability of customers who stay is transparent. This once again assures that Tenure is a strong predictor of customer retention rate.

logistic_train$Status <- factor(logistic_train$Status, levels = c("Left", "Current"))
# Prediction function
# Step 1: Defining prediction function for 'Current'
pdp_pred_fun <- function(object, newdata) {
  probs <- predict(logistic_final_fit, new_data = newdata, type = "prob")
  mean(probs$.pred_Current)
}
# Step 2: Run partial() with dummy object (anything works here)
pdp_result <- pdp::partial(
  object = NULL,  # Dummy object
  pred.var = "Tenure",
  pred.fun = pdp_pred_fun,
  train = logistic_train,
  grid.resolution = 10
)
# Step 3: Plot
ggplot(pdp_result, aes(x = Tenure, y = yhat)) +
  geom_line(size = 1.2, color = "#2C77B8") +
  scale_y_continuous(name = "Predicted Probability of 'Current'") +
  labs(title = "Partial Dependence of Tenure on 'Current' Status") +
  theme_minimal()

Using cross-validation, the model balances well between underfitting and overfitting with an AUC of 0.865 and a generalization error of ~19%, providing meaningful insights on customer retention strategies Regork can start employing to thrive in this new field of telecommunications.

# Generalization Error on Test Set (Misclassification Rate)
logistic_test_preds <- logistic_final_fit %>%
  predict(new_data = logistic_test, type = "class") %>%
  bind_cols(logistic_test %>% select(Status))

# Computing misclassification rate
gen_error <- mean(logistic_test_preds$.pred_class != logistic_test_preds$Status)

# Displaying it
cat("Generalization Error (Misclassification Rate) on Test Set:", round(gen_error, 4), "\n")

## Generalization Error (Misclassification Rate) on Test Set: 0.1922

The possible loss in revenue on a monthly basis is as follows:

# Calculate the potential monthly loss
# Step 1: Filter churned customers
# 1. Generate predicted class for the entire rdata
# Get predicted class
test_pred_class <- predict(logistic_final_fit, new_data = logistic_test, type = "class")

# Mutate into the test set
logistic_test_with_class <- logistic_test %>%
  mutate(.pred_class = test_pred_class$.pred_class)

monthly_revenue_loss <- logistic_test_with_class %>%
  filter(.pred_class == "Left") %>%
  summarise(monthly_revenue_loss = sum(MonthlyCharges, na.rm = TRUE))

monthly_revenue_loss

Multivariate Adaptive Regression Splines (MARS)

This model achieves a comparably high AUC of 0.850 with hyperparameter tuning. The confusion matrix also shows that the model is better at detecting “Current” customers than “Left” customers, yet can be improved to identify churners more accurately (FP is higher than FN).

# Splitting the data sets
set.seed(123)
mars_split <- initial_split(rdata, prop = .7, strata = Status) 
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
# Recipe
mars_recipe <- recipe(Status ~ ., data = mars_train)
# K-fold Validations
set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")
# Creating the model
mars_mod <- mars(num_terms = tune(),
                 prod_degree = tune()) %>%  
            set_mode("classification")
#Tuning
mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)

mars_results <- mars_wf %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

# Confusion matrix
# Step 1: Best Parameters
best_params <- mars_results %>% 
  select_best(metric = "roc_auc")
# Step 2: Finalize model with the best parameters
final_mars <- finalize_model(mars_mod, best_params)
# Step 3: Fit the finalized model
final_fit <- final_mars %>%
  fit(Status ~ ., data = mars_train)
final_fit %>%
   predict(mars_test) %>%
   bind_cols(mars_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>% 
  add_model(mars_mod) %>% add_formula(Status ~ .) %>% 
  finalize_workflow(mars_best)
# AUC graph
final_fit %>%
  predict(mars_train, type = "prob") %>%
  mutate(truth = mars_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Tenure continues to be the most impactful predictor variables, followed closely by Total Charges and Monthly Charges.

# Most influential variables
mars_final_wf %>% 
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

Bagging

This model achieves a slightly lower AUC of 0.826 compared to the formers, even with hyperparameter tuning. The confusion matrix still shows that the model is better at detecting “Current” customers than “Left” customers, yet, similar to the earlier models, can be improved to identify churners more accurately (FP is higher than FN).

# Splitting the data sets
set.seed(123)
bag_split <- initial_split(rdata, prop = .7, strata = Status)
bag_train <- training(bag_split)
bag_test <- testing(bag_split)
# Creating the model
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")
# Recipe
bag_recipe <- recipe(Status ~ ., data = bag_train)
# Cross-validation
set.seed(123)
bag_kfolds <- vfold_cv(bag_train, v = 5, strata = "Status")
# Training the model
bag_results <- fit_resamples(bag_mod, bag_recipe, bag_kfolds)
# Model results
collect_metrics(bag_results) %>% 
  filter(.metric == "roc_auc")

# Tuning option set for number of bagged trees
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")
# creating the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
# Training our model across the hyper parameter grid
set.seed(123)
bag_results <- tune_grid(bag_mod, bag_recipe, resamples = bag_kfolds, grid = bag_hyper_grid)
# Model results
show_best(bag_results, metric = "roc_auc")

# Getting best hyperparameter values
bag_best_model <- select_best(bag_results, metric = 'roc_auc')
# Putting together final workflow
bag_final_wf <- workflow() %>%
  add_recipe(bag_recipe) %>%
  add_model(bag_mod) %>%
  finalize_workflow(bag_best_model)
# Fitting final workflow across entire training data
bag_final_fit <- bag_final_wf %>%
  fit(data = bag_train)
# Determining model quality
bag_final_fit %>%
   predict(bag_test) %>%
   bind_cols(bag_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1333  260
##    Left        207  297

bag_final_fit %>% 
   predict(bag_train, type = "prob") %>%
   mutate(truth = bag_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

For this model, Total Charges, Monthly Charges, and Tenure are respectively the 3 most influential predictors, slightly differentiating itself from the previous models.

vi <- bag_final_fit %>%
   extract_fit_parsnip() %>%
   .[['fit']] %>%
   var_imp() %>%
   slice(1:10)

ggplot(vi, aes(value, reorder(term, value))) +
   geom_col() +
   ylab(NULL) +
   xlab("Feature importance")

Random Forest

This model achieves a relatively high AUC of 0.841 with hyperparameter tuning. The confusion matrix shows that the model is far better at detecting “Current” customers than “Left” customers, yet, similar to the previous models, can be improved to identify churners more accurately (FP is much higher than FN).

# creating random forest model object
rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger") 
# Splitting the datasets
set.seed(123)
rf_split <- initial_split(rdata, prop = .7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)
# Recipe
rf_recipe <- recipe(Status ~ ., data = rf_train)
# Cross-validation
set.seed(123)
rf_kfolds <- vfold_cv(rf_train, v = 5, strata = "Status")
# Training the model
rf_results <- fit_resamples(rf_mod, rf_recipe, rf_kfolds)
# Model results
collect_metrics(rf_results) %>% 
  filter(.metric == "roc_auc")

# Tuning
rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")
# Creating the hyperparameter grid
rf_hyper_grid <- grid_regular(
   trees(range = c(100, 1000)),
   mtry(range = c(1, 50)),
   min_n(range = c(1, 20)),        
   levels = 5
   )
# Training model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfolds, grid = rf_hyper_grid)
# Model results
show_best(rf_results, metric = "roc_auc")

# Getting best hyperparameter values
rf_best_model <- select_best(rf_results, metric = 'roc_auc')
# Putting together final workflow
rf_final_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_model)
# Fitting final workflow across entire training data
rf_final_fit <- rf_final_wf %>%
  fit(data = rf_train)
# Determining model quality
rf_final_fit %>%
   predict(rf_test) %>%
   bind_cols(rf_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1481  384
##    Left         59  173

rf_final_fit %>% 
   predict(rf_train, type = "prob") %>%
   mutate(truth = rf_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

Unlike the previous models, Contract is the most impactful predictor variable, followed by other preponderant predictors like Tenure, Total Charges, and Monthly Charges.

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip::vip(num_features = 10)

Business Analysis & Conclusion

In terms of relative importance, the most influential predictor variable is Tenure, followed by Contract – Two year and Contract – One year, as explored in the data visualization step. This is important to know as a business manager, since there is proof as to how these variables may be associated with one another. Potential approaches could be that Regork targets more engaging incentives at customers in the short-tenure segment that sign month-to-month contracts since they are more likely to churn, while simultaneously offers exclusive member perks to retain the high-tenure, year-to-year contract group. This asks for careful cost-benefit analysis and long-run planning from Regork to calculate the exact tradeoffs and finalize feasible profit margins.

If no action is taken, Regork is potentially subject to loosing $40,283.05 per month. This is only a predictive figure based on the customers who may not eventually but rather are most likely to leave Regork.

To retain customers who are predicted to be leaving at some point, I suggest Regork proactively takes actions. Essentially, they should leverage their segmented client base well to understand different demands or patterns in how customers are using their services, thereby devise appropriate incentives for each customer segment. For example, they ought to configure anniversary gifts or loyalty privileges such as free add-ons, personalized packages for retained customers using a point-tracking or tier-tracking system or provide quicker telecommunication support through VIP-only channels. From the previous visualizations, they can also tailor their current services to different groups of customers, e.g., offer loyalty services bundling streamable movies and online security for senior clients, encourage contracts upgrade from month-to-month to longer-term with lower monthly costs and free add-ons, incentivize digital payments with first-month discounts or other benefits based on how Regork negotiates with their banking partners.

In conclusion, Regork Telecom services have a bright, profitable future ahead based on the modelling metrics presented in this report. Some factors they can consider to enforce amendments include customer tenure as well as the telecom service costs and other aspects like security, payment methods, etc. With more strategic incentives focusing closely on different consumer groups, Regork will undoubtedly observe lower customer churn rate.

Much as this analysis is thorough, it can be improved with more input surrounding customers who churned like the timing or reasons they left, to ensure my logics are not biased and external factors are not overlooked. Additionally, I’d like to explore more powerful modeling approaches that can capture and predict customer patterns more accurately.