Customer relation management is undoubtedly one of
the key identifiers of any successful company. This includes not only
incentives to attract new segments of customers but also retain existing
client base.
As such, this project specifically aims to enhance customer retention
for Regorkās telecommunication services as they have
recently entered the field.
To realize this goal, I analyzed the dataset provided and visualized
outstanding trends between the most influential predictor variables and
the response variable. Following that direction, I assessed the data by
harnessing 4 different machine learning algorithms including logistic
regression, MARS (Multivariate Adaptive Regression Splinesā¦), bagging,
and random forest, then came up with the best model after comparing
their respective AUC (Area Under the Curve).
Prior to visualizing the data and building the models, I installed
and loaded the following packages:
tidyverse: to load suits of tools for data
science, including the following packages
dplyr: to filter, summarize, join, reshape data
stringr: to simplify working with strings
tidymodels: to build workflows, preprocess data,
and tune models
parsnip: to provide a tidy, consistent interface
to different model engines
tune: to handle hyperparameter tuning using grid
or Bayesian search, integrated with tidymodels workflows
yardstick: to compute performance metrics for
classification and regression models
vip: to visualize feature importance for any
model
pdp: to generate Partial Dependence Plots to
show how features influence model predictions
glmnet: to fir regularized linear models
earth: to implement Multivariate Adaptive
Regression Splines
baguette: to provide access to ensemble models
like bagged trees
ranger: to implement random forests
ggplot2: to create graphics
scales: to control axis labels and legends in
plots (formating units)
viridis: to provide colorblind-friendly color palettes for plots
library(tidyverse)
library(dplyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(scales)
library(viridis)
library(tidymodels)
library(vip)
library(glmnet)
library(earth)
library(yardstick)
library(baguette)
library(ranger)
library(tune)
library(parsnip)
library(pdp)
Additionally, I performed some data cleaning steps such as
mutating the response variable, Status, and removing null values. This
will more or less minimize errors and inconsistencies as well as amplify
the performance of my models:
rdata <- read_csv("C:/Users/linhk/OneDrive/Desktop/BANA 4080 (DM) - SS25/Final Project/customer_retention.csv")
rdata <- mutate(rdata, Status = factor(Status))
rdata <- na.omit(rdata)
To commence with, I examined the number of customers who either stay or leave Regork over the course of their tenure. It appears that the company struggles with early customer retention. Long-term customers are loyal, and churn significantly drops as tenure increases.
# Aggregating data by Tenure and Status
agg_data <- rdata %>%
group_by(Tenure, Status) %>%
summarise(count = n()) %>%
ungroup()
# Making sure Status is a factor with desired order
agg_data$Status <- factor(agg_data$Status, levels = c("Left", "Current"))
# Plotting the line graph
ggplot(agg_data, aes(x = Tenure, y = count, color = Status, group = Status)) +
geom_line(size = 1) + # Line graph
geom_point(size = 1.4) + # Optional: Add points for clarity
labs(x = "Tenure (months)", y = "Customer Count", title = "Trends of Current and Left Customers by Tenure") +
scale_color_viridis_d() +
theme(axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(face = "bold", size = 13))
Building off of that, I faceted both figures by contract type. Most
customers who leave the company early in their tenure tend to have
signed Month-to-month contracts, meaning low tenure can trigger high
churn risk. The situation for customers who stay is more divergent as
their data stretches across all types of contracts. However, longer
contracts (especially 2-year) typically seem to encourage long-term
commitment.
ggplot(rdata, aes(x = Tenure, fill = Status)) +
geom_bar(position = "dodge") +
facet_wrap(~Contract) +
labs(x = "Tenure (months)", y = "Customer Count", title = "Customer Tenure by Status and Contract Type") +
scale_fill_viridis_d(option = "cividis") +
theme(axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(face = "bold", size = 13))
When it comes to telecommunications, I believe it is helpful that
we consider such factors as online security or payment methods. The
following graph illustrates how many senior and non-senior customers who
either do or do not have online security, according to whether or not
they are senior citizens, stay or leave Regork. It is clear that senior
customers who set up online security are proportionally less prone to
churning than those who do not. The same goes for customers who are
non-senior citizens, except that the their churn rate is significantly
higher, especially when comparing seniors and non-seniors who neglect
online security. Still, both segments primarily choose not to protect
over protect their online ids.
ggplot(rdata, aes(x = OnlineSecurity, fill = Status)) +
geom_bar() +
scale_y_continuous(labels = scales::comma) +
facet_wrap(~SeniorCitizen) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Online Security", y = "Customer Count", title = "Online Security by Seniority of Citizenship") +
scale_fill_viridis_d(option = "inferno") +
theme(axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(face = "bold", size = 13))
This plot below depicts the number of clients who use different
payment Methods based on whether or not they stream movies. Customers
who stream movies are more prone to automatic and electronic payments
than those who do not. Customers who donāt stream movies tend to use
more traditional methods like through mailed checks.
ggplot(rdata, aes(x = StreamingMovies)) +
geom_bar(aes(fill = StreamingMovies)) + # add aes(fill=...) to map colors
facet_wrap(~PaymentMethod)+
labs(x = "Streaming Movies", y = "Customer Count", title = "Whether Customers Stream Movies by Payment Methods") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d(option = "plasma") +
theme(axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(face = "bold", size = 13))
For each of the 4 models, I began by splitting the data into
designated training and test sets and used 5-fold cross-validation to
assess its performance. I also performed hyperparameter tuning on all
the models, except for the logistical regression model, so as to achieve
the maximum AUCs possible. Subsequently, I took into consideration their
confusion matrices, explained how the models behave, and plotted their
respective Receiver Operator Curve (ROC).
The first and also most optimal model in records is the logistical
regression model, with an AUC of approximately 0.865, the highest among
all 4 models. The confusion matrix of this model shows that it does a
better job at predicting customers who left compared to those who
sticked around. However, it is also more biased towards at predicting
āCurrentā customers who actually left (high False Positive) than
predicting āLeftā customers who ended up staying (lower False Negative).
# Splititng the data sets
set.seed(123)
logistic_split <- initial_split(rdata, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)
# K-fold Validations
set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)
logistic_reg() %>%
fit_resamples(Status ~ ., logistic_kfolds) %>%
collect_metrics(summarize = FALSE) %>%
filter(.metric == "roc_auc") %>%
arrange(desc(.estimate))
# Final fit
logistic_final_fit <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)
# Confusion matrix
logistic_final_fit %>%
predict(logistic_test) %>%
bind_cols(logistic_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
# AUC graph
logistic_final_fit %>%
predict(logistic_train, type = "prob") %>%
mutate(truth = logistic_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
I plotted the feature importance and found that Tenure is the most
influential factor affecting the customer churn rate. As analyzed in the
previous EDA section, Tenure bears a direct relation with whether or not
customers stay in that the number of current customers spikes as they
enter the high-tenure zone. Inversely, the number of customers who leave
rather starts out high into the short-tenure range then sharply
decreases towards the end. Regork can therefore focus on incentivizing
long-tenure customers through better loyalty membership benefits, while
enforcing more attractive offers to retain the low-tenure segments.
vip::vip(logistic_final_fit)
Demonstrated in this Partial Dependence Plot (PDP) as a smooth,
consistent upward curve, the positive non-linear relation between Tenure
and the probability of customers who stay is transparent. This once
again assures that Tenure is a strong predictor of customer retention
rate.
logistic_train$Status <- factor(logistic_train$Status, levels = c("Left", "Current"))
# Prediction function
# Step 1: Defining prediction function for 'Current'
pdp_pred_fun <- function(object, newdata) {
probs <- predict(logistic_final_fit, new_data = newdata, type = "prob")
mean(probs$.pred_Current)
}
# Step 2: Run partial() with dummy object (anything works here)
pdp_result <- pdp::partial(
object = NULL, # Dummy object
pred.var = "Tenure",
pred.fun = pdp_pred_fun,
train = logistic_train,
grid.resolution = 10
)
# Step 3: Plot
ggplot(pdp_result, aes(x = Tenure, y = yhat)) +
geom_line(size = 1.2, color = "#2C77B8") +
scale_y_continuous(name = "Predicted Probability of 'Current'") +
labs(title = "Partial Dependence of Tenure on 'Current' Status") +
theme_minimal()
Using cross-validation, the model balances well between underfitting
and overfitting with an AUC of 0.865 and a generalization error of ~19%,
providing meaningful insights on customer retention strategies Regork
can start employing to thrive in this new field of telecommunications.
# Generalization Error on Test Set (Misclassification Rate)
logistic_test_preds <- logistic_final_fit %>%
predict(new_data = logistic_test, type = "class") %>%
bind_cols(logistic_test %>% select(Status))
# Computing misclassification rate
gen_error <- mean(logistic_test_preds$.pred_class != logistic_test_preds$Status)
# Displaying it
cat("Generalization Error (Misclassification Rate) on Test Set:", round(gen_error, 4), "\n")
## Generalization Error (Misclassification Rate) on Test Set: 0.1922
The possible loss in revenue on a monthly basis is as follows:
# Calculate the potential monthly loss
# Step 1: Filter churned customers
# 1. Generate predicted class for the entire rdata
# Get predicted class
test_pred_class <- predict(logistic_final_fit, new_data = logistic_test, type = "class")
# Mutate into the test set
logistic_test_with_class <- logistic_test %>%
mutate(.pred_class = test_pred_class$.pred_class)
monthly_revenue_loss <- logistic_test_with_class %>%
filter(.pred_class == "Left") %>%
summarise(monthly_revenue_loss = sum(MonthlyCharges, na.rm = TRUE))
monthly_revenue_loss
This model achieves a comparably high AUC of 0.850 with
hyperparameter tuning. The confusion matrix also shows that the model is
better at detecting āCurrentā customers than āLeftā customers, yet can
be improved to identify churners more accurately (FP is higher than FN).
# Splitting the data sets
set.seed(123)
mars_split <- initial_split(rdata, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
# Recipe
mars_recipe <- recipe(Status ~ ., data = mars_train)
# K-fold Validations
set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")
# Creating the model
mars_mod <- mars(num_terms = tune(),
prod_degree = tune()) %>%
set_mode("classification")
#Tuning
mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)
mars_wf <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)
mars_results <- mars_wf %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)
mars_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
# Confusion matrix
# Step 1: Best Parameters
best_params <- mars_results %>%
select_best(metric = "roc_auc")
# Step 2: Finalize model with the best parameters
final_mars <- finalize_model(mars_mod, best_params)
# Step 3: Fit the finalized model
final_fit <- final_mars %>%
fit(Status ~ ., data = mars_train)
final_fit %>%
predict(mars_test) %>%
bind_cols(mars_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_model(mars_mod) %>% add_formula(Status ~ .) %>%
finalize_workflow(mars_best)
# AUC graph
final_fit %>%
predict(mars_train, type = "prob") %>%
mutate(truth = mars_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
Tenure continues to be the most impactful predictor variables,
followed closely by Total Charges and Monthly Charges.
# Most influential variables
mars_final_wf %>%
fit(data = mars_train) %>%
extract_fit_parsnip() %>%
vip(10, type = "rss")
This model achieves a slightly lower AUC of 0.826 compared to the
formers, even with hyperparameter tuning. The confusion matrix still
shows that the model is better at detecting āCurrentā customers than
āLeftā customers, yet, similar to the earlier models, can be improved to
identify churners more accurately (FP is higher than FN).
# Splitting the data sets
set.seed(123)
bag_split <- initial_split(rdata, prop = .7, strata = Status)
bag_train <- training(bag_split)
bag_test <- testing(bag_split)
# Creating the model
bag_mod <- bag_tree() %>%
set_engine("rpart", times = 5) %>%
set_mode("classification")
# Recipe
bag_recipe <- recipe(Status ~ ., data = bag_train)
# Cross-validation
set.seed(123)
bag_kfolds <- vfold_cv(bag_train, v = 5, strata = "Status")
# Training the model
bag_results <- fit_resamples(bag_mod, bag_recipe, bag_kfolds)
# Model results
collect_metrics(bag_results) %>%
filter(.metric == "roc_auc")
# Tuning option set for number of bagged trees
bag_mod <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
# creating the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
# Training our model across the hyper parameter grid
set.seed(123)
bag_results <- tune_grid(bag_mod, bag_recipe, resamples = bag_kfolds, grid = bag_hyper_grid)
# Model results
show_best(bag_results, metric = "roc_auc")
# Getting best hyperparameter values
bag_best_model <- select_best(bag_results, metric = 'roc_auc')
# Putting together final workflow
bag_final_wf <- workflow() %>%
add_recipe(bag_recipe) %>%
add_model(bag_mod) %>%
finalize_workflow(bag_best_model)
# Fitting final workflow across entire training data
bag_final_fit <- bag_final_wf %>%
fit(data = bag_train)
# Determining model quality
bag_final_fit %>%
predict(bag_test) %>%
bind_cols(bag_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1333 260
## Left 207 297
bag_final_fit %>%
predict(bag_train, type = "prob") %>%
mutate(truth = bag_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
For this model, Total Charges, Monthly Charges, and Tenure are
respectively the 3 most influential predictors, slightly differentiating
itself from the previous models.
vi <- bag_final_fit %>%
extract_fit_parsnip() %>%
.[['fit']] %>%
var_imp() %>%
slice(1:10)
ggplot(vi, aes(value, reorder(term, value))) +
geom_col() +
ylab(NULL) +
xlab("Feature importance")
This model achieves a relatively high AUC of 0.841 with
hyperparameter tuning. The confusion matrix shows that the model is far
better at detecting āCurrentā customers than āLeftā customers, yet,
similar to the previous models, can be improved to identify churners
more accurately (FP is much higher than FN).
# creating random forest model object
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
# Splitting the datasets
set.seed(123)
rf_split <- initial_split(rdata, prop = .7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)
# Recipe
rf_recipe <- recipe(Status ~ ., data = rf_train)
# Cross-validation
set.seed(123)
rf_kfolds <- vfold_cv(rf_train, v = 5, strata = "Status")
# Training the model
rf_results <- fit_resamples(rf_mod, rf_recipe, rf_kfolds)
# Model results
collect_metrics(rf_results) %>%
filter(.metric == "roc_auc")
# Tuning
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
# Creating the hyperparameter grid
rf_hyper_grid <- grid_regular(
trees(range = c(100, 1000)),
mtry(range = c(1, 50)),
min_n(range = c(1, 20)),
levels = 5
)
# Training model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfolds, grid = rf_hyper_grid)
# Model results
show_best(rf_results, metric = "roc_auc")
# Getting best hyperparameter values
rf_best_model <- select_best(rf_results, metric = 'roc_auc')
# Putting together final workflow
rf_final_wf <- workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_model)
# Fitting final workflow across entire training data
rf_final_fit <- rf_final_wf %>%
fit(data = rf_train)
# Determining model quality
rf_final_fit %>%
predict(rf_test) %>%
bind_cols(rf_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1481 384
## Left 59 173
rf_final_fit %>%
predict(rf_train, type = "prob") %>%
mutate(truth = rf_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
Unlike the previous models, Contract is the most impactful predictor
variable, followed by other preponderant predictors like Tenure, Total
Charges, and Monthly Charges.
rf_final_fit %>%
extract_fit_parsnip() %>%
vip::vip(num_features = 10)
In terms of relative importance, the most influential
predictor variable is Tenure, followed by Contract ā
Two year and Contract ā One year, as explored in the data visualization
step. This is important to know as a business manager, since there is
proof as to how these variables may be associated with one another.
Potential approaches could be that Regork targets more engaging
incentives at customers in the short-tenure segment that sign
month-to-month contracts since they are more likely to churn, while
simultaneously offers exclusive member perks to retain the high-tenure,
year-to-year contract group. This asks for careful cost-benefit analysis
and long-run planning from Regork to calculate the exact tradeoffs and
finalize feasible profit margins.
If no action is taken, Regork is potentially subject to loosing
$40,283.05 per month. This is only a predictive figure
based on the customers who may not eventually but rather are most likely
to leave Regork.
To retain customers who are predicted to be leaving at some point, I
suggest Regork proactively takes actions. Essentially, they should
leverage their segmented client base well to understand different
demands or patterns in how customers are using their services, thereby
devise appropriate incentives for each customer segment. For example,
they ought to configure anniversary gifts or loyalty privileges such as
free add-ons, personalized packages for retained customers using a
point-tracking or tier-tracking system or provide quicker
telecommunication support through VIP-only channels. From the previous
visualizations, they can also tailor their current services to different
groups of customers, e.g., offer loyalty services bundling streamable
movies and online security for senior clients, encourage contracts
upgrade from month-to-month to longer-term with lower monthly costs and
free add-ons, incentivize digital payments with first-month discounts or
other benefits based on how Regork negotiates with their banking
partners.
In conclusion, Regork Telecom services have a bright, profitable
future ahead based on the modelling metrics presented in this report.
Some factors they can consider to enforce amendments include customer
tenure as well as the telecom service costs and other aspects like
security, payment methods, etc. With more strategic incentives focusing
closely on different consumer groups, Regork will undoubtedly observe
lower customer churn rate.
Much as this analysis is thorough, it can be improved with more input
surrounding customers who churned like the timing or reasons they left,
to ensure my logics are not biased and external factors are not
overlooked. Additionally, Iād like to explore more powerful modeling
approaches that can capture and predict customer patterns more
accurately.