To support Regork with their new telecommunications service, we developed multiple models to predict customer retention. Our approach began with exploring trends within the provided dataset. From there, we built, trained, and tested four distinct machine learning models: logistic regression, regularized regression, decision tree, and random forest. The performance of each model was evaluated using the area under the curve (AUC) metric.
This is every package that was used for the analysis.
library(tidymodels)
library(tidyverse)
library(ranger)
library(baguette)
library(earth)
library(pdp)
library(vip)
library(ggplot2)
library(kernlab)
library(here)
tidyverse: A collection of packages designed for data
science, providing tools for data manipulation, visualization, and
analysis.
vip: A package for constructing variable importance plots to
interpret machine learning models.
tidymodels: Allows you to perform discrete parts of the
machine learning workflow with discrete packages, focusing on model
building, tuning, and evaluation.
pdp: A package for constructing partial dependence plots and
individual conditional expectation curves to understand model
predictions.
ranger: A fast implementation of random forests and other
tree-based models for regression and classification tasks.
baguette: A package for tree and rule-based models,
particularly focused on bagging methods like random forests and bagged
decision trees.
earth: A package for fitting multivariate adaptive
regression splines (MARS) models, used for flexible non-linear
regression.
ggplot2: A data visualization package that implements the
grammar of graphics, allowing for flexible and elegant creation of
plots.
kernlab: A package for kernel-based machine learning
methods, particularly for support vector machines and other kernel-based
models.
here: A package for managing file paths in R, ensuring code
portability by creating relative paths based on the project’s root
directory.
retention <- read.csv("customer_retention.csv")
retention <- mutate(retention, Status = factor(Status))
retention <- na.omit(retention)
Reading the customer data and saving as “retention”.
The graphs below show the most important trends the current data has. The first one shows the correlation between payment method and the length of tenure. Credit card and bank transfer tend to go the longest due to how easy and convenient it is to pay.
ggplot(retention, aes(PaymentMethod)) +
geom_bar(fill = "gold") +
facet_wrap(~Contract) +
coord_flip() +
ggtitle("Customer Payment Method and Length of Contract") +
labs(y = "Count of Contract Type", x = "Customer Payment Method")
The second and third graphs display how gender is divided up
among the internet service plans and phone service plans. Overall they
generally seem to be evenly split.
retention %>%
ggplot(aes(x=InternetService, fill=Gender))+
geom_bar() +
labs(title = "Gender Count by Internet Service Use")
retention %>%
ggplot(aes(PhoneService, fill=Gender))+
geom_bar() +
labs(title = "Gender Count and Phone Service Use")
This final graph shows us the actual commitment customers made
based on the contract they signed up for. Customers who did month by
month dropped off early then the other two. While many customers who
opted for the 2 year contract ended up making it to five plus years.
ggplot(retention, aes(Tenure)) +
geom_bar(fill = "steelblue2") +
facet_wrap(~Contract) +
ggtitle("Tenure vs. Type of Contract ") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Count of Customers", x = "Length of Tenure (months)")
We ran 4 distinct statistical models to predict retention rate:
Logistic Regression
In this section, we build and evaluate a logistic regression model for customer retention.
set.seed(456)
retention_split <- initial_split(retention, prop = .7, strata = Status)
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
set.seed(456)
logistic_kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
logistic_reg() %>%
fit_resamples(Status ~ ., logistic_kfolds) %>%
collect_metrics()
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.802 5 0.00408 Preprocessor1_Model1
## 2 brier_class binary 0.135 5 0.00172 Preprocessor1_Model1
## 3 roc_auc binary 0.847 5 0.00355 Preprocessor1_Model1
The area under the curve for this logistic regression model was 0.847
Multivariate Adaptive Regression Spline
MARS creates piecewise linear regression models by recursively partitioning the data and fitting linear models within each region.
set.seed(456)
mars_split <- initial_split(retention, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
mars_recipe <- recipe(Status ~ ., data = mars_train)
set.seed(456)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)
mars_wf <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)
mars_results <- mars_wf %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)
mars_results %>% collect_metrics() %>% filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 60 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 1 roc_auc binary 0.850 5 0.00404 Preprocessor1_M…
## 2 16 1 roc_auc binary 0.850 5 0.00410 Preprocessor1_M…
## 3 19 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 4 20 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 5 21 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 6 22 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 7 23 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 8 24 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 9 25 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## 10 26 1 roc_auc binary 0.850 5 0.00417 Preprocessor1_M…
## # ℹ 50 more rows
The area under the curve for this Multivariate Adaptive
Regression Spline model was 0.850
autoplot(mars_results)
These plot the results of the grid search to analyze how the performance
of the model changes with different hyperparameter values
mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_model(mars_mod) %>% add_formula(Status ~ .) %>%
finalize_workflow(mars_best)
mars_final_wf %>%
fit(data = mars_train) %>%
extract_fit_parsnip() %>%
vip(10, type = "rss")
Here we use the vip function to display the top 10 features that were most important in predicting customer retention based on our model. This helps us understand which variables (such as tenure, charges, or payment method) had the most influence on the model’s predictions.
Decision Tree
In this section, we build and evaluate a decision tree model for customer retention.
retention_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
retention_recipe <- recipe(Status ~ ., data = retention_train)
retention_fit <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
fit(data = retention_train)
retention_fit
rpart.plot::rpart.plot(retention_fit$fit$fit$fit, roundint = FALSE)
Here, we visualize the trained decision tree. This helps us
understand how the decision tree makes its predictions. Each node in the
tree represents a decision based on a feature (like “Tenure” or
“Contract type”), and the branches show how the decision splits based on
the feature values. The tree helps us interpret how various customer
features affect retention.
# create resampling procedure
set.seed(456)
kfold <- vfold_cv(retention_train, v = 5)
# train model
retention_results <- fit_resamples(retention_mod, retention_recipe, kfold)
# model results
collect_metrics(retention_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.796 5 0.00396 Preprocessor1_Model1
## 2 brier_class binary 0.144 5 0.00231 Preprocessor1_Model1
## 3 roc_auc binary 0.809 5 0.00407 Preprocessor1_Model1
We set up a 5-fold cross-validation to evaluate the model’s
performance. This means the data is split into 5 subsets, and the model
is trained and tested 5 times, each time with a different subset of data
used as the test set and the remaining data used for training. The
results are collected, and we evaluate model performance using AUC.
Tuning
Next, we perform hyperparameter
tuning to optimize the performance of the decision tree.
retention_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
retention_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(456)
retention_results <- tune_grid(retention_mod, retention_recipe, resamples = kfold, grid = retention_hyper_grid)
show_best(retention_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.814 5 0.00472
## 2 0.0000000178 8 30 roc_auc binary 0.814 5 0.00472
## 3 0.00000316 8 30 roc_auc binary 0.814 5 0.00472
## 4 0.000562 8 30 roc_auc binary 0.814 5 0.00486
## 5 0.0000000001 11 30 roc_auc binary 0.814 5 0.00474
## # ℹ 1 more variable: .config <chr>
After tuning, we display the top 5 hyperparameter combinations based
on the ROC AUC score, which is a measure of the model’s ability to
distinguish between the classes.
The area under the curve for
this decision tree model was 0.814
retention_best_model <- select_best(retention_results, metric = 'roc_auc')
retention_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
finalize_workflow(retention_best_model)
retention_final_fit <- retention_final_wf %>%
fit(data = retention_train)
retention_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1346 273
## Left 194 284
We evaluate the model’s performance on the test set
(retention_test) using a confusion matrix. The confusion matrix compares
the predicted class (.pred_class) to the actual class (Status) and gives
us insights into how well the model is performing (the number of false
positives, true positives, false negatives, and true negatives).
retention_final_fit %>%
extract_fit_parsnip() %>%
vip(10)
Finally, we use the vip function to display the top 10 features that were most important in predicting customer retention.
Random Forest
A Random Forest is an ensemble learning method that uses multiple trees to improve prediction accuracy.
set.seed(456)
model_recipe <- recipe(Status ~ ., data = retention_train )
kfold <- vfold_cv(retention_train , v = 5)
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(trees(range = c(100, 1000)), mtry(range = c(1, 50)), min_n(range = c(1, 20)), levels = 5)
set.seed(456)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)
show_best(rf_results)
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 550 10 roc_auc binary 0.841 5 0.00296 Preprocessor1_Model0…
## 2 1 1000 10 roc_auc binary 0.840 5 0.00289 Preprocessor1_Model0…
## 3 1 1000 20 roc_auc binary 0.840 5 0.00308 Preprocessor1_Model1…
## 4 1 550 15 roc_auc binary 0.840 5 0.00333 Preprocessor1_Model0…
## 5 1 775 15 roc_auc binary 0.840 5 0.00262 Preprocessor1_Model0…
This displays the best performing hyperparameter combination based on
model performance.
The area under the curve for this logistic
regression model was 0.841
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = retention_train)
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
After selecting the best hyperparameters, we finalize the Random
Forest model and fit it to the training data. We then use the vip
function to visualize the top 10 features that most influence customer
retention.
The area under the curve for each of the 4 models were:
0.847
for logistic regression
0.850 for MARS
0.814 for the Decision
Tree
0.841 for Random Forest
We decided that the MARS Model is the optimal model as it has the highest AUC.
Findings
Since Total Charges and contracts are the most significant factors,
it’s essential for a business manager to understand its impact. I
believe that implementing promotions or incentives would greatly benefit
Regork in terms of retaining customers. Customers are leaving due to the
high charges, so offering these programs would give them more reason to
stay. However, this should be rolled out gradually to avoid a sudden
strain on the Finance team. Over time, though, this approach could lead
to long-term benefits by encouraging customer loyalty and repeat
business. We can implement deals for those who sign a two year contract
compared to those who do one year.
When we analyze the demographics of customers leaving due to charges
and tenure, we find that most are single males and not dependents. We
believe this is related to income and financial independence. Single
individuals typically rely on a single income, and those who are not
dependents have limited freedom to spend as they rely on their parents
for financial support.
Total Loss: Our calculations estimate that Regork’s total loss from churned customers is approximately $445,000.
Recomendations
Based on our analysis, we identified that Total Charges and the type of contract are the most significant factors influencing customer churn. Given that many customers are leaving due to high charges, we recommend that Regork implement promotions or incentives to mitigate this issue and retain more customers. Specifically, offering discounts or flexible payment plans could give customers more incentive to stay. However, it is essential to roll out these programs gradually to avoid placing undue strain on the Finance team, especially in the short term. Over time, this could lead to greater customer loyalty and repeat business.
We also suggest implementing a loyalty program to reward long-term customers. Offering perks or discounts based on tenure could strengthen the bond between Regork and its customers, making it more difficult for them to switch to competitors. A structured loyalty program could especially benefit customers who are on longer-term contracts, like two-year agreements, encouraging others to consider the stability of a longer commitment.
In addition, our analysis suggests that targeting single
males may help reduce churn. We observed that many of the
customers leaving Regork are single males who are not dependents. This
is likely linked to financial independence: single individuals often
rely on a single income, and those without dependents may have less
disposable income or financial flexibility. Offering targeted promotions
to this demographic could address their specific needs, such as more
affordable plans or budgeting tools, which may reduce their likelihood
of leaving.
Limitations
It would be beneficial to explore the demographics in greater depth at a later time. While we touched on a few aspects, a more detailed examination of factors such as age, race, occupation, and specific income levels could provide a clearer picture. This deeper analysis would help us better understand exactly what the reasons are for leaving Regork. Also, we only ran a few models and with more time we can find a more optimal model to be put in use.