Final Project: Group 30

Synopsis

To support Regork with their new telecommunications service, we developed multiple models to predict customer retention. Our approach began with exploring trends within the provided dataset. From there, we built, trained, and tested four distinct machine learning models: logistic regression, regularized regression, decision tree, and random forest. The performance of each model was evaluated using the area under the curve (AUC) metric.

Data Preparation

This is every package that was used for the analysis.

library(tidymodels)
library(tidyverse)
library(ranger)
library(baguette)
library(earth)
library(pdp)
library(vip)
library(ggplot2)
library(kernlab)
library(here)

tidyverse: A collection of packages designed for data science, providing tools for data manipulation, visualization, and analysis.

vip: A package for constructing variable importance plots to interpret machine learning models.

tidymodels: Allows you to perform discrete parts of the machine learning workflow with discrete packages, focusing on model building, tuning, and evaluation.

pdp: A package for constructing partial dependence plots and individual conditional expectation curves to understand model predictions.

ranger: A fast implementation of random forests and other tree-based models for regression and classification tasks.

baguette: A package for tree and rule-based models, particularly focused on bagging methods like random forests and bagged decision trees.

earth: A package for fitting multivariate adaptive regression splines (MARS) models, used for flexible non-linear regression.

ggplot2: A data visualization package that implements the grammar of graphics, allowing for flexible and elegant creation of plots.

kernlab: A package for kernel-based machine learning methods, particularly for support vector machines and other kernel-based models.

here: A package for managing file paths in R, ensuring code portability by creating relative paths based on the project’s root directory.

retention <- read.csv("customer_retention.csv")
retention <- mutate(retention, Status = factor(Status))
retention <- na.omit(retention)

Reading the customer data and saving as “retention”.

Exploratory Analysis

The graphs below show the most important trends the current data has. The first one shows the correlation between payment method and the length of tenure. Credit card and bank transfer tend to go the longest due to how easy and convenient it is to pay.

ggplot(retention, aes(PaymentMethod)) + 
  geom_bar(fill = "gold") +
  facet_wrap(~Contract) +
  coord_flip() +
  ggtitle("Customer Payment Method and Length of Contract") +
  labs(y = "Count of Contract Type", x = "Customer Payment Method")

The second and third graphs display how gender is divided up among the internet service plans and phone service plans. Overall they generally seem to be evenly split.

retention %>% 
  ggplot(aes(x=InternetService, fill=Gender))+
  geom_bar() +
  labs(title = "Gender Count by Internet Service Use")

retention %>% 
  ggplot(aes(PhoneService, fill=Gender))+
  geom_bar() +
  labs(title = "Gender Count and Phone Service Use")

This final graph shows us the actual commitment customers made based on the contract they signed up for. Customers who did month by month dropped off early then the other two. While many customers who opted for the 2 year contract ended up making it to five plus years.

ggplot(retention, aes(Tenure)) +
  geom_bar(fill = "steelblue2") +
  facet_wrap(~Contract) +
  ggtitle("Tenure vs. Type of Contract ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

Machine Learning

We ran 4 distinct statistical models to predict retention rate:

Logistic Regression

Logistic Regression

In this section, we build and evaluate a logistic regression model for customer retention.

set.seed(456)
retention_split <- initial_split(retention, prop = .7, strata = Status)
retention_train <- training(retention_split)
retention_test <- testing(retention_split)

set.seed(456)
logistic_kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(Status ~ ., logistic_kfolds) %>%
  collect_metrics()

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.802     5 0.00408 Preprocessor1_Model1
## 2 brier_class binary     0.135     5 0.00172 Preprocessor1_Model1
## 3 roc_auc     binary     0.847     5 0.00355 Preprocessor1_Model1

The area under the curve for this logistic regression model was 0.847

MARS

Multivariate Adaptive Regression Spline

MARS creates piecewise linear regression models by recursively partitioning the data and fitting linear models within each region.

set.seed(456)
mars_split <- initial_split(retention, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train)

set.seed(456)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>% add_recipe(mars_recipe) %>% add_model(mars_mod)

mars_results <- mars_wf %>% tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>% collect_metrics() %>% filter(.metric == "roc_auc") %>%
arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        15           1 roc_auc binary     0.850     5 0.00404 Preprocessor1_M…
##  2        16           1 roc_auc binary     0.850     5 0.00410 Preprocessor1_M…
##  3        19           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  4        20           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  5        21           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  6        22           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  7        23           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  8        24           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
##  9        25           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
## 10        26           1 roc_auc binary     0.850     5 0.00417 Preprocessor1_M…
## # ℹ 50 more rows

The area under the curve for this Multivariate Adaptive Regression Spline model was 0.850

autoplot(mars_results)

These plot the results of the grid search to analyze how the performance of the model changes with different hyperparameter values

mars_best <- select_best(mars_results, metric = "roc_auc")

mars_final_wf <- workflow() %>% 
  add_model(mars_mod) %>% add_formula(Status ~ .) %>% 
  finalize_workflow(mars_best)

mars_final_wf %>% 
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

Here we use the vip function to display the top 10 features that were most important in predicting customer retention based on our model. This helps us understand which variables (such as tenure, charges, or payment method) had the most influence on the model’s predictions.

Decision Tree

Decision Tree

In this section, we build and evaluate a decision tree model for customer retention.

retention_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")

retention_recipe <- recipe(Status ~ ., data = retention_train)

retention_fit <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
fit(data = retention_train)

retention_fit

rpart.plot::rpart.plot(retention_fit$fit$fit$fit, roundint = FALSE)

Here, we visualize the trained decision tree. This helps us understand how the decision tree makes its predictions. Each node in the tree represents a decision based on a feature (like “Tenure” or “Contract type”), and the branches show how the decision splits based on the feature values. The tree helps us interpret how various customer features affect retention.

# create resampling procedure
set.seed(456)
kfold <- vfold_cv(retention_train, v = 5)
# train model
retention_results <- fit_resamples(retention_mod, retention_recipe, kfold)
# model results
collect_metrics(retention_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.00396 Preprocessor1_Model1
## 2 brier_class binary     0.144     5 0.00231 Preprocessor1_Model1
## 3 roc_auc     binary     0.809     5 0.00407 Preprocessor1_Model1

We set up a 5-fold cross-validation to evaluate the model’s performance. This means the data is split into 5 subsets, and the model is trained and tested 5 times, each time with a different subset of data used as the test set and the remaining data used for training. The results are collected, and we evaluate model performance using AUC.

Tuning

Next, we perform hyperparameter tuning to optimize the performance of the decision tree.

retention_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")

retention_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)

set.seed(456)
retention_results <- tune_grid(retention_mod, retention_recipe, resamples = kfold, grid = retention_hyper_grid)

show_best(retention_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.814     5 0.00472
## 2    0.0000000178          8    30 roc_auc binary     0.814     5 0.00472
## 3    0.00000316            8    30 roc_auc binary     0.814     5 0.00472
## 4    0.000562              8    30 roc_auc binary     0.814     5 0.00486
## 5    0.0000000001         11    30 roc_auc binary     0.814     5 0.00474
## # ℹ 1 more variable: .config <chr>

After tuning, we display the top 5 hyperparameter combinations based on the ROC AUC score, which is a measure of the model’s ability to distinguish between the classes.

The area under the curve for this decision tree model was 0.814

retention_best_model <- select_best(retention_results, metric = 'roc_auc')

retention_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
finalize_workflow(retention_best_model)

retention_final_fit <- retention_final_wf %>%
fit(data = retention_train)

retention_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1346  273
##    Left        194  284

We evaluate the model’s performance on the test set (retention_test) using a confusion matrix. The confusion matrix compares the predicted class (.pred_class) to the actual class (Status) and gives us insights into how well the model is performing (the number of false positives, true positives, false negatives, and true negatives).

retention_final_fit %>%
extract_fit_parsnip() %>%
vip(10)

Finally, we use the vip function to display the top 10 features that were most important in predicting customer retention.

Random Forest

Random Forest

A Random Forest is an ensemble learning method that uses multiple trees to improve prediction accuracy.

set.seed(456)
model_recipe <- recipe(Status ~ ., data = retention_train )

kfold <- vfold_cv(retention_train , v = 5)


rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")



rf_hyper_grid <- grid_regular(trees(range = c(100, 1000)), mtry(range = c(1, 50)), min_n(range = c(1, 20)), levels = 5)



set.seed(456)
rf_results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = rf_hyper_grid)



show_best(rf_results)

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1   550    10 roc_auc binary     0.841     5 0.00296 Preprocessor1_Model0…
## 2     1  1000    10 roc_auc binary     0.840     5 0.00289 Preprocessor1_Model0…
## 3     1  1000    20 roc_auc binary     0.840     5 0.00308 Preprocessor1_Model1…
## 4     1   550    15 roc_auc binary     0.840     5 0.00333 Preprocessor1_Model0…
## 5     1   775    15 roc_auc binary     0.840     5 0.00262 Preprocessor1_Model0…

This displays the best performing hyperparameter combination based on model performance.

The area under the curve for this logistic regression model was 0.841

rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

rf_final_fit <- final_rf_wf %>%
  fit(data = retention_train)

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

After selecting the best hyperparameters, we finalize the Random Forest model and fit it to the training data. We then use the vip function to visualize the top 10 features that most influence customer retention.

Results

The area under the curve for each of the 4 models were:
0.847 for logistic regression
0.850 for MARS
0.814 for the Decision Tree
0.841 for Random Forest

We decided that the MARS Model is the optimal model as it has the highest AUC.

Business Analysis

Findings

Since Total Charges and contracts are the most significant factors, it’s essential for a business manager to understand its impact. I believe that implementing promotions or incentives would greatly benefit Regork in terms of retaining customers. Customers are leaving due to the high charges, so offering these programs would give them more reason to stay. However, this should be rolled out gradually to avoid a sudden strain on the Finance team. Over time, though, this approach could lead to long-term benefits by encouraging customer loyalty and repeat business. We can implement deals for those who sign a two year contract compared to those who do one year.

When we analyze the demographics of customers leaving due to charges and tenure, we find that most are single males and not dependents. We believe this is related to income and financial independence. Single individuals typically rely on a single income, and those who are not dependents have limited freedom to spend as they rely on their parents for financial support.

Total Loss: Our calculations estimate that Regork’s total loss from churned customers is approximately $445,000.

Recomendations

Based on our analysis, we identified that Total Charges and the type of contract are the most significant factors influencing customer churn. Given that many customers are leaving due to high charges, we recommend that Regork implement promotions or incentives to mitigate this issue and retain more customers. Specifically, offering discounts or flexible payment plans could give customers more incentive to stay. However, it is essential to roll out these programs gradually to avoid placing undue strain on the Finance team, especially in the short term. Over time, this could lead to greater customer loyalty and repeat business.

We also suggest implementing a loyalty program to reward long-term customers. Offering perks or discounts based on tenure could strengthen the bond between Regork and its customers, making it more difficult for them to switch to competitors. A structured loyalty program could especially benefit customers who are on longer-term contracts, like two-year agreements, encouraging others to consider the stability of a longer commitment.

In addition, our analysis suggests that targeting single males may help reduce churn. We observed that many of the customers leaving Regork are single males who are not dependents. This is likely linked to financial independence: single individuals often rely on a single income, and those without dependents may have less disposable income or financial flexibility. Offering targeted promotions to this demographic could address their specific needs, such as more affordable plans or budgeting tools, which may reduce their likelihood of leaving.

Limitations

It would be beneficial to explore the demographics in greater depth at a later time. While we touched on a few aspects, a more detailed examination of factors such as age, race, occupation, and specific income levels could provide a clearer picture. This deeper analysis would help us better understand exactly what the reasons are for leaving Regork. Also, we only ran a few models and with more time we can find a more optimal model to be put in use.