Kyle Rex, Gabe Logan, Ryan Thomas

Introduction

When entering a new industry, gaining a deep understanding of customer behavior is crucial for long-term success. In the analysis that follows, we examined customer retention data from the “customer_retention.csv” file and developed a model that effectively predicts customer retention. The model developed here will play an important role in future analyses across a variety of departments within your company. For instance, the marketing team could leverage it to design targeted promotions for existing customers, while the finance team might use it to forecast customer retention trends, helping to assess company risk, revenue, and bookings. As you review the analysis, you’ll find details about the necessary packages used for the report, along with background information on the data source and the preparation steps taken. We also conducted an in-depth exploratory analysis before building the model, uncovering trends and relationships between the predictor variables and the response variable. A significant portion of the analysis focuses on evaluating different machine learning models. Specifically, we compared three classification methods: logistic regression, bagging, and a decision tree.

Packages Required

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(vip)

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.3.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(pdp)

## 
## Attaching package: 'pdp'
## 
## The following object is masked from 'package:purrr':
## 
##     partial

library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:dials':
## 
##     buffer
## 
## The following object is masked from 'package:scales':
## 
##     alpha
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(baguette)
library(dplyr)

Data Prep and Exploration

retention <- read.csv("customer_retention.csv")
retention <- retention %>% 
  dplyr::mutate(Status = as.factor(Status))
retention <- drop_na(retention)
retention %>% is.na() %>% sum()

## [1] 0

set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test  <- testing(retention_split)
dim(retention_test)

## [1] 2097   20

dim(retention_train)

## [1] 4891   20

retention_train %>% 
  mutate(id = 'retention_train') %>% 
  bind_rows(retention_test %>% mutate(id = 'retention_test')) %>%
  ggplot(aes(Status, color = id)) +
  geom_density() + labs(title = "Retention Sample Distribution")

table(retention$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7344018 0.2655982

table(retention_train$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7344101 0.2655899

table(retention_test$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7343825 0.2656175

head(retention$Status)

## [1] Current Current Left    Current Left    Left   
## Levels: Current Left

retention %>% 
  ggplot(aes(PaymentMethod, fill=Gender))+
  geom_bar() +
  labs(title = "Payment Method and Gender")

retention %>% 
  ggplot(aes(PaymentMethod, fill=Partner))+
  geom_bar()+
  labs(title = "Payment Method and Partner Count")

retention %>% 
  ggplot(aes(PaymentMethod, fill=Dependents))+
  geom_bar()+
  labs(title = "Payment Method Dependents Count")

The following analysis examines the full retention dataset, highlighting the relationships between various demographic factors and payment methods. As you can see in these graphs there is no large deviation in any of the categories in terms of gender. However, there are stronger connections when looking at other factors.

We also wanted to understand the Payment Method and Partner count and how they reacted with each other.

Analysis Logistic Regression

retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())
table(retention$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7344018 0.2655982

lr_mod <- logistic_reg()
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
log_results <- lr_mod %>% 
  fit_resamples(Status ~ ., kfolds)
collect_metrics(log_results) %>% filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

logfit <-lr_mod %>%
  fit(Status ~ ., data = retention_train)
logfit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

logfit %>%
  predict(retention_train, type = "prob") %>%
  mutate(truth = retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Bagging

bagging <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")
bag_results <- fit_resamples(bagging, retention_recipe, kfolds)
collect_metrics(bag_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.760     5 0.00543 Preprocessor1_Model1
## 2 brier_class binary     0.172     5 0.00169 Preprocessor1_Model1
## 3 roc_auc     binary     0.767     5 0.00448 Preprocessor1_Model1

bagging <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
set.seed(123)
bag_results <- tune_grid(bagging, retention_recipe, resamples = kfolds, grid = bag_hyper_grid)
show_best(bag_results, metric = "roc_auc")

## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.822     5 0.00352 Preprocessor1_Model5
## 2   300 roc_auc binary     0.821     5 0.00432 Preprocessor1_Model6
## 3   100 roc_auc binary     0.821     5 0.00462 Preprocessor1_Model4
## 4    50 roc_auc binary     0.815     5 0.00361 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00457 Preprocessor1_Model2

bag_best_model <- select_best(bag_results, metric = 'roc_auc')
bag_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(bagging) %>%
  finalize_workflow(bag_best_model)
bag_final_fit <- bag_final_wf %>%
  fit(data = retention_train)
bag_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1342  263
##    Left        198  294

bag_final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

Decision Tree

tree_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")
tree_fit <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(tree_mod) %>%
  fit(data = retention_train)
tree_results <- fit_resamples(tree_mod, retention_recipe, kfolds)
collect_metrics(tree_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.788     5 0.00399 Preprocessor1_Model1
## 2 brier_class binary     0.161     5 0.00171 Preprocessor1_Model1
## 3 roc_auc     binary     0.710     5 0.00529 Preprocessor1_Model1

tree_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
 ) %>%
  set_engine("rpart")
tree_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)
set.seed(123)
tree_results <- tune_grid(tree_mod, retention_recipe, resamples = kfolds, grid = tree_hyper_grid)
show_best(tree_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.816     5 0.00552
## 2    0.0000000178          8    30 roc_auc binary     0.816     5 0.00552
## 3    0.00000316            8    30 roc_auc binary     0.816     5 0.00552
## 4    0.0000000001          8    40 roc_auc binary     0.815     5 0.00420
## 5    0.0000000178          8    40 roc_auc binary     0.815     5 0.00420
## # ℹ 1 more variable: .config <chr>

tree_best_model <- select_best(tree_results, metric = 'roc_auc')
tree_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(tree_mod) %>%
  finalize_workflow(tree_best_model)
tree_final_fit <- tree_final_wf %>%
  fit(data = retention_train)
tree_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1316  243
##    Left        224  314

tree_final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

Business Analysis and Conclusion

In terms of relative importance, how would you be the predictors in your model? As a business manager, which factors would you focus on to decrease the chances of customers leaving?

In our model, payment methods emerged as one of the most influential predictors of customer retention, with customers who pay via electronic check being far more likely to leave compared to those using other payment methods. Partner count (whether a customer has a partner) also showed significant importance, with single customers showing higher churn rates than those with partners.

Propose an incentive scheme to your manager to retain these customers

As a business manager, I would prioritize strategies around improving the payment experience for customers using electronic checks — potentially offering easier, more secure, or incentivized payment alternatives — and explore relationship-based marketing strategies for single customers who may feel less “tied” to the service. Collect all the customers from the test dataset that you predict are going to leave.

In conclusion, our analysis suggests that improving the payment experience is key to retaining customers. Since a large portion of customers use electronic checks and self-serve stations, we recommend that Regork expand the number of these stations to enhance convenience and overall satisfaction. Our insights show that making the payment process easier and more accessible would have a direct, positive impact on customer retention. Additionally, investing in better payment options signals to customers that Regork values their time and loyalty. This move could also help differentiate Regork from competitors by offering a smoother, more customer-centric experience. Overall, we believe this is a strong, strategic step forward for the company.

Kyle Rex, Gabe Logan, Ryan Thomas - Final Project

Kyle Rex

2025-04-27