Introduction
When entering a new industry, gaining a deep understanding of customer behavior is crucial for long-term success. In the analysis that follows, we examined customer retention data from the “customer_retention.csv” file and developed a model that effectively predicts customer retention. The model developed here will play an important role in future analyses across a variety of departments within your company. For instance, the marketing team could leverage it to design targeted promotions for existing customers, while the finance team might use it to forecast customer retention trends, helping to assess company risk, revenue, and bookings. As you review the analysis, you’ll find details about the necessary packages used for the report, along with background information on the data source and the preparation steps taken. We also conducted an in-depth exploratory analysis before building the model, uncovering trends and relationships between the predictor variables and the response variable. A significant portion of the analysis focuses on evaluating different machine learning models. Specifically, we compared three classification methods: logistic regression, bagging, and a decision tree.
Packages Required
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ rsample 1.3.0
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.8 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.1 ✔ yardstick 1.3.2
## ✔ recipes 1.3.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(pdp)
##
## Attaching package: 'pdp'
##
## The following object is masked from 'package:purrr':
##
## partial
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:dials':
##
## buffer
##
## The following object is masked from 'package:scales':
##
## alpha
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(baguette)
library(dplyr)
Data Prep and Exploration
retention <- read.csv("customer_retention.csv")
retention <- retention %>%
dplyr::mutate(Status = as.factor(Status))
retention <- drop_na(retention)
retention %>% is.na() %>% sum()
## [1] 0
set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
dim(retention_test)
## [1] 2097 20
dim(retention_train)
## [1] 4891 20
retention_train %>%
mutate(id = 'retention_train') %>%
bind_rows(retention_test %>% mutate(id = 'retention_test')) %>%
ggplot(aes(Status, color = id)) +
geom_density() + labs(title = "Retention Sample Distribution")
table(retention$Status) %>% prop.table()
##
## Current Left
## 0.7344018 0.2655982
table(retention_train$Status) %>% prop.table()
##
## Current Left
## 0.7344101 0.2655899
table(retention_test$Status) %>% prop.table()
##
## Current Left
## 0.7343825 0.2656175
head(retention$Status)
## [1] Current Current Left Current Left Left
## Levels: Current Left
retention %>%
ggplot(aes(PaymentMethod, fill=Gender))+
geom_bar() +
labs(title = "Payment Method and Gender")
retention %>%
ggplot(aes(PaymentMethod, fill=Partner))+
geom_bar()+
labs(title = "Payment Method and Partner Count")
retention %>%
ggplot(aes(PaymentMethod, fill=Dependents))+
geom_bar()+
labs(title = "Payment Method Dependents Count")
The following analysis examines the full retention dataset, highlighting the relationships between various demographic factors and payment methods. As you can see in these graphs there is no large deviation in any of the categories in terms of gender. However, there are stronger connections when looking at other factors.
We also wanted to understand the Payment Method and Partner count and how they reacted with each other.
Analysis Logistic Regression
retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
table(retention$Status) %>% prop.table()
##
## Current Left
## 0.7344018 0.2655982
lr_mod <- logistic_reg()
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
log_results <- lr_mod %>%
fit_resamples(Status ~ ., kfolds)
collect_metrics(log_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
logfit <-lr_mod %>%
fit(Status ~ ., data = retention_train)
logfit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
logfit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
Bagging
bagging <- bag_tree() %>%
set_engine("rpart", times = 5) %>%
set_mode("classification")
bag_results <- fit_resamples(bagging, retention_recipe, kfolds)
collect_metrics(bag_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.760 5 0.00543 Preprocessor1_Model1
## 2 brier_class binary 0.172 5 0.00169 Preprocessor1_Model1
## 3 roc_auc binary 0.767 5 0.00448 Preprocessor1_Model1
bagging <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
set.seed(123)
bag_results <- tune_grid(bagging, retention_recipe, resamples = kfolds, grid = bag_hyper_grid)
show_best(bag_results, metric = "roc_auc")
## # A tibble: 5 × 7
## times .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 200 roc_auc binary 0.822 5 0.00352 Preprocessor1_Model5
## 2 300 roc_auc binary 0.821 5 0.00432 Preprocessor1_Model6
## 3 100 roc_auc binary 0.821 5 0.00462 Preprocessor1_Model4
## 4 50 roc_auc binary 0.815 5 0.00361 Preprocessor1_Model3
## 5 25 roc_auc binary 0.809 5 0.00457 Preprocessor1_Model2
bag_best_model <- select_best(bag_results, metric = 'roc_auc')
bag_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(bagging) %>%
finalize_workflow(bag_best_model)
bag_final_fit <- bag_final_wf %>%
fit(data = retention_train)
bag_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1342 263
## Left 198 294
bag_final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
Decision Tree
tree_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
tree_fit <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(tree_mod) %>%
fit(data = retention_train)
tree_results <- fit_resamples(tree_mod, retention_recipe, kfolds)
collect_metrics(tree_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00399 Preprocessor1_Model1
## 2 brier_class binary 0.161 5 0.00171 Preprocessor1_Model1
## 3 roc_auc binary 0.710 5 0.00529 Preprocessor1_Model1
tree_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
tree_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
tree_results <- tune_grid(tree_mod, retention_recipe, resamples = kfolds, grid = tree_hyper_grid)
show_best(tree_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.816 5 0.00552
## 2 0.0000000178 8 30 roc_auc binary 0.816 5 0.00552
## 3 0.00000316 8 30 roc_auc binary 0.816 5 0.00552
## 4 0.0000000001 8 40 roc_auc binary 0.815 5 0.00420
## 5 0.0000000178 8 40 roc_auc binary 0.815 5 0.00420
## # ℹ 1 more variable: .config <chr>
tree_best_model <- select_best(tree_results, metric = 'roc_auc')
tree_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(tree_mod) %>%
finalize_workflow(tree_best_model)
tree_final_fit <- tree_final_wf %>%
fit(data = retention_train)
tree_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1316 243
## Left 224 314
tree_final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
Business Analysis and Conclusion
In terms of relative importance, how would you be the predictors in your model? As a business manager, which factors would you focus on to decrease the chances of customers leaving?
In our model, payment methods emerged as one of the most influential predictors of customer retention, with customers who pay via electronic check being far more likely to leave compared to those using other payment methods. Partner count (whether a customer has a partner) also showed significant importance, with single customers showing higher churn rates than those with partners.
Propose an incentive scheme to your manager to retain these customers
As a business manager, I would prioritize strategies around improving the payment experience for customers using electronic checks — potentially offering easier, more secure, or incentivized payment alternatives — and explore relationship-based marketing strategies for single customers who may feel less “tied” to the service. Collect all the customers from the test dataset that you predict are going to leave.
In conclusion, our analysis suggests that improving the payment experience is key to retaining customers. Since a large portion of customers use electronic checks and self-serve stations, we recommend that Regork expand the number of these stations to enhance convenience and overall satisfaction. Our insights show that making the payment process easier and more accessible would have a direct, positive impact on customer retention. Additionally, investing in better payment options signals to customers that Regork values their time and loyalty. This move could also help differentiate Regork from competitors by offering a smoother, more customer-centric experience. Overall, we believe this is a strong, strategic step forward for the company.