To help Regork optimize their telecommunication service offerings, we conducted an analysis to predict customer retention. The goal was to identify key trends and relationships in the dataset that would inform a model capable of predicting customer churn.
We began by exploring the dataset to uncover valuable insights, such as the impact of customer demographics, service usage, and tenure on retention. After preparing the data, we built and evaluated several machine learning models, including a decision tree, a MARS model, and a logistic regression.
This model will serve as a foundational tool for Regork’s customer retention strategy. It can help teams across different functions target retention efforts more effectively, forecast customer trends and make data-driven decisions. In this report, you’ll find a detailed explanation of the data preparation process, key exploratory findings, and an evaluation of the machine learning models used.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(here)
## here() starts at /Users/maxwellbean/Desktop/uc-bana-4080
library(pdp)
##
## Attaching package: 'pdp'
##
## The following object is masked from 'package:purrr':
##
## partial
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:scales':
##
## alpha
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(baguette)
library(dplyr)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
data <- read.csv(here("Final Project","customer_retention.csv"))
data <- mutate(data, Status = factor(Status))
data <- na.omit(data)
set.seed(123)
split <- initial_split(data, prop = 0.7, strata = "Status")
training <- training(split)
testing <- testing(split)
recipe <- recipe(Status ~., data = training) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
set.seed(123)
kfolds <- vfold_cv(training, v = 5, strata = Status)
ggplot(data, aes(Tenure, fill = Partner)) +
geom_bar() +
ggtitle("Tenure of Customers") +
labs(y = "Number of Customers", x = "Number of Months With Regork")
ggplot(data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Churn Rate by Contract Type", y = "Percentage", x = "Contract Type") +
theme_minimal()
Description: The churn rate by contract type graph shows that customers
with shorter contracts tend to leave more often than those with longer
contracts. The month-to-month contracts relate to a significantly higher
churn rate than the one-year and two-year contracts. The churn rate
associated with a one-year contract is also greater than a two-year
contract.
ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Customer Status", y = "Monthly Charges ($)", x = "Status") +
theme_minimal()
Description:
The customers who left had higher monthly charges compared to those who haven’t left. The first quartile, median, and third quartile are all larger for those who left. Thus, there is a correlation between higher charges and customers leaving.
DECISION TREE
set.seed(123)
dt_split <- initial_split(data, prop = 0.7, strata = "Status")
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
dt_model <- decision_tree(mode = "classification") %>%
set_engine("rpart")
dt_fit <- workflow() %>%
add_recipe(recipe) %>%
add_model(dt_model) %>%
fit(data = dt_train)
dt_results <- fit_resamples(dt_model, recipe, kfolds)
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00399 Preprocessor1_Model1
## 2 brier_class binary 0.161 5 0.00171 Preprocessor1_Model1
## 3 roc_auc binary 0.710 5 0.00529 Preprocessor1_Model1
dt_model <- decision_tree(mode = "classification", cost_complexity = tune(),
tree_depth = tune(), min_n = tune()) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)
set.seed(123)
dt_tune_results <- tune_grid(dt_model, recipe, kfolds, dt_hyper_grid)
## Warning: The `...` are not used in this function but one or more objects were
## passed: ''
show_best(dt_tune_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 2.63e- 5 9 37 roc_auc binary 0.815 5 0.00522
## 2 4.48e- 6 13 35 roc_auc binary 0.810 5 0.00605
## 3 1.22e- 6 14 23 roc_auc binary 0.806 5 0.00507
## 4 2.19e- 3 8 25 roc_auc binary 0.804 5 0.00721
## 5 3.55e-10 11 20 roc_auc binary 0.801 5 0.00319
## # ℹ 1 more variable: .config <chr>
best_dt_model <- select_best(dt_tune_results, metric = "roc_auc")
dt_final_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(dt_model) %>%
finalize_workflow(best_dt_model)
dt_final_fit <- dt_final_wf %>%
fit(data = dt_train)
dt_final_fit %>%
predict(dt_test) %>%
bind_cols(dt_test %>%
select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1337 261
## Left 203 296
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
LOGISTIC REGRESSION
set.seed(123)
log_split <- initial_split(data, prop = 0.7, strata = "Status")
log_train <- training(log_split)
log_test <- testing(log_split)
set.seed(123)
kfolds <- vfold_cv(training, v = 5, strata = Status)
log_results <- logistic_reg() %>%
fit_resamples(Status ~., kfolds)
collect_metrics(log_results) %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
MARS
set.seed(123)
mars_split <- initial_split(data, prop = .7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
mars_fit <- mars(mode = "classification", prod_degree = 1) %>%
fit(Status ~ ., mars_train)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
mars_mod <- mars(
mode = "classification",
num_terms = tune(), #<<
prod_degree = tune() #<<
)
##### K-Fold Cross
set.seed(123)
folds <- vfold_cv(mars_train, v = 5)
##### Recipe
mars_recipe <- recipe(Status ~ ., data = mars_train)
##### Hyperparameter tuning grid
mars_grid <- grid_regular( #<<
num_terms(range = c(10,50)), #<<
prod_degree(), #<<
levels = 10 #<<
) #<<
###### Train across the hyperparameter grid
mars_results <- tune_grid(mars_mod, mars_recipe, resamples = folds, grid = mars_grid) #<<
###### Best Results
show_best(mars_results, metric = "roc_auc", n=10)
## # A tibble: 10 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 23 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 2 27 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 3 32 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 4 36 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 5 41 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 6 45 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 7 50 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_M…
## 8 18 1 roc_auc binary 0.847 5 0.00928 Preprocessor1_M…
## 9 14 1 roc_auc binary 0.844 5 0.00906 Preprocessor1_M…
## 10 10 1 roc_auc binary 0.842 5 0.0106 Preprocessor1_M…
mars_roc <- show_best(mars_results, metric = "roc_auc", n=1) %>%
arrange(desc(mean))
mars_roc
## # A tibble: 1 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 23 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_Mo…
autoplot(mars_results)
mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_model(mars_mod) %>%
add_recipe(mars_recipe) %>%
finalize_workflow(mars_best)
##### Variable Influence
mars_final_wf %>%
fit(data = mars_train) %>%
extract_fit_parsnip() %>%
vip(13)
Important Predictors
When looking at our best model, the main three factors that stand out in importance compared to the rest are TotalCharges, Tenure, MonthlyCharges.
With Total Charges being the most important predictor variable for this model, it shows just how important price can be to customers. This information could predict that customers base their decision of telecommunications plans based on price as the leading factor. To make sure that you can retain as many customers as possible, it is important to keep prices low and competitive with other companies to make sure your price is reasonable for the customer.
Tenure being the second most important predictor variable is also very important to Regork. This is common knowledge within the business and marketing world that it is much easier to keep a long time customer compared to reeling in a brand new customer. This should show to Regork that they should cater their telecommunications offerings to their longest tenure customers to make sure they are satisfied and do not switch companies.
Conclusion
To conclude this analysis, we recommend that Regork focuses on the main predictor variables of Total Charges, Tenure, and Monthly Charges to ensure they are maximizing their retention of customers. Building these long relationships with customers can help ensure that Regork retains customers year after year and in turn also lowers the cost of marketing to new customers, because as we know, it is much more expensive to get a new customer compared to retaining a current customer. Regork also need to focus on keeping Total AND Monthly charges in line with the current telecommunications market. With both Total and Monthly charges being high in importance, this shows that the average customer looks at price first and will be one of the final deciders when it comes to their choice in provider.