The Business Problem

The business problem we are addressing is improving customer retention by developing a predictive model that determines the likelihood of a customer staying or leaving. This report should be particularly valuable to the CEO of Regork Telecom, given the potential impact on customer retention strategies. Our approach involved analyzing customer data to identify trends associated with churn and building a model to predict the reasons behind customer departures. This analysis offers key insights that can help enhance customer retention efforts and may also support initiatives to attract new customers.

Packages

tidymodels is a package used to perform Machine Learning.
tidyverse is a package used to analyse and edit data.
earth is a package used for performing the Multivariate Adaptive Regression Splines (MARS).
vip is a package used for creating variable importance plots.
pdp is a package used for creating partial dependency plots.
rpart.plot is a package used to visualize a Decision Tree.

suppressWarnings(suppressMessages(library(tidymodels)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(earth)))
suppressWarnings(suppressMessages(library(vip)))
suppressWarnings(suppressMessages(library(pdp)))
suppressWarnings(suppressMessages(library(rpart.plot)))

Adjusted Data

Before we start to explore the data, we need to edit it slightly. We removed all null values, and changed the response variable, Status, to be the factor for our classification models.

suppressWarnings(suppressMessages(
  retention <- read_csv('customer_retention.csv') %>%
  mutate(Status = factor(Status)) %>% 
  na.omit()))

Exploratory Data Analysis

ggplot(retention, aes(Status)) +
  geom_bar(fill = "lightpink") +
  facet_wrap(~Contract) +
  ggtitle("Customers Contract Types by Status") +
  labs(y = "Customers", x = "Status")

This plot shows that customers tend to sign longer contracts with Regork when intending to stay, where as majority of customers leaving are taking advantage of the monthly contract. This can help us make assumptions that data linked to long contracts are current customers.

yes_partner <- retention %>%
  filter(str_detect(Partner, regex("Yes", ignore_case = TRUE)))
ggplot(yes_partner, aes(Dependents)) +
  geom_bar(fill = "navy") +
  facet_wrap(~Contract) +
  ggtitle("Customers with a Partner's Contract Type") +
  labs(subtitle = "Faceted by Dependants", y = "Customers")

This chart shows that couples are likely to be retained and sign into longer contracts, regardless of whether they have dependents or not.

no_partner <- retention %>%
  filter(str_detect(Partner, regex("No", ignore_case = TRUE)))
ggplot(no_partner, aes(Dependents)) +
  geom_bar(fill = "brown") +
  facet_wrap(~Contract) +
  ggtitle("Customers without a Partner's Contract Type") +
  labs(subtitle = "Faceted by Dependants", y = "Customers")

This chart shows that singles are more likely to stay on a monthly contract, and are therefore more likely to leave. When comparing this to the previous chart, we can infer customers with partners and dependents are more likely to commit to longer contracts, while customers without partners and dependents favor flexible, short-term month-to-month plans.

ggplot(retention, aes(Tenure)) +
  geom_density(fill = "darkgreen") +
  facet_wrap(~Status) +
  ggtitle("Customer Retention per Tenure") +
  labs(y = "Percentage of Customers", x = "Length of Tenure (months)")

This density graph shows the distribution of customer tenure faceted by status. For current customers the distribution is fairly stable, with slight peaks at the beginning and around 60 months, suggesting a steady loyalty with a small spike in long-term retention. In contrast, the customers who have left show a steep decline after the first few months, most departures happen early, with very few customers leaving after 40+ months. This suggests that if customers stay past the early period, they are more likely to remain long-term.

ggplot(retention, aes(Tenure)) + 
  geom_density(fill = "red") + 
  facet_wrap(~Contract) + 
  ggtitle("Customers vs. Type of Contract ") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  labs(y = "Percentage of Customers", x = "Length of Tenure (months)")

This density graph shows the percentage of customers based on tenure, faceted by contract type. Customers who have been around for 30+ months are more likely to sign a longer contract, while new customers are predominantly going to sign monthly contracts. This indicates that the longer we retain a customer, the more likely we are to retain them in the future.

intrnt <- retention %>%
  filter(str_detect(InternetService, regex("DSL|Fiber optic", ignore_case = TRUE)))
ggplot(intrnt, aes(TechSupport)) + 
  geom_bar(fill = "blue") + 
  facet_wrap(~ InternetService) +
  ggtitle("Tech Support usuage by Internet Service") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  labs(y = "Customers", x = "Tech Support")

This graph shows that customers generally need tech support less when using Fiber Optics compared to DSL. This indicates that fiber optics performs at a higher rate than DSL. # Machine Learning The first step to machine learning is setting up the test and train data sets. We do this after setting the seed, and then also create our cross validated data.

set.seed(123)
split <- initial_split(retention, prop = .7, strata = Status)
train <- training(split)
test <- testing(split)
kfolds <- vfold_cv(train, v = 5, strata = Status)

Logisitic Regression

Our first machine learning model we tested was logistic regression.

log <- logistic_reg() %>% 
  fit_resamples(Status ~ ., kfolds)
collect_metrics(log)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00449 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00243 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00590 Preprocessor1_Model1

fit <- logistic_reg() %>%
  fit(Status ~ ., data = train)
fit %>%
  predict(test) %>%
  bind_cols(test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

fit %>% 
  predict(train, type = "prob") %>%
  mutate(truth = train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

From this model we can see that there are a lot of false positives in its prediction. This goes to show that logistic regression is unfavorable when compared to other machine learning models we can use. ### Multivariate Adaptive Regression Splines The next machine learning model we will use is the MARS model. ##### MARS Model For this one we need to create a recipe for our data, in which we need to tune our numeric and nominal values.

set.seed(123)
recipe <- recipe(Status ~ ., data = train) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors())
model <- mars(mode = "classification", num_terms = tune(), prod_degree = tune())
grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 50)
wflow <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(model)
results <- wflow %>% 
  tune_grid(resamples = kfolds, grid = grid)
results %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        18           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  3        20           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  4        21           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  5        22           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  6        23           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  7        24           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
## # ℹ 50 more rows

autoplot(results)

From this we learned that the MARS model has a slightly better AUC mean than our logistic regression model. We can now look into this further to find our most important variables and hyperparameters. ##### Feature Interpretation

best <- select_best(results, metric = "roc_auc")
final_wf <- workflow() %>% 
  add_model(model) %>% 
  add_formula(Status ~ .) %>% 
  finalize_workflow(best)
final_wf %>% 
  fit(data = train) %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

This shows that Tenure is the most important variable when determining if a customer is going to stay or leave Regork. ### Decision Tree For our last machine learning model we decided to use a decision tree. We are going to use the same recipe we made for the MARS model.

set.seed(123)
dt_model <- decision_tree( 
  mode = "classification", 
  tree_depth = 5, 
  min_n = 2,
  cost_complexity = 0.001) %>% 
  set_engine("rpart")
dt_fit <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(dt_model) %>% 
  fit(data = train)
dt_results <- fit_resamples(dt_model, recipe, kfolds)
collect_metrics(dt_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.784     5 0.00301 Preprocessor1_Model1
## 2 brier_class binary     0.152     5 0.00323 Preprocessor1_Model1
## 3 roc_auc     binary     0.765     5 0.0225  Preprocessor1_Model1

rpart.plot::rpart.plot(dt_fit$fit$fit$fit, roundint = FALSE)

This shows once again that Tenure is our most important variable. We can see that the AUC mean is the lowest of the three machine learning models, therefore is not as reliable.

Business Analysis & Conclusion

Our analysis identified MARS (Multivariate Adaptive Regression Splines) as the most effective predictive model, achieving the highest mean AUC score. We learned that customer tenure is the most important factor when determining if a customer will leave or stay.

A key insight was that customer loyalty significantly increases after the first year, suggesting that early retention strategies are critical. Customers with partners tend to have a more balanced distribution across different contract types, while customers without partners or dependents are more drawn to month-to-month contracts. Additionally, we found that customer churn occurs most frequently within the first few months. Our demographic analysis revealed that customers who leave due to service charges are predominantly single, male, and without dependents. To address this, Regork should consider offering discounts during the initial months to encourage longer-term retention, as well as include a special package or deal for singles.

To further strengthen our findings, future work could involve the use of additional machine learning models and deeper exploratory data analysis.

Regork Telecom

Evan Baumaster, Andrew Moore

2025-04-22