Synopsis

# As Regork faces a high customer churn rate, this threatens both short term and long term growth. And because finding new customers uses more resources and is more expensive than retaining existing customers it is important to understand what causes customers to leave and figure out how to prevent it. Customer churn rates directly impact the bottom line and even a small reduction in churn rates would greatly increase monthly revenue retention. By identifying at-risk customers and using strategic implementations we can improve customer lifetime value and gain a competitive edge. Using the customer retention data given I analyzed customer behavior and demographics to explore the contributing factors to churn. I applied three machine learning models, logistic regression, MARS, and random forest, to predict churn with high accuracy. My analysis dives into key churn indicators like contract type, tenure, and service subscriptions. These insight help regork to be able to predict which customers are most likely to churn, understand the driving factors behind the churn, and also take targeted action like offering long-term contracts or bundled services to retain high risk customers.

Data Preparation

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

## ✔ broom        1.0.7     ✔ recipes      1.1.1
## ✔ dials        1.4.0     ✔ rsample      1.2.1
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
## ✔ infer        1.0.7     ✔ tune         1.3.0
## ✔ modeldata    1.4.0     ✔ workflows    1.2.0
## ✔ parsnip      1.3.0     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.4     ✔ yardstick    1.3.2

## Warning: package 'broom' was built under R version 4.3.3

## Warning: package 'dials' was built under R version 4.3.3

## Warning: package 'scales' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'infer' was built under R version 4.3.3

## Warning: package 'modeldata' was built under R version 4.3.3

## Warning: package 'parsnip' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'recipes' was built under R version 4.3.3

## Warning: package 'rsample' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'tune' was built under R version 4.3.3

## Warning: package 'workflows' was built under R version 4.3.3

## Warning: package 'workflowsets' was built under R version 4.3.3

## Warning: package 'yardstick' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

# -  helps streamline the process of building, evaluating, and comparing machine learning models
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# - makes it easy to explore, manipulate, and visualize data quickly
library(vip)

## Warning: package 'vip' was built under R version 4.3.3

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

# - visualize feature importance from machine learning models
library(baguette)

## Warning: package 'baguette' was built under R version 4.3.3

# - Reduces over fitting and improves generalization
library(ggplot2)
# - creating elegant, informative, and highly customizable graphics using the grammar of graphics framework
library(pdp)

## Warning: package 'pdp' was built under R version 4.3.3

## 
## Attaching package: 'pdp'
## 
## The following object is masked from 'package:purrr':
## 
##     partial

# - visualize the marginal effect of one or two features on the predicted outcome of a machine learning model
library(kernlab)

## Warning: package 'kernlab' was built under R version 4.3.3

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## The following object is masked from 'package:dials':
## 
##     buffer
## 
## The following object is masked from 'package:scales':
## 
##     alpha

# - Great for modeling non-linear relationships in your data.
library(dplyr)
# - provides a consistent and intuitive set of verbs that make it easy to transform, filter, and summarize data
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.95 loaded

# - helps visualize correlation matrices 
library(ranger)

## Warning: package 'ranger' was built under R version 4.3.3

# - fast implementation of random forests
df <- read.csv("customer_retention.csv")
df <- mutate (df, Status = factor(Status))
df <-  na.omit(df)

After loading the data and required libraries, I converted the churn status into a factor for classification modeling and removed any missing values.

Exploratory Data Analysis

#To better understand customer behavior and identify churn risk factors, I explored key trends in the dataset using visualization. This includes comparing churn status across tenure, charges, and correlations among numeric variables.

# Visualize churn distribution
ggplot(df, aes(x = Status)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Customer Churn Distribution", x = "Churn Status", y = "Count")

# The Churn Distribution graph visually represents the proportion of customers who have churned versus those who have remained with Regork Telecom. It highlights that a significant portion of the customer base is still retained, but the number of churned customers presents an opportunity for improvement. This emphasizes the need for targeted retention strategies.

numeric_vars <- df %>%
  select(where(is.numeric))

cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle")

# The correlation matrix reveals a strong positive relationship between Tenure and TotalCharges, confirming that longer-tenured customers accumulate more charges over time. MonthlyCharges also correlates positively with TotalCharges, though less strongly, suggesting both time and pricing contribute to total spending. SeniorCitizen shows only weak correlations with other variables, indicating that age status alone is not a strong driver of churn or billing behavior.

ggplot(df, aes(x = Status, y = Tenure)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Tenure by Churn Status", x = "Churn Status", y = "Tenure (months)")

ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Churn Status", x = "Churn Status", y = "Monthly Charges")

# Tenure vs churn graph hows how long customers have been with Regork (Tenure) compared between those who have churned vs. those who have stayed.Customers who churn tend to have significantly shorter tenures. This shows that the first few months are critical for customer retention.

#The montly charges by churn shows the average amount customers pay per month broken down by churn status. Customers with higher monthly charges are more likely to churn, which may suggest pricing concerns or dissatisfaction with premium plans.

Machine Learning

Logistic Regression

set.seed(123)  
split <- initial_split(df, prop = 0.7, strata = Status)
train_data <- training(split)
test_data  <- testing(split)

log_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

log_model <- logistic_reg(mode = "classification") %>%
  set_engine("glm")

log_wf <- workflow() %>%
  add_model(log_model) %>%
  add_recipe(log_recipe)

set.seed(123)
log_folds <- vfold_cv(train_data, v = 5, strata = Status)

log_results <- log_wf %>%
  fit_resamples(resamples = log_folds,
                metrics = metric_set(roc_auc, accuracy),
                control = control_resamples(save_pred = TRUE))

## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## There were issues with some computations   A: x1There were issues with some computations   A: x2There were issues with some computations   A: x3There were issues with some computations   A: x4There were issues with some computations   A: x5There were issues with some computations   A: x5

collect_metrics(log_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 roc_auc  binary     0.845     5 0.00521 Preprocessor1_Model1

final_log_fit <- log_wf %>%
  fit(data = train_data)

log_preds <- predict(final_log_fit, new_data = test_data, type = "class") %>%
  bind_cols(test_data %>% select(Status))

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

conf_mat(log_preds, truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

log_probs <- predict(final_log_fit, new_data = test_data, type = "prob") %>%
  bind_cols(test_data %>% select(Status))

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

roc_auc(log_probs, truth = Status, .pred_Current)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.845

The logistic regression model served as my baseline classifier. It achieved an AUC of approximately 0.845 and an accuracy of 79.9% based on 5-fold cross-validation. This indicates that the model is quite effective at distinguishing between customers who stay and those who churn. However, the model did show a moderate bias toward predicting that a customer would stay, resulting in some false positives such as predicting Current when the customer actually left. Despite that, logistic regression is very interpretable and provides valuable insight into key churn drivers such as Tenure, Contract Type, and TotalCharges.

#MARS Model

mars_model <- mars(mode = "classification",
                   num_terms = tune(),
                   prod_degree = tune()) %>%
  set_engine("earth")

mars_wf <- workflow() %>%
  add_model(mars_model) %>%
  add_recipe(log_recipe)

mars_grid <- grid_regular(
  num_terms(range = c(5, 30)),
  prod_degree(range = c(1, 2)),
  levels = 5
)

set.seed(123)
mars_results <- mars_wf %>%
  tune_grid(
    resamples = log_folds,
    grid = mars_grid,
    metrics = metric_set(roc_auc, accuracy)
  )

## Warning: package 'earth' was built under R version 4.3.3

## Warning: package 'plotmo' was built under R version 4.3.3

collect_metrics(mars_results) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 10 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  2        30           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        17           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  4        11           1 roc_auc binary     0.843     5 0.00493 Preprocessor1_M…
##  5        23           2 roc_auc binary     0.841     5 0.00532 Preprocessor1_M…
##  6        30           2 roc_auc binary     0.841     5 0.00540 Preprocessor1_M…
##  7        17           2 roc_auc binary     0.840     5 0.00572 Preprocessor1_M…
##  8        11           2 roc_auc binary     0.837     5 0.00690 Preprocessor1_M…
##  9         5           1 roc_auc binary     0.823     5 0.00323 Preprocessor1_M…
## 10         5           2 roc_auc binary     0.820     5 0.00657 Preprocessor1_M…

mars_best <- select_best(mars_results, metric = "roc_auc")

mars_final_wf <- mars_wf %>%
  finalize_workflow(mars_best)

mars_final_fit <- mars_final_wf %>%
  fit(data = train_data)

mars_preds <- predict(mars_final_fit, new_data = test_data, type = "class") %>%
  bind_cols(test_data %>% select(Status))

conf_mat(mars_preds, truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

mars_probs <- predict(mars_final_fit, new_data = test_data, type = "prob") %>%
  bind_cols(test_data %>% select(Status))

roc_auc(mars_probs, truth = Status, .pred_Current)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.847

mars_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

#The MARS model slightly outperformed logistic regression, achieving a cross-validated AUC of 0.849 using 23 terms and a product degree of 1. This model’s ability to capture non-linear relationships allowed it to better differentiate between churned and retained customers, especially when interactions weren’t critical. Although the improvement over logistic regression is modest, MARS maintains good interpretability while introducing more flexibility. Key predictors remained consistent, with tenure, contract type, and totalcharges emerging as the most influential variables.

Random Forest

rf_model <- rand_forest(
  mode = "classification",
  mtry = tune(),
  min_n = tune(),
  trees = tune()
) %>%
  set_engine("ranger", importance = "impurity")

rf_wf <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(log_recipe)

rf_grid <- grid_regular(
  mtry(range = c(2, 10)),
  min_n(range = c(1, 10)),
  trees(range = c(100, 1000)),
  levels = 4
)

set.seed(123)
rf_results <- rf_wf %>%
  tune_grid(
    resamples = log_folds,
    grid = rf_grid,
    metrics = metric_set(roc_auc, accuracy)
  )

collect_metrics(rf_results) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 64 × 9
##     mtry trees min_n .metric .estimator  mean     n std_err .config             
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
##  1     4   700    10 roc_auc binary     0.844     5 0.00464 Preprocessor1_Model…
##  2     4  1000    10 roc_auc binary     0.844     5 0.00450 Preprocessor1_Model…
##  3     4   700     7 roc_auc binary     0.844     5 0.00421 Preprocessor1_Model…
##  4     4   100    10 roc_auc binary     0.843     5 0.00416 Preprocessor1_Model…
##  5     4   400    10 roc_auc binary     0.843     5 0.00442 Preprocessor1_Model…
##  6     4  1000     7 roc_auc binary     0.843     5 0.00449 Preprocessor1_Model…
##  7     4   400     7 roc_auc binary     0.843     5 0.00464 Preprocessor1_Model…
##  8     4  1000     4 roc_auc binary     0.843     5 0.00450 Preprocessor1_Model…
##  9     4   700     1 roc_auc binary     0.843     5 0.00431 Preprocessor1_Model…
## 10     4   700     4 roc_auc binary     0.843     5 0.00392 Preprocessor1_Model…
## # ℹ 54 more rows

rf_best <- select_best(rf_results, metric = "roc_auc")

rf_final_wf <- rf_wf %>%
  finalize_workflow(rf_best)

rf_final_fit <- rf_final_wf %>%
  fit(data = train_data)

rf_preds <- predict(rf_final_fit, new_data = test_data, type = "class") %>%
  bind_cols(test_data %>% select(Status))

conf_mat(rf_preds, truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1385  265
##    Left        155  292

rf_probs <- predict(rf_final_fit, new_data = test_data, type = "prob") %>%
  bind_cols(test_data %>% select(Status))

roc_auc(rf_probs, truth = Status, .pred_Current)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.844

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

# The Random Forest model achieved a cross-validated AUC of 0.844, matching the performance of logistic regression and slightly trailing the MARS model. The best performance was reached with 700 trees, 4 predictors randomly sampled at each split, and a minimum node size of 10. While the improvement over the other models was minimal, Random Forest provides valuable feature importance insights and tends to be more robust to overfitting. The model confirmed that Tenure, Contract Type, and MonthlyCharges were consistently the top predictors of customer churn.

Business Analysis

# Across all three models, the most influential predictors of customer churn are tenure, contract type, monthly charges and total charges, and payment method. Short-tenured customers are more likely to churn within the first 6-12 months. Customers on month-to-month contracts churn at much higher rates than those on a one year or two year agreement. Higher monthly bills are associated with higher churn, more so with customers who haven't been with the company long. Customers using electronic checks may be more likely to churn than those using automatic payment options. 

# One strategy I recommend be implemented is an on boarding program with rewards, check ins and personalized offers to keep them engaged since a high amount of customers who leave do so within 12 months. Another strategy I recommend is encouraging upgrades from month to month using loyalty discounts and device bundling. A third strategy is offering a tiered service bundle or usage based billing for high charging customers who have not committed long term yet. 

# My analysis indicates that short-tenured, high-bill, month-to-month customers are most likely to churn. By targeting these segments with contract incentives, price-based retention strategies, and proactive outreach we can use the Random Forest model for future churn prediction due to its high performance and inter predictability.

Final Project 2025

Jack Moeller

2025-04-27