# As Regork faces a high customer churn rate, this threatens both short term and long term growth. And because finding new customers uses more resources and is more expensive than retaining existing customers it is important to understand what causes customers to leave and figure out how to prevent it. Customer churn rates directly impact the bottom line and even a small reduction in churn rates would greatly increase monthly revenue retention. By identifying at-risk customers and using strategic implementations we can improve customer lifetime value and gain a competitive edge. Using the customer retention data given I analyzed customer behavior and demographics to explore the contributing factors to churn. I applied three machine learning models, logistic regression, MARS, and random forest, to predict churn with high accuracy. My analysis dives into key churn indicators like contract type, tenure, and service subscriptions. These insight help regork to be able to predict which customers are most likely to churn, understand the driving factors behind the churn, and also take targeted action like offering long-term contracts or bundled services to retain high risk customers.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ recipes 1.1.1
## ✔ dials 1.4.0 ✔ rsample 1.2.1
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.3.0
## ✔ modeldata 1.4.0 ✔ workflows 1.2.0
## ✔ parsnip 1.3.0 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.4 ✔ yardstick 1.3.2
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.3
## Warning: package 'scales' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'infer' was built under R version 4.3.3
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'recipes' was built under R version 4.3.3
## Warning: package 'rsample' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'tune' was built under R version 4.3.3
## Warning: package 'workflows' was built under R version 4.3.3
## Warning: package 'workflowsets' was built under R version 4.3.3
## Warning: package 'yardstick' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
# - helps streamline the process of building, evaluating, and comparing machine learning models
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# - makes it easy to explore, manipulate, and visualize data quickly
library(vip)
## Warning: package 'vip' was built under R version 4.3.3
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
# - visualize feature importance from machine learning models
library(baguette)
## Warning: package 'baguette' was built under R version 4.3.3
# - Reduces over fitting and improves generalization
library(ggplot2)
# - creating elegant, informative, and highly customizable graphics using the grammar of graphics framework
library(pdp)
## Warning: package 'pdp' was built under R version 4.3.3
##
## Attaching package: 'pdp'
##
## The following object is masked from 'package:purrr':
##
## partial
# - visualize the marginal effect of one or two features on the predicted outcome of a machine learning model
library(kernlab)
## Warning: package 'kernlab' was built under R version 4.3.3
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## The following object is masked from 'package:dials':
##
## buffer
##
## The following object is masked from 'package:scales':
##
## alpha
# - Great for modeling non-linear relationships in your data.
library(dplyr)
# - provides a consistent and intuitive set of verbs that make it easy to transform, filter, and summarize data
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
# - helps visualize correlation matrices
library(ranger)
## Warning: package 'ranger' was built under R version 4.3.3
# - fast implementation of random forests
df <- read.csv("customer_retention.csv")
df <- mutate (df, Status = factor(Status))
df <- na.omit(df)
#To better understand customer behavior and identify churn risk factors, I explored key trends in the dataset using visualization. This includes comparing churn status across tenure, charges, and correlations among numeric variables.
# Visualize churn distribution
ggplot(df, aes(x = Status)) +
geom_bar(fill = "steelblue") +
labs(title = "Customer Churn Distribution", x = "Churn Status", y = "Count")
# The Churn Distribution graph visually represents the proportion of
customers who have churned versus those who have remained with Regork
Telecom. It highlights that a significant portion of the customer base
is still retained, but the number of churned customers presents an
opportunity for improvement. This emphasizes the need for targeted
retention strategies.
numeric_vars <- df %>%
select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle")
# The correlation matrix reveals a strong positive relationship between
Tenure and TotalCharges, confirming that longer-tenured customers
accumulate more charges over time. MonthlyCharges also correlates
positively with TotalCharges, though less strongly, suggesting both time
and pricing contribute to total spending. SeniorCitizen shows only weak
correlations with other variables, indicating that age status alone is
not a strong driver of churn or billing behavior.
ggplot(df, aes(x = Status, y = Tenure)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Tenure by Churn Status", x = "Churn Status", y = "Tenure (months)")
ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Churn Status", x = "Churn Status", y = "Monthly Charges")
# Tenure vs churn graph hows how long customers have been with Regork
(Tenure) compared between those who have churned vs. those who have
stayed.Customers who churn tend to have significantly shorter tenures.
This shows that the first few months are critical for customer
retention.
#The montly charges by churn shows the average amount customers pay per month broken down by churn status. Customers with higher monthly charges are more likely to churn, which may suggest pricing concerns or dissatisfaction with premium plans.
set.seed(123)
split <- initial_split(df, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)
log_recipe <- recipe(Status ~ ., data = train_data) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
log_model <- logistic_reg(mode = "classification") %>%
set_engine("glm")
log_wf <- workflow() %>%
add_model(log_model) %>%
add_recipe(log_recipe)
set.seed(123)
log_folds <- vfold_cv(train_data, v = 5, strata = Status)
log_results <- log_wf %>%
fit_resamples(resamples = log_folds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE))
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## There were issues with some computations A: x1There were issues with some computations A: x2There were issues with some computations A: x3There were issues with some computations A: x4There were issues with some computations A: x5There were issues with some computations A: x5
collect_metrics(log_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00401 Preprocessor1_Model1
## 2 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
final_log_fit <- log_wf %>%
fit(data = train_data)
log_preds <- predict(final_log_fit, new_data = test_data, type = "class") %>%
bind_cols(test_data %>% select(Status))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
conf_mat(log_preds, truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
log_probs <- predict(final_log_fit, new_data = test_data, type = "prob") %>%
bind_cols(test_data %>% select(Status))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
roc_auc(log_probs, truth = Status, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.845
#MARS Model
mars_model <- mars(mode = "classification",
num_terms = tune(),
prod_degree = tune()) %>%
set_engine("earth")
mars_wf <- workflow() %>%
add_model(mars_model) %>%
add_recipe(log_recipe)
mars_grid <- grid_regular(
num_terms(range = c(5, 30)),
prod_degree(range = c(1, 2)),
levels = 5
)
set.seed(123)
mars_results <- mars_wf %>%
tune_grid(
resamples = log_folds,
grid = mars_grid,
metrics = metric_set(roc_auc, accuracy)
)
## Warning: package 'earth' was built under R version 4.3.3
## Warning: package 'plotmo' was built under R version 4.3.3
collect_metrics(mars_results) %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 10 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 23 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 2 30 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 3 17 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 4 11 1 roc_auc binary 0.843 5 0.00493 Preprocessor1_M…
## 5 23 2 roc_auc binary 0.841 5 0.00532 Preprocessor1_M…
## 6 30 2 roc_auc binary 0.841 5 0.00540 Preprocessor1_M…
## 7 17 2 roc_auc binary 0.840 5 0.00572 Preprocessor1_M…
## 8 11 2 roc_auc binary 0.837 5 0.00690 Preprocessor1_M…
## 9 5 1 roc_auc binary 0.823 5 0.00323 Preprocessor1_M…
## 10 5 2 roc_auc binary 0.820 5 0.00657 Preprocessor1_M…
mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- mars_wf %>%
finalize_workflow(mars_best)
mars_final_fit <- mars_final_wf %>%
fit(data = train_data)
mars_preds <- predict(mars_final_fit, new_data = test_data, type = "class") %>%
bind_cols(test_data %>% select(Status))
conf_mat(mars_preds, truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
mars_probs <- predict(mars_final_fit, new_data = test_data, type = "prob") %>%
bind_cols(test_data %>% select(Status))
roc_auc(mars_probs, truth = Status, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.847
mars_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
#The MARS model slightly outperformed logistic regression, achieving a
cross-validated AUC of 0.849 using 23 terms and a product degree of 1.
This model’s ability to capture non-linear relationships allowed it to
better differentiate between churned and retained customers, especially
when interactions weren’t critical. Although the improvement over
logistic regression is modest, MARS maintains good interpretability
while introducing more flexibility. Key predictors remained consistent,
with tenure, contract type, and totalcharges emerging as the most
influential variables.
rf_model <- rand_forest(
mode = "classification",
mtry = tune(),
min_n = tune(),
trees = tune()
) %>%
set_engine("ranger", importance = "impurity")
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(log_recipe)
rf_grid <- grid_regular(
mtry(range = c(2, 10)),
min_n(range = c(1, 10)),
trees(range = c(100, 1000)),
levels = 4
)
set.seed(123)
rf_results <- rf_wf %>%
tune_grid(
resamples = log_folds,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy)
)
collect_metrics(rf_results) %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 64 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 700 10 roc_auc binary 0.844 5 0.00464 Preprocessor1_Model…
## 2 4 1000 10 roc_auc binary 0.844 5 0.00450 Preprocessor1_Model…
## 3 4 700 7 roc_auc binary 0.844 5 0.00421 Preprocessor1_Model…
## 4 4 100 10 roc_auc binary 0.843 5 0.00416 Preprocessor1_Model…
## 5 4 400 10 roc_auc binary 0.843 5 0.00442 Preprocessor1_Model…
## 6 4 1000 7 roc_auc binary 0.843 5 0.00449 Preprocessor1_Model…
## 7 4 400 7 roc_auc binary 0.843 5 0.00464 Preprocessor1_Model…
## 8 4 1000 4 roc_auc binary 0.843 5 0.00450 Preprocessor1_Model…
## 9 4 700 1 roc_auc binary 0.843 5 0.00431 Preprocessor1_Model…
## 10 4 700 4 roc_auc binary 0.843 5 0.00392 Preprocessor1_Model…
## # ℹ 54 more rows
rf_best <- select_best(rf_results, metric = "roc_auc")
rf_final_wf <- rf_wf %>%
finalize_workflow(rf_best)
rf_final_fit <- rf_final_wf %>%
fit(data = train_data)
rf_preds <- predict(rf_final_fit, new_data = test_data, type = "class") %>%
bind_cols(test_data %>% select(Status))
conf_mat(rf_preds, truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1385 265
## Left 155 292
rf_probs <- predict(rf_final_fit, new_data = test_data, type = "prob") %>%
bind_cols(test_data %>% select(Status))
roc_auc(rf_probs, truth = Status, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.844
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
# The Random Forest model achieved a cross-validated AUC of 0.844,
matching the performance of logistic regression and slightly trailing
the MARS model. The best performance was reached with 700 trees, 4
predictors randomly sampled at each split, and a minimum node size of
10. While the improvement over the other models was minimal, Random
Forest provides valuable feature importance insights and tends to be
more robust to overfitting. The model confirmed that Tenure, Contract
Type, and MonthlyCharges were consistently the top predictors of
customer churn.
# Across all three models, the most influential predictors of customer churn are tenure, contract type, monthly charges and total charges, and payment method. Short-tenured customers are more likely to churn within the first 6-12 months. Customers on month-to-month contracts churn at much higher rates than those on a one year or two year agreement. Higher monthly bills are associated with higher churn, more so with customers who haven't been with the company long. Customers using electronic checks may be more likely to churn than those using automatic payment options.
# One strategy I recommend be implemented is an on boarding program with rewards, check ins and personalized offers to keep them engaged since a high amount of customers who leave do so within 12 months. Another strategy I recommend is encouraging upgrades from month to month using loyalty discounts and device bundling. A third strategy is offering a tiered service bundle or usage based billing for high charging customers who have not committed long term yet.
# My analysis indicates that short-tenured, high-bill, month-to-month customers are most likely to churn. By targeting these segments with contract incentives, price-based retention strategies, and proactive outreach we can use the Random Forest model for future churn prediction due to its high performance and inter predictability.