Customer Retention Analysis

Introduction

Customer retention is a critical focus for Regork, especially as the company expands its footprint in the competitive telecommunications industry. Offering services such as internet, phone, and streaming, Regork recognizes that retaining current customers is more cost-effective than acquiring new ones. Understanding why customers leave (churn) is essential for maintaining revenue and fostering long-term loyalty.

The goal of this project is to analyze customer data and build a predictive model to identify individuals at risk of leaving. This model will help Regork implement proactive strategies, such as targeted incentives, to reduce churn and maintain customer satisfaction. By leveraging data-driven insights, we aim to support the company’s efforts to enhance customer loyalty and minimize revenue losses.

Our analysis involves exploring the data to uncover trends, building machine learning models to predict churn, and providing actionable recommendations based on our findings. This comprehensive approach ensures that Regork can address churn effectively and continue its growth in the telecommunications market.

Libraries and Data Loading

# Load libraries
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(earth)
library(rpart.plot)
library(ranger)

# Load data
c_retention <- read.csv("customer_retention.csv")
str(c_retention)

## 'data.frame':    6999 obs. of  20 variables:
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : chr  "Current" "Current" "Left" "Current" ...

# Check for missing values
sum(is.na(c_retention))

## [1] 11

# Data preprocessing
c_retention <- c_retention %>%
  mutate(Status = factor(Status)) %>%
  na.omit()

sum(is.na(c_retention))

## [1] 0

Data Exploration

### Device Protection

dp_counts <- c_retention %>%
  count(DeviceProtection) %>%
  arrange(desc(n))

ggplot(dp_counts, aes(x = fct_reorder(DeviceProtection, n), y = n)) +
  geom_col(fill = "black", width = 0.5) +
  theme_minimal() +
  labs(title = "Presence of Device Protection Among Users Who Left",
       x = "Device Protection",
       y = "Number of Users") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 10))

###Contract Type

contract_counts <- c_retention %>%
  count(Contract) %>%
  arrange(desc(n))

ggplot(contract_counts, aes(x = fct_reorder(Contract, n), y = n)) +
  geom_col(fill = "red", width = 0.5) +
  theme_minimal() +
  labs(title = "Contract Types of Users Who Left",
       x = "Contract Type",
       y = "Number of Users") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 10))

Tenure Distribution

tenure_counts <- c_retention %>%
  count(Tenure) %>%
  arrange(desc(n))

ggplot(tenure_counts, aes(x = Tenure, y = n)) +
  geom_line(color = "yellow", linewidth = 1) +
  geom_point(color = "yellow", linewidth = 3) +
  theme_minimal() +
  labs(title = "Distribution of Tenure Among Departed Users",
       x = "Months", 
       y = "Number of Users")

The graph reveals that customer churn decreases as tenure increases, reaching its lowest point around 60 months, after which it begins to rise again.

Partner and Tenure

ggplot(c_retention, aes(Tenure)) +
  geom_bar(fill = "black") +
  ggtitle("Partner and Tenure") +
  facet_wrap(~Partner)

The graph indicates that married individuals are less likely to churn compared to those who are not married.

##Machine Learning

Our initial model focuses on logistic regression.

Logistic Regression

### Model Training and Evaluation

set.seed(123)
log_split <- initial_split(c_retention, prop = 0.5, strata = Status)
log_train <- training(log_split)
log_test <- testing(log_split)

set.seed(123)
log_kfolds <- vfold_cv(log_train, v = 5, strata = Status)
results <- logistic_reg() %>%
  fit_resamples(Status ~ ., log_kfolds)
collect_metrics(results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.809     5 0.00841 Preprocessor1_Model1
## 2 brier_class binary     0.135     5 0.00329 Preprocessor1_Model1
## 3 roc_auc     binary     0.847     5 0.00735 Preprocessor1_Model1

We obtained an ROC-AUC value of 0.847, which is a strong result but has potential for further improvement.

### Final Model

final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = log_train)
tidy(final_fit)

## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.84     1.16        1.59   1.12e- 1
##  2 GenderMale                     -0.0753   0.0924     -0.815  4.15e- 1
##  3 SeniorCitizen                   0.239    0.121       1.98   4.78e- 2
##  4 PartnerYes                     -0.114    0.111      -1.03   3.02e- 1
##  5 DependentsYes                  -0.0595   0.127      -0.468  6.40e- 1
##  6 Tenure                         -0.0565   0.00876    -6.45   1.09e-10
##  7 PhoneServiceYes                 0.738    0.923       0.800  4.24e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.498    0.250       1.99   4.67e- 2
## 10 InternetServiceFiber optic      2.26     1.14        1.98   4.79e- 2
## # ℹ 21 more rows

final_fit %>%
  predict(log_test) %>%
  bind_cols(log_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    2272  412
##    Left        294  516

MARS Model

### Data Preparation and Tuning

set.seed(123)
mars_split <- initial_split(c_retention, prop = 0.5, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

set.seed(123)
kfolds <- vfold_cv(mars_train, v = 5, strata = Status)

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode('classification')

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 40)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

tuning_results <- mars_wf %>%
  tune_grid(resamples = kfolds, grid = mars_grid)

tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        15           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  2        16           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  3        17           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  4        18           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  5        19           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  6        20           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  7        21           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  8        22           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
##  9        23           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
## 10        24           1 roc_auc binary     0.850     5 0.00812 Preprocessor1_M…
## # ℹ 50 more rows

We achieved a maximum ROC-AUC of 0.850, which is promising, but we believe there is still room for improvement. This value was obtained using 15 terms and a degree of 1, effectively making it a lasso model.

autoplot(tuning_results)

### Final Model

best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(best_hyperparameters)

final_fit <- final_wf %>%
  fit(data = mars_train)

final_fit %>%
  extract_fit_parsnip() %>%
  vip()

This model shows our most important features to be tenure, total charges, and payment method.

final_fit %>%
  predict(mars_test) %>%
  bind_cols(mars_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    2312  449
##    Left        254  479

Decision Tree

### Tuning and Feature Importance

dt_mod <- decision_tree(mode = 'classification') %>%
  set_engine("rpart")
model_recipe <- recipe(Status ~ ., data = log_train)
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = log_train)

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

The root node of the decision tree is based on the contract type, specifically whether it is a 1-year or 2-year contract.

set.seed(123)
kfold <- vfold_cv(log_train, v=5)
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
collect_metrics(dt_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.788     5 0.00624 Preprocessor1_Model1
## 2 brier_class binary     0.148     5 0.00278 Preprocessor1_Model1
## 3 roc_auc     binary     0.799     5 0.00604 Preprocessor1_Model1

The decision tree produced an ROC-AUC of 0.799, which is not ideal. With further tuning, we aim to develop a more effective model.

### Final Model

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfolds, grid = dt_hyper_grid)
show_best(dt_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         11    40 roc_auc binary     0.805     5  0.0111
## 2    0.0000000178         11    40 roc_auc binary     0.805     5  0.0111
## 3    0.00000316           11    40 roc_auc binary     0.805     5  0.0111
## 4    0.0000000001         15    40 roc_auc binary     0.805     5  0.0111
## 5    0.0000000178         15    40 roc_auc binary     0.805     5  0.0111
## # ℹ 1 more variable: .config <chr>

We have now improved the ROC-AUC to 0.805, which is an enhancement but still falls short of being the optimal model.

dt_best_model <- select_best(dt_results, metric = 'roc_auc')
dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)
dt_final_fit <- dt_final_wf %>%
  fit(data = log_train)
dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(20)

With decision trees, the most influential features identified are tenure, contract type, and total charges.

Random Forest

Lastly, we will analyze Random Forest

dt_final_fit %>%
  predict(log_test) %>%
  bind_cols(log_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    2261  446
##    Left        305  482

### Tuning and Feature Importance

set.seed(123)
rf_split <- initial_split(c_retention, prop = 0.7, strata = Status)
rf_training <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = c_retention)

rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()) %>%
  set_engine("ranger", importance = "impurity")

rf_hyper_grid <- grid_regular(
  trees(range = c(50, 200)),
  mtry(range = c(2, 10)),
  min_n(range = c(1, 5)),
  levels = 3)

rf_kfold <- vfold_cv(c_retention, v = 5, strata = Status)

rf_results <- tune_grid(
  rf_mod,
  rf_recipe,
  resamples = rf_kfold,
  grid = rf_hyper_grid
)

collect_metrics(rf_results)

## # A tibble: 81 × 9
##     mtry trees min_n .metric     .estimator  mean     n std_err .config         
##    <int> <int> <int> <chr>       <chr>      <dbl> <int>   <dbl> <chr>           
##  1     2    50     1 accuracy    binary     0.800     5 0.00373 Preprocessor1_M…
##  2     2    50     1 brier_class binary     0.138     5 0.00197 Preprocessor1_M…
##  3     2    50     1 roc_auc     binary     0.839     5 0.00555 Preprocessor1_M…
##  4     2   125     1 accuracy    binary     0.802     5 0.00392 Preprocessor1_M…
##  5     2   125     1 brier_class binary     0.137     5 0.00226 Preprocessor1_M…
##  6     2   125     1 roc_auc     binary     0.842     5 0.00595 Preprocessor1_M…
##  7     2   200     1 accuracy    binary     0.802     5 0.00355 Preprocessor1_M…
##  8     2   200     1 brier_class binary     0.137     5 0.00227 Preprocessor1_M…
##  9     2   200     1 roc_auc     binary     0.841     5 0.00619 Preprocessor1_M…
## 10     6    50     1 accuracy    binary     0.791     5 0.00426 Preprocessor1_M…
## # ℹ 71 more rows

set.seed(123)
rf_results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfold, grid = rf_hyper_grid)

show_best(rf_results, metric = "roc_auc")

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   200     3 roc_auc binary     0.843     5 0.00548 Preprocessor1_Model12
## 2     2   125     5 roc_auc binary     0.842     5 0.00545 Preprocessor1_Model20
## 3     2   200     5 roc_auc binary     0.842     5 0.00569 Preprocessor1_Model21
## 4     2   125     1 roc_auc binary     0.842     5 0.00545 Preprocessor1_Model02
## 5     2   125     3 roc_auc binary     0.841     5 0.00620 Preprocessor1_Model11

With hyperparameter tuning, we improved the ROC-AUC to 0.8435.

rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

rf_final_fit <- final_rf_wf %>%
  fit(data = log_train)
rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 20)

Our random forest model identifies tenure, total charges, and contract type as the most important features.

rf_final_fit %>%
   predict(log_test) %>%
   bind_cols(log_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    2320  484
##    Left        246  444

Conclusion

Our analysis successfully identified key factors influencing customer churn at Regork and developed a predictive model to address this challenge. Among the tested models, the MARS model emerged as the most accurate, achieving an AUC of 0.850 with 15 terms and a degree of 1. Key factors influencing churn included tenure, total charges, contract type, and payment methods, with tenure consistently emerging as the most critical factor across all models.

Based on these findings, we recommend that Regork focus on several strategies to improve customer retention. First, incentivizing long-term contracts by offering transparent pricing structures and avoiding unexpected charges can enhance customer trust and satisfaction. Second, rewarding customer loyalty through discounts, free devices for long-term customers, or bundled services can foster stronger relationships. Finally, addressing payment-related concerns by simplifying and diversifying payment methods will reduce friction and further improve customer satisfaction.

These strategies, combined with targeted interventions for at-risk customers, can help Regork reduce churn, increase customer satisfaction, and build long-term brand loyalty. While our analysis demonstrated strong predictive capabilities, it was constrained by the dataset’s size and lack of temporal data. A larger, time-series dataset would provide a more comprehensive understanding of customer behavior and churn trends over time. Additionally, exploring a broader range of machine learning models could yield even greater accuracy.

Despite these limitations, our project provides actionable insights and a solid foundation for Regork to enhance its customer retention strategies. By adopting a data-driven approach, the company is well-positioned to thrive in the competitive telecommunications industry.