Introduction

Introduction

For this project, Regork has tasked us with analyzing customer retention for sustainable growth. As data scientists, our analysis needed to study the customer behavior and identify patterns related to churn. To begin, we structured our project to focus on key retention factors like payment methods, monthly charges, contract types, and additional services. Additionally, we employed machine learning models of logistic regression, decision trees, and random forest to predict future turn, identify important factors, and develop actionable recommendations.

Tenure Analysis: For our first analysis, we began examining the length of time customers remained with Regork and compared it to those who stayed versus left. This revealed that shorter tenure customers were more likely to churn and helped us focus on offering incentives for new customers to establish long term loyalty.

Payment and Contract Type: We then examined payment methods and contract types to understand how the options were correlated to churn. We identified that customers with certain payment behaviors and contract types displayed clear churn patterns. This will help Regork tailor its efforts to longer contracts and other payment types.

Billing Methods: We also wanted to examine how Regork billed its customers to identify areas of improvement on their end. This analysis focused on paperless vs mailed bills and churn rates. Customers using paperless billing were significantly more likely to leave and suggests that Regork

Gender Analysis: Finally, we examined the gender demographic to see if there was a correlation with churn. While our analysis and graph did not reveal any meaningful information, it helped us focus on more relevant factors and identify which factors had the largest impact.

General Analysis Finally, we wanted to expand and see if gender played a role in churn likelihood. Subsequently, we realized gender had little to no significance and allowed us to prioritize more impactful factors within customer retention.

Data Set-Up

Data Set-Up

# Required Packages ---------------------------------------------------------------
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(lattice)
library(caret)
library(randomForest)
library(ranger)
library(vip)
library(rpart)
library(rpart.plot)
library(pROC)
library(e1071)

# Data Preparation ----------------------------------------------------------------
data <- read.csv("customer_retention.csv")
colSums(is.na(data))
data$TotalCharges <- ifelse(is.na(data$TotalCharges),
                            mean(data$TotalCharges, na.rm = TRUE),
                            data$TotalCharges)
data <- data %>%
  mutate(OnlineBackup = ifelse(OnlineBackup == "No internet service", "No", OnlineBackup),
         TechSupport = ifelse(TechSupport == "No internet service", "No", TechSupport))

Exploratory Analysis

Exploratory Analysis

# Bar Chart for Status Distribution
status_plot <- ggplot(data, aes(x = Status)) +
  geom_bar(fill = c("green", "red"), alpha = 0.7) +
  theme_minimal() +
  labs(title = "Customer Churn Distribution", x = "Customer Status", y = "Count") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
print(status_plot)

# Tenure distribution by churn status (Histogram)
tenure_plot <- ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_density(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Tenure Distribution by Customer Status", x = "Tenure (Months)", y = "Density") +
  scale_fill_manual(values = c("green", "red"))
print(tenure_plot)

# Boxplot for monthly charges vs. churn status
monthly_charges_plot <- ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, position = position_dodge(width = 0.9), alpha = 0.5) +
  theme_minimal() +
  labs(title = "Monthly Charges by Customer Status", x = "Status", y = "Monthly Charges ($)") +
  scale_fill_manual(values = c("green", "red"))
print(monthly_charges_plot)

# Bar chart for payment method distribution
payment_method_plot <- ggplot(data, aes(x = PaymentMethod, fill = Status)) +
  geom_bar(position = "fill", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Payment Method Proportions by Customer Status", x = "Payment Method", y = "Proportion") +
  scale_fill_manual(values = c("green", "red")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(payment_method_plot)

# Stacked bar chart for contract type by churn
contract_plot <- ggplot(data, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Contract Type by Churn Status", x = "Contract Type", y = "Proportion") +
  scale_fill_manual(values = c("green", "red"))
print(contract_plot)

# Bar chart for paperless billing
billing_plot <- ggplot(data, aes(x = PaperlessBilling, fill = Status)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Paperless Billing and Churn", x = "Paperless Billing", y = "Count") +
  scale_fill_manual(values = c("green", "red"))
print(billing_plot)

# Bar chart for gender distribution by churn
gender_plot <- ggplot(data, aes(x = Gender, fill = Status)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Gender Distribution by Churn Status", x = "Gender", y = "Count") +
  scale_fill_manual(values = c("green", "red"))
print(gender_plot)

Machine Learning

Machine Learning

# setting up our data
set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
ret_train <- training(data_split)
ret_test <- testing(data_split)
model_recipe <- recipe(Status ~ ., data = ret_train) %>%
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>% # Encode categorical variables
  step_normalize(all_numeric_predictors())                 # Scale numeric variables
kfold <- vfold_cv(ret_train, v = 5, strata = Status)

# Logistic Regression -------------------------------------------------------------
log_mod <- logistic_reg(mode = "classification") %>%
  set_engine("glm")
log_workflow <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(log_mod)
log_results <- fit_resamples(log_workflow, kfold, metrics = metric_set(roc_auc, accuracy))
log_metrics <- collect_metrics(log_results)
print(log_metrics)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.798     5 0.00640 Preprocessor1_Model1
## 2 roc_auc  binary     0.843     5 0.00178 Preprocessor1_Model1
# Decision Tree  ------------------------------------------------------------------
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")
dt_hyper_grid <- grid_regular(
  cost_complexity(range = c(-10, -1)),
  tree_depth(range = c(5, 15)),        
  min_n(range = c(10, 50)),            
  levels = 5                          
)
set.seed(123)
dt_results <- tune_grid(
  dt_mod,
  model_recipe,
  resamples = kfold,
  grid = dt_hyper_grid,
  metrics = metric_set(roc_auc)        
)
# model results
top_results <- show_best(dt_results, metric = "roc_auc", n = 5)
print(top_results)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          7    40 roc_auc binary     0.818     5 0.00392
## 2    0.0000000178          7    40 roc_auc binary     0.818     5 0.00392
## 3    0.00000316            7    40 roc_auc binary     0.818     5 0.00392
## 4    0.0000000001         10    30 roc_auc binary     0.817     5 0.00436
## 5    0.0000000178         10    30 roc_auc binary     0.817     5 0.00436
## # ℹ 1 more variable: .config <chr>
all_results <- collect_metrics(dt_results)
print(all_results)
## # A tibble: 125 × 9
##    cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##              <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
##  1    0.0000000001          5    10 roc_auc binary     0.794     5 0.00369
##  2    0.0000000178          5    10 roc_auc binary     0.794     5 0.00369
##  3    0.00000316            5    10 roc_auc binary     0.794     5 0.00369
##  4    0.000562              5    10 roc_auc binary     0.794     5 0.00369
##  5    0.1                   5    10 roc_auc binary     0.587     5 0.0533 
##  6    0.0000000001          7    10 roc_auc binary     0.807     5 0.00487
##  7    0.0000000178          7    10 roc_auc binary     0.807     5 0.00487
##  8    0.00000316            7    10 roc_auc binary     0.807     5 0.00487
##  9    0.000562              7    10 roc_auc binary     0.808     5 0.00575
## 10    0.1                   7    10 roc_auc binary     0.587     5 0.0533 
## # ℹ 115 more rows
## # ℹ 1 more variable: .config <chr>
# Random Forest Model -------------------------------------------------------------
rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),      
  mtry = tune(),      
  min_n = tune()      
) %>%
  set_engine("ranger", importance = "impurity")  
rf_hyper_grid <- grid_regular(
  trees(range = c(20, 300)),  
  mtry(range = c(2, 19)),      
  min_n(range = c(1, 20)),    
  levels = 5                  
)
set.seed(123)
rf_results <- tune_grid(
  rf_mod,                        
  model_recipe,                  
  resamples = kfold,              
  grid = rf_hyper_grid,          
  metrics = metric_set(roc_auc)  
)
# model results
top_rf_results <- show_best(rf_results, metric = "roc_auc", n = 5)
print(top_rf_results)
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n  std_err .config             
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>               
## 1     6   300    20 roc_auc binary     0.843     5 0.00109  Preprocessor1_Model…
## 2     6   230    20 roc_auc binary     0.843     5 0.000589 Preprocessor1_Model…
## 3     6   300    15 roc_auc binary     0.841     5 0.000918 Preprocessor1_Model…
## 4     6   160    20 roc_auc binary     0.841     5 0.00132  Preprocessor1_Model…
## 5     2   300    15 roc_auc binary     0.841     5 0.00122  Preprocessor1_Model…
rf_all_results <- collect_metrics(rf_results)
print(rf_all_results)
## # A tibble: 125 × 9
##     mtry trees min_n .metric .estimator  mean     n  std_err .config            
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>              
##  1     2    20     1 roc_auc binary     0.833     5 0.00178  Preprocessor1_Mode…
##  2     2    90     1 roc_auc binary     0.840     5 0.00149  Preprocessor1_Mode…
##  3     2   160     1 roc_auc binary     0.839     5 0.00100  Preprocessor1_Mode…
##  4     2   230     1 roc_auc binary     0.840     5 0.000922 Preprocessor1_Mode…
##  5     2   300     1 roc_auc binary     0.840     5 0.000785 Preprocessor1_Mode…
##  6     6    20     1 roc_auc binary     0.819     5 0.00167  Preprocessor1_Mode…
##  7     6    90     1 roc_auc binary     0.831     5 0.00186  Preprocessor1_Mode…
##  8     6   160     1 roc_auc binary     0.836     5 0.00102  Preprocessor1_Mode…
##  9     6   230     1 roc_auc binary     0.835     5 0.000910 Preprocessor1_Mode…
## 10     6   300     1 roc_auc binary     0.836     5 0.000822 Preprocessor1_Mode…
## # ℹ 115 more rows
# best hyperparameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
  fit(data = ret_train)
rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 20)  

# The optimal model ---------------------------------------------------------------
log_roc_auc <- log_metrics %>%
  filter(.metric == "roc_auc") %>%
  summarise(roc_auc = mean(mean)) %>%
  pull(roc_auc)
dt_roc_auc <- top_results %>%
  filter(.metric == "roc_auc") %>%
  pull(mean) %>%
  .[1]  
rf_roc_auc <- top_rf_results %>%
  filter(.metric == "roc_auc") %>%
  pull(mean) %>%
  .[1]  
roc_auc_comparison <- tibble(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
  ROC_AUC = c(log_roc_auc, dt_roc_auc, rf_roc_auc)
)
print(roc_auc_comparison)
## # A tibble: 3 × 2
##   Model               ROC_AUC
##   <chr>                 <dbl>
## 1 Logistic Regression   0.843
## 2 Decision Tree         0.818
## 3 Random Forest         0.843
best_model <- roc_auc_comparison %>%
  filter(ROC_AUC == max(ROC_AUC)) %>%
  pull(Model)

cat("The best model is:", best_model)
## The best model is: Random Forest
# the confusion matrix
data <- data %>%
  mutate(Status = factor(Status, levels = c("Current", "Left")))
set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
ret_train <- training(data_split)
ret_test <- testing(data_split)
model_recipe <- recipe(Status ~ ., data = ret_train) %>%
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  step_normalize(all_numeric_predictors())
log_mod <- logistic_reg(mode = "classification") %>%
  set_engine("glm")
log_workflow <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(log_mod)
log_fit <- fit(log_workflow, data = ret_train)
log_predictions <- log_fit %>%
  predict(ret_test) %>%
  bind_cols(ret_test %>% select(Status))
log_conf_matrix <- log_predictions %>%
  conf_mat(truth = Status, estimate = .pred_class)
print(log_conf_matrix)
##           Truth
## Prediction Current Left
##    Current    1390  248
##    Left        153  309

Conclusion

Conclusion

In our analysis, we wanted to understand the factors contributing to customer churn at Regork and provide actionable insights for sustainable growth. By examining customer tenure, payment behavior, contract types, billing methods, and demographic patterns, we identified key drivers of churn and opportunities for improvement.

Tenure: Customers with shorter tenures were significantly more likely to churn, highlighting the importance of early engagement strategies.

Payment and Contract Types: Flexible contract options and improved payment experiences could foster retention, as customers with monthly contracts and paperless billing were more likely to leave.

Billing Methods: Addressing challenges associated with paperless billing by improving communication or offering incentives for traditional billing methods could reduce churn.

Machine Learning Models: Our analysis showed that the Random Forest model performed the best with the highest ROC AUC score, indicating its robustness in predicting churn. This model allowed us to identify critical features influencing churn and provided a reliable framework for targeting at-risk customers.

While gender did not prove to be a significant factor, focusing on the factors with the largest impact enables Regork to implement tailored retention strategies. These could include personalized offers, targeted communication, and enhanced onboarding experiences to improve customer satisfaction and loyalty.

Moving forward, Regork should consider continuous monitoring of churn-related metrics, implement pilot retention programs, and assess their impact using predictive models. This data-driven approach will help ensure long-term growth and customer loyalty.

Final Findings Following the completion of our analysis, we determined Random Forest to be the optimal model for predicting customer churn. It had the highest roc_auc score at 0.844 compared at 0.843 and 0.818. We also determined the most influential hyperparameters to be tenure, total charges, monthly charges, internet service, payment method, and finally contract length. These contributors helped us create our models and produce 3 key recommendations.

First, we recommend implementing tailored discounts for long-term contracts. We easily identified tenure as a leading factor along with contracts so focusing on signing long term contracts will help lengthen the tenure of the customer and hopefully reduce their likelihood of churning.

Second, we wanted to improve the paperless payment process as it seems the customers are not happy with this process. Both the electronic check as payment and paperless billing had higher levels of churn. This demonstrates a potential issue that may be occurring within these electronic payments.

Finally, we recommend increasing technical support and adding customer support. These individuals could be incentivized by increasing contracts and create a better environment for in person payments as opposed to electric billing and payments.

These findings helped us determine where the issues were originating and help create ways to reduce customer churn in the future. If Regork fails to take action we predict there will be over a $50,000 loss in monthly revenue. This is due to the lack of spending over long periods of time from the customers who left. While they make large purchases at times, their short tenure results in a large revenue loss. Overall, addressing these key factors and implementing most of the recommended strategies, we believe Regork will improve their customer retention. By focusing on longer contracts, better payment processes, and large customer service team, Regork can build strong relationships with its customers and successfully achieve sustainable growth in the future.