Customer Retention Analysis

Introduction

Business Problem

Customer churn is a big challenge for Regork Telecom—it means losing revenue and spending more to replace customers. Keeping existing customers is not only cheaper but also better for business, so it’s important to figure out why people leave.

Approach and Methodology

I focused on customer tenure and monthly charges to find patterns tied to churn. Using data analysis and predictive models, I looked at what influences churn and built tools to spot customers most at risk.

Proposed Solution and Impact

This analysis shows how Regork can reduce churn by focusing on customers with short tenure or high monthly charges. Acting on these insights will help improve retention, keep customers happy, and protect revenue.

Data Preparation

The packages you need to load/install are:

# Load libraries
library(tidymodels)
library(tidyverse)
library(kableExtra)
library(vip)

The dataset needed for this is:

# Load the dataset
data <- read_csv("customer_retention.csv")

Exploratory Data Analysis

Customer Status Distribution

data %>%
  ggplot(aes(x = Status)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Customer Churn Distribution", x = "Status", y = "Count")

What It Shows: The bar chart shows that a majority of customers are retained (“Current”), while a smaller percentage have churned (“Left”). This baseline provides context for comparing churn trends across other variables.

Monthly Charges by Churn Status

data %>%
  ggplot(aes(x = MonthlyCharges, fill = Status)) +
  geom_density(alpha = 0.6) +
  labs(title = "Monthly Charges by Customer Status", x = "Monthly Charges", y = "Density")

What It Shows: The density plot indicates that customers who churned (“Left”) tend to have higher monthly charges compared to retained customers. This suggests that pricing may play a role in customer dissatisfaction and churn.

Tenure Distribution

data %>%
  ggplot(aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5, position = "dodge", color = "black") +
  labs(title = "Tenure Distribution by Customer Status", x = "Tenure (Months)", y = "Count")

What It Shows: The histogram reveals that churned customers are more likely to have shorter tenures compared to retained customers. This suggests that customers may leave early if their needs or expectations are not met.

Machine Learning

Logistic Regression

# Logistic Regression Model
logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

logistic_wf <- workflow() %>%
  add_model(logistic_model) %>%
  add_formula(Status ~ .)

set.seed(123)
logistic_results <- fit_resamples(
  logistic_wf,
  vfold_cv(train_data, v = 5, strata = Status),
  metrics = metric_set(roc_auc, accuracy)
)

logistic_results %>% collect_metrics()

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.798     5 0.00463 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00421 Preprocessor1_Model1

Explanation: Logistic regression provides a clear baseline for churn prediction, achieving an AUC of 0.844. It indicates that variables like tenure and monthly charges play a significant role in predicting customer churn.

Random Forest with Hyperparameter Tuning

# Random Forest Model
rf_model <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")

rf_wf <- workflow() %>%
  add_model(rf_model) %>%
  add_formula(Status ~ .)

rf_grid <- grid_random(mtry(range = c(2, 10)), min_n(range = c(5, 20)), size = 10)

set.seed(123)
rf_tuned <- tune_grid(
  rf_wf,
  resamples = vfold_cv(train_data, v = 5, strata = Status),
  grid = rf_grid,
  metrics = metric_set(roc_auc, accuracy)
)

rf_best_params <- select_best(rf_tuned, metric = "roc_auc")

rf_final_wf <- rf_wf %>%
  finalize_workflow(rf_best_params)

rf_fit <- rf_final_wf %>%
  fit(data = train_data)

vip::vip(extract_fit_parsnip(rf_fit)$fit)

What It Shows: Random Forest, tuned for optimal performance, achieves an AUC of 0.843. This model highlights tenure, contract type, and monthly charges as the most influential factors in predicting churn, offering actionable insights for customer retention strategies.

Decision Tree

# Decision Tree Model
tree_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wf <- workflow() %>%
  add_model(tree_model) %>%
  add_formula(Status ~ .)

set.seed(123)
tree_results <- fit_resamples(
  tree_wf,
  vfold_cv(train_data, v = 5, strata = Status),
  metrics = metric_set(roc_auc, accuracy)
)

tree_results %>% collect_metrics()

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.784     5 0.00344 Preprocessor1_Model1
## 2 roc_auc  binary     0.803     5 0.00330 Preprocessor1_Model1

Explanation: The decision tree, while easy to interpret, achieves a lower AUC of 0.803. Its performance suggests it’s less effective for complex relationships compared to Random Forest and Logistic Regression.

Comparing AUC Across Models

# Extract AUC metrics for Logistic Regression and Decision Tree
logistic_auc <- logistic_results %>% collect_metrics() %>% filter(.metric == "roc_auc")
tree_auc <- tree_results %>% collect_metrics() %>% filter(.metric == "roc_auc")

# Extract AUC for the best Random Forest model
rf_best_auc <- rf_tuned %>%
  show_best(metric = "roc_auc", n = 1) %>%
  mutate(model = "Random Forest") %>%
  select(model, mean, std_err)

# Combine AUC metrics into one table
auc_comparison <- bind_rows(
  logistic_auc %>% mutate(model = "Logistic Regression") %>% select(model, mean, std_err),
  rf_best_auc,
  tree_auc %>% mutate(model = "Decision Tree") %>% select(model, mean, std_err)
)

# Display AUC comparison table
auc_comparison %>%
  arrange(desc(mean)) %>%
  kable() %>%
  kable_styling(full_width = FALSE)

model	mean	std_err
Logistic Regression	0.8437671	0.0042101
Random Forest	0.8431790	0.0037668
Decision Tree	0.8029446	0.0032961

Explanation: The AUC comparison confirms that Logistic Regression and Random Forest outperform the Decision Tree. This validates Random Forest as the best choice due to its balance of accuracy and feature importance insights.

Test Set Generalization Error

# Evaluate the best model (Random Forest) on the test set
rf_test_predictions <- predict(rf_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

rf_test_auc <- rf_test_predictions %>%
  roc_auc(truth = Status, .pred_Left)

rf_test_auc

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.156

Explanation: The low test set AUC of 0.156 suggests poor generalization. Improving feature engineering or further tuning the model could enhance performance on unseen data.

Confusion Matrix

# Generate predictions for the test dataset
final_predictions <- predict(rf_fit, test_data) %>%
  bind_cols(test_data)

# Generate the confusion matrix
confusion_matrix <- final_predictions %>%
  conf_mat(truth = Status, estimate = .pred_class)

# Convert the confusion matrix to a tibble for display
confusion_matrix_table <- as_tibble(confusion_matrix$table)

# Display the confusion matrix as a table
confusion_matrix_table %>%
  kable() %>%
  kable_styling(full_width = FALSE)

Prediction	Truth	n
Current	Current	1425
Left	Current	118
Current	Left	286
Left	Left	271

Explanation: The confusion matrix shows the Random Forest model predicts churn with decent accuracy but has 118 false negatives (missed churns) and 286 false positives (unnecessary retention efforts). This provides a clear focus for optimization to reduce these errors.

Bysiness Analysis & Conclusion

Factors to Focus On

Two of the biggest factors driving customer churn are tenure and monthly charges. Customers with shorter tenure and higher bills are more likely to leave. For example, newer customers in their first 6-12 months and those paying premium rates should be the focus for retention efforts.

Potential Revenue Loss

If we don’t act, we’re looking at losing a significant chunk of monthly revenue from high-risk customers. Even saving a small percentage of these customers could make a noticeable difference to the bottom line over time.

Incentive Scheme

Offering a discount on monthly charges or introducing a loyalty program could work wonders. For example, giving a 10-15% discount to customers in their first year or bundling additional services for free could help keep them around longer. It’s a small upfront cost that could pay off big in the long run.

Proposal to the CEO

To tackle churn, we should focus on easing pricing concerns for newer customers. A targeted discount or loyalty reward program could encourage them to stay longer. Running a small pilot program first to measure how effective these incentives are would be a smart way to start before rolling it out broadly.

Limitations and Future Improvements

This analysis mainly looks at tenure and charges but doesn’t include things like customer satisfaction or service usage, which could give even better insights. Adding that data in the future could make our predictions stronger. We could also experiment with more advanced models to see if they do a better job of identifying at-risk customers.