Regork Customer Retention Analysis

Required Libraries:

library(magrittr)
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(dplyr)
library(xfun)
library(yardstick)
library(htmltools)
library(sass)
library(recipes)
library(rsample)

Introduction:

As the economic climate and customer demands change over time, understanding consumer behavior becomes one of the most important factors to analyze to maximize retention, and improve relationships with the customer. Thanks to the dataset provided, we were able to analyze the factors that drive customers to stay with Regork, and subsequently provide recommendations to ensure unsatisfied customers don’t leave.

As telecommunications is a very competitive sector, it is crucial to understand and ensure the decisions Regork makes are data-backed, and will cause a maximization of customer retention. Provided in this report is a detailed analysis and model determining whether customers will leave Regork at some point in the future. This data will ensure Regork has the most comprehensive information possible to make an informed decision on how to keep its customers.

Data Preparation & Exploratory Data Analysis

Data Preparation

df <- read.csv("C:/Users/Alek/Documents/BANA 4080/data/customer_retention.csv")

df <- df %>%
  mutate(Status = factor(Status))

df <- na.omit(df)

The first analysis we decided to perform was to figure out which types of families are Regork customers. Below is a graph that shows the average tenure of a customer by partner status and dependent status. According to our analysis, those with a partner are more likely to stay for longer. Having dependents does not seem to matter as much, so trying to figure out how to market toward single people would be a strategy to improve retention duration, at least.

Graph code:

ggplot(tenure_by_partner_dependents, aes(x = Partner_Dependents, y = AvgTenure, fill = Partner_Dependents)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Average Tenure by Partner & Dependents Status",
    x = "Partner & Dependents Status",
    y = "Average Tenure (Months)",
    fill = "Partner & Dependents Status"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank())

The next analysis we needed to perform was to figure out the mean duration customers stay with Regork. In our case, on average, customers left after 18 months. In this case, it would be beneficial to try and keep customers early in their tenure, since later customers tend to stay.

Graph code:

ggplot(df_left, aes(x = Tenure)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Tenure for Customers Who Left",
    x = "Tenure (Months)",
    y = "Frequency"
  ) +
  theme_minimal()

##   Mean_Tenure Median_Tenure Min_Tenure Max_Tenure
## 1    18.01293            10          1         72

The last factor we wanted to determine was what types of contracts people were more likely to leave. Our analysis revealed that month-to-month customers were most likely to leave. This means that finding a way to make monthly contracts stay would be ideal in this case.

ggplot(left_contract_counts, aes(x = Contract, y = n, fill = Contract)) +
  geom_bar(stat = "identity", color = "black") +
  labs(
    title = "Number of Customers Who Left by Contract Type",
    x = "Contract Type",
    y = "Number of Customers",
    fill = "Contract Type"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Machine Learning

Recipe and Proportion Table:

# Create recipe for preprocessing
customer_retention <- recipe(Status ~ ., data = df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Apply recipe to the data
customer_retention <- prep(customer_retention, training = df)

# Display proportion of Status categories
table(df$Status) %>% prop.table()

## 
##   Current      Left 
## 0.7344018 0.2655982

For our first machine learning model, we decided to perform a logistic regression:

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00357 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00186 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00514 Preprocessor1_Model1

Next, we performed a MARS analysis.

set.seed(123)

multivariate_split <- initial_split(df, prop = 0.7, strata = Status)
multivariate_train <- training(multivariate_split)
multivariate_test <- testing(multivariate_split)

multivariate_recipe <- recipe(Status ~ ., data = multivariate_train)

set.seed(123)
multivariate_kfolds <- vfold_cv(multivariate_train, v = 5, strata = "Status")

multivariate_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

multivariate_grid <- grid_regular(num_terms(range = c(1, 20)), prod_degree(), levels = 50)

multivariate_wf <- workflow() %>%
  add_recipe(multivariate_recipe) %>%
  add_model(multivariate_mod)

multivariate_results <- multivariate_wf %>% tune_grid(resamples = multivariate_kfolds, grid = multivariate_grid)

metrics <- multivariate_results %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  arrange(desc(mean))

print(metrics)

## # A tibble: 40 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        18           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        17           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  5        16           1 roc_auc binary     0.849     5 0.00490 Preprocessor1_M…
##  6        15           1 roc_auc binary     0.847     5 0.00486 Preprocessor1_M…
##  7        14           1 roc_auc binary     0.847     5 0.00455 Preprocessor1_M…
##  8        13           1 roc_auc binary     0.845     5 0.00493 Preprocessor1_M…
##  9        12           1 roc_auc binary     0.844     5 0.00465 Preprocessor1_M…
## 10        11           1 roc_auc binary     0.843     5 0.00493 Preprocessor1_M…
## # ℹ 30 more rows

autoplot(multivariate_results)

multivariate_best <- select_best(multivariate_results, metric = "roc_auc")

multivariate_final_wf <- workflow() %>% 
  add_model(multivariate_mod) %>% 
  add_formula(Status ~ .) %>% 
  finalize_workflow(multivariate_best)

multivariate_fit <- multivariate_final_wf %>% 
  fit(data = multivariate_train)

multivariate_fit %>% 
  extract_fit_parsnip() %>% 
  vip()

Our last machine learning model was a bagged model:

num_folds <- 5

set.seed(123)
kfolds <- vfold_cv(df, v = num_folds)

bagged_recipe <- recipe(Status ~ ., data = df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

bagged_wf <- workflow() %>%
  add_recipe(bagged_recipe) %>%
  add_model(bag_tree() %>% set_engine("rpart", times = 5) %>% set_mode("classification"))

bagged_results <- fit_resamples(
  bagged_wf,
  resamples = kfolds
)

collect_metrics(bagged_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.759     5 0.00249 Preprocessor1_Model1
## 2 brier_class binary     0.177     5 0.00215 Preprocessor1_Model1
## 3 roc_auc     binary     0.757     5 0.00615 Preprocessor1_Model1

###Tuning

num_folds <- 5

set.seed(123)
kfolds <- vfold_cv(df, v = num_folds)

bagged_recipe <- recipe(Status ~ ., data = df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

bag_hyper_grid <- expand.grid(times = c(5, 25, 50))

set.seed(123)
bag_results <- tune_grid(bag_mod, bagged_recipe, resamples = kfolds, grid = bag_hyper_grid)

show_best(bag_results, metric = "roc_auc")

## # A tibble: 3 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1    50 roc_auc binary     0.810     5 0.00488 Preprocessor1_Model3
## 2    25 roc_auc binary     0.802     5 0.00385 Preprocessor1_Model2
## 3     5 roc_auc binary     0.752     5 0.00344 Preprocessor1_Model1

bagged_ff_best_model <- select_best(bag_results, metric = 'roc_auc')

bagged_ff_final_wf <- workflow() %>%
  add_recipe(bagged_recipe) %>%
  add_model(bag_mod) %>%
  finalize_workflow(bagged_ff_best_model)

bagged_ff_final_fit <- bagged_ff_final_wf %>%
  fit(data = df)

bagged_ff_final_fit %>%
  predict(new_data = df) %>%
  bind_cols(df %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    5125   13
##    Left          7 1843

bagged_wf <- workflow() %>%
  add_recipe(bagged_recipe) %>%
  add_model(bag_mod) %>%
  finalize_workflow(bagged_ff_best_model)

bagged_final_fit <- bagged_wf %>%
  fit(data = df)

bagged_final_fit %>%
  predict(df, type = "prob") %>%
  bind_cols(df) %>%
  roc_curve(truth = Status, .pred_Current) %>%
  autoplot()

Business Analysis

In today’s competitive telecommunications landscape, understanding customer behavior is crucial for maximizing retention and fostering strong relationships. Leveraging the provided dataset, this analysis aims to uncover patterns and trends to inform strategic decisions for Regork Telecom’s customer retention efforts. We’ll explore the dataset, conduct machine learning modeling, and propose actionable insights based on our findings.

Data Preparation & Exploratory Data Analysis: Prior to diving into machine learning, it’s imperative to explore the dataset and identify underlying trends. We’ll examine individual predictor variables and assess relationships with the response variable “Status” (indicating churn). Additionally, we’ll address any data cleaning requirements, such as handling missing values or standardizing categorical variable levels.

Initial exploration reveals intriguing insights: Certain services may have higher tenure rates than others, suggesting varying customer preferences. Demographic factors might influence service usage patterns, warranting further investigation. The baseline churn rate provides context for model evaluation and business decision-making. Data cleaning steps involve handling missing values and ensuring consistency in categorical variable levels to maintain data integrity.

Machine Learning: Our machine learning process comprises: Data splitting into training and test sets. Feature engineering to enhance model performance. Cross-validation and hyperparameter tuning for robustness. Assessment of multiple algorithms, prioritizing AUC as the primary evaluation metric. Selection and finalization of the optimal model based on performance metrics. Interpretation of feature importance to identify influential predictors. After rigorous evaluation, the optimal model is chosen based on its AUC performance and generalization ability. Feature importance analysis highlights key predictors such as tenure, contract type, and service charges, indicating their significant impact on customer behavior. These insights empower the Regork Telecom CEO to tailor retention strategies effectively, focusing on incentivizing longer-term contracts and offering competitive pricing.

Assessing the generalization error on the test set provides insights into model performance on new data, aiding in realistic expectations and decision-making. From this section, we learn not only about predictive accuracy but also about the underlying drivers of customer churn.

Business Analysis & Conclusion: Using the optimal model, we address critical questions:

Assessing predictor importance guides focus areas for retention efforts, emphasizing incentivizing longer contracts and competitive pricing. Predicted churn customers from the test dataset enable proactive retention strategies. Estimating predicted revenue loss highlights the urgency of action. Proposing an incentive scheme involves a cost-benefit analysis, demonstrating the potential ROI of retention efforts. In conclusion, leveraging data-driven insights from machine learning enables Regork Telecom to devise effective customer retention strategies. By prioritizing key predictors and implementing targeted incentives, Regork can mitigate churn, enhance customer satisfaction, and drive long-term business success in the telecommunications industry.

Summary of Findings:

Top Predictor Variables:

Tenure: Customers with longer tenures exhibit significantly lower churn rates. Regork should target customers with longer tenures for improved retention strategies.

Contract: Longer-term contracts correlate with reduced churn rates. Encouraging customers to opt for longer contract durations could mitigate churn. Total Charges and Monthly Charges: Competitive pricing is crucial for customer retention. Maintaining competitive prices relative to competitors is essential to prevent churn due to pricing concerns.

Online Security: Robust online security measures are vital for retaining customers. Regork should prioritize enhancing data protection protocols to address privacy concerns and prevent churn.

Proposed Solution: Incentive Scheme Proposal:

Discounts: Offer discounts on monthly bills for customers with longer tenures or those willing to commit to longer contracts.

Contract Incentives: Provide incentives such as bonus data or service upgrades for customers opting for longer-term contracts.

Enhanced Security: Assure customers of robust online security measures and offer additional security features as incentives for retention.

Cost-Benefit Analysis:

Cost: Calculate the cost of implementing the incentive scheme, including discounts and additional services.

Benefit: Estimate the potential revenue gained from retained customers and reduced churn.

This comprehensive analysis provides actionable insights for Regork Telecom to develop effective retention strategies. By prioritizing key predictor variables and implementing targeted incentives, Regork can mitigate churn, enhance customer satisfaction, and drive long-term business success in the telecommunications industry.

In conclusion, leveraging data-driven insights from machine learning enables Regork Telecom to devise effective customer retention strategies. By prioritizing key predictors and implementing targeted incentives, Regork can mitigate churn, enhance customer satisfaction, and drive long-term business success in the telecommunications industry.