Introduction

The problem that I am going to address is to look at consumer retention data in order to develop informed decisions on how to retain customers for the Regork Telecommunications Service. The purpose of attaining this information is save on costs for Regork, as the cost to attract new customers greatly outweighs the cost to retain current customers.

I have addressed this problem by examining many aspects on Regork’s customer retention data. This will include splitting, performing feature engineering, using cross-validation, perform hyperparameter tuning, select and finalize optimal model, identify important features, and by assessing my final model.

My analysis will be able to help the Regork Telecom CEO as as they will be able to make better informed decisions for the course of action to retain the most possible customers.

Packages/Libraries Required

We need to load the following packages and import the following data in order to prepare our report.

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)

retention <- read_csv('customer_retention.csv')

Exploratory Data Analysis

 retention %>%
  count(Contract, Status) %>%
  group_by(Contract) %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = Contract, y = perc, fill = Status)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type",
       y = "Proportion",
       fill = "Customer Status") +
  theme_minimal()

This model shows us that those who are the most committed to their contract are most prone to not leave. As those that commit to two-year contracts will leave less than 5% of the time.

retention <- read_csv("customer_retention.csv") %>%
  mutate(
    Status = factor(Status, levels = c("Current", "Left")),
    SeniorCitizen = case_when(
      SeniorCitizen == 1 ~ "Senior Citizen",
      SeniorCitizen == 0 ~ "Not Senior Citizen"
    )
  )

retention %>%
  count(SeniorCitizen, Status) %>%
  group_by(SeniorCitizen) %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = SeniorCitizen, y = perc, fill = Status)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Churn Rate by Senior Citizen or Not",
       x = "Customer Type",
       y = "Proportion",
       fill = "Customer Status") +
  theme_minimal()

This model shows us that senior citizens are more likely to churn from their contracts at an alarming rate of approximantly 40%.

retention %>%
  count(PaymentMethod, Status) %>%
  group_by(PaymentMethod) %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = PaymentMethod, y = perc, fill = Status)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Churn Rate by Payment Method",
       x = "Payment Method",
       y = "Proportion",
       fill = "Customer Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This model shows that those who pay with electronic checks are most likely to leave. When looking at the automatic vs non-automatic payment methods, the automatic methods have a slightly lower churn rate. `

ggplot(retention, aes(x = Status, y = MonthlyCharges, fill = Status)) +
  geom_violin() +
  labs(title = "Distribution of Monthly Charges by Churn Status",
       x = "Customer Status",
       y = "Monthly Charges") +
  theme_minimal() +
  scale_fill_manual(values = c("Current" = "blue", "Left" = "red")) 

This model displays that the people who leave Regork are generally paying higher monthly amounts. Also, people with low monthly are leaving much left often.

Machine Learning

Logistic Regression

set.seed(123)
split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(split)
retention_test  <- testing(split)

set.seed(123)
kfold <- vfold_cv(retention_train, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.00994 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00412 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00769 Preprocessor1_Model1

The Logistic Regression Model gives us solid accuracy and roc_auc values, meaning that this is a good option to use.

Decision Tree

dt_mod <- decision_tree(mode = 'classification') %>%
  set_engine("rpart")

model_recipe <- recipe(Status ~ ., data = retention_train)

dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = retention_train)
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

This model predicts whether a customer will stay or leave Regork Telecom based on a variety of variables.

dt_results <- fit_resamples(dt_mod, model_recipe, kfold)

collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.789     5 0.00537 Preprocessor1_Model1
## 2 brier_class binary     0.147     5 0.00318 Preprocessor1_Model1
## 3 roc_auc     binary     0.801     5 0.00651 Preprocessor1_Model1

Based on these values, the roc_auc is not as good as the previous logistic regression model, however this model still seems to be solid.

dt_best_model <- select_best(dt_results, metric = 'roc_auc')

dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

dt_final_fit <- dt_final_wf %>%
  fit(data = retention_train)

dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(5)

The Decision Tree Model shows us that Contract, Tenure, TotalCharges, OnlineSecurity, and Device Protection are the most important predictor variables.

Multivariable Adaptive Regression Splines (MARS) Model

set.seed(123) 
split <- initial_split(retention, prop = 0.7, strata = Status) 
retentionmars_train <- training(split)
retentionmars_test <- testing(split)

retentionmars_recipe <- recipe(Status ~ ., data = retentionmars_train) %>%
  step_zv(all_predictors()) %>%                     
  step_impute_median(all_numeric_predictors()) %>%  
  step_YeoJohnson(all_numeric_predictors()) %>%     
  step_normalize(all_numeric_predictors()) 

set.seed(123)
kfolds <- vfold_cv(retentionmars_train, v = 5, strata = Status)

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
   set_mode('classification')

mars_grid <- grid_regular(num_terms(c(1, 30)), prod_degree(), levels = 50)

retentionmars_wf <- workflow() %>%
   add_recipe(retentionmars_recipe) %>%
   add_model(mars_mod)

tuning_results <- retentionmars_wf %>%
  tune_grid(resamples = kfolds, grid = mars_grid)

tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        18           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  3        20           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  4        21           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  5        22           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  6        23           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  7        24           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.848     5 0.00397 Preprocessor1_M…
## # ℹ 50 more rows

The MARS Model presents us with the highest roc_auc value when comparing it to the previous two.

autoplot(tuning_results)

best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")

final_wf <- workflow() %>%
   add_recipe(retentionmars_recipe) %>%
   add_model(mars_mod) %>%
   finalize_workflow(best_hyperparameters)

final_fit <- final_wf %>%
  fit(data = retentionmars_train)

final_fit %>%
  extract_fit_parsnip() %>%
  vip(5)

This model shows us that the TotalCharges, MonthlyCharges, PhoneService, Tenure, and OnlineSecurity are the most important when using the MARS model.

Confusion Matrix

final_predictions <- final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  mutate(Status = as.factor(Status))

conf_mat(final_predictions, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1404  258
##    Left        139  299

This confusion matrix shows the amount of false negatives and false positives compared to what actually happened.

Business Analysis and Summary

As a business manager the factors that are most prominent in retaining customers are clearly stated as they are TotalCharges, MonthlyCharges, PhoneService, Tunure, and OnlineSecurity.

If no action is taken and when looking at the rate that customers are leaving Regork’s Telecommunications Service, it is evident that they will lose a large sum of their monthly revenue as it is far more expensive for them to try to retain new customers, compared to keeping their current customers.

I believe that our models show that an incentive scheme is necearry to retain a maximum number of customers as possible while maintaining high revenues. I would recommend for them to reduce the number of charges that their customers are faced with, also, I would say that they should offer discounts and sales as it is evident that they lose a lasge number of customers to frusturations over high prices.

I would tell the Regork Telecom CEO that they should lessen the amount of charges that they give their customers and to offer discounts with these charges. The data points out that these are two of the top reasons why people are leaving Regork Telecom.

While I feel that my model was successful, it was not perfect. I could have used more machine learning models to give infoirmation that could have potentially been better or more accurate.