BANA 4080: Final Project

Introduction

The purpose of this report is to analyze the customer retention for Regork. Now that the company has entered the telecommunications market, they must focus on retaining existing customers as it is much more costly to attract new customers rather than retain existing ones. Thus, we will be analyzing customer data and modeling predictions of which customers will leave in the future.

As you will see through this analysis, tenure with the company is one of the biggest influencers on customer retention. Because of this, we believe that offering new customers – those with tenure of 12 months or less – a 25% discount on their services would be beneficial to Regork’s customer retention rate. This will be discussed further in later sections.

Load Packages

# Helper packages
library(tidyverse) # for data wrangling & plotting
library(readr) # for importing csv file
library(summarytools)
library(grid)
library(gridExtra)
library(gtable)

# Modeling packages
library(tidymodels)

# Model interpretability packages
library(vip)         # for variable importance
library(pdp)         # for variable relationships

Load Data

customer_retention <- read_csv('data/customer_retention.csv') %>% 
  mutate(Status = as.factor(Status))

Response Variable

Status: Whether the customer is Current or has Left.

Predictor Variables

Gender: Whether the customer is a male or a female
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
Partner: Whether the customer has a partner or not (Yes, No)
Dependents: Whether the customer has dependents or not (Yes, No)
Tenure: Number of months the customer has stayed with the company
PhoneService: Whether the customer has a phone service or not (Yes, No)
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer

Exploratory Data Analysis

Data Preparation

Dummy Encode

cus_rt <- customer_retention %>% 
  na.omit(cus_rt) %>% 
  mutate(Female = if_else(Gender == 'Female', 1, 0),
         Partner = if_else(Partner == 'Yes', 1, 0),
         PhoneService = if_else(PhoneService == 'Yes', 1, 0),
         HasInternet = if_else(InternetService == 'No', 0, 1),
         MultipleLines = if_else(MultipleLines == 'Yes', 1, 0),
         Dependents = if_else(Dependents == 'Yes', 1, 0),
         OnlineSecurity = if_else(OnlineSecurity == 'Yes', 1, 0),
         OnlineBackup = if_else(OnlineBackup == 'Yes', 1, 0),
         DeviceProtection = if_else(DeviceProtection == 'Yes', 1, 0),
         TechSupport = if_else(TechSupport == 'Yes', 1, 0),
         StreamingTV = if_else(StreamingTV == 'Yes', 1, 0),
         StreamingMovies = if_else(StreamingMovies == 'Yes', 1, 0),
         PaperlessBilling = if_else(PaperlessBilling == 'Yes', 1, 0)) %>% 
  select(-Gender)

cus_rt %>% 
  ggplot(aes(Tenure)) +
  geom_histogram() +
  facet_wrap(~Status)

Data Exploration

Overall proportion of Status:

prop.table(table(customer_retention$Status))
## 
##   Current      Left 
## 0.7348193 0.2651807

Proportion of Status based on Gender:

prop.table(table(customer_retention$Status, customer_retention$Gender))
##          
##              Female      Male
##   Current 0.3619088 0.3729104
##   Left    0.1334476 0.1317331

Relationship Between MonthlyCharges and Status:

cus_rt_binned <- cus_rt %>% 
  mutate(bin_MonthlyCharges = cut(MonthlyCharges, breaks=seq(15,130,by=15))) 

title_perc <- textGrob("Status")         

# calculate status percentage based on Monthly Charges
mth_chgs_prop <- table(cus_rt_binned$bin_MonthlyCharges, cus_rt_binned$Status) %>% 
  prop.table(1)

mth_chgs_perc <- tableGrob(round(mth_chgs_prop*100, digits = 2)) %>% 
  gtable_add_rows(
  heights = grobHeight(title_perc) + unit(5,'mm'), 
  pos = 0) %>% 
  gtable_add_grob(title_perc, 1, 2, 1, clip = "off")

# create histogram that shows proportion with status of current vs. left
mth_chgs_plot <- cus_rt_binned %>% 
  ggplot(aes(x=bin_MonthlyCharges, fill = Status)) +
  geom_bar() + 
  theme(legend.position="top") +
  geom_text(aes(label = ..count..), stat = "count", colour = "white", position=position_stack(vjust = 0.5))

# display table and histogram
grid.arrange(mth_chgs_perc, mth_chgs_plot,
             ncol=2, widths=c(6,12), top = "Monthly Charges")

Machine Learning

set.seed(123)
split  <- rsample::initial_split(cus_rt, prop = 0.7, strata = "Status")
cus_rt_train  <- rsample::training(split)
cus_rt_test   <- rsample::testing(split)

Models

MARS

mars_fit <- mars(mode = "classification", prod_degree = 3) %>%
  fit(Status ~ ., cus_rt_train)

# create MARS model object
mars_mod <- mars(
  mode = "classification", 
  num_terms = tune(), #<<
  prod_degree = tune() #<<
  )

# create k-fold cross validation object
set.seed(123)
folds <- vfold_cv(cus_rt_train, v = 5)

# create our model recipe
model_recipe <- recipe(Status ~ ., data = cus_rt_train)

# create a hyper parameter tuning grid
hyper_grid <- grid_regular(  #<<
  num_terms(range = c(10,50)), #<<
  prod_degree(), #<<
  levels = 10  #<<
 )  #<<

# train our model across the hyper parameter grid
MARS_results <- tune_grid(mars_mod, model_recipe, resamples = folds, grid = hyper_grid)  #<<

# get best results
show_best(MARS_results, metric = "roc_auc", n=10)


MARS_roc <- show_best(MARS_results, metric = "roc_auc", n=1)
MARS_roc


autoplot(MARS_results)

MARS_best_hyperparameters <- select_best(MARS_results)

MARS_final_wf <- workflow() %>%
  add_model(mars_mod) %>%
  add_recipe(model_recipe) %>% 
  finalize_workflow(MARS_best_hyperparameters)

# plot top 20 influential variables
MARS_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(20)

Decision Trees


#create decision tree model object
#dt_mod <- decision_tree(mode = "classification") %>% set_engine("rpart")

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
 ) %>% 
  set_engine("rpart")
 
# create the hyperparameter grid
hyper_grid <- grid_regular(
  cost_complexity(),  #<<
  tree_depth(), #<<
  min_n() #<<
 )

#create the model recipe
dt_recipe <- recipe(Status ~ ., data = cus_rt_train)

# 5-fold cross validation object
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)

# train our model across the hyper parameter grid
dt_results <- tune_grid(dt_mod, dt_recipe, resamples = kfold, grid = hyper_grid)

# get best results
show_best(dt_results, metric = "roc_auc", n = 10)

DT_roc <- show_best(dt_results, metric = "roc_auc", n = 1)

DT_best_hyperparameters <- select_best(dt_results)

DT_final_wf <- workflow() %>%
  add_model(dt_mod) %>%
  add_recipe(dt_recipe) %>% 
  finalize_workflow(DT_best_hyperparameters)

# plot top 20 influential variables
DT_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(20)

Random Forests

##  Fitting a default random forest model

# create random forest model object with tuning option
rf_mod <- rand_forest(
  mode = "classification", 
  trees = 1000,
  mtry = tune(), #<<
  min_n = tune() #<<
  ) %>%
  set_engine("ranger", importance = "permutation") #<<

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)

# create the hyperparameter grid
hyper_grid <- grid_regular(
   mtry(range = c(2, 80)), #<<
   min_n(range = c(1, 20)), #<<        
   levels = 2 #<<
   )

# train our model across the hyper parameter grid
set.seed(123)
results_RF <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = hyper_grid)

# model results
show_best(results_RF, metric = "roc_auc", n=10)

RF_roc <- show_best(results_RF, metric = "roc_auc", n=1)

RF_best_hyperparameters <- select_best(results_RF)

RF_final_wf <- workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(model_recipe) %>% 
  finalize_workflow(RF_best_hyperparameters)

# plot top 20 influential variables
RF_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(20)

Optimal Model Selection

AUC

MARS_roc

DT_roc

RF_roc

Based on the area under the curve (AUC), the MARS model would be considered optimal.

Confusion Matrix

MARS

mars_mod_conf <- MARS_final_wf %>% 
  fit(data = cus_rt_train)

mars_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

This model tends to underanticipate customers leaving more than it does overestimate it. In other words, there are more cases where people actually left but were predicted to have stayed.

Decision Trees

dt_mod_conf <- DT_final_wf %>% 
  fit(data = cus_rt_train)

dt_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1337  261
##    Left        203  296

This model also underestimates customers leaving more than it overestimates it.

Random Forests

rf_mod_conf <- RF_final_wf %>% 
  fit(data = cus_rt_train)

rf_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1401  281
##    Left        139  276

This model also underestimates customers leaving more than it overestimates it.

By looking at these confusion matrices, we can see that the MARS model is optimal as it has the highest number of accurate predictions (1383+315=1698).

Business Analysis + Summary

According to our model Tenure, TotalCharges, and MonthlyCharges are the most influential features.

Here are a few recommendations we, Group 22, have after analyzing the customer retention data. We saw quite a bit of variability in our model during the first 10 months of tenure. Most of the customers who left, left within this time frame and consequently our model is less accurate during this period. As the length of tenure increases our model consistently gets more accurate and it plateaus at around 20 months. The most common trend in our data is that customers seem to leave Regork either at the beginning of their tenure or once price points become too high.

Predictions

Here we predict all of the customers that will leave

cus_left <- mars_mod_conf %>%
  predict(cus_rt_test) %>%
  bind_cols(cus_rt_test) %>% 
  filter(.pred_class == "Left")
cus_left

Here we predict the monthly revenue that will be lost if no action is taken

cus_left %>% 
  summarize(lost_rev = sum(MonthlyCharges))

If no action is taken we predict Regork could lose $37,228.65 per month in revenue.

Recommendations

Below is a list of specific areas that we believe Regork could focus on to help retain their customers:

Regork should look into offering reduced prices for the first few months or so, or more specifically those who hold month-to-month plans.
- This could address the issue of the high number of customers who leave in the first few months.

cus_left %>% 
  filter(Tenure <= 12) %>% 
  mutate(new_cus_discount = 0.25 * MonthlyCharges) %>% 
  summarise(disc_lost_rev = sum(new_cus_discount))


cus_left %>% 
  filter(Tenure <= 12) %>% 
  summarise(new_cus_lost_rev = sum(MonthlyCharges))

Here, we can see that if Regork were to offer a 25% discount for all customers with tenure of 12 months or less, they would lose $6,456.64 in monthly revenue due to the discount.
This is much less, however, than the $25,826.55 we predict they will lose in revenue each month from new customers (Tenure <= 12) leaving.
Overall, this could save the company up to $19,369.91 per month in lost revenue.

Obviously, price is always the number one indicator of whether someone leaves Regork’s services or not. However, they seem to lose a majority of their customers as prices reach the middle to upper ranges of 60-100 dollars per month. This could mean customers add more lines to their plan, and consequently as the dollar amount increases, they are more likely to switch providers. 77 percent of customers lost were in this price range.
- One possible way Regork could mitigate these losses is by offering customers incremental discounts as they purchase additional lines. This could mean offering more attractive family plans, etc.
Online security is a large part of their business and one of the most important predictor variables in our model
- We think Regorks marketing team should use their strong online security presence as a selling point for their other services.

Conclusion

Overall, we think there are a few options to encourage customer retention. To us, however, the smartest decision would be to target new customers by offering them a discount for the first 12 months, which should help with the large loss of new customers. # {-}

BANA 4080: Final Project

Group 22: Meme Habel, Leo Magoto, + Mark Summers

2022-12-07

Introduction

Load Packages

Load Data

Response Variable

Predictor Variables

Exploratory Data Analysis

Data Preparation

Dummy Encode

Data Exploration

Machine Learning

Models

MARS

Decision Trees

Random Forests

Optimal Model Selection

AUC

Confusion Matrix

MARS

Decision Trees

Random Forests

Business Analysis + Summary

Predictions

Recommendations

Conclusion