The purpose of this report is to analyze the customer retention for Regork. Now that the company has entered the telecommunications market, they must focus on retaining existing customers as it is much more costly to attract new customers rather than retain existing ones. Thus, we will be analyzing customer data and modeling predictions of which customers will leave in the future.
As you will see through this analysis, tenure with the company is one of the biggest influencers on customer retention. Because of this, we believe that offering new customers – those with tenure of 12 months or less – a 25% discount on their services would be beneficial to Regork’s customer retention rate. This will be discussed further in later sections.
# Helper packages
library(tidyverse) # for data wrangling & plotting
library(readr) # for importing csv file
library(summarytools)
library(grid)
library(gridExtra)
library(gtable)
# Modeling packages
library(tidymodels)
# Model interpretability packages
library(vip) # for variable importance
library(pdp) # for variable relationships
<- read_csv('data/customer_retention.csv') %>%
customer_retention mutate(Status = as.factor(Status))
<- customer_retention %>%
cus_rt na.omit(cus_rt) %>%
mutate(Female = if_else(Gender == 'Female', 1, 0),
Partner = if_else(Partner == 'Yes', 1, 0),
PhoneService = if_else(PhoneService == 'Yes', 1, 0),
HasInternet = if_else(InternetService == 'No', 0, 1),
MultipleLines = if_else(MultipleLines == 'Yes', 1, 0),
Dependents = if_else(Dependents == 'Yes', 1, 0),
OnlineSecurity = if_else(OnlineSecurity == 'Yes', 1, 0),
OnlineBackup = if_else(OnlineBackup == 'Yes', 1, 0),
DeviceProtection = if_else(DeviceProtection == 'Yes', 1, 0),
TechSupport = if_else(TechSupport == 'Yes', 1, 0),
StreamingTV = if_else(StreamingTV == 'Yes', 1, 0),
StreamingMovies = if_else(StreamingMovies == 'Yes', 1, 0),
PaperlessBilling = if_else(PaperlessBilling == 'Yes', 1, 0)) %>%
select(-Gender)
%>%
cus_rt ggplot(aes(Tenure)) +
geom_histogram() +
facet_wrap(~Status)
Overall proportion of Status:
prop.table(table(customer_retention$Status))
##
## Current Left
## 0.7348193 0.2651807
Proportion of Status based on Gender:
prop.table(table(customer_retention$Status, customer_retention$Gender))
##
## Female Male
## Current 0.3619088 0.3729104
## Left 0.1334476 0.1317331
Relationship Between MonthlyCharges and Status:
<- cus_rt %>%
cus_rt_binned mutate(bin_MonthlyCharges = cut(MonthlyCharges, breaks=seq(15,130,by=15)))
<- textGrob("Status")
title_perc
# calculate status percentage based on Monthly Charges
<- table(cus_rt_binned$bin_MonthlyCharges, cus_rt_binned$Status) %>%
mth_chgs_prop prop.table(1)
<- tableGrob(round(mth_chgs_prop*100, digits = 2)) %>%
mth_chgs_perc gtable_add_rows(
heights = grobHeight(title_perc) + unit(5,'mm'),
pos = 0) %>%
gtable_add_grob(title_perc, 1, 2, 1, clip = "off")
# create histogram that shows proportion with status of current vs. left
<- cus_rt_binned %>%
mth_chgs_plot ggplot(aes(x=bin_MonthlyCharges, fill = Status)) +
geom_bar() +
theme(legend.position="top") +
geom_text(aes(label = ..count..), stat = "count", colour = "white", position=position_stack(vjust = 0.5))
# display table and histogram
grid.arrange(mth_chgs_perc, mth_chgs_plot,
ncol=2, widths=c(6,12), top = "Monthly Charges")
set.seed(123)
<- rsample::initial_split(cus_rt, prop = 0.7, strata = "Status")
split <- rsample::training(split)
cus_rt_train <- rsample::testing(split) cus_rt_test
<- mars(mode = "classification", prod_degree = 3) %>%
mars_fit fit(Status ~ ., cus_rt_train)
# create MARS model object
<- mars(
mars_mod mode = "classification",
num_terms = tune(), #<<
prod_degree = tune() #<<
)
# create k-fold cross validation object
set.seed(123)
<- vfold_cv(cus_rt_train, v = 5)
folds
# create our model recipe
<- recipe(Status ~ ., data = cus_rt_train)
model_recipe
# create a hyper parameter tuning grid
<- grid_regular( #<<
hyper_grid num_terms(range = c(10,50)), #<<
prod_degree(), #<<
levels = 10 #<<
#<<
)
# train our model across the hyper parameter grid
<- tune_grid(mars_mod, model_recipe, resamples = folds, grid = hyper_grid) #<<
MARS_results
# get best results
show_best(MARS_results, metric = "roc_auc", n=10)
<- show_best(MARS_results, metric = "roc_auc", n=1)
MARS_roc MARS_roc
autoplot(MARS_results)
<- select_best(MARS_results)
MARS_best_hyperparameters
<- workflow() %>%
MARS_final_wf add_model(mars_mod) %>%
add_recipe(model_recipe) %>%
finalize_workflow(MARS_best_hyperparameters)
# plot top 20 influential variables
%>%
MARS_final_wf fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(20)
#create decision tree model object
#dt_mod <- decision_tree(mode = "classification") %>% set_engine("rpart")
<- decision_tree(
dt_mod mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
%>%
) set_engine("rpart")
# create the hyperparameter grid
<- grid_regular(
hyper_grid cost_complexity(), #<<
tree_depth(), #<<
min_n() #<<
)
#create the model recipe
<- recipe(Status ~ ., data = cus_rt_train)
dt_recipe
# 5-fold cross validation object
set.seed(123)
<- vfold_cv(cus_rt_train, v = 5)
kfold
# train our model across the hyper parameter grid
<- tune_grid(dt_mod, dt_recipe, resamples = kfold, grid = hyper_grid)
dt_results
# get best results
show_best(dt_results, metric = "roc_auc", n = 10)
<- show_best(dt_results, metric = "roc_auc", n = 1) DT_roc
<- select_best(dt_results)
DT_best_hyperparameters
<- workflow() %>%
DT_final_wf add_model(dt_mod) %>%
add_recipe(dt_recipe) %>%
finalize_workflow(DT_best_hyperparameters)
# plot top 20 influential variables
%>%
DT_final_wf fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(20)
## Fitting a default random forest model
# create random forest model object with tuning option
<- rand_forest(
rf_mod mode = "classification",
trees = 1000,
mtry = tune(), #<<
min_n = tune() #<<
%>%
) set_engine("ranger", importance = "permutation") #<<
# create resampling procedure
set.seed(123)
<- vfold_cv(cus_rt_train, v = 5)
kfold
# create the hyperparameter grid
<- grid_regular(
hyper_grid mtry(range = c(2, 80)), #<<
min_n(range = c(1, 20)), #<<
levels = 2 #<<
)
# train our model across the hyper parameter grid
set.seed(123)
<- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = hyper_grid)
results_RF
# model results
show_best(results_RF, metric = "roc_auc", n=10)
<- show_best(results_RF, metric = "roc_auc", n=1) RF_roc
<- select_best(results_RF)
RF_best_hyperparameters
<- workflow() %>%
RF_final_wf add_model(rf_mod) %>%
add_recipe(model_recipe) %>%
finalize_workflow(RF_best_hyperparameters)
# plot top 20 influential variables
%>%
RF_final_wf fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(20)
MARS_roc
DT_roc
RF_roc
Based on the area under the curve (AUC), the MARS model would be considered optimal.
<- MARS_final_wf %>%
mars_mod_conf fit(data = cus_rt_train)
%>%
mars_mod_conf predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
This model tends to underanticipate customers leaving more than it does overestimate it. In other words, there are more cases where people actually left but were predicted to have stayed.
<- DT_final_wf %>%
dt_mod_conf fit(data = cus_rt_train)
%>%
dt_mod_conf predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1337 261
## Left 203 296
This model also underestimates customers leaving more than it overestimates it.
<- RF_final_wf %>%
rf_mod_conf fit(data = cus_rt_train)
%>%
rf_mod_conf predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1401 281
## Left 139 276
This model also underestimates customers leaving more than it overestimates it.
By looking at these confusion matrices, we can see that the MARS model is optimal as it has the highest number of accurate predictions (1383+315=1698).
According to our model Tenure, TotalCharges, and MonthlyCharges are the most influential features.
Here are a few recommendations we, Group 22, have after analyzing the customer retention data. We saw quite a bit of variability in our model during the first 10 months of tenure. Most of the customers who left, left within this time frame and consequently our model is less accurate during this period. As the length of tenure increases our model consistently gets more accurate and it plateaus at around 20 months. The most common trend in our data is that customers seem to leave Regork either at the beginning of their tenure or once price points become too high.
Here we predict all of the customers that will leave
<- mars_mod_conf %>%
cus_left predict(cus_rt_test) %>%
bind_cols(cus_rt_test) %>%
filter(.pred_class == "Left")
cus_left
Here we predict the monthly revenue that will be lost if no action is taken
%>%
cus_left summarize(lost_rev = sum(MonthlyCharges))
If no action is taken we predict Regork could lose $37,228.65 per month in revenue.
Below is a list of specific areas that we believe Regork could focus on to help retain their customers:
%>%
cus_left filter(Tenure <= 12) %>%
mutate(new_cus_discount = 0.25 * MonthlyCharges) %>%
summarise(disc_lost_rev = sum(new_cus_discount))
%>%
cus_left filter(Tenure <= 12) %>%
summarise(new_cus_lost_rev = sum(MonthlyCharges))
Overall, we think there are a few options to encourage customer retention. To us, however, the smartest decision would be to target new customers by offering them a discount for the first 12 months, which should help with the large loss of new customers. # {-}