The purpose of this report is to evaluate customer retention for Regork. Since the company has entered the telecommunications market, it has been a major focus to retain existing customers, rather then to attract new ones. The reason for this being is that it cost much more to acquire a new customer. We will solve this problem by analyzing the data and creating models to predict what customers will leave in the future.
As can be seen through the analysis, one of the biggest factors on customer retention is tenure with the company. Because of this, we believe that offering new customers, those with tenure of 12 months or less, a discount on their services would be beneficial to Regork’s customer retention rate.
# packages required
library(readr)
library(summarytools)
library(grid)
library(gridExtra)
library(gtable)
library(tidyverse)
library(tidymodels)
library(vip)
library(pdp)
library(prettydoc)customer_retention <- read_csv('data/customer_retention.csv') %>%
mutate(Status = as.factor(Status))cus_rt <- customer_retention %>%
na.omit(cus_rt) %>%
mutate(Female = if_else(Gender == 'Female', 1, 0),
Partner = if_else(Partner == 'Yes', 1, 0),
PhoneService = if_else(PhoneService == 'Yes', 1, 0),
HasInternet = if_else(InternetService == 'No', 0, 1),
MultipleLines = if_else(MultipleLines == 'Yes', 1, 0),
Dependents = if_else(Dependents == 'Yes', 1, 0),
OnlineSecurity = if_else(OnlineSecurity == 'Yes', 1, 0),
OnlineBackup = if_else(OnlineBackup == 'Yes', 1, 0),
DeviceProtection = if_else(DeviceProtection == 'Yes', 1, 0),
TechSupport = if_else(TechSupport == 'Yes', 1, 0),
StreamingTV = if_else(StreamingTV == 'Yes', 1, 0),
StreamingMovies = if_else(StreamingMovies == 'Yes', 1, 0),
PaperlessBilling = if_else(PaperlessBilling == 'Yes', 1, 0)) %>%
select(-Gender)cus_rt %>%
ggplot(aes(Tenure)) +
geom_bar(fill = "#b3e379") +
facet_wrap(~Status)prop.table(table(customer_retention$Status, customer_retention$Gender))
##
## Female Male
## Current 0.3619088 0.3729104
## Left 0.1334476 0.1317331cus_rt_gender <- cus_rt %>%
group_by(Female) %>%
select(Female, Status)
ggplot(cus_rt_gender, aes(x = Status, fill = factor(Female))) +
geom_bar() +
labs(title = "Status by Gender",
subtitle = "blue equates to male, pink equates to female",
x = "Status",
y = "Count") +
scale_fill_manual(name = "Gender", values = c("1" = "pink", "0" = "lightblue")) +
theme_minimal()prop.table(table(customer_retention$Status, customer_retention$PaymentMethod))
##
## Bank transfer (automatic) Credit card (automatic) Electronic check
## Current 0.18245464 0.18331190 0.18331190
## Left 0.03671953 0.03271896 0.15245035
##
## Mailed check
## Current 0.18574082
## Left 0.04329190paymentmethod <- cus_rt %>%
group_by(PaymentMethod) %>%
select(Status, PaymentMethod)
ggplot(paymentmethod, aes(x = Status, fill = PaymentMethod)) +
geom_bar(position = "dodge") +
labs(title = "Customer Status by Payment Method",
x = "Customer Status",
y = "Count") +
theme_minimal()Relationship Between Internet Service and Status
internet_status <- customer_retention %>%
group_by(Status) %>%
select(Status, InternetService)
ggplot(internet_status, aes(x = Status, fill = InternetService)) +
geom_bar() +
labs(title = "Customer Status by Internet Service",
x = "Customer Status",
y = "Count") +
theme_minimal()set.seed(123)
split <- rsample::initial_split(cus_rt, prop = 0.7, strata = "Status")
cus_rt_train <- rsample::training(split)
cus_rt_test <- rsample::testing(split)mars_fit <- mars(mode = "classification", prod_degree = 1) %>%
fit(Status ~ ., cus_rt_train)# MARS model
mars_mod <- mars(
mode = "classification",
num_terms = tune(), #<<
prod_degree = tune() #<<
)
# k-fold cross validation
set.seed(123)
folds <- vfold_cv(cus_rt_train, v = 5)
# model recipe
mars_recipe <- recipe(Status ~ ., data = cus_rt_train)
# hyper parameter tuning grid
mars_grid <- grid_regular( #<<
num_terms(range = c(10,50)), #<<
prod_degree(), #<<
levels = 10 #<<
) #<<
# train our model across the hyper parameter grid
mars_results <- tune_grid(mars_mod, mars_recipe, resamples = folds, grid = mars_grid) #<<
# best results
show_best(mars_results, metric = "roc_auc", n=10)
mars_roc <- show_best(mars_results, metric = "roc_auc", n=1) %>%
arrange(desc(mean))
mars_roc
autoplot(mars_results)mars_best <- select_best(mars_results)
mars_final_wf <- workflow() %>%
add_model(mars_mod) %>%
add_recipe(mars_recipe) %>%
finalize_workflow(mars_best)
# Most influential variables
mars_final_wf %>%
fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(13)
#Decision tree model
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Hyperparameter grid
mars_grid <- grid_regular(
cost_complexity(), #<<
tree_depth(), #<<
min_n() #<<
)
dt_recipe <- recipe(Status ~ ., data = cus_rt_train)
# Cross validation object
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)
dt_results <- tune_grid(dt_mod, dt_recipe, resamples = kfold, grid = mars_grid)
# Best results
show_best(dt_results, metric = "roc_auc", n = 10)DT_roc <- show_best(dt_results, metric = "roc_auc", n = 1)Decision_tree_best_hyperparameters <- select_best(dt_results)
Decision_tree_final_wf <- workflow() %>%
add_model(dt_mod) %>%
add_recipe(dt_recipe) %>%
finalize_workflow(Decision_tree_best_hyperparameters)
# Top influential variables
Decision_tree_final_wf %>%
fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(20)## Random forest model
rf_mod <- rand_forest(
mode = "classification",
trees = 1000,
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "permutation")
# Resampling procedure
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)
# Hyperparameter grid
mars_grid <- grid_regular(
mtry(range = c(2, 80)), #<<
min_n(range = c(1, 20)), #<<
levels = 2 #<<
)
# Hyper parameter grid
set.seed(123)
results_Random_forest <- tune_grid(rf_mod, mars_recipe, resamples = kfold, grid = mars_grid)
# results
show_best(results_Random_forest, metric = "roc_auc", n=10)RF_roc <- show_best(results_Random_forest, metric = "roc_auc", n=1)Random_forest_best_hyperparameters <- select_best(results_Random_forest)
Random_forest_final_wf <- workflow() %>%
add_model(rf_mod) %>%
add_recipe(mars_recipe) %>%
finalize_workflow(Random_forest_best_hyperparameters)
# top influential variables
Random_forest_final_wf %>%
fit(data = cus_rt_train) %>%
extract_fit_parsnip() %>%
vip(19)mars_rocDT_rocRF_rocBased on the area under the curve (AUC), the MARS model would be considered optimal.
mars_mod_conf <- mars_final_wf %>%
fit(data = cus_rt_train)
mars_mod_conf %>%
predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315This model tends to underestimate customers leaving more than it does overestimate it.
dt_mod_conf <- Decision_tree_final_wf %>%
fit(data = cus_rt_train)
dt_mod_conf %>%
predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1337 261
## Left 203 296This model also underestimates customers leaving.
rf_mod_conf <- Random_forest_final_wf %>%
fit(data = cus_rt_train)
rf_mod_conf %>%
predict(cus_rt_test) %>%
bind_cols(cus_rt_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1397 283
## Left 143 274This model also underestimates customers leaving.
When looking at the confusion matrices, it can be seen that the MRAS model performs the best because it has the largest number of correct predictions. (1383+315=1698).
According to our model, the three most influential factors were, tenure, monthly charges, and total charges.
We have a few recommendations for Regork after analyzing the customer retention data. One insight we have found is that there was the most variability in the model within the first 10 months. Most of the customers who left, left within the first 10 months, making our model less accurate during this time period. As the length of tenure increases, our model gets more accurate, and is most accurate around 20 months. The main trend that we have found from the data is that customers tend to leave Regork either at the start of their tenure or once the price begins to become too high.
Prediction of what customers will leave.
cus_left <- mars_mod_conf %>%
predict(cus_rt_test) %>%
bind_cols(cus_rt_test) %>%
filter(.pred_class == "Left")
cus_leftcus_left %>%
filter(Tenure <= 12) %>%
mutate(new_cus_discount = 0.30 * MonthlyCharges) %>%
summarise(disc_lost_rev = sum(new_cus_discount))
cus_left %>%
filter(Tenure <= 12) %>%
summarise(new_cus_lost_rev = sum(MonthlyCharges))Here, we can see that if Regork were to offer customers a discount of 30% with a tenure time of 12 months or less they would lose $7,747 in revenue.
Overall, we think that the best option for Regork would be to offer
new customer (tenure < 12 months), a discount for their first year of
service. We believe that this will greatly help in the goal to retain
more customers.
# {-}