BANA 4080: Final Project

Introduction

The purpose of this report is to evaluate customer retention for Regork. Since the company has entered the telecommunications market, it has been a major focus to retain existing customers, rather then to attract new ones. The reason for this being is that it cost much more to acquire a new customer. We will solve this problem by analyzing the data and creating models to predict what customers will leave in the future.

As can be seen through the analysis, one of the biggest factors on customer retention is tenure with the company. Because of this, we believe that offering new customers, those with tenure of 12 months or less, a discount on their services would be beneficial to Regork’s customer retention rate.

Packages Required

# packages required
library(readr)
library(summarytools)
library(grid)
library(gridExtra)
library(gtable)
library(tidyverse)
library(tidymodels)
library(vip)      
library(pdp) 
library(prettydoc)

customer_retention <- read_csv('data/customer_retention.csv') %>% 
  mutate(Status = as.factor(Status))

Response Variable

Status: Whether the customer is Current or has Left.

Predictor Variables

Gender: Whether the customer is a male or a female
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
Partner: Whether the customer has a partner or not (Yes, No)
Dependents: Whether the customer has dependents or not (Yes, No)
Tenure: Number of months the customer has stayed with the company
PhoneService: Whether the customer has a phone service or not (Yes, No)
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer

Exploratory Data Analysis

cus_rt <- customer_retention %>% 
  na.omit(cus_rt) %>% 
  mutate(Female = if_else(Gender == 'Female', 1, 0),
         Partner = if_else(Partner == 'Yes', 1, 0),
         PhoneService = if_else(PhoneService == 'Yes', 1, 0),
         HasInternet = if_else(InternetService == 'No', 0, 1),
         MultipleLines = if_else(MultipleLines == 'Yes', 1, 0),
         Dependents = if_else(Dependents == 'Yes', 1, 0),
         OnlineSecurity = if_else(OnlineSecurity == 'Yes', 1, 0),
         OnlineBackup = if_else(OnlineBackup == 'Yes', 1, 0),
         DeviceProtection = if_else(DeviceProtection == 'Yes', 1, 0),
         TechSupport = if_else(TechSupport == 'Yes', 1, 0),
         StreamingTV = if_else(StreamingTV == 'Yes', 1, 0),
         StreamingMovies = if_else(StreamingMovies == 'Yes', 1, 0),
         PaperlessBilling = if_else(PaperlessBilling == 'Yes', 1, 0)) %>% 
  select(-Gender)

cus_rt %>% 
  ggplot(aes(Tenure)) +
  geom_bar(fill = "#b3e379") +
  facet_wrap(~Status)

prop.table(table(customer_retention$Status, customer_retention$Gender))
##          
##              Female      Male
##   Current 0.3619088 0.3729104
##   Left    0.1334476 0.1317331

cus_rt_gender <- cus_rt %>% 
  group_by(Female) %>% 
  select(Female, Status)

ggplot(cus_rt_gender, aes(x = Status, fill = factor(Female))) +
  geom_bar() +
  labs(title = "Status by Gender",
       subtitle = "blue equates to male, pink equates to female",
       x = "Status",
       y = "Count") +
  scale_fill_manual(name = "Gender", values = c("1" = "pink", "0" = "lightblue")) +
  theme_minimal()

prop.table(table(customer_retention$Status, customer_retention$PaymentMethod))
##          
##           Bank transfer (automatic) Credit card (automatic) Electronic check
##   Current                0.18245464              0.18331190       0.18331190
##   Left                   0.03671953              0.03271896       0.15245035
##          
##           Mailed check
##   Current   0.18574082
##   Left      0.04329190

paymentmethod <- cus_rt %>% 
  group_by(PaymentMethod) %>% 
  select(Status, PaymentMethod)

ggplot(paymentmethod, aes(x = Status, fill = PaymentMethod)) +
  geom_bar(position = "dodge") +
  labs(title = "Customer Status by Payment Method",
       x = "Customer Status",
       y = "Count") +
  theme_minimal()

Relationship Between Internet Service and Status

internet_status <- customer_retention %>% 
  group_by(Status) %>% 
  select(Status, InternetService)

ggplot(internet_status, aes(x = Status, fill = InternetService)) +
  geom_bar() +
  labs(title = "Customer Status by Internet Service",
       x = "Customer Status",
       y = "Count") +
  theme_minimal()

Machine Learning

set.seed(123)
split  <- rsample::initial_split(cus_rt, prop = 0.7, strata = "Status")
cus_rt_train  <- rsample::training(split)
cus_rt_test   <- rsample::testing(split)

Models

MARS

mars_fit <- mars(mode = "classification", prod_degree = 1) %>%
  fit(Status ~ ., cus_rt_train)

# MARS model 
mars_mod <- mars(
  mode = "classification", 
  num_terms = tune(), #<<
  prod_degree = tune() #<<
  )

# k-fold cross validation 
set.seed(123)
folds <- vfold_cv(cus_rt_train, v = 5)

# model recipe
mars_recipe <- recipe(Status ~ ., data = cus_rt_train)

# hyper parameter tuning grid
mars_grid <- grid_regular(  #<<
  num_terms(range = c(10,50)), #<<
  prod_degree(), #<<
  levels = 10  #<<
 )  #<<

# train our model across the hyper parameter grid
mars_results <- tune_grid(mars_mod, mars_recipe, resamples = folds, grid = mars_grid)  #<<

#  best results
show_best(mars_results, metric = "roc_auc", n=10)


mars_roc <- show_best(mars_results, metric = "roc_auc", n=1) %>% 
arrange(desc(mean))
mars_roc


autoplot(mars_results)

mars_best <- select_best(mars_results)

mars_final_wf <- workflow() %>%
  add_model(mars_mod) %>%
  add_recipe(mars_recipe) %>% 
  finalize_workflow(mars_best)

# Most influential variables
mars_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(13)

Decision Trees


#Decision tree model


dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
 ) %>% 
  set_engine("rpart")
 
# Hyperparameter grid
mars_grid <- grid_regular(
  cost_complexity(),  #<<
  tree_depth(), #<<
  min_n() #<<
 )


dt_recipe <- recipe(Status ~ ., data = cus_rt_train)

# Cross validation object
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)


dt_results <- tune_grid(dt_mod, dt_recipe, resamples = kfold, grid = mars_grid)

# Best results
show_best(dt_results, metric = "roc_auc", n = 10)

DT_roc <- show_best(dt_results, metric = "roc_auc", n = 1)

Decision_tree_best_hyperparameters <- select_best(dt_results)

Decision_tree_final_wf <- workflow() %>%
  add_model(dt_mod) %>%
  add_recipe(dt_recipe) %>% 
  finalize_workflow(Decision_tree_best_hyperparameters)

# Top influential variables
Decision_tree_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(20)

Random Forests

## Random forest model


rf_mod <- rand_forest(
  mode = "classification", 
  trees = 1000,
  mtry = tune(), 
  min_n = tune() 
  ) %>% 
  set_engine("ranger", importance = "permutation") 

# Resampling procedure
set.seed(123)
kfold <- vfold_cv(cus_rt_train, v = 5)

# Hyperparameter grid
mars_grid <- grid_regular(
   mtry(range = c(2, 80)), #<<
   min_n(range = c(1, 20)), #<<        
   levels = 2 #<<
   )

# Hyper parameter grid
set.seed(123)
results_Random_forest <- tune_grid(rf_mod, mars_recipe, resamples = kfold, grid = mars_grid)

# results
show_best(results_Random_forest, metric = "roc_auc", n=10)

RF_roc <- show_best(results_Random_forest, metric = "roc_auc", n=1)

Random_forest_best_hyperparameters <- select_best(results_Random_forest)

Random_forest_final_wf <- workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(mars_recipe) %>% 
  finalize_workflow(Random_forest_best_hyperparameters)

# top influential variables
Random_forest_final_wf %>%
   fit(data = cus_rt_train) %>% 
   extract_fit_parsnip() %>%
   vip(19)

Optimal Model Selection

AUC

mars_roc

DT_roc

RF_roc

Based on the area under the curve (AUC), the MARS model would be considered optimal.

Confusion Matrix

MARS

mars_mod_conf <- mars_final_wf %>% 
  fit(data = cus_rt_train)

mars_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

This model tends to underestimate customers leaving more than it does overestimate it.

Decision Trees

dt_mod_conf <- Decision_tree_final_wf %>% 
  fit(data = cus_rt_train)

dt_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1337  261
##    Left        203  296

This model also underestimates customers leaving.

Random Forests

rf_mod_conf <- Random_forest_final_wf %>% 
  fit(data = cus_rt_train)

rf_mod_conf %>%
   predict(cus_rt_test) %>%
   bind_cols(cus_rt_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1397  283
##    Left        143  274

This model also underestimates customers leaving.

When looking at the confusion matrices, it can be seen that the MRAS model performs the best because it has the largest number of correct predictions. (1383+315=1698).

Business Analysis And Summary

According to our model, the three most influential factors were, tenure, monthly charges, and total charges.

We have a few recommendations for Regork after analyzing the customer retention data. One insight we have found is that there was the most variability in the model within the first 10 months. Most of the customers who left, left within the first 10 months, making our model less accurate during this time period. As the length of tenure increases, our model gets more accurate, and is most accurate around 20 months. The main trend that we have found from the data is that customers tend to leave Regork either at the start of their tenure or once the price begins to become too high.

Predictions

Prediction of what customers will leave.

cus_left <- mars_mod_conf %>%
  predict(cus_rt_test) %>%
  bind_cols(cus_rt_test) %>% 
  filter(.pred_class == "Left")
cus_left

Recommendations

we believe that Regork should look into offering reduced prices for the beginning of each customers tenure. This would help to reduce the problem of the large amount of customers that leave in the months following after signing up.

cus_left %>% 
  filter(Tenure <= 12) %>% 
  mutate(new_cus_discount = 0.30 * MonthlyCharges) %>% 
  summarise(disc_lost_rev = sum(new_cus_discount))


cus_left %>% 
  filter(Tenure <= 12) %>% 
  summarise(new_cus_lost_rev = sum(MonthlyCharges))

Here, we can see that if Regork were to offer customers a discount of 30% with a tenure time of 12 months or less they would lose $7,747 in revenue.

However, this is much less than the $25,826.55 we predict they will lose in revenue each month from new customers leaving.

Conclusion

Overall, we think that the best option for Regork would be to offer new customer (tenure < 12 months), a discount for their first year of service. We believe that this will greatly help in the goal to retain more customers.
# {-}

BANA 4080: Final Project

Group 19: Sahil Panse and Nathan Dubsky

2023-04-25

Introduction

Packages Required

Response Variable

Predictor Variables

Exploratory Data Analysis

Machine Learning

Models

MARS

Decision Trees

Random Forests

Optimal Model Selection

AUC

Confusion Matrix

MARS

Decision Trees

Random Forests

Business Analysis And Summary

Predictions

Recommendations

Conclusion