Introduction

Some Text

Prerequisites

Libraries

A variety of libraries will be used to assist in the formulation of our initial analysis as well as the synthesis of various machine learning models.

library(tidyverse)    #Data cleaning, manipulation, visualization
library(tidymodels)   #Creating machine learning models
library(vip)          #Constructing variable importance plots
library(earth)        #Creating MARS models
library(baguette)     #Creating bagged tree models
library(pdp)          #Constructing partial dependency plots
library(ranger)       #Creating Random Forest models

Importing Data

We will be analyzing a data set containing customer retention information. The information is provided in the form of a csv flat file, which we will first need to import in order to conduct our analysis. This data set contains information pertaining to 7000 customers.

#Grab file path to data file
path <- here::here("customer_retention.csv")

#Import data file
retention <- readr::read_csv(path)

Prepping & Splitting Data

For our purposes, we will omit null values as they may bias the results of our machine learning models or decrease the accuracy of them.

#Recode response variable as a factor
retention <- mutate(retention, Status = as.factor(Status))

#Remove null values
retention <- na.omit(retention)

In order to build our machine learning models later, we must split the data into a set to train our model and a set to test our model’s performance. We will be using a 70-30 train-test split in order to optimize the generalizability our of models.

#For reproducibility
set.seed(123)

#70-30 train-test split
split <- initial_split(retention, prop = 0.7, strata = "Status")
train <- training(split)
test <- testing(split)

Exploratory Data Analysis (EDA)

Before beginning to build our machine learning models, it is important that we understand some of the underlying trends and relationships in our data.

#payment by account status
retention %>%
  group_by(PaymentMethod, Status) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = PaymentMethod, y = count, fill = Status)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Customer Payment Method by Account Status",
         x = "Payment Method",
         y = "Number of Customers") +
    scale_fill_manual(values = c("dodgerblue","orange2")) +
    theme_minimal()

Some text

#proportion of total customers by payment status
proportions <- retention %>%
  group_by(PaymentMethod) %>%
  summarize(count = n()) %>%
  mutate(proportion = count / sum(count))

ggplot(proportions, aes(x = PaymentMethod, y = proportion)) +
  geom_bar(stat = "identity")

Some Text

#customers that left by tenure
retention %>%
  filter(Status == "Left") %>%
    group_by(Tenure) %>%
    summarise(count = n()) %>%
    ggplot(aes(x = Tenure, y = count)) +
    geom_line() +
    geom_point()

Some text

#customers that left by contract type
retention %>%
  filter(Status == "Left") %>%
    ggplot(aes(x = Contract)) +
    geom_bar()

Some Text

Machine Learning

The problem that we are looking to solve is a classification problem. To assess the performance of our models we will be looking at the AUC.

Before we began building any models, we created a recipe to normalize our numeric predictors to provide a common comparable unit of measure across all of our variables, and we dummy encoded our nominal predictors (categorical variables) to ensure they take on a numeric form.

Recipe
recipe <- recipe(Status ~ ., data = train) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors())

Logistic Regression

Some Text

set.seed(123)

lr_kfold <- vfold_cv(train, v = 5, strata = Status)

lr_results <- logistic_reg() %>%
  fit_resamples(Status ~ ., lr_kfold)

#mean AUC
collect_metrics(lr_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1
#AUC across all folds
collect_metrics(lr_results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 5 × 5
##   id    .metric .estimator .estimate .config             
##   <chr> <chr>   <chr>          <dbl> <chr>               
## 1 Fold1 roc_auc binary         0.836 Preprocessor1_Model1
## 2 Fold2 roc_auc binary         0.839 Preprocessor1_Model1
## 3 Fold3 roc_auc binary         0.865 Preprocessor1_Model1
## 4 Fold4 roc_auc binary         0.839 Preprocessor1_Model1
## 5 Fold5 roc_auc binary         0.845 Preprocessor1_Model1

Based off our mean AUC from our 5 fold resampling procedure, our regularized logistic regression model is accurate nearly 85% of the time when making predictions about a customer attriting.

When we look at each of our 5 folds, we see that our best model was predicting accurately 86.4% of the time, with the worst model’s accuracy at 83.6%.

lr_final <- logistic_reg() %>%
  fit(Status ~ ., data = train)

lr_final %>%
  predict(test) %>%
  bind_cols(test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

As we can see from our confusion matrix, our logistic regression model is more prone to return false negatives than it is to return false positives. Considering we are attempting to predict customer retention, false positives are likely preferred as it allows Regork to take action to prevent a customer from leaving before it even occurs.

vip(lr_final$fit, num_features = 10)

Tenure, contract length, and total charges are among the most influential variables affecting customer retention according to our logistic regression model.

Hyperparamter Tuning

While our out the box logistic regression model performed fairly well, we though we may be able to increase the AUC by performing some hyperparameter tuning.

lr_hyper <- logistic_reg(mixture = tune(), penalty = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

lr_grid <- grid_regular(mixture(), penalty(), levels = 10)

lr_wf <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(lr_hyper)

lr_tuning_results <- lr_wf %>%
  tune_grid(resamples = lr_kfold, grid = lr_grid)

lr_tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean)) %>%
  print(n = 5)
## # A tibble: 100 × 8
##         penalty mixture .metric .estimator  mean     n std_err .config          
##           <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
## 1 0.0000000001    0.889 roc_auc binary     0.845     5 0.00523 Preprocessor1_Mo…
## 2 0.00000000129   0.889 roc_auc binary     0.845     5 0.00523 Preprocessor1_Mo…
## 3 0.0000000167    0.889 roc_auc binary     0.845     5 0.00523 Preprocessor1_Mo…
## 4 0.000000215     0.889 roc_auc binary     0.845     5 0.00523 Preprocessor1_Mo…
## 5 0.00000278      0.889 roc_auc binary     0.845     5 0.00523 Preprocessor1_Mo…
## # ℹ 95 more rows
#autoplot(lr_tuning_results)

Our hyperparameter tuning has increased our AUC; however, the effect is only marginal.

lr_best_hyper <- select_best(lr_tuning_results, metric = "roc_auc")

lr_final_wf <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(lr_hyper) %>%
  finalize_workflow(lr_best_hyper)

lr_hyper_final <- lr_final_wf %>%
  fit(data = train)

lr_hyper_final %>%
  predict(test) %>%
  bind_cols(test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1360  225
##    Left        180  332

Tuning our model affected our confusion matrix only slightly as well.

lr_hyper_final %>%
  extract_fit_parsnip() %>%
  vip(10)

Some Text

MARS

Some Text

# create MARS model object
mars <- mars(mode = "classification", num_terms = tune(), prod_degree = tune())

set.seed(123)
mars_kfold <- vfold_cv(train, v = 5)

# create a hyper parameter tuning grid
mars_grid <- grid_regular(
 num_terms(range = c(1, 100)), 
 prod_degree(),
 levels = 25
 )

# train our model across the hyper parameter grid
mars_results <- tune_grid(mars, recipe, resamples = mars_kfold, grid = mars_grid)

# get best results
show_best(mars_results, metric = "roc_auc")
## # A tibble: 5 × 8
##   num_terms prod_degree .metric .estimator  mean     n std_err .config          
##       <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
## 1        17           1 roc_auc binary     0.847     5 0.00946 Preprocessor1_Mo…
## 2        21           1 roc_auc binary     0.847     5 0.00930 Preprocessor1_Mo…
## 3        25           1 roc_auc binary     0.847     5 0.00930 Preprocessor1_Mo…
## 4        29           1 roc_auc binary     0.847     5 0.00930 Preprocessor1_Mo…
## 5        34           1 roc_auc binary     0.847     5 0.00930 Preprocessor1_Mo…

Some Text

mars_best_hyperparameters <- select_best(mars_results, metric = "roc_auc")

mars_final_wf <- workflow() %>%
  add_model(mars) %>%
  add_recipe(recipe) %>%
  finalize_workflow(mars_best_hyperparameters)

mars_final_fit <- mars_final_wf %>%
  fit(data = train)

mars_final_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1382  242
##    Left        158  315

Some Text

mars_final_wf %>%
   fit(data = train) %>% 
   extract_fit_parsnip() %>%
   vip(10, type = "rss")

Some Text

Decision Tree

Some Text

dt <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(dt) %>%
  fit(data = train)

# create resampling procedure
set.seed(123)
dt_kfold <- vfold_cv(train, v = 5)

# train model
dt_results <- fit_resamples(dt, recipe, dt_kfold)

# model results
collect_metrics(dt_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.787     5 0.00560 Preprocessor1_Model1
## 2 roc_auc  binary     0.703     5 0.00571 Preprocessor1_Model1

Our AUC is considerably lower than it is using a MARS or logistic regression model.

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

There are only 3 levels to this tree, so it is very well possible that our model performance could be increased by tuning.

Hyperparameter Tuning

Some Text (about the hyperparamters and their meaning)

#hyperparamater tuning
dt_hyper <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
  ) %>%
  set_engine("rpart")

# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
  )

# train our model across the hyper parameter grid
set.seed(123)
dt_hyper_results <- tune_grid(dt_hyper, 
                             recipe, 
                             resamples = dt_kfold, 
                             grid = dt_hyper_grid)

# get best results
show_best(dt_hyper_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.820     5 0.00898
## 2    0.0000000178          8    30 roc_auc binary     0.820     5 0.00898
## 3    0.00000316            8    30 roc_auc binary     0.820     5 0.00898
## 4    0.0000000001          8    40 roc_auc binary     0.819     5 0.00936
## 5    0.0000000178          8    40 roc_auc binary     0.819     5 0.00936
## # ℹ 1 more variable: .config <chr>

Tuning our decision tree increased our model’s accuracy by about 12%, from 70% without tuning to 82% with hyperparameter tuning.

# get best hyperparameter values
dt_best_model <- select_best(dt_hyper_results, metric = 'roc_auc')

# put together final workflow
dt_final_wf <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(dt_hyper) %>%
  finalize_workflow(dt_best_model)

# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
  fit(data = train)

rpart.plot::rpart.plot(dt_final_fit$fit$fit$fit)

# confusion matrix
dt_final_fit %>%
   predict(test) %>%
   bind_cols(test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1316  243
##    Left        224  314

Some Text

# plot feature importance
dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(10)

Some Text

Bagging

Some Text

# create resampling procedure
set.seed(123)
bag_kfold <- vfold_cv(train, v = 5)

# create bagged CART model object with 5 bagged trees
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

# train model
bag_results <- fit_resamples(bag_mod, recipe, bag_kfold)

# model results
collect_metrics(bag_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.762     5 0.00879 Preprocessor1_Model1
## 2 roc_auc  binary     0.769     5 0.00884 Preprocessor1_Model1

Our out of the box AUC for the CART bagged tree model is higher than it was for our decision tree without tuning, but lower than that of our MARS and logistic regression models. With hyperparamter tuning, we expect to see the AUC rise.

Hyperparameter Tuning

Some Text (about the meaning of our hyperparameters)

bag_hyper <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

# create the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))

# train our model across the hyper parameter grid
set.seed(123)
bag_hyper_results <- tune_grid(bag_hyper, recipe, resamples = bag_kfold, grid = bag_hyper_grid)

# model results
show_best(bag_hyper_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.819     5 0.0106  Preprocessor1_Model5
## 2   300 roc_auc binary     0.819     5 0.0108  Preprocessor1_Model6
## 3   100 roc_auc binary     0.816     5 0.0105  Preprocessor1_Model4
## 4    50 roc_auc binary     0.811     5 0.0108  Preprocessor1_Model3
## 5    25 roc_auc binary     0.807     5 0.00897 Preprocessor1_Model2

Some text about the AUC

# identify best model
bag_best_hyperparameters <- bag_hyper_results %>%
   select_best("roc_auc")

# finalize workflow object
bag_final_wf <- workflow() %>%
   add_recipe(recipe) %>%
   add_model(bag_hyper) %>%
   finalize_workflow(bag_best_hyperparameters)

# final fit on training data
bag_final_fit <- bag_final_wf %>%
   fit(data = train)

# confusion matrix
bag_final_fit %>%
   predict(test) %>%
   bind_cols(test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1340  266
##    Left        200  291

This model is still more prone to false negatives.

bag_vip <- bag_final_fit %>%
   extract_fit_parsnip() %>%
   .[['fit']] %>%
   var_imp() %>%
   slice(1:10)

ggplot(bag_vip, aes(value, reorder(term, value))) +
   geom_col() +
   ylab(NULL) +
   xlab("Feature importance")

The most important features for our CART bagged model are consistent with those we found to be most important in our decision tree model.

Random Forest

Some Text

# create resampling procedure
set.seed(123)
rf_kfold <- vfold_cv(train, v = 5)

# create random forest object
rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

# train model
rf_results <- fit_resamples(rf_mod, recipe, rf_kfold)

# model results
collect_metrics(rf_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.801     5 0.00612 Preprocessor1_Model1
## 2 roc_auc  binary     0.842     5 0.0104  Preprocessor1_Model1

Our random forest model’s performance is on par with our best models even without tuning.

Hyperparamter Tuning

Some text (about the hyperparamaters)

# create random forest model object with tuning option
rf_hyper <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
  trees(range = c(50, 800)),
  mtry(range = c(2, 30)),
  min_n(range = c(1, 20)),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
rf_hyper_results <- tune_grid(rf_hyper, recipe, resamples = rf_kfold, grid = rf_hyper_grid)

# model results
show_best(rf_hyper_results, metric = "roc_auc")
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     9   612    20 roc_auc binary     0.841     5  0.0107 Preprocessor1_Model1…
## 2     9   425    20 roc_auc binary     0.840     5  0.0106 Preprocessor1_Model1…
## 3     9   237    20 roc_auc binary     0.840     5  0.0107 Preprocessor1_Model1…
## 4     2   237     1 roc_auc binary     0.840     5  0.0112 Preprocessor1_Model0…
## 5     9   800    20 roc_auc binary     0.840     5  0.0107 Preprocessor1_Model1…

Some Text about the AUC

# get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_hyper_results, metric = "roc_auc")

# create final workflow object
final_rf_wf <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(rf_hyper) %>%
  finalize_workflow(rf_best_hyperparameters)

# fit final workflow object
rf_final_fit <- final_rf_wf %>%
  fit(data = train)

# confusion matrix
rf_final_fit %>%
   predict(test) %>%
   bind_cols(test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1376  261
##    Left        164  296

Some Text

# plot feature importance
rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

Some Text

Conclusions

Mention what the optimal model is

Some Text

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving? Collect all the customers from the test dataset that you predict are going to leave. What is the predicted loss in revenue per month if no action is taken? Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).