1. Introduction

Picture from: https://m16marketing.com/digital-marketing-blog/the-influence-of-the-self-on-consumer-behavior/

Summary of report and problem statement:

With entering any new industry, understanding its customers behavior becomes one of the most important aspects to becoming successful. In the following analysis we analyzed customer retention data, found within the “customer_retention.csv” file, and subsequently found a model that does well in predicting customer retention.

Using this model will become crucial for many future analysis which may span across many different functional areas of your company. Perhaps the marketing team will use the model to plan promotions for existing customers. The finance team could use this to forecast the trend of customer retention as it affects company risk, revenue and bookings.

As your explore the analysis, you will find information regarding the necessary packages for the report, information on the data source and data preparation process. You will also find an extensive analysis, completed prior to creating the model, which will explore trends and relationships between the predictor variables and response variable. Lastly, you will notice the bulk of the analysis is around our machine learning model. Here we assessed four different classification methods: logistic regression, random forest, bagging, and decision tree. You can find our conclusion and reasoning highlighted throughout the summary tab.

2. Packages Required

The associated packages used in our analysis, including a short description:

  • tidyverse: A collection of packages designed for data science
  • vip: A package for constructing variable importance plots
  • tidymodels: Allows you to perform discrete parts of the ML workflow with discrete packages
  • pdp: A package for constructing parital dependence plots and individual conditional expectation curves
  • kernlab: A package for kernel-based machine learning methods
  • baguette: Package includes tree and rule-based models for bagging
library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)

3. Data Preperation & Exploratory Data Analysis

A. Data Preperation

As part of our data preparation, we mutated our response variable, Status, to be a factor as this analysis is based around performing a classification model. We also removed null values, which originally counted in at 11.

#Importing data sets needed for analysis
retention <- read.csv("customer_retention.csv")

#Data preparation
retention <- retention %>% 
  dplyr::mutate(Status = as.factor(Status))

retention <- drop_na(retention)
retention %>% is.na() %>% sum()

Below you will see we split the original dataset into a training and a test set. The training set accounts for 70% of our data, while the test set accounts for 30%. This split of the data will ensure that our model is generalizable (not overfitting) and will allow us to get a good assessment of model parameters.

#Splitting dataset
set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test  <- testing(retention_split)

dim(retention_test)
dim(retention_train)

The following illustrates that in our original retention data set we have an imbalanced response (Current: 73%, Left: 26%). Enforcing stratified sampling ensures that both training and test sets have approximately equal response distributions.

#sample distribution
retention_train %>% 
  mutate(id = 'retention_train') %>% 
  bind_rows(retention_test %>% mutate(id = 'retention_test')) %>%
  ggplot(aes(Status, color = id)) +
  geom_density() + labs(title = "Retention Sample Distribution")

# original response distribution
table(retention$Status) %>% prop.table()

# response distribution for training data
table(retention_train$Status) %>% prop.table()

# response distribution for test data
table(retention_test$Status) %>% prop.table()

B. Exploratory Data Analysis

The following code output shows the options for our categorical outcome within the classification problem: Current or Left, depicting if the customer is a current customer or left in the past.

#Exploratory data analysis

head(retention$Status)

Following is an analysis on the full retention data set, which shows the relationship between different demographics and phone service use. Here we see that gender does not seem to play a major role in the use of the phone service, as the distribution is fairly similar across the use of phone service or not, between the two genders. However, we can see there is a greater relationship between senior citizens and having a phone service or not, as well as those with dependents.

#Phone service and demographic analysis
retention %>% 
  ggplot(aes(PhoneService, fill=Gender))+
  geom_bar() +
  labs(title = "Gender Count and Phone Service Use")

retention %>% 
  mutate(Senior_Citizen = case_when(
    SeniorCitizen == 1 ~ "Senior Citizen",
    SeniorCitizen == 0 ~ "Not a Senior Citizen")) %>% 
  ggplot(aes(PhoneService, fill=Senior_Citizen))+
  geom_bar()+
  labs(title = "Senior Citizen Count and Phone Service Use")

retention %>% 
  ggplot(aes(PhoneService, fill=Partner))+
  geom_bar()+
  labs(title = "Partner Count and Phone Service Use")

retention %>% 
  ggplot(aes(PhoneService, fill=Dependents))+
  geom_bar()+
  labs(title = "Dependents Count and Phone Service Use")

In addition, to understanding the relationship between different demographics and phone service use, we also wanted to understand how the same demographics related to the use of internet service. Here we can see that different demographics don’t seem to have a major impact on internet service.

#Internet service and demographic analysis
retention %>% 
  ggplot(aes(x=InternetService, fill=Gender))+
  geom_bar() +
  labs(title = "Gender Count and Internet Service Use")

retention %>% 
  mutate(Senior_Citizen = case_when(
    SeniorCitizen == 1 ~ "Senior Citizen",
    SeniorCitizen == 0 ~ "Not a Senior Citizen")) %>% 
  ggplot(aes(InternetService, fill=Senior_Citizen))+
  geom_bar()+
  labs(title = "Senior Citizen Count and Internet Service Use")

retention %>% 
  ggplot(aes(InternetService, fill=Partner))+
  geom_bar()+
  labs(title = "Partner Count and Internet Service Use")

retention %>% 
  ggplot(aes(InternetService, fill=Dependents))+
  geom_bar()+
  labs(title = "Dependents Count and Internet Service Use")

4. Analysis

Model Recipe

retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

Proportion Table

table(retention$Status) %>% prop.table()
## 
##   Current      Left 
## 0.7344018 0.2655982

Logistic Regression

lr_mod <- logistic_reg()

set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

log_results <- lr_mod %>% 
  fit_resamples(Status ~ ., kfolds)

collect_metrics(log_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

How will this model behave? Is this a good model?

final_fit <- lr_mod %>%
  fit(Status ~ ., data = retention_train)

final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332
final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

This model does far better at predicting if a customer is current versus accurately predicting if a customer has left. When our model is inaccurate, we can see that it is from more false positives (Prediction = Current but Truth = Left) than false negatives (Prediction = Left but Truth = Current).This means our model is biased towards predicting that a customer is current when in fact the customer has left.

Decision Tree

dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

dt_fit <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = retention_train)

dt_results <- fit_resamples(dt_mod, retention_recipe, kfolds)

collect_metrics(dt_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.788     5 0.00399 Preprocessor1_Model1
## 2 roc_auc  binary     0.710     5 0.00529 Preprocessor1_Model1
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

TUNING

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
 ) %>%
  set_engine("rpart")

# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, retention_recipe, resamples = kfolds, grid = dt_hyper_grid)

# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001          8    30 roc_auc binary    0.816     5 0.00552 Prepro…
## 2    0.0000000178          8    30 roc_auc binary    0.816     5 0.00552 Prepro…
## 3    0.00000316            8    30 roc_auc binary    0.816     5 0.00552 Prepro…
## 4    0.0000000001          8    40 roc_auc binary    0.815     5 0.00420 Prepro…
## 5    0.0000000178          8    40 roc_auc binary    0.815     5 0.00420 Prepro…
## # … with abbreviated variable name ¹​.estimator

How will this model behave?

# get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')

# put together final workflow
dt_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
  fit(data = retention_train)

# determination of model quality
dt_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1316  243
##    Left        224  314
dt_final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

This model does far better at predicting if a customer is current versus accurately predicting if a customer has left. When our model is inaccurate, we can see that it is from more false positives (Prediction = Current but Truth = Left) than false negatives (Prediction = Left but Truth = Current).This means our model is slightly biased towards predicting that a customer is current when in fact the customer has left.

Bagging

# create bagged CART model object and # start with 5 bagged trees
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

# train model
bag_results <- fit_resamples(bag_mod, retention_recipe, kfolds)

# model results
collect_metrics(bag_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.765     5 0.00205 Preprocessor1_Model1
## 2 roc_auc  binary     0.769     5 0.00443 Preprocessor1_Model1

TUNING

# create bagged CART model object with
# tuning option set for number of bagged trees
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

# create the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))

# train our model across the hyper parameter grid
set.seed(123)
bag_results <- tune_grid(bag_mod, retention_recipe, resamples = kfolds, grid = bag_hyper_grid)

# model results
show_best(bag_results, metric = "roc_auc")
## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.822     5 0.00352 Preprocessor1_Model5
## 2   300 roc_auc binary     0.821     5 0.00432 Preprocessor1_Model6
## 3   100 roc_auc binary     0.821     5 0.00462 Preprocessor1_Model4
## 4    50 roc_auc binary     0.815     5 0.00361 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00457 Preprocessor1_Model2

How will this model behave?

# get best hyperparameter values
bag_best_model <- select_best(bag_results, metric = 'roc_auc')

# put together final workflow
bag_final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(bag_mod) %>%
  finalize_workflow(bag_best_model)

# fit final workflow across entire training data
bag_final_fit <- bag_final_wf %>%
  fit(data = retention_train)

# determination of model quality
bag_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1342  263
##    Left        198  294
bag_final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

This model does far better at predicting if a customer is current versus accurately predicting if a customer has left. When our model is inaccurate, we can see that it is from more false positives (Prediction = Current but Truth = Left) than false negatives (Prediction = Left but Truth = Current).This means our model is biased towards predicting that a customer is current when in fact the customer has left.

Random Forest

# create random forest model object and
# use the ranger package for the engine
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")

# train model
rf_results <- fit_resamples(rf_mod, retention_recipe, kfolds)

# model results
collect_metrics(rf_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00356 Preprocessor1_Model1
## 2 roc_auc  binary     0.843     5 0.00437 Preprocessor1_Model1

TUNING

# create random forest model object with tuning option
rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")

# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
  trees(range = c(50, 800)),
  mtry(range = c(2, 50)),
  min_n(range = c(1, 20)),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, retention_recipe, resamples = kfolds, grid = rf_hyper_grid)

# model results
show_best(rf_results, metric = "roc_auc")

How will this model behave?

# get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")

# create final workflow object
final_rf_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

# fit final workflow object
rf_final_fit <- final_rf_wf %>%
fit(data = retention_train)

# determination of model quality
rf_final_fit %>%
   predict(retention_test) %>%
   bind_cols(retention_test %>% select(Status)) %>%
   conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1442  343
##    Left         98  214
rf_final_fit %>% 
   predict(retention_train, type = "prob") %>%
   mutate(truth = retention_train$Status) %>%
   roc_curve(truth, .pred_Current) %>%
   autoplot()

This model does far better at predicting if a customer is current versus accurately predicting if a customer has left. When our model is inaccurate, we can see that it is from more false positives (Prediction = Current but Truth = Left) than false negatives (Prediction = Left but Truth = Current).This means our model is significantly more biased towards predicting that a customer is current when in fact the customer has left.

Optimal Model: Logisitic Regresssion

collect_metrics(log_results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 5 × 5
##   id    .metric .estimator .estimate .config             
##   <chr> <chr>   <chr>          <dbl> <chr>               
## 1 Fold1 roc_auc binary         0.836 Preprocessor1_Model1
## 2 Fold2 roc_auc binary         0.839 Preprocessor1_Model1
## 3 Fold3 roc_auc binary         0.865 Preprocessor1_Model1
## 4 Fold4 roc_auc binary         0.839 Preprocessor1_Model1
## 5 Fold5 roc_auc binary         0.845 Preprocessor1_Model1
final_fit <- lr_mod %>%
  fit(Status ~ ., data = retention_train)

tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.02     0.969      1.05    2.95e- 1
##  2 GenderMale                     -0.0461   0.0779    -0.592   5.54e- 1
##  3 SeniorCitizen                   0.258    0.101      2.55    1.08e- 2
##  4 PartnerYes                     -0.141    0.0925    -1.52    1.29e- 1
##  5 DependentsYes                  -0.0475   0.108     -0.441   6.59e- 1
##  6 Tenure                         -0.0646   0.00762   -8.47    2.35e-17
##  7 PhoneServiceYes                -0.0417   0.774     -0.0539  9.57e- 1
##  8 MultipleLinesNo phone service  NA       NA         NA      NA       
##  9 MultipleLinesYes                0.442    0.211      2.10    3.56e- 2
## 10 InternetServiceFiber optic      1.48     0.950      1.56    1.18e- 1
## # … with 21 more rows
vip::vip(final_fit)

Which predictor variables appear to be most influential in customer behavior?

The Tenure predictor variable appears to be the most influential in customer behavior. Meaning that the length of the tenure and having a two-year contract contribute most to the status of the customer (current or left).

Why are those specific predictor variables the most influential?

Those specific predictor variables are the most influential because they contributed to the highest “Importance” when using the vip function.

For the optimal model selected, use the test set to compute the generalization error so that the Regork Telecom leadership understand what to expect on new data

final_fit %>% 
  predict(retention_test, type = "prob") %>%
  mutate(truth = retention_test$Status) %>%
  roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.845

How does this generalization error compare to the cross validation error seen in earlier results?

The generalization error is slightly less than the cross validation error seen in earlier results.

As a person responsible for making business decisions, what else are you learning from the observations in this section?

Our model was very successful when using the training data, additionally it is very successful when completing the generalization error. This would mean that our model would have a high success rate on future data.

5. Business Analysis & Conclusion

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

In our optimal model, we can see that Total Charges and Tenure are the most influential variables. Coming in close after those are Monthly Charges and Internet Service Fiber Optic. However, after that the variables have a drastic decrease with some of them being close to zero.

With Total Charges being the most influential variable, it is important as a business manager to focus on what this means. I believe that offering promotions or incentives would be extremely beneficial for Regork in terms of customer retention. You see customers leaving because of the price points and all the total charges that come with it, so offering these incentive programs provide them with more of a reason to stay. This would need to be a slow implementation because you do not want it to take a direct hit on the Finance team, however over time you may see it benefiting the company as a whole by keeping customers coming back.

Collect all the customers from the test dataset that you predict are going to leave.

When we break down the demographics of customers that are leaving because of charges and tenure we see that the majority are male, not senior citizens, people who are single, and people who are not dependents. I believe that this has to do with the income levels, as well as independence in spending. Senior citizens have retirement money saved up and no obligations of payments for children, giving them more wiggle room in how they spend their money, people who are single only have one income in their household, and people who are not dependents have way less freedom in how they choose to spend their money because they are still dependent on their parents.

What is the predicted loss in revenue per month if no action is taken?

If no action is taken on this issue, Regork will definitely face some revenue loss. The loss in revenue that we have predicted based on our model is $480,925.80. It is important to note that this is only a prediction based on the demographic groups that are most likely to leave Regork. However, this does not guarantee that all the members within this demographic data will be leaving. Once filtering out that data, then summing their total charges we get to that number. I do not believe that all the members within these demographics will leave the company, but they are considered to be the most likely to go.

Propose an incentive scheme to your manager to retain these customers.

An incentive scheme that we propose consists of a few options that could help. We could implement sending them a gift/ providing some type of discount on their birthday or on their anniversary of working with you. This will never come across as redundant and people love a little recognition on their birthday or any important date. If we implemented a rewards program and frequently reminded them of their status, as well as give them tips on how to move higher in the program it would give them more of an incentive to come back while keeping it fresh in their mind. Lastly, you could ask for feedback in exchange for a reward. This is a direct benefit for both parties. It gives the company an idea of what is working and what is not, while giving the customer something in return. Even if the feedback reveals that the customer is not going to stay, it builds a positive relationship and could potentially open the door for the customer to recommend you to someone else.

This all directly relates back to the model because the biggest issue we see is the total charges. All of these incentive schemes provide some sort of “reward” (discounts) which will be helping to bring the total charges down for our customers.

Conclusion/Implications

To conclude this report, we believe that the best way we can retain customers is by offering discounts and incentive programs. Yes, it will be a slower implementation because it would take much observation and conversation with our Finance Department to fully understand how much of a discount we can offer and at what costs. However, this will definitely be our most beneficial way to keep these customers leaving Regork.

From our insights we are able to see how much of a factor cost has on whether or not a customer will stay. While we do not have very many customers leaving Regork, we have enough for it to be alarming and something to put extra time and money into in order to fix. The extra cost that these benefits and incentives will cost us initially, will be worth it in the long run because not only will it keep existing customers with us, but it will build a positive relationship which in turn will bring in more customers that will be new to us.

Limitations

Some things that would be helpful to go more into depth at a later time would be to further look at the demographics. We dove into a few, but I think it would provide us a better picture if we were able to look at age more specifically, or even race, occupation, specific income levels, etc. This information would paint a more detailed picture for us to better understand specifically WHO is leaving Regork and WHY.