Regork Grocery Chain: Customer Churn and Machine Learning

Introduction

Entering a new industry comes with many barriers to entry and business problems to solve. Finding out how to effectively retain customers is one of the biggest, if not the biggest, of them all. Using the retention data provided, I first performed an exploratory analysis to find trends between certain demographics and products and the tenure length of customers. I also compared three different machine learning models to find the one that could best predict customer churn rate. The three models include: logistic regression, decision trees, and random forests.

Data Preparation and Exploratory Analysis

Before the analysis can begin, it is necessary to load the following packages in to R:

library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)
library(rpart.plot)

As well as to load the data provided, convert the response variable into a factor, and remove NA values:

#Importing data 
retention <- read.csv("customer_retention.csv")

#factor conversion
retention <- retention %>% 
  dplyr::mutate(Status = as.factor(Status))

#Drop NA values
retention <- drop_na(retention)
retention %>% is.na() %>% sum()

## [1] 0

To begin the exploratory data analysis, I first wanted to visualize the distribution of the current customers and the ones that have already left. To do this, a simple bar graph will suffice.

# original response distribution
table(retention$Status) %>% 
  prop.table()

## 
##   Current      Left 
## 0.7344018 0.2655982

retention %>%
  ggplot(aes(Status))+
  geom_bar()+
  labs(title="Proportion of Current and Left Customers")

The main purpose of this report is to effectively predict customer churn rate. Therefore, I think the best variable to explore in this analysis is tenure length, particularly, what demographics and products have an effect on the tenure length.

First I want to determine if household size has any effect on tenure length. For example, Do larger households usually stick to one service for longer? Do smaller households switch between service providers more often or not? To visualize this, I created a facet grid with length of tenure in months on the x-axis, a count of customers on the y-axis, and faceted the partner variable vertically and the dependents variable horizontally to produce the graph below.

ggplot(retention, aes(Tenure)) +
  geom_bar(fill = "blue") +
  facet_grid(Dependents~Partner) +
  ggtitle("Tenure difference with Partners and Dependents") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

As you can see on the top left quadrant where the customer has neither a partner nor dependents, there is a large number of customers with a very short tenure. We can infer that perhaps these individuals are trying to find a provider that suits their needs the most and can afford to switch between providers to find a match. We can also see that on the two quadrants on the right where the customers has a partner and has or does not have dependents, that there are more customers with a longer tenure length. this could indicate that the larger the household size, the more likely they are to stick with a single provider for longer.

Next, I want to see if there are any trends between the length of tenure and whether or not a customer has phone service, internet service, or both. To do this, I created another facet grid with length of tenure on the x-axis and a count of customers on the y-axis, as well as internet service faceted vertically and phone service faceted horizontally. The resulting graph is visualized below.

ggplot(retention, aes(Tenure)) +
  geom_bar(fill = "blue") +
  facet_grid(PhoneService~InternetService) +
  ggtitle("Tenure Difference In Phone and Internet Service") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

As you can see, If you do not have phone service, the only internet service you can have is DSL. You can also visualize the variation in tenure length among all of the internet service products where the customer does also have phone service. In all three graphs, the count of customers with a tenure length of zero is much higher than the maximum tenure length. It would be important for management to reach these customers and determine a way to extend the average tenure length in all internet service products. A further explanation and possible solutions to this will be discussed in the business analysis an conclusion section.

When Covid began, the world economy seemingly switched gears completely. Most, if not all workers were required to work from home. You required a mask to go to the movie theater and some people chose to just stay in and watch a movie from the comfort of their home. Because of this, TV and movie streaming has become increasingly in demand and sometimes more preferable than regular cable TV. In order for Regork to fully capitalize of their streaming services, its important to visualize the tenure length among customers who have or don’t have these products.

ggplot(retention, aes(Tenure)) +
  geom_bar(fill = "blue") +
  facet_grid(StreamingTV~StreamingMovies) +
  ggtitle("Tenure Difference In TV and Movie Streaming") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

This graph depicts movie streaming vertically and TV streaming horizontally. We can see that there is a large amount of customers who have internet service but do not have either streaming services that leave quickly. We can also see that there is a large amount of customers with both streaming services with the maximum tenure length. This could indicate that the customers with both streaming services are satisfied with their product. It would be important for management to find a way to reach these customers who do not have either streaming services and and introduce them to the product so that we might extend the tenure length.

Machine Learning

Before the machine learning analysis begins, there are a few steps to complete. First we create a 70-30 split in the retention data to create training and testing variables. We will train all three of the machine learning algorithms in the training data to avoid over fitting, that way when the models are introduced to the testing data, they can create new predictions. We also define a recipe that includes standardizing all numeric predictors so that they have a zero mean and unit variance as well as re-coding all categorical variables using dummy encoding. finally, we create a k-fold cross validation re-sampling method so that the machine learning algorithm trains on different samples of the same training data which improves the accuracy of the predictions.

#split Retention
set.seed(123)
Retention_split <- initial_split(retention, prop = .7, strata = Status)
Retention_train <- training(Retention_split)
Retention_test <- testing(Retention_split)
#Retention recipe
Retention_recipe <- recipe(Status ~ ., data = Retention_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())
#k-fold cross validation
set.seed(123)
kfolds <- vfold_cv(Retention_train, v = 5, strata = Status)

Logistic Regression

The first machine learning algorithm used is logistic regression. I chose this model because of its performance in computing probabilities. We start by defining the model object using a logistic regression and fitting the k-fold re-sampling method. The metric that we are trying to maximize is the Area Under the ROC Curve (AUC), below we can see that the AUC value of this logistic regression is 0.8447877.

log_mod <- logistic_reg()

log_results <- log_mod %>% 
  fit_resamples(Status ~ ., kfolds)

collect_metrics(log_results) %>% filter(.metric == "roc_auc")

## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

Now we fit the final model and compute the confusion matrix to assess the accuracy of the model. According to thee confusion matrix, this model is better at predicting whether a customer is current versus if a customer had left.

final_fit <- log_mod %>%
  fit(Status ~ ., data = Retention_train)

final_fit %>%
  predict(Retention_test) %>%
  bind_cols(Retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

The ROC curve for the logistic regression is visualized below.

final_fit %>% 
  predict(Retention_train, type = "prob") %>%
  mutate(truth = Retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Decision Tree Model

The next machine learning model is a decision tree model.

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

hyper_grid <- grid_regular(
  cost_complexity(), 
  tree_depth(),
  min_n()
)

dt_results <- tune_grid(dt_mod, Retention_recipe, kfolds, grid(hyper_grid))

show_best(dt_results, metric="roc_auc",n=5)

## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1     0.000000238          7    26 roc_auc binary     0.816     5 0.00569
## 2     0.000180             9    34 roc_auc binary     0.810     5 0.00569
## 3     0.000000596         12    32 roc_auc binary     0.809     5 0.00527
## 4     0.0000115            6    39 roc_auc binary     0.809     5 0.00560
## 5     0.000351            14    16 roc_auc binary     0.793     5 0.00323
## # ... with 1 more variable: .config <chr>

dt_best_model<-select_best(dt_results, metric="roc_auc")

We can see the best (AUC) value is is 0.8156932 which is not as good as the logistic regression model.

dt_wf <- workflow() %>%
  add_recipe(Retention_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

# fit final workflow across entire training data
dt_final_fit <- dt_wf %>%
  fit(data = Retention_train)

dt_final_fit %>%
  predict(Retention_test) %>%
  bind_cols(Retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1314  218
##    Left        226  339

The confusion matrix above indicates that our model is much better at predicting whether a customer is current versus if a customer had left.

The ROC curve for this model is visualized below.

dt_final_fit %>% 
  predict(Retention_train, type = "prob") %>%
  mutate(truth = Retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

rpart.plot::rpart.plot(dt_final_fit$fit$fit$fit)

Random Forest Model

The final model is a random forest model.

#Random Forest
# create random forest model object with tuning option
rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")
# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
  trees(range=c(50,800)),
  mtry(range = c(2,50)),
  min_n(range=c(1,30)),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, Retention_recipe, resamples = kfolds, grid = rf_hyper_grid)
# model results
show_best(rf_results, metric = "roc_auc")

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    14   237    30 roc_auc binary     0.842     5 0.00405 Preprocessor1_Model1~
## 2    14   425    30 roc_auc binary     0.841     5 0.00459 Preprocessor1_Model1~
## 3    14   612    30 roc_auc binary     0.841     5 0.00411 Preprocessor1_Model1~
## 4    14   800    30 roc_auc binary     0.841     5 0.00411 Preprocessor1_Model1~
## 5     2   425    30 roc_auc binary     0.840     5 0.00474 Preprocessor1_Model1~

The best (AUC) value for this model is 0.8415413, which is still sightly less than the logistic regression.

# get optimal hyper parameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
# create final workflow object
final_rf_wf <- workflow() %>%
  add_recipe(Retention_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)
# fit final workflow object
rf_final_fit <- final_rf_wf %>%
  fit(data = Retention_train)

rf_final_fit %>%
  predict(Retention_test) %>%
  bind_cols(Retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1379  263
##    Left        161  294

The confusion matrix above indicates that our model is much better at predicting whether a customer is current versus if a customer had left.

rf_final_fit %>% 
  predict(Retention_train, type = "prob") %>%
  mutate(truth = Retention_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

Optimal Model

The optimal model for predicting customer churn is the logistic regression model. The logistic regression model has the highest Area Under the ROC Curve (AUC) out of any of the three models with a value of 0.8447877.

The most important factors are visualized below:

tidy(final_fit)

## # A tibble: 31 x 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.02     0.969      1.05    2.95e- 1
##  2 GenderMale                     -0.0461   0.0779    -0.592   5.54e- 1
##  3 SeniorCitizen                   0.258    0.101      2.55    1.08e- 2
##  4 PartnerYes                     -0.141    0.0925    -1.52    1.29e- 1
##  5 DependentsYes                  -0.0475   0.108     -0.441   6.59e- 1
##  6 Tenure                         -0.0646   0.00762   -8.47    2.35e-17
##  7 PhoneServiceYes                -0.0417   0.774     -0.0539  9.57e- 1
##  8 MultipleLinesNo phone service  NA       NA         NA      NA       
##  9 MultipleLinesYes                0.442    0.211      2.10    3.56e- 2
## 10 InternetServiceFiber optic      1.48     0.950      1.56    1.18e- 1
## # ... with 21 more rows

final_fit%>%
  vip(20)

According to the graph, the most influential factors in the logistic regression model are tenure, followed by one and two year contracts.

Now we compute the (AUC) value for the test data to see how the model holds up against unseen data. The generalization error is slightly smaller than than it was for the training data.

final_fit %>% 
  predict(Retention_test, type = "prob") %>%
  mutate(truth = Retention_test$Status) %>%
  roc_auc(truth, .pred_Current)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.845

Business Analysis and Conclusion

In the optimal model we can see than tenure and one and two year contracts have the highest importance factors. This finding compliments the exploratory analysis well since tenure length was the subject.

In the first graph, it is evident that there is a large number of customers without a partner and dependents who have the minimum tenure length. The second graph tells us that there is a large number of customers who have phone service with the minimum tenure length. Finally, the third graph tells us that there is a large number of customers who do not have TV and movie streaming who have the minimum tenure length and that there is a large number of customers who do have those products with the maximum tenure length.

The exploratory analysis and and the logistic regression analysis both point out a need to maximize the tenure length of customers and management can achieve this by catering to specific demographics. For example, offering deals and promotions targeted to single customers with no dependents, as well as pushing new customers towards the TV and movie streaming products because the customers who do have those products tend to have a longer tenure length. Management may also offer incentives for selecting the one and two year contracts instead of month to month, this way the customers may not feel the need to switch providers at the end of the contract.

The limitations of the report include the limited exploratory analysis performed on the tenure length. It would be helpful to find trends among the other variable that were deemed important in the logistic regression model to fully understand which factors influence customer churn.

Data Mining Final Project

Jared Darpel

2023-04-25