Final Project

Introduction

The goal of this analysis is to determine the main predictors of churn and identify tangible ways to reduce churn in the business.

1. Data Preparation & Exploratory Data Analysis

# Load Packages

library(tidyverse)
library(tidymodels)
library(baguette)
library(pdp)
library(vip)
library(kernlab)
library(rpart.plot)

# Import Data
regork <- read.csv("customer_retention.csv")

# Clear NA values
regork <- na.omit(regork)
regork$Status <- factor(regork$Status)

# Compare Statuses

ggplot(regork, aes(Status)) +
  geom_bar() +
  ggtitle("Status at a Glance") +
  labs(y= "Count of Status", x= "Status")

This graph aims to give a better understanding of how prevalent this churn is.

# Evaluate Payment Method
ggplot(regork, aes(PaymentMethod)) +
  geom_bar() +
  facet_wrap(~Status) +
  coord_flip() +
  ggtitle("Payment Method and Status") +
  labs(y="Count of Status", x="Payment Method")

Although the Current Status graph is not too helpful, the graph for those who left clearly shows Electronic check as the dominant payment method. This could suggest providing incentives to pay automatically or via mailed check.

# Length of Contract Analysis
ggplot(regork, aes(Contract)) +
  geom_bar() +
  facet_wrap(~Status) +
  ggtitle("Length of Contract and Status") +
  labs(x= "Length of Contract", y="Count of Status")

This graph shows that people are more likely to churn the more frequent their payments. This would provide backing for a discount on a year or greater subscription plans.

# Run summary statistics of Charge data
summary(regork$MonthlyCharge)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.54   70.35   64.79   89.90  118.75

summary(regork$TotalCharges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.8   401.9  1397.5  2283.1  3796.9  8684.8

# Look at the relationship between Monthly Charge and Status 

ggplot(regork, aes(MonthlyCharges)) +
  geom_bar() +
  facet_wrap(~Status) +
  ggtitle("Monthly Charge and Status Connection") +
  labs(y= "Number of Customers", x= "Monthly Charge ($)")

We can tell that most of our current customers have a charge of under $30. This is shown through the major spike in the current graph as well as the first quartile for the Monthly Charge data.We can see that when we reach a monthly charge of around $60 we see a rise in the amount of clients who leave.

2. Machine Learning

Logistic Regression

# Split the data
set.seed(962)
split <- initial_split(regork, prop = 0.7, strata = Status)
train <- training(split)
test <-  testing(split)

# Kfolds
set.seed(962)
kfolds <-  vfold_cv(train, v=5, strata = Status)

# fit the regression
logistic_reg() %>% 
  fit_resamples(Status ~., kfolds) %>% 
  collect_metrics()

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.805     5 0.00528 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00346 Preprocessor1_Model1
## 3 roc_auc     binary     0.843     5 0.00868 Preprocessor1_Model1

We get a roc_auc of 0.8431 from the logistic regression which is pretty strong, but likely is not the strongest model we can perform.

confusion_matrix <-  logistic_reg() %>% 
  fit(Status ~., data= train)

confusion_matrix %>% 
  predict(test) %>% 
  bind_cols(test %>% 
  select(Status)) %>% 
  conf_mat(truth= Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1356  246
##    Left        184  311

Our model is more likely to choose a false negative than a false positive. This means we should be aware that there will likely be a little more churn than what is predicted by the model.

Decision Tree

# Create Model
dt_mod <- decision_tree(mode = "classification") %>% 
  set_engine("rpart")

# Create recipe
dt_recipe <- recipe(Status ~., data= train)

# Fit workflow
dt_fit <-  workflow() %>% 
  add_recipe(dt_recipe) %>% 
  add_model(dt_mod) %>% 
  fit(data= train)

# fit results
dt_results <- fit_resamples(dt_mod, dt_recipe, kfolds)

# collect results
collect_metrics(dt_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.785     5 0.00270 Preprocessor1_Model1
## 2 brier_class binary     0.148     5 0.00278 Preprocessor1_Model1
## 3 roc_auc     binary     0.796     5 0.00943 Preprocessor1_Model1

The decision tree without tuning performs the worst of our models.

dt_mod <-  decision_tree(
  mode= "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>% 
  set_engine('rpart')

# grid
dt_hyper_grid <-  grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# train model
set.seed(964)
dt_results <-  tune_grid(dt_mod, dt_recipe, resamples = kfolds, grid =  dt_hyper_grid)

# show best
show_best(dt_results, metric= 'roc_auc', n=5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    40 roc_auc binary     0.813     5 0.00678
## 2    0.0000000178          8    40 roc_auc binary     0.813     5 0.00678
## 3    0.00000316            8    40 roc_auc binary     0.813     5 0.00678
## 4    0.000562              8    40 roc_auc binary     0.812     5 0.00610
## 5    0.0000000001          8    30 roc_auc binary     0.810     5 0.00792
## # ℹ 1 more variable: .config <chr>

The tuned decision tree is a decent improvement but still not better than the logistic model

#  get best
dt_best_model <-  select_best(dt_results, metric = 'roc_auc')

# final wf
dt_final_wf <-  workflow() %>% 
  add_recipe(dt_recipe) %>% 
  add_model(dt_mod) %>% 
  finalize_workflow(dt_best_model)

# fit final
dt_final_fit <-  dt_final_wf %>% 
  fit(data = train)

# plot feature importance
dt_final_fit %>% 
  extract_fit_parsnip() %>% 
  vip(10)

We can see contract plays the most important role in this model. This further supports the need for incentivizing longer term contracts. ### Random Forest

# Recipe
rf_recipe <- recipe(Status ~., data= train)

# Create model
rf_mod <-  rand_forest(mode= "classification", trees = tune(), mtry = tune(), min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity")

# set ranges to test
rf_grid <- grid_regular(trees(range= c(100,1000)),mtry(range = c(1, 50)), min_n(range= c(1,20)), levels = 5)

# collect the tested grid data
set.seed(964)
results <- tune_grid(rf_mod, rf_recipe, resamples = kfolds, grid= rf_grid)

# show the outcomes of the rf model
show_best(results)

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1   550    20 roc_auc binary     0.839     5 0.00698 Preprocessor1_Model1…
## 2     1   550    15 roc_auc binary     0.838     5 0.00677 Preprocessor1_Model0…
## 3     1   325     5 roc_auc binary     0.838     5 0.00691 Preprocessor1_Model0…
## 4     1   775    20 roc_auc binary     0.838     5 0.00692 Preprocessor1_Model1…
## 5     1  1000    20 roc_auc binary     0.838     5 0.00670 Preprocessor1_Model1…

The roc_auc is actually very comparable to the logistic regression but slightly lower.

# define the outcome with the best roc_auc

rf_best <- select_best(results, metric = "roc_auc")

# finalize the workflow

final_wf <- workflow() %>% 
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>% 
  finalize_workflow(rf_best)

# finalize the fit

final_fit <-  final_wf %>% 
  fit(data=train)

# show the most impactful predictors

final_fit %>% 
  extract_fit_parsnip %>% 
  vip(num_features = 10)

3. Business Analysis & Conclusion

I can see lots of potential for adaptation in this business, however my analysis will focus on just three actionable points: Incentives for long term contracts, instituting a soft maximum monthly charge, and adding reminders of automatic payment methods to consumers.

Contracts were determined to be the most important predictor of churn in our decision tree model. Because of this and our findings in the exploratory analysis, we know that monthly contracts are a primary cause of churn for regork. This makes sense due to the psychological effect of many small payments providing frequent reminders of the money being spent as opposed to long term payments of only once a year. Because of this I would recommend having greater discounts on long term contracts to incentivize the consumer to stay for longer.

Instituting a soft maximum monthly charge could be helpfulas well due to the importance of monthly charges being the fourth most important predictor of churn according to our decision tree model. Through our exploratory data analysis we saw that there is a rise in the churn once the monthly charge approaches $75. Because of this, I would recommend using $75 as a soft cap for pricing to minimize churn. There are plenty of customers who stay with monthly charges above $75, which is why I wonly recommend a soft maximum.

Payment method, although less important in predicting churn, stands out because of the difference in the customer status between electronic check payments and all other payment types. To reduce the use of electronic checks, reminders and occasional discounts could be sent to those who use that payment method to switch to automated payments.