Introduction

The data and models you will see later on in this analysis are in the hopes of explaining the status of our current and future customers. One of the highest costs to the company is getting new customers, and it is a lot more efficient if we are able to retain the ones we currently have. Utilizing the data on each customer which ranges from their type of contract, tenure, their type of phone service, and more. As mentioned, this is important because if we can correctly predict what our consumers will be in terms of their status we can target them directly and improve the factors that may make them leave while harping on the ones that help them stay. This increased retention will add to our revenue as a company and will decrease our costs of finding new customers.

We started our analysis by analyzing the data and the correlation of certain predictor variables to our main variable, customer status. We were able to find preliminary trends which gave us a better understanding of how to move forward in our machine learning later on. Once we found those trends we created three machine learning models: decision trees, logistic regression, and a Multivariate adaptive regression splines (MARS) model. These models gave us outputs for the Area-Under-Curve which tells us how well our models are predicting the data. We were able to pick out the best model from there and ran the test data on it to get a final prediction on the usability and accuracy of our created model in terms of how to understand the status of the customer.

Overall, this model will aid in understanding how and why the clients make the decision to stay or leave. We now know the most important variables that affect this decision (tenure, total charges, monthly charges, payment method, online security) and can market and show how our company can hit those better than our competition. This new information and model will increase our total revenue and retention rates and benefit our business as a whole.

Data Preparation

In order to complete all necessary exploratory data analysis and machine learning with the data being used, these were the required packages:

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(rsample)
library(rpart.plot)

setwd("C:/Users/ericw/Documents/UC 2023 Spring/Data Mining")

Exploratory Data Analysis

In order to begin our data analysis, it was first necessary to import the customer retention dataset being used. We then omitted all null values and changed the Status variable (our response variable) to be a factor so that we could our classification models.

retention <- read.csv("customer_retention.csv")

retention <- retention %>% 
  mutate(Status = as.factor(Status))

retention <- na.omit(retention)

We first began our exploratory data analysis by comparing the current customers to those who had left by gender.

# Breakdown of Retention Status by Gender
ggplot(retention, aes(Status)) +
  geom_bar(fill = "purple") +
  facet_wrap(~Gender) +
  ggtitle("Status vs. Gender ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Status")

This plot shows that there is little to no difference in retention status by gender. Therefore, we decided to look at areas other than demographics, where we hoped we might find more significant trends. Before doing so, we decided to plot the overall difference in status.

# Breakdown of Current Status
ggplot(retention, aes(Status)) +
  geom_bar(fill = "lightblue") +
  ggtitle("Status Breakdown") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Count of Customers", x = "Status")

As expected, we find similar results in our plot above as we did with our initial plot which broke down status by gender. Given this information, we then chose to analyze customer retention status by how many months they have been with the company.

# Customer Retention Status by Month
ggplot(retention, aes(x = Status, fill = Status)) +
  geom_bar() +
  facet_wrap(~Tenure) +
  ggtitle("Status vs. Tenure (By Month) ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Customer Count", x = "Status")

From this plot, we see that customers generally have a high degree of turnover within the first year, followed by a slightly lessened degree of turnover in the second year. After that, the number of customers leaving becomes very low, showing a certain degree of customer loyalty for customers who make it past the 1-2 year mark.

# Customer Retention Status vs. Length of Contract Plot
retention %>% 
  select(Contract, Status) %>% 
  group_by(Status) %>% 
  ggplot(aes(x = Status, y = Contract, fill = Contract)) + 
  geom_col() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle("Customer Retention Status vs. Length of Contract")

This plot shows that customers who sign one or two year contracts are much more likely to remain customers than those who sign month-to-month contracts. Month-to-month contracts seem to dominate the “Left” column, while there are very few one or two year contract customers who have left.

# Customer Retention Status vs. Payment Method Plot
retention %>% 
  select(PaymentMethod, Status) %>% 
  group_by(Status) %>% 
  ggplot(aes(x = Status, y = PaymentMethod, fill = PaymentMethod)) + 
  geom_col() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle("Customer Retention Status vs. Payment Method")

Through the above plot, we can see that customers who are paying through automatic payment are much less likely to leave than those who pay using an electronic check or a mailed check.

Machine Learning

Logistic Regression

set.seed(123)
retention_split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test  <- testing(retention_split)

# Logistic Regression Model
retention_kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

log_retention <- logistic_reg() %>% 
  fit_resamples(Status ~ ., retention_kfolds) %>%
  collect_metrics()

log_fit <- logistic_reg() %>% 
  fit(Status ~ ., data = retention_train)

log_fit %>% 
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

As we seen in the AUC value of 0.845, this logistic regression model does a fairly decent job of predicting whether a customer is going to stay with the company or not. From our decision matrix, we can see that there is a slight bias towards false positives (picking a customer will stay when in reality they leave) over false negatives, however, this again shows the model is rather accurate overall.

Decision Tree Model

# Decision Tree Model

# Step 1 
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = retention_train)

# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = retention_train)

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

# Step 4: K-fold
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5)

# Train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)

# Model results
collect_metrics(dt_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.785     5 0.00552 Preprocessor1_Model1
## 2 roc_auc  binary     0.803     5 0.00777 Preprocessor1_Model1
# Step 5
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# Create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# Train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)

# Get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 2    0.0000000178         15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 3    0.00000316           15    40 roc_auc binary    0.814     5 0.00780 Prepro…
## 4    0.0000000001         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## 5    0.0000000178         11    40 roc_auc binary    0.814     5 0.00827 Prepro…
## # … with abbreviated variable name ¹​.estimator
#Step 6

# Get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')

# Put together final workflow
dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

# Fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
  fit(data = retention_train)

# Plot feature importance
dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(20)

# The best model was able to have a mean AUC of .814 with 15 trees deep. 
# Tenure, contract, and total Charges were the most relevant.

dt_final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1346  275
##    Left        194  282

As seen from the AUC value of 0.814, the above Decision Tree model, while still a good predictor of customer status, performs slightly worse than our Logistic Regression model. Our model also shows us that the most important response variables for this model were Tenure, Contract, and Total Charges.

MARS Model

# MARS Model
mars_recipe <- recipe(Status ~ ., data = retention_train)

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

tuning_results <- mars_wf %>%
  tune_grid(resamples = retention_kfolds, grid = mars_grid)

tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        18           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  3        20           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  4        21           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  5        22           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  6        23           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  7        24           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.851     5 0.00566 Preprocessor1_M…
## # … with 50 more rows
autoplot(tuning_results)

best_mars <- select_best(tuning_results, metric = "roc_auc")

final_mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(best_mars)

final_mars_wf %>% 
  fit(data = retention_train) %>%
  extract_fit_parsnip() %>%
  vip(type = "rss")

final_mars_fit <- final_mars_wf %>%
  fit(data = retention_train)

final_mars_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315
final_mars_fit %>% 
  predict(retention_test, type = "prob") %>%
  mutate(truth = retention_test$Status) %>%
  roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.847

With an AUC of 0.851, the MARS model was the best of the three models that we created. We found that the most important variables in this model were Tenure, Total Charges, Monthly Charges, Electronic Check Payment, and Online Security. The confusion matrix reinforced the predictive quality of our model, although there was still a slight preference toward false positives compared to false negatives in the matrix. When we used the test data as an input for the MARS model, we found it had an AUC value of 0.847, which still shows a very good ability to predict retention among our customers.

Business Analysis

Most Important Predictor Variables

We were able to run a MARS model with an AUC or accuracy of 84.8%. The top 5 factors from this model that can predict whether the customer will stay or leave are the following:

  1. Tenure- The longer that a customer remains with Regork, the less likely that they are to leave. This means that it should be an absolute top priority to ensure customers remain with the company during the first year or two in order to ensure that customers stay remain loyal consumers in the long run.
  2. Total Charges- The cost the a customer is paying for any product is obviously going to be first and foremost on their mind when considering whether or not to purchase a product. It is important for Regork to keep prices high enough that they maintain a good profit margin, while also ensuring prices are fair for consumers and similar to their competitors.
  3. Monthly Charges- The rationale for monthly charges is very similar to that of total charges, so it makes sense that the two would be closely linked in terms of importance to the model. Again, it is essential for Regork to ensure pricing is competitive in the industry to maintain their customer base.
  4. Payment Method- Our model showed that customers who use an electronic check as their preferred payment method are more likely to remain with the company. Regork could encourage customers who are currently using other forms of payment, such as a mailed check, to switch to an electronic check in order to boost retention.
  5. Online Security- It is important that customers feel that their information (both demographic and payment information) is secure with Regork. If they feel like security is not a priority to Regork, they are obviously more likely to leave the company and take their business elsewhere. A continual emphasis data privacy and security will help improve customer retention for Regork going forward.

Proposed Incentive Scheme

Because Tenure was the most important predictor variable, we believe that Regork would benefit greatly from trying to move some customers who are currently on month-to-month contracts onto one or two year contracts in the future. In order to do so, we think it would be beneficial to provide some sort of a discount for customers who agree to longer-term contracts. For instance, perhaps Regork could provide one month off of a one year contract for new customers, as well as two months off of a two year contract. This would increase the likelihood that customers would agree to long-term contracts, and therefore increase the chances that they would remain customers in the long run.

Limitations of Analysis

Although we have confidence in our analysis and the models that we have built, it is important to acknowledge that there are some limitations to the analysis that we have provided. Although our final model had a strong predictive score, a broader dataset over a longer period of time would provide more insights into the Regork customers. We also must acknowledge that past customer data is not necessarily going to be indicative of how customers will act in the future. Despite this, we believe that the model that we have built is a strong one, and that the recommendations that we have provided will be helpful in informing future decision making within the company.