Final Project

Results

Introduction

In this report for Regork, I am going to be analyzing data and finding predictor models that fit the data the best. I will do this by first finding the trends in the data by conducting exploratory analysis. By finding trends in certain data, it can help me prepare to find the best predictor model. Next, I will be executing a thorough and robust machine learning processes. I will test three different algorithms to try and find the best match for Regork’s data. The three machine learning algorithms I will be using are a Logistic Regression Model, a Multivariate Adaptive Regression Splines (MARS) Model, and finally, a Decision Tree Model.

Data Preparation

Packages required

library(tidymodels)
library(tidyverse)
library(earth)
library(baguette)
library(yardstick)
library(vip)
library(pdp)
library(dplyr)
library(ggplot2)
library(kernlab)
library(rpart)
library(rpart.plot)
library(DT)

Import data

library(readr)
customer_retention <- read_csv("customer_retention.csv")

Remove Missing Values/ Mutate Response Variable

Mutating the response variable to be a factor is essential when preforming a classification model. Removing missing values will ensure out data is ready to use and will not cause any warnings or issues when running the code.

customer_retention <- mutate(customer_retention, Status = factor(Status))
customer_retention <- drop_na(customer_retention)
customer_retention %>%
  is.na() %>%
  sum()

## [1] 0

Preview the data

datatable(head(customer_retention, 500), options = list(pageLength = 10))

Exploratory Analysis

Customer Status

customer_retention %>%
  group_by(Status) %>%
  ggplot(aes(x = Status, fill = Status)) +
  geom_bar() +
  ggtitle("Current Status of Customers")

This graph is a basic layout of the number of customers who are current and have left. This provides a visual that will be useful to compare to other charts.

Customers Who Left

customer_retention %>%
  filter(Status == "Left") %>%
  group_by(Tenure) %>%
  summarize(avg_MonthlyCharges = mean(MonthlyCharges)) %>%
  ggplot(aes(x = Tenure, y = avg_MonthlyCharges, fill = Tenure)) +
  geom_col() +
  ylab("Average Monthly Charges") +
  ggtitle("Months Stayed Vs. Average Price Paid per Month")

Current Customers

customer_retention %>%
  filter(Status == "Current") %>%
  group_by(Tenure) %>%
  summarize(avg_MonthlyCharges = mean(MonthlyCharges)) %>%
  ggplot(aes(x = Tenure, y = avg_MonthlyCharges, fill = Tenure)) +
  geom_col() +
  ylab("Average Monthly Charges") +
  ggtitle("Current Months Stayed Vs. Average Price Paid per Month")

The first thing I want to do is see if there is a particular reason why customers left. I graphed only customers who have left by how long they stayed and how much they were paying along with another graph of current customers. The more time people stayed, the more they paid. This could be because they were adding more services such as online security and device protection, meaning money was not a reason people were leaving.

You can also see that current customers are paying a smaller starting rate than customers who have left did when they first started.

customer_retention %>%
  group_by(Status) %>%
  summarize(avg_Tenure = mean(Tenure)) %>%
  ggplot(aes(x = avg_Tenure, y = Status, fill = Status)) +
  geom_col() +
  xlab("Average Tenure") +
  ggtitle("Average Tenure Vs. Status")

This graph provides a basic look on the average of how long people stayed according to their status. It tells us that people who left were not customers with Regork for long.

I am now going to compare Senior Citizens who have left and with what services they had to Senior Citizens who are still current customers and what services they have.

Senior Citizens Who Left

customer_retention %>%
  filter(Status == "Left") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = OnlineSecurity, fill = OnlineSecurity)) +
  geom_bar() +
    xlab("Had Online Security") +
    ggtitle("Senior Citizens Who Left That Had Online Security")

customer_retention %>%
  filter(Status == "Left") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = DeviceProtection, fill = DeviceProtection)) +
  geom_bar() +
    xlab("Had Device Protection") +
    ggtitle("Senior Citizens Who Left That Had Device Protection")

customer_retention %>%
  filter(Status == "Left") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = TechSupport, fill = TechSupport)) +
  geom_bar() +
  xlab("Had Tech Support") +
  ggtitle("Senior Citizens Who Left That Had Tech Support")

Senior Citizens Still Current

customer_retention %>%
  filter(Status == "Current") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = OnlineSecurity, fill = OnlineSecurity)) +
  geom_bar() +
    xlab("Have Online Security") +
    ggtitle("Senior Citizens That Have Online Security")

customer_retention %>%
  filter(Status == "Current") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = DeviceProtection, fill = DeviceProtection)) +
  geom_bar() +
    xlab("Have Device Protection") +
    ggtitle("Senior Citizens That Have Device Protection")

customer_retention %>%
  filter(Status == "Current") %>%
  filter(SeniorCitizen == 1) %>%
  ggplot(aes(x = TechSupport, fill = TechSupport)) +
  geom_bar() +
  xlab("Have Tech Support") +
  ggtitle("Senior Citizens That Have Tech Support")

As you can see, the Senior Citizens who are still current customers are much more likely to have more protection services than the ones who have left, especially with device protection. Senior Citizens have a difficult time with technology and even though it may be more expensive, it is to their benefit to recieve these types of services if you want them to stay. This could be improved by providing them with certain services at a discounted rate because they are Senior Citizens.

Machine Learning

Logistic Regression

set.seed(123)
logistic_split <- initial_split(customer_retention, prop = .7, strata = Status)
logistic_train <- training(logistic_split)
logistic_test <- testing(logistic_split)

set.seed(123)
logistic_kfolds <- vfold_cv(logistic_train, v = 5, strata = Status)

log_reg <- logistic_reg() %>% 
  fit_resamples(Status ~ ., logistic_kfolds) %>%
  collect_metrics() 

print(log_reg)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00188 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00521 Preprocessor1_Model1

log_fit <- logistic_reg() %>% 
  fit(Status ~ ., data = logistic_train)

log_fit %>% 
   predict(logistic_test) %>%
   bind_cols(truth = logistic_test %>%
  select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

vip(log_fit)

The Logistic Regression model has a AUC mean value of .844 which is a god value, but can be improved. I will now try the MARS model to see if the AUC can be raised more.

MARS Model

set.seed(123)
mars_split <- initial_split(customer_retention, prop = .7, strata = Status)

mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train)

set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1,100)), prod_degree(), levels = 50)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

tuning_results <- mars_wf %>%
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 100 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  2        21           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        25           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  5        27           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  6        29           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  7        31           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  8        33           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  9        35           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## 10        37           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## # ℹ 90 more rows

best_mars <- select_best(tuning_results, metric = "roc_auc")

autoplot(tuning_results)

final_mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(best_mars)

final_mars_wf %>% 
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip()

The MARS model has an AUC mean value of 8.49. This value is the highest of the three models and proves it has the best fit.

Decision Tree Model

set.seed(123)
dt_split <- initial_split(customer_retention, prop = .7, strata = Status)

dt_train <- training(dt_split)
dt_test <- testing(dt_split)

dt_mod <- decision_tree(mode = 'classification') %>%
  set_engine("rpart")

model_recipe <- recipe(Status ~ ., data = dt_train)

dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = dt_train)

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

set.seed(123)
dt_kfold <- vfold_cv(dt_train, v = 5)

dt_results <- fit_resamples(dt_mod, model_recipe, dt_kfold)

collect_metrics(dt_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.785     5 0.00552 Preprocessor1_Model1
## 2 brier_class binary     0.147     5 0.00263 Preprocessor1_Model1
## 3 roc_auc     binary     0.803     5 0.00777 Preprocessor1_Model1

dt_mod <- decision_tree(mode = "classification", cost_complexity = tune(), tree_depth = tune(), min_n = tune()) %>%
  set_engine("rpart")

dt_hyper_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)

set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = dt_kfold, grid = dt_hyper_grid)

show_best(dt_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.814     5 0.00780
## 2    0.0000000178         15    40 roc_auc binary     0.814     5 0.00780
## 3    0.00000316           15    40 roc_auc binary     0.814     5 0.00780
## 4    0.0000000001         11    40 roc_auc binary     0.814     5 0.00827
## 5    0.0000000178         11    40 roc_auc binary     0.814     5 0.00827
## # ℹ 1 more variable: .config <chr>

dt_best_model <- select_best(dt_results, metric = 'roc_auc')
                               
dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

dt_final_fit <- dt_final_wf %>%
  fit(data = dt_train)

dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(5)

The Decision Tree Model has an AUC value of 0.814. This is a good value, but is the worst of the three, as the Logistic Regression model and the MARS model have a higher value.

Business Analysis

Predicting customers who will leave with MARS Model

prediction <- predict(dt_final_fit, dt_test, type = "prob") %>%
  select(.pred_Left)

threshold <- 0.5

dt_test <- dt_test %>%
  mutate(Status_pred = ifelse(prediction > threshold, "Leave", "Stay"))

dt_prediction <- dt_test %>%
  filter(Status_pred == "Leave") %>%
  filter(Status == "Current")

print(dt_prediction)

## # A tibble: 194 × 21
##    Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines   
##    <chr>          <dbl> <chr>   <chr>       <dbl> <chr>        <chr>           
##  1 Male               1 No      No              1 Yes          No              
##  2 Female             0 Yes     No             47 Yes          Yes             
##  3 Female             0 Yes     Yes            11 No           No phone service
##  4 Male               0 Yes     Yes             3 Yes          No              
##  5 Female             0 No      No             13 Yes          Yes             
##  6 Female             0 Yes     Yes            15 Yes          Yes             
##  7 Female             0 No      No              8 Yes          No              
##  8 Male               0 No      No             14 No           No phone service
##  9 Male               0 No      No              4 Yes          No              
## 10 Female             1 No      No              1 Yes          Yes             
## # ℹ 184 more rows
## # ℹ 14 more variables: InternetService <chr>, OnlineSecurity <chr>,
## #   OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## #   StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## #   PaperlessBilling <chr>, PaymentMethod <chr>, MonthlyCharges <dbl>,
## #   TotalCharges <dbl>, Status <fct>, Status_pred <chr[,1]>

Using the MARS Model, it predicts that 194 of current customers will leave if no action is taken. By taking into account data that was shown by the MARS model and other models, there are ways that you can improve the churn rate.

In all of the models, Tenure, Contracts, and Total Charges are the most influential variables in the data when it comes to the customers Status. This was also previewed before I ran the models, as some of my exploratory data hinted at this.

Tenure is the most influential variable in 2/3 of the models. Tenure, meaning how long the customer stayed, has a large impact on the Status of the costumer, which is whether they are current customers or not. This was shown in my exploratory data as in the 4th graph, it shows the trends of tenure when related to status. The average tenure is higher if they are a current customer, and the opposite, it is shorter if the customer has left. This means that if a customer stays longer, they are more likely to keep with Regork.

Though Tenure was the most influential on 2/3 of the models, Total Charges was the most influential on the MARS model, which was the best predictor model out of the three with a AUC mean value of .85. Because it was the most influential on the best predictor model and close to the top three on the other two models, I believe it is the most influential of the predictor variables.

With this data, I would focus more on cost and tenure when it comes to keeping customers. Keeping up to date witch customers and making sure you are doing as much as you can to keep the customer happy are both important in making a customer stay. Tenure could also be improved by trying to sell more yearly contracts rather tahn month to month contracts because the longer people stay, the more likely they are to stay longer. Tying in costs with these by making sure it is affordable, your churn rate could improve. This could include things like promotions or special deals for people who choose to sign a contract that is a year longer instead of paying month to month.

In conclusion, the MARS Model was the most accurate model with an AUC of .85. While this is the best model, there is always room for improvement by adding more data to improve accuracy. But, it does show areas where Regork can improve to increase customer retention. With the top predictor variables being Total Charges and Tenure, a combination of things can be done to help customers have more of an incentive to stay with Regork.