Regork Final Project

Introduction

We are back to working with Regork as they enter the telecommunications market. Our task is to find out what factors are used to retain customers in order to maximize our profit. We started by cleaning the data and loading necessary packages and getting rid of missing value in the data. We then created three machine learning models in the form of decision trees, bagging decision trees, and random forests. We measure area under the curve to determine the best model and base our analysis after. Using the best model we are going to find the most important factors in retaining customer and run an analysis on the model.

Data Preperation

These are the libraries necessary to run our models in order to come up with the best possible solution to the problem.

suppressMessages(library(tidyverse, quietly = TRUE))
suppressMessages(library(ggplot2, quietly = TRUE))
suppressMessages(library(readr, quietly = TRUE))
suppressMessages(library(tidymodels, quietly = TRUE))
suppressMessages(library(baguette, quietly = TRUE))
suppressMessages(library(vip, quietly = TRUE))
suppressMessages(library(pdp, quietly = TRUE))
suppressMessages(library(ranger, quietly = TRUE))


customer_retention <- suppressMessages(read_csv("~/Desktop/customer_retention.csv"))

customer_retention <- na.omit(customer_retention)

cust_left <- customer_retention %>%
  filter(Status == "Left")

EDA

In our Exploratory Data Analysis, we created three visualizations to try to understand the data that we are working with and to see if we can see any early factors in why customers are leaving.

Contract Type

contract_counts <- cust_left %>%
  group_by(Contract) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

contract_counts %>%
  ggplot(aes(x = Contract, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Contract Types of Users Who Left",
       x = "Contract Type",
       y = "Number of Users")

The Contract Type charts shows that the majority of customers that leave are on month-to-month contracts compared to limited people leaving for one and two year contracts.

Tenure

tenure_counts <- cust_left %>%
  group_by(Tenure) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

tenure_counts %>%
  ggplot(aes(x = Tenure, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Tenure of Users Who Left",
       x = "Months",
       y = "Number of Users")

The tenure chart shows that the shorter time that the customers are there, the more likely they are to leave as the number of users leaving decreases and the number of months increase.

Online Security

os_counts <- cust_left %>%
  group_by(OnlineSecurity) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

os_counts %>%
  ggplot(aes(x = OnlineSecurity, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Did the users who left have online secruity?",
       x = "Online Security",
       y = "Number of Users")

Our final chart is based on the number of users who left based on whether or not they have online security. As you can see in the chart, customers who don’t have online security left a lot more often then customers who do have online security.

Machine Learning

Decision Trees

set.seed(123)
split <- initial_split(customer_retention, prop = 0.7, strata = "Status")
cust_train <- training(split)
cust_test <- testing(split)

# Step 1: create decision tree model object
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = cust_train)

# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = cust_train)

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(cust_train, v = 5)

# train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)

# model results
collect_metrics(dt_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.785     5 0.00552 Preprocessor1_Model1
## 2 roc_auc  binary     0.803     5 0.00777 Preprocessor1_Model1

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)

# get best results
show_best(dt_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.814     5 0.00780
## 2    0.0000000178         15    40 roc_auc binary     0.814     5 0.00780
## 3    0.00000316           15    40 roc_auc binary     0.814     5 0.00780
## 4    0.0000000001         11    40 roc_auc binary     0.814     5 0.00827
## 5    0.0000000178         11    40 roc_auc binary     0.814     5 0.00827
## # ℹ 1 more variable: .config <chr>

Using our decision trees model, we created multiple models with different variables in order to find the best decision tree. Our best one had an area under the curve of .814, which is good, but not our optimal model.

Bagging

bag_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

# train model
bag_results <- fit_resamples(bag_mod, model_recipe, kfold)

# model results
collect_metrics(bag_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.768     5 0.00708 Preprocessor1_Model1
## 2 roc_auc  binary     0.773     5 0.00987 Preprocessor1_Model1

# create bagged CART model object with
# tuning option set for number of bagged trees
bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

# create the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))

# train our model across the hyper parameter grid
set.seed(123)
bag_results <- tune_grid(bag_mod, model_recipe, resamples = kfold, grid = bag_hyper_grid)

# model results
show_best(bag_results, metric = "roc_auc")

## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.821     5 0.0108  Preprocessor1_Model5
## 2   300 roc_auc binary     0.820     5 0.00981 Preprocessor1_Model6
## 3   100 roc_auc binary     0.819     5 0.00991 Preprocessor1_Model4
## 4    50 roc_auc binary     0.816     5 0.00940 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00971 Preprocessor1_Model2

Our bagging model also creates a lot of different models in which we can narrow down to the best possible one. You can see here that our model created an under the curve metric of .821 which is better than our decision trees model but still not the optimal model for our analysis

Random Forest

# Create random forest model object without tuning
rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

# Train model
rf_results <- fit_resamples(rf_mod, model_recipe, kfold)

# View model results
collect_metrics(rf_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.798     5 0.00626 Preprocessor1_Model1
## 2 roc_auc  binary     0.837     5 0.0103  Preprocessor1_Model1

# Create random forest model object with tuning options
rf_mod_tune <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

# Define the hyperparameter grid
hyper_grid <- grid_regular(
  mtry(range = c(4, 8)),
  min_n(range = c(5, 20)),
  trees(range = c(50, 100))  # Adjust the range for 'trees'
)

# Perform hyperparameter tuning with cross-validation
rf_results_tune <- tune_grid(
  object = rf_mod_tune,
  resamples = kfold,
  grid = hyper_grid,
  metrics = metric_set(roc_auc),
  control = control_grid(save_pred = TRUE),
  preprocessor = model_recipe
)

# Show the best hyperparameters
show_best(rf_results_tune, metric = "roc_auc")

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4    75    20 roc_auc binary     0.838     5 0.00982 Preprocessor1_Model16
## 2     4   100    20 roc_auc binary     0.838     5 0.0102  Preprocessor1_Model25
## 3     4   100    12 roc_auc binary     0.838     5 0.0105  Preprocessor1_Model22
## 4     4    50    20 roc_auc binary     0.838     5 0.00980 Preprocessor1_Model07
## 5     6   100    20 roc_auc binary     0.836     5 0.0110  Preprocessor1_Model26

# Get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_results_tune, metric = "roc_auc")

# Create final workflow object
final_rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_mod_tune) %>%
  finalize_workflow(rf_best_hyperparameters)

# Fit final workflow object
rf_final_fit <- final_rf_wf %>%
  fit(data = cust_train)

# Plot feature importance
rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 20)

We now get to our best model which was using random forest. We used tuning to get our best under the curve metric at .838.

Using the above chart, we can determine the most influential factors to whether customers leave. The five factors that have the most influence on the customers leaving are tenure, total charges, monthly charges, contract, and online security according to our model.

Confusion Matrix

rf_final_fit %>%
  predict(cust_test) %>%
  bind_cols(cust_test %>% select(Status)) %>%
  mutate(truth = as.factor(cust_test$Status)) %>%
  conf_mat(truth, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1380  268
##    Left        160  289

Here we create a confusion matrix to identify false positives and false negatives found in the test set.

Generalization Error

rf_final_fit %>%
  predict(cust_test, type = "prob") %>%
  mutate(truth = as.factor(cust_test$Status)) %>%
  roc_auc(truth, .pred_Current)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.840

Business Analysis

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

Our top five influential factors had a lot more importance than the rest. As stated before, the top five influential factors in whether customers stayed or left were tenure, total charges, monthly charges, contract, and online security according to our model. Based on this, we would focus on these five factors with an emphasis on tenure as that was the most influential factor. We would create our incentives and promotions based on these factors in order to decrease our predicted loss for the future.

Collect all the customers from the test dataset that you predict are going to leave.

All of the customers that are model had leaving were on month-to-month contracts which is what we will focus on with an incentive program. 352 of the 455 customers that we had leaving had a tenure of less than a year which we evaluated as hand in hand with the contract length. The customers that left also had approximately a $12 dollar increase of monthly charges compared to the customers that stayed.

What is the predicted loss in revenue per month if no action is taken?

Adding the monthly charges and total charges of the customers our training set predicts to leave, the predicted loss in revenue per month would be $34,693 if no action is taken.

Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).

We came up with a major incentive scheme in order to retain customers based on two of our most influential factor of tenure and contract. We came up with a program where if you sign a one or two year contract, you recieve added benefits compared to just the month-to-month sign ups. We would offer reward points and monthly deals so that customers are continuously getting rewarded and will look forward to extra rewards just for signing up for a longer contract. Our model has all of the month-to-month customers leaving and this would decrease the number of month-to-month customers. Regork could give $20 a month in rewards to each customer which would ultimately save the company approximately $56 a month per customer.

Conclusion

Limitations:

A few limitations with our report was that we could have dug deeper into the demographics of the customers we predicted to leave, instead of focusing on the most important factors. More demographic information, such as the customers income level and ages instead of just senior citizen or not, in the dataset could also help create a better preforming model.

Summary:

To summarize we created a decision tree, bagging, and random forest model to predict whether Regork will retain certain customers. The random forest model was our best model with an auc of 0.838 and our generalization error from the test set was very slightly higher, at 0.840. Our suggestion to the CEO of Regork would be to impletement an incentive plan for customers to sign up for a one or two year contract. This would help retain customers by lowering their charges and extending their tenure. Thank you for taking time to our look at our report we hope you are able to apply our findings to further improve your telecommunications services!