final_project

Customer Retention Analysis for Regork

Introduction

Explanation and Importance

When assisting Regork with their new telecommunications market, I will be building multiple machine learning models to predict retention of customers. I will first begin with visualizing a few trends that are present in the customer_retention dataset. I will then build a logistic regression model, multivariate adaptive regression spline (MARS), and random forest model. These trends will help cross-functional teams assess how to retain customers, specifically within Marketing department. With this data, Regork can highlight the areas with highest impact related to retention. Finally, with the conclusive models, Regork can assess what trends might need more attention going forward.

Exploratory Analysis

Data Preparation

The following packages will be used throughout this report

suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidymodels)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(baguette)))
suppressWarnings(suppressMessages(library(vip)))
suppressWarnings(suppressMessages(library(pdp)))
suppressWarnings(suppressMessages(library(here)))
suppressWarnings(suppressMessages(library(kernlab)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(ranger)))
suppressWarnings(suppressMessages(library(earth)))
suppressWarnings(suppressMessages(library(knitr)))

suppressMessages(suppressWarnings(setwd("C:/Users/samev/OneDrive/BANA_4080/data")))

suppressWarnings(suppressMessages(retention<- readr::read_csv("customer_retention.csv")))

suppressWarnings(suppressMessages(retention<- mutate(retention, Status = factor(Status))))

suppressWarnings(suppressMessages((retention<- na.omit(retention))))

## # A tibble: 6,988 × 20
##    Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines   
##    <chr>          <dbl> <chr>   <chr>       <dbl> <chr>        <chr>           
##  1 Female             0 Yes     No              1 No           No phone service
##  2 Male               0 No      No             34 Yes          No              
##  3 Male               0 No      No              2 Yes          No              
##  4 Male               0 No      No             45 No           No phone service
##  5 Female             0 No      No              2 Yes          No              
##  6 Female             0 No      No              8 Yes          Yes             
##  7 Male               0 No      Yes            22 Yes          Yes             
##  8 Female             0 No      No             10 No           No phone service
##  9 Female             0 Yes     No             28 Yes          Yes             
## 10 Male               0 No      Yes            62 Yes          No              
## # ℹ 6,978 more rows
## # ℹ 13 more variables: InternetService <chr>, OnlineSecurity <chr>,
## #   OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## #   StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## #   PaperlessBilling <chr>, PaymentMethod <chr>, MonthlyCharges <dbl>,
## #   TotalCharges <dbl>, Status <fct>

ggplot(retention, aes(Tenure)) +
  geom_bar(fill = "lightblue") +
  facet_wrap(~InternetService) +
  ggtitle("Tenure and Internet Service") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Number of Customers", x = "Length of Tenure (months)")

This plot displays the relationship between Tenure and Internet Service. There seems to be some relations between each graph, as it clearly shows a similar distribution of tenure for DSL, Fiber Optic, and NO internet service.

ggplot(retention, aes(x= MonthlyCharges)) +
  geom_bar(fill = "lightgreen") +
  facet_wrap(~Contract)+
  ggtitle("Contract Type and Monthly Charges") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x = "Monthly Charges", y = "Number of Customers")

This plot displays the trends in Contract Type and Monthly Charges. This graph shows that more customers have a month-to-month contract and the mean of the distribution also shows that customers are paying around 80 dollars per month. This is a critical factor in the prediction of whether customers will stay with their telecommunications service.

retention%>%
ggplot(aes(x=Tenure, fill=Status))+
  geom_bar() +
   labs(x = "Tenure", y = "Number of Customers")+
  ggtitle("Tenure and Status of Customer")

This plot is crucial for Regork executives to examine. As shown in the plot, we do not see much of a relationship between tenure and whether a customer left or is current. We can see that the majority of the customers that did leave, left early on in their tenure with the internet service.

Machine Learning

The underlying outcome variables that we look to evaluate are the roc_auc and accuracy. The lower the roc_auc, the better the model that will be built.

To Split the Data

set.seed(123)
retention_split <- initial_split(retention, prop = .7, strata = "Status")
retention_train <- training(retention_split)
retention_test  <- testing(retention_split)

Logistic Regression

The first machine learning model used is the logistic regression model

# Use same kfold for whole report
set.seed(123)
kfold<- vfold_cv(retention_train, v = 5)

log_results<- logistic_reg()%>%
  fit_resamples(Status ~.,kfold)

collect_metrics(log_results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00374 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00974 Preprocessor1_Model1

final_fit_log<- logistic_reg()%>%
  fit(Status ~., data = retention_train)

Logistic Regression shows a relatively good roc_auc.

Confusion Matrix

final_fit_log %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

Multivariate Adaptive Regression Sampling

Multivariate Adaptive Regression Sampling, or MARS is the next machine learning model that will be tested. This model provides another non-linearity scope which often proves to be more accurate than simple logistic regression.

retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode('classification')

mars_grid <- grid_regular(  
  num_terms(range = c(1, 50)), 
  prod_degree(), 
  levels = 30)

retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod) 

tuning_results <- retention_wf %>%
tune_grid(resamples = kfold, grid = mars_grid)

tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))

## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        17           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  3        21           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  4        22           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  5        24           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  6        26           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  7        28           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  8        29           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
##  9        31           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
## 10        33           1 roc_auc binary     0.848     5  0.0107 Preprocessor1_M…
## # ℹ 50 more rows

autoplot(tuning_results)

When we plot the results of MARS, we see the accuracy and roc_auc prove to be significant.

Random Forest

The final machine learning model that I will be testing is the random forest.

suppressMessages(suppressWarnings(retention_recipe<-recipe(Status~.,retention_train)))

suppressMessages(suppressWarnings(retention_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")))

suppressMessages(suppressWarnings(retention_mod <- rand_forest(mode = "classification", trees = tune(), mtry = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity")))

suppressMessages(suppressWarnings(retention_hyper_grid <- grid_regular(trees(range = c(1, 800)), mtry(range = c(1, 50)), min_n(range = c(1, 20)), levels = 5)))

set.seed(123)
suppressMessages(suppressWarnings(results <- tune_grid(retention_mod, retention_recipe, resamples = kfold, grid = retention_hyper_grid)))

suppressMessages(suppressWarnings(show_best(results, metric = "roc_auc")))

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1   800     1 roc_auc binary     0.840     5  0.0110 Preprocessor1_Model0…
## 2     1   800    15 roc_auc binary     0.840     5  0.0108 Preprocessor1_Model0…
## 3     1   400     1 roc_auc binary     0.840     5  0.0106 Preprocessor1_Model0…
## 4     1   400    20 roc_auc binary     0.839     5  0.0108 Preprocessor1_Model1…
## 5     1   600    10 roc_auc binary     0.839     5  0.0109 Preprocessor1_Model0…

With the use of tuning, the model’s performance can be maximized. The results show that random forest modeling for this situation provides optimal results, with a low mean roc_auc.

# get optimal hyperparameters
retention_best_hyperparameters <- select_best(results, metric = "roc_auc")

# create final workflow object
final_retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(retention_mod) %>%
finalize_workflow(retention_best_hyperparameters)

# fit final workflow object
retention_final_fit <- final_retention_wf %>%
fit(data = retention_train)

# plot feature importance
retention_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)

## [1] 5

Finally, I have created a chart that shows which variables are more important (VIP) in the random forest model.

Confusion Matrix (random_forest)

retention_final_fit%>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1480  375
##    Left         60  182

To show the validity of the model compared to the actual data, a simple confusion matrix was created to list false positive or false negative results from the random forest.

train_predictions <- retention_final_fit %>%
  predict(retention_train) %>%
  bind_cols(retention_train %>% select(Status)) 

train_accuracy <- train_predictions %>%
  metrics(truth = Status, estimate = .pred_class) %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)

test_predictions <- retention_final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) 

test_accuracy <- test_predictions %>%
  metrics(truth = Status, estimate = .pred_class) %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)
generalization_error <- train_accuracy - test_accuracy

generalization_error

## [1] 0.01688512

The next thing I did was create a generalization error equation which shows a low generalization error in this model.The generalization error is similar to the validation error which means this model was able to accurately predict the data.

Business Decision: From the observations in this section, I have learned that each machine learning model is not too far off from one another. Also, the low generalization number means that the random forest model can be accurate at predicting the the actual or future data sets.

Analysis and Summary

TOP 5 FACTORS Contract: Customers that have larger contracts are less likely to leave the service. It would be wise for Regork to advertise larger contracts to customers to decrease the chances of customers leaving.

Tenure: Tenure is the second most important predictor. It is apparent that retaining customers for longer periods of time leads to better retaining of customers.

Total Charges: Coming in at the 3rd most important predictor, customers tend to value the pricing of their plans when deciding who to choose.

Monthly Charges: Related to the 3rd most influential predictor. Monthly charges is similar to total charges as customers tend to put priority on their monthly bill.

Online Security: Coming in at 5th, online security. Customers who have online security also tend to have better retention with these service providers.

Predicted Lost Monthly Revenue

retention_monthly_income<- retention%>%
  filter(Status == 'Left')%>%
  select(MonthlyCharges)%>%
  summarize(total_income = sum(MonthlyCharges))
tibble(retention_monthly_income)

## # A tibble: 1 × 1
##   total_income
##          <dbl>
## 1      138222.

When we look at lost customers from our data set, we see that potential lost monthly income is $138,222. The reason why these customers left would need more potential research but if there is no action taken, Regork would suffer this extreme loss per month.

Incentive Shceme: One incentive scheme that I propose is implementing a loyalty program. Customers can climb tiers with longevity and with each level up, customers can reap the rewards.

For Example:

Bronze: Customers with 3-6 months of tenure receive a discount on their next bill.

Silver: Customers with 6-12 months of tenure are eligible to receive a 10% discount on their monthly bill.

Gold: Customers with 12-18 months of tenure are elegible for free streaming services along with their discount.

Platinum: Customers with 18+ months of tenure receive a free upgrade to higher internet speed and a boost of 15% discount on their monthly bill.

Summary of Report: With this report, Regork officials should consider getting ahead of the game with the incentive program mentioned above. Customer tenure and contract length are two of the highest predictors of retaining customers. The loyalty program will cut the lost profits from customers who were not satisfied with their current provider. The loyalty program is not only an incentive for customers to stay longer, it also decreases the prices for customers who stay longer, which is another major factor of retention. Regork executives can expect higher retention rates if they choose to implement this plan.

Limitations: Throughout this process, there were several limitations in this data set. One of the biggest limitations we might see is the recognition of irrelevant patterns due to an over-fit or under-fit model. It would also be helpful to have more information on our customer base which may explain demographic variable predictors. With this information, we can use factors like age and income range to determine what our ideal customer base is and eliminate any outliers. The last limitation is that the random forest model could be better with more trees.

final_project_evrard

Sam Evrard

2024-12-03