Introduction

The telecommunications market, much like many others, relies on companies being able to maintain a consistent customer base. This can be either by attracting new customers or maintaining current ones, and it’s more expensive to consistently attract new customers to your company. Because of that, Regork needs to focus on maintaining its existing customer base in a way that minimizes the cost required to do so. Utilizing a model that can help predict which customers are on their way out or not allows the company to understand why customers are leaving and apply the best solutions possible to retain them using the information provided from that model.

This analysis uncovers underlying trends in the retention dataset through visualization of the data. Then three different classification machine learning models - logistic regression, decision trees, and MARS model - are tested to find the one that best predicts if a customer will remain with the company or leave in the future. Measurement of the optimal model will come from the value of the area under the curve (AUC).

The information uncovered through this analysis can be used to help determine which customers are most likely to leave the company. When customers leave the company, the company is losing money and we need to be proactive in order to profit as much possible. To help retain these customers, I propose the implementation of an incentive plan to new customers. This will allow the company to prove its worth to the customer and reward them for sticking with the company when it’s so new to them. Customers will be retained for a longer period of time and it’ll also increase profits in the long-term.

Data Preparation

You will need to install and load the following packages to replicate the results for this report:

# tidying data and creating data visualizations
library(tidyverse)
# collection of packages for modeling and machine learning
library(tidymodels)
# constructing variable importance plots
library(vip)
# kernel-based machine learning methods for classification
library(kernlab)
# efficient model functions for bagging
library(baguette)
# constructing partial dependence plots from machine learning models
library(pdp)
# building regression models using techniques from Friedman's papers
library(earth)

You will also need to store this dataset in the environment:

retention <- read.csv('data/customer_retention.csv')

The response variable Status needs to be mutated into a factor in order to successfully complete this analysis as it is based on classification models. Missing values will also need to be dropped for functions to run properly.

retention <- mutate(retention, Status = factor(Status))

retention <- na.omit(retention)

Exploratory Analysis

This analysis will begin with a look at some trends regarding a customer’s tenure and charges.

ggplot(retention, aes(x = Tenure, y = MonthlyCharges)) +
  scale_y_continuous(labels = scales::dollar) +
  geom_point(alpha = 0.5, color = 'darkgray') +
  geom_smooth(aes(color = 'All Customers')) +
  geom_smooth(data = filter(retention, InternetService != 'No'),
              aes(color = 'Internet Customers')) +
  geom_smooth(data = filter(retention, PhoneService == 'Yes'),
              aes(color = 'Phone Customers')) +
  scale_color_manual(
    name = 'Customer Type',
    values = c('All Customers' = 'black',
               'Internet Customers' = 'blue2',
               'Phone Customers' = 'green3')
  ) +
  labs(
    title = 'Monthly Charges vs. Customer Tenure',
    x = 'Tenure (Months)',
    y = 'Monthly Charges'
  )

An analysis of monthly charges per customer shows that the longer a customer remains with the company, the higher their monthly charge is. Customers with phone service have a slightly higher monthly charge than average, and customers with internet service have a more significant higher monthly charge than average.

ggplot(retention, aes(x = Tenure, y = TotalCharges)) +
  scale_y_continuous(labels = scales::dollar) +
  geom_point(alpha = 0.5, color = 'darkgray') +
  geom_smooth(aes(color = 'All Customers')) +
  geom_smooth(data = filter(retention, InternetService != 'No'),
              aes(color = 'Internet Customers')) +
  geom_smooth(data = filter(retention, PhoneService == 'Yes'),
              aes(color = 'Phone Customers')) +
  scale_color_manual(
    name = 'Customer Type',
    values = c('All Customers' = 'black',
               'Internet Customers' = 'blue2',
               'Phone Customers' = 'green3')
  ) +
  labs(
    title = 'Total Charges vs. Customer Tenure',
    x = 'Tenure (Months)',
    y = 'Total Charges'
  )

Moving on to analyzing total charges, understandably, the longer a customer is with the company, the higher their total charge is. Similar trends remain for phone service and internet service customers compared to the average.

retention %>%
  ggplot(aes(Status, fill = Status)) +
  geom_bar(show.legend = FALSE) +
  scale_fill_manual(
    values = c('Current' = 'green3',
               'Left' = 'red2')
  ) +
  geom_text(aes(label = scales::percent(after_stat(count)/sum(after_stat(count)))), 
            stat = 'count', vjust = -0.5) +
  labs(title = 'Status Distribution')

Looking into the response variable Status shows an imbalanced response, with 73% of customers in the database still with the company and the other 27% having left it.

Analyzing relationships between Status and other predictor variables allows us to get a deeper understanding of Status, so we’ll take a look at a previously analyzed predictor variable - Tenure.

ggplot(retention, aes(x = Status, y = Tenure, fill = Status)) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_manual(
    values = c('Current' = 'green3',
               'Left' = 'red2')
  ) +
  labs(
    title = 'Customer Status by Tenure',
    x = 'Customer Status',
    y = 'Tenure (Months)'
  )

A deeper dive into Status reveals that customers that have left the company have a much shorter median tenure time than those who have remained with the company. This makes newer customers a priority target group for Regork to look at when predicting who might leave the company in the future.

Machine Learning

Logistic Regression

set.seed(123)
log_split <- initial_split(retention, prop = 0.7, strata = Status)
log_train <- training(log_split)
log_test <- testing(log_split)

set.seed(123)
log_kfolds <- vfold_cv(log_train, v = 5, strata = Status)

log_results <- logistic_reg() %>%
  fit_resamples(Status ~ ., log_kfolds)

collect_metrics(log_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00188 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00521 Preprocessor1_Model1

This algorithm has an acceptable accuracy and AUC, but it’s not the most optimal model. Other algorithms will have to be tested to find the best possible AUC.

log_cm <- logistic_reg() %>%
  fit(Status ~ ., data = log_train)

log_cm %>%
  predict(log_test) %>%
  bind_cols(log_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

When this model isn’t accurate, it’s because of more false negatives than false positives. This means that the model is biased toward predicting that a customer has left when they in fact are still a current customer.

Decision Trees

set.seed(123)
dt_split <- initial_split(retention, prop = 0.7, strata = Status)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)

dt_mod <- decision_tree(mode = 'classification') %>%
  set_engine('rpart')

dt_recipe <- recipe(Status ~ ., data = dt_train)

dt_fit <- workflow() %>%
  add_recipe(dt_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = dt_train)

set.seed(123)
dt_kfolds <- vfold_cv(dt_train, v = 5)

dt_mod <- decision_tree(
  mode = 'classification',
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine('rpart')

dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_results <- tune_grid(dt_mod, dt_recipe, resamples = dt_kfolds, grid = dt_hyper_grid)

show_best(dt_results, metric = 'roc_auc', n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.814     5 0.00780
## 2    0.0000000178         15    40 roc_auc binary     0.814     5 0.00780
## 3    0.00000316           15    40 roc_auc binary     0.814     5 0.00780
## 4    0.0000000001         11    40 roc_auc binary     0.814     5 0.00827
## 5    0.0000000178         11    40 roc_auc binary     0.814     5 0.00827
## # ℹ 1 more variable: .config <chr>
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)

The best AUC from this model is lower than the AUC from the logistic regression algorithm, so this is is definitely not the best optimized model. Another algorithm will have to be tested to find the most optimal AUC.

dt_best <- select_best(dt_results, metric = 'roc_auc')

dt_final_wf <- workflow() %>% 
  add_recipe(dt_recipe) %>% 
  add_model(dt_mod) %>%
  finalize_workflow(dt_best)

dt_cm <- dt_final_wf %>%
  fit(data = dt_train)

dt_cm %>%
  predict(dt_test) %>%
  bind_cols(dt_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1346  275
##    Left        194  282

Like the logistic regression algorithm, when this model isn’t accurate, it’s because of more false negatives than false positives. This means that the model is biased toward predicting that a customer has left when they in fact are still a current customer.

MARS Model (Optimal)

set.seed(123)
mars_split <- initial_split(retention, prop = 0.7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)

mars_recipe <- recipe(Status ~ ., data = mars_train) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

set.seed(123)
mars_kfolds <- vfold_cv(mars_train, v = 5, strata = 'Status')

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode('classification')

mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 50)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

mars_results <- mars_wf %>%
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>%
  collect_metrics() %>%
  filter(.metric == 'roc_auc') %>%
  arrange(desc(mean))
## # A tibble: 60 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        15           1 roc_auc binary     0.850     5 0.00509 Preprocessor1_M…
##  2        16           1 roc_auc binary     0.849     5 0.00502 Preprocessor1_M…
##  3        18           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  4        19           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  5        20           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  6        21           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  7        22           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  8        23           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  9        24           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## 10        25           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## # ℹ 50 more rows
mars_best <- select_best(mars_results, metric = 'roc_auc')

mars_final_wf <- workflow() %>% 
  add_recipe(mars_recipe) %>% 
  add_model(mars_mod) %>%
  finalize_workflow(mars_best)

mars_final_fit <- mars_final_wf %>%
  fit(data = mars_train)

mars_final_fit %>%
  predict(mars_train, type = 'prob') %>%
  mutate(truth = mars_train$Status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

The best AUC in this model is the best one out of all three models tested, which makes the MARS model the most optimal for predicting consumer retention. Further analysis of features within this model allows us to dive deep into what influences retention the most and find out what the generalization error is.

mars_cm <- mars_final_wf %>%
  fit(data = mars_train)

mars_cm %>%
  predict(mars_test) %>%
  bind_cols(mars_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1382  245
##    Left        158  312

Like both other algorithms, when this model isn’t accurate, it’s because of more false negatives than false positives. This means that the model is biased toward predicting that a customer has left when they in fact are still a current customer.

mars_final_fit %>%
  fit(data = mars_train) %>%
  extract_fit_parsnip() %>%
  vip(10)

The predictor variables most influential to consumer behavior are Tenure, Total Charges, and Payment Method - Electronic Check. This means that how long a customer stays at the company, their total charges, and if they pay with an e-check are the most important features when determining if a customer might leave or not. These will be important to keep in mind when figuring out which customers to apply retention strategies towards.

mars_final_fit %>%
  predict(mars_test, type = 'prob') %>%
  mutate(truth = mars_test$Status) %>%
  roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.848

The final generalization error for this model is lower than the cross validation error.

Learning from Observations

As a person responsible for making business decisions, seeing that both the generalization error and the cross validation error are quite successful is a good sign that the MARS model is well-optimized to predict customer retention. This gives me increased confidence that applying this model to future customer data would also provide similar successful results.

Business Analysis/Conclusion

Most Influential Predictors & Managerial Focus

In the optimal model, the top predictor variable is Tenure, followed by Total Charges and Payment Method - Electronic Check.

To decrease the chances of customers leaving, I would focus the most on Tenure the most because it’s the most influential variable. Customers that stay longer with the company are satisfied with their services, so they’re less likely to leave. Therefore, those who have been with the company for a lesser period of time need to be focused on more. These customers are much more likely to deal with critical issues such as initial dissatisfaction with a service, a mismatch between expectations and actual offerings, and failed onboarding, so these early periods would be a great time to provide incentives to encourage them to remain with the company.

Predicted Customer Losses & Financial Implications

Looking at the customers that have previously left, the majority of them are those with lower tenures of 1-3 months. Becuase these customers haven’t established loyalty with the company, they’re more likely to leave and find another service provider. An important demographic here is “not a senior citizen”, as these people are still active members of the workforce and don’t have as much money to spend.

The model predicts that Regork would lose $4,070.25 in revenue per month if no action is taken to help keep these customers remain with the company. However, this is only a predictive model and not everyone with short tenure and isn’t a senior citizen is going to leave Regork.

Incentive Scheme

To retain these newer customers, I propose implementing an incentive plan that has the company sending out benefits or incentives on a regular basis depending on how long a particular customer has spent with a company. For Month 1, provide a discount on the customer’s first bill to make a good impression with the customer and show the company’s value quickly. Month 2 would reinforce usage of the company and prevent early disengagement through an “exclusive gift” with a free add-on service for the next month such as a trial for a streaming platform. Month 3 would continue to build on the first two months and start to transition these new customers into established customers by rewarding them with a credit off their next bill as a loyalty bonus. Once they’re an established customer, then prove that it’s not worth the hassle to transition to another company by providing a long-term perk like cheaper internet or phone service for a couple months.

It might seem like a lot of effort to continually provide incentives, but if the company proves its worth to the customer early on, we’ll be able to earn that cost back later on as the customer stays with us for longer. The model showed that Tenure is the most influential variable when it comes to retention, and trend analysis showed that the longer customers stayed with a company, the more they ended up spending. Providing incentives will encourage customers to stay with the company longer and spend more money with us long-term, so the company would profit.

Proposed Solution & Consumer Implications

The best way for Regork to retain customers is through implementation of an incentive plan target towards new customers. This will allow the company to prove its worth to the customer and reward them for sticking with the company when it’s so new to them. Customers will benefit from these programs because these incentives will help strain their budget less through discounts on service bills or add-ons. If customers see that they’re saving money through these platforms, they’re able to see the value the company provides for them and will continue purchasing services for a longer period of time.

Analysis Limitations

One limitation of this analysis is that this optimal MARS model retained only 30 terms. A wider range of terms could potentially improve the AUC through further capturing of complex relationships within the retention data. Another limitation of this analysis is limited demographic data. Incorporation of other predictive variables such as age or income level would provide additional background into the customer base and help predict retention better.