Customer Retention Analysis

Synopsis

We have been tasked with helping regork better understand their customer retention in order to assist them with launching their telecommunications business. To do this, we began exploring the data, to identify trends that we thought could be useful. Next, we built Machine learning models to better understand which of our variables have the greatest impact on the status of a customer. By the end of this presentation, we will confidently be able to identify the most influential variables on customer retention.

Data Preperation

To produce our results, these packages must be downloaded/ran

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(readr)
library(dplyr)

Below is a short description of all the packages

tidymodels: Simplifies modeling and machine learning in R with a focus on tidy data principles
tidyverse: Collection of R packages for data manipulation and visualization, promoting consistency and coherence in data analysis
baguette: Facilitates building and evaluating tree-based models like random forests and gradient boosting machines
vip: Computes and visualizes variable importance measures for predictive models
here: Manages file paths in R projects relative to the project’s root directory
kernlab: Implements kernel-based machine learning methods
ggplot2: Creates customizable graphics and visualizations in R
ranger: Quick implementation of random forests for classification and regression tasks
earth: Implements MARS algorithm for building regression models with piecewise linear functions
readr: Reads data files into R data frames
dplyr: Provides a grammar of data manipulation for data frames

retention <- read_csv("customer_retention.csv")
retention <- retention %>% 
  dplyr::mutate(Status = as.factor(Status))
retention <- drop_na(retention)

## [1] 0

Additionally we mutated Status to be a factor because the analysis is based around performing a classification model. We also droppped null values which totaled 11.

Exploratory Analysis

Customer Status by Contract

# Customer Status by Contract
ggplot(retention, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Customer Status by Contract Type",
       x = "Contract Type",
       y = "Proportion of Customers (%)") +
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(values = c("Current" = "orange", "Left" = "purple"))

Customer Payment Method and Length of Contract

# Customer Payment Method and Length of Contract 
retention$SeniorCitizen <- factor(retention$SeniorCitizen, levels = c(0, 1), labels = c("Not Senior Citizen", "Senior Citizen"))
ggplot(retention, aes(x = PaymentMethod, fill = SeniorCitizen)) + 
  geom_bar(stat = "count", color = "black") + 
  scale_fill_manual(values = c("Not Senior Citizen" = "purple", "Senior Citizen" = "orange")) +
  facet_wrap(Contract ~ SeniorCitizen) +
  coord_flip() +
  ggtitle("Customer Payment Method and Length of Contract") +
  labs(y = "Count of Contract Type", x = "Customer Payment Method")

Creating a Summary by Tenure

# Creating a summary by tenure
tenure_summary <- retention %>%
  group_by(Tenure) %>%
  summarise(Total = n(),
            Left = sum(Status == "Left"),
            Current = sum(Status == "Current")) %>%
  mutate(LeftRate = Left / Total,
         CurrentRate = Current / Total)
ggplot(tenure_summary, aes(x = Tenure)) +
  geom_smooth(aes(y = CurrentRate, color = "Retention Rate"), size = 1.5) +
  geom_smooth(aes(y = LeftRate, color = "Leave Rate"), size = 1.5) +
  labs(title = "Retention and Leave Rates by Tenure",
       x = "Tenure in months",
       y = "Rate",
       color = "Rate Type") +
  scale_color_manual(values = c("Retention Rate" = "orange", "Leave Rate" = "purple"))

Machine Learning

Logistic Regression

We began our Analysis by creating a logistic regression model. This is a good starting model as it is easy to set up and is known to perform well.

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00974 Preprocessor1_Model1

As you can see above, the logistic model returns a very nice Area under the curve. Area under the curve is essentially a model of how well our model performs at predicting future trends.

final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = retention_train)

Looking at our confusion matrix, you can see how often our model correctly predicts whether a value is positive or negative. Overall, this model predicts with about 80 percent accuracy whether or not a variable will correctly influence status. This is a very strong number.

final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

vip(final_fit)

This chart shows which variables are most important according to the logistic model. According to this machine learning set, the three most important were Tenure, ContractYearOne and ContractYearTwo.

Mars Model

Although our last model was very good, we wanted to create a Mars model to get a better understanding of our data’s influence on status.

## # A tibble: 50 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        15           1 roc_auc binary     0.850     5 0.00509 Preprocessor1_M…
##  2        16           1 roc_auc binary     0.849     5 0.00502 Preprocessor1_M…
##  3        19           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  4        20           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  5        21           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  6        22           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  7        23           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## # ℹ 40 more rows

When looking at our results, you can see that we have an area under the curve of .850. This is greater than the number we got for our logistic model, making this a more accurate measure of our data. Moving forward, we selected our best hyperparameters to see which features were most influential according to this model.

best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")

final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(best_hyperparameters)

final_fit <- final_wf %>%
fit(data = retention_train)

final_fit %>%
  extract_fit_parsnip() %>%
  vip()

Looking at this model, our most important model the most important features are Tenure, Total Charges, and PaymentMethodElectronic Check.

final_fit %>%
  predict(retention_test) %>%
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1382  245
##    Left        158  312

Lastly, we created a confusion matrix for this model to demonstrate the accuracy of it. As you can see, it is slightly more accurate than the previous model at about 81 percent.

Decision Tree

We used rpart.plot() to plot our tree and given that it is a small tree, it is easy to visualize.

Our nodes broke off by contract, internet service, length of tenure, and tech support respectively.

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.785     5 0.00552 Preprocessor1_Model1
## 2 roc_auc  binary     0.803     5 0.00777 Preprocessor1_Model1

This gave us an AUC of .803 which is slightly better than our logistic model but worse than the MARS model.

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.814     5 0.00780
## 2    0.0000000178         15    40 roc_auc binary     0.814     5 0.00780
## 3    0.00000316           15    40 roc_auc binary     0.814     5 0.00780
## 4    0.0000000001         11    40 roc_auc binary     0.814     5 0.00827
## 5    0.0000000178         11    40 roc_auc binary     0.814     5 0.00827
## # ℹ 1 more variable: .config <chr>

Our best prediction had an AUC of .814

Based on the graph, Tenure, Contract, and Monthly/Total Charges were our most important features.

Business Analysis

Looking through our models, the most influential variables were Tenure and Contract are our most important variables. Total Charges and Monthly Charges were also influential.

Looking at Tenure, it is important that we target customers who have been with the company for a long time for our new telecommunications service. Some possible strategies to do this could be offering promotions and different perks for customers who have been with Regork for a certain amount of time. This will get our most loyal customers involved, which will ultimately lead to a successful telecommunications sector.

Looking at our next most influential variable, Contract Length is very important to understand who will be a good target for our new business sector. When looking back at our machine learning models, it was very clear to see that our customers with longer term contracts have a greater retention rate than customers with short term contracts. We would suggest having targeted ads and promotions sent out to our customers with long term contracts.

We also noticed that the Total Charges and Monthly Charges were influential variables. In any business, you make a large chunk of your money from your biggest spenders. This is no different with Regork. Understanding that customers who are willing to spend a lot have a lot of faith in Regork is imperative to our success. Because of this trust, we believe we could target this group of customers with ads and promotions as well.

Lastly, in our Mars model, which was most accurate according to all metrics, Using an Electronic Check as your Payment Method was the third most influential metric in predicting status. This could be because of the increase in core customers trusting and having access to this technology. This is another key demographic we could focus our efforts on.

Looking at our report, there are always certain limitations with any data analysis. To begin, our decision tree only has five nodes in it. This is not as in-depth as some decision trees, leaving more information out there to be collected. Next, Our models still have a high rate of inaccuracy in them as we saw with the confusion matrix. To get these models more accurate, there would have to be a large time and money commitment to get these numbers as precise as possible.