The data and models you will see later on in this analysis are in the hopes of explaining the status of our current and future customers. One of the highest costs to the company is getting new customers, and it is a lot more efficient if we are able to retain the ones we currently have. Utilizing the data on each customer which ranges from their type of contract, tenure, their type of phone service, and more. As mentioned, this is important because if we can correctly predict what our consumers will be in terms of their status we can target them directly and improve the factors that may make them leave while harping on the ones that help them stay. This increased retention will add to our revenue as a company and will decrease our costs of finding new customers.
We started our analysis by analyzing the data and the correlation of certain predictor variables to our main variable, customer status. We were able to find preliminary trends which gave us a better understanding of how to move forward in our machine learning later on. Once we found those trends we created three machine learning models: decision trees, logistic regression, and a Multivariate adaptive regression splines (MARS) model. These models gave us outputs for the Area-Under-Curve which tells us how well our models are predicting the data. We were able to pick out the best model from there and ran the test data on it to get a final prediction on the usability and accuracy of our created model in terms of how to understand the status of the customer.
Overall, this model will aid in understanding how and why the clients make the decision to stay or leave. We now know the most important variables that affect this decision (tenure, total charges, monthly charges, payment method, online security) and can market and show how our company can hit those better than our competition. This new information and model will increase our total revenue and retention rates and benefit our business as a whole.
In order to complete all necessary exploratory data analysis and machine learning with the data being used, these were the required packages:
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(rsample)
library(rpart.plot)
setwd("C:/Users/ericw/Documents/UC 2023 Spring/Data Mining")
In order to begin our data analysis, it was first necessary to import the customer retention dataset being used. We then omitted all null values and changed the Status variable (our response variable) to be a factor so that we could our classification models.
retention <- read.csv("customer_retention.csv")
retention <- retention %>%
mutate(Status = as.factor(Status))
retention <- na.omit(retention)
We first began our exploratory data analysis by comparing the current customers to those who had left by gender.
# Breakdown of Retention Status by Gender
ggplot(retention, aes(Status)) +
geom_bar(fill = "purple") +
facet_wrap(~Gender) +
ggtitle("Status vs. Gender ") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Count of Customers", x = "Status")
This plot shows that there is little to no difference in retention status by gender. Therefore, we decided to look at areas other than demographics, where we hoped we might find more significant trends. Before doing so, we decided to plot the overall difference in status.
# Breakdown of Current Status
ggplot(retention, aes(Status)) +
geom_bar(fill = "lightblue") +
ggtitle("Status Breakdown") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Count of Customers", x = "Status")
As expected, we find similar results in our plot above as we did with our initial plot which broke down status by gender. Given this information, we then chose to analyze customer retention status by how many months they have been with the company.
# Customer Retention Status by Month
ggplot(retention, aes(x = Status, fill = Status)) +
geom_bar() +
facet_wrap(~Tenure) +
ggtitle("Status vs. Tenure (By Month) ") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Customer Count", x = "Status")
From this plot, we see that customers generally have a high degree of turnover within the first year, followed by a slightly lessened degree of turnover in the second year. After that, the number of customers leaving becomes very low, showing a certain degree of customer loyalty for customers who make it past the 1-2 year mark.
# Customer Retention Status vs. Length of Contract Plot
retention %>%
select(Contract, Status) %>%
group_by(Status) %>%
ggplot(aes(x = Status, y = Contract, fill = Contract)) +
geom_col() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
ggtitle("Customer Retention Status vs. Length of Contract")
This plot shows that customers who sign one or two year contracts are much more likely to remain customers than those who sign month-to-month contracts. Month-to-month contracts seem to dominate the “Left” column, while there are very few one or two year contract customers who have left.
# Customer Retention Status vs. Payment Method Plot
retention %>%
select(PaymentMethod, Status) %>%
group_by(Status) %>%
ggplot(aes(x = Status, y = PaymentMethod, fill = PaymentMethod)) +
geom_col() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
ggtitle("Customer Retention Status vs. Payment Method")
Through the above plot, we can see that customers who are paying through automatic payment are much less likely to leave than those who pay using an electronic check or a mailed check.
set.seed(123)
retention_split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(retention_split)
retention_test <- testing(retention_split)
# Logistic Regression Model
retention_kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
log_retention <- logistic_reg() %>%
fit_resamples(Status ~ ., retention_kfolds) %>%
collect_metrics()
log_fit <- logistic_reg() %>%
fit(Status ~ ., data = retention_train)
log_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
As we seen in the AUC value of 0.845, this logistic regression model does a fairly decent job of predicting whether a customer is going to stay with the company or not. From our decision matrix, we can see that there is a slight bias towards false positives (picking a customer will stay when in reality they leave) over false negatives, however, this again shows the model is rather accurate overall.
# Decision Tree Model
# Step 1
dt_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = retention_train)
# Step 3: fit model workflow
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = retention_train)
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)
# Step 4: K-fold
set.seed(123)
kfold <- vfold_cv(retention_train, v = 5)
# Train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
# Model results
collect_metrics(dt_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.785 5 0.00552 Preprocessor1_Model1
## 2 roc_auc binary 0.803 5 0.00777 Preprocessor1_Model1
# Step 5
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# Train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
# Get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estima…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 2 0.0000000178 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 3 0.00000316 15 40 roc_auc binary 0.814 5 0.00780 Prepro…
## 4 0.0000000001 11 40 roc_auc binary 0.814 5 0.00827 Prepro…
## 5 0.0000000178 11 40 roc_auc binary 0.814 5 0.00827 Prepro…
## # … with abbreviated variable name ¹.estimator
#Step 6
# Get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
# Put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
# Fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = retention_train)
# Plot feature importance
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
# The best model was able to have a mean AUC of .814 with 15 trees deep.
# Tenure, contract, and total Charges were the most relevant.
dt_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1346 275
## Left 194 282
As seen from the AUC value of 0.814, the above Decision Tree model, while still a good predictor of customer status, performs slightly worse than our Logistic Regression model. Our model also shows us that the most important response variables for this model were Tenure, Contract, and Total Charges.
# MARS Model
mars_recipe <- recipe(Status ~ ., data = retention_train)
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 50)
mars_wf <- workflow() %>%
add_recipe(mars_recipe) %>%
add_model(mars_mod)
tuning_results <- mars_wf %>%
tune_grid(resamples = retention_kfolds, grid = mars_grid)
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 60 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 18 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 3 20 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 4 21 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 5 22 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 6 23 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 7 24 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 8 25 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 9 26 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## 10 27 1 roc_auc binary 0.851 5 0.00566 Preprocessor1_M…
## # … with 50 more rows
autoplot(tuning_results)
best_mars <- select_best(tuning_results, metric = "roc_auc")
final_mars_wf <- workflow() %>%
add_recipe(mars_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_mars)
final_mars_wf %>%
fit(data = retention_train) %>%
extract_fit_parsnip() %>%
vip(type = "rss")
final_mars_fit <- final_mars_wf %>%
fit(data = retention_train)
final_mars_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
final_mars_fit %>%
predict(retention_test, type = "prob") %>%
mutate(truth = retention_test$Status) %>%
roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.847
With an AUC of 0.851, the MARS model was the best of the three models that we created. We found that the most important variables in this model were Tenure, Total Charges, Monthly Charges, Electronic Check Payment, and Online Security. The confusion matrix reinforced the predictive quality of our model, although there was still a slight preference toward false positives compared to false negatives in the matrix. When we used the test data as an input for the MARS model, we found it had an AUC value of 0.847, which still shows a very good ability to predict retention among our customers.
We were able to run a MARS model with an AUC or accuracy of 84.8%. The top 5 factors from this model that can predict whether the customer will stay or leave are the following:
Because Tenure was the most important predictor variable, we believe that Regork would benefit greatly from trying to move some customers who are currently on month-to-month contracts onto one or two year contracts in the future. In order to do so, we think it would be beneficial to provide some sort of a discount for customers who agree to longer-term contracts. For instance, perhaps Regork could provide one month off of a one year contract for new customers, as well as two months off of a two year contract. This would increase the likelihood that customers would agree to long-term contracts, and therefore increase the chances that they would remain customers in the long run.
Although we have confidence in our analysis and the models that we have built, it is important to acknowledge that there are some limitations to the analysis that we have provided. Although our final model had a strong predictive score, a broader dataset over a longer period of time would provide more insights into the Regork customers. We also must acknowledge that past customer data is not necessarily going to be indicative of how customers will act in the future. Despite this, we believe that the model that we have built is a strong one, and that the recommendations that we have provided will be helpful in informing future decision making within the company.