Retaining Customers for Regork
Class: Data Mining (4080-001)
Author: Roan Zappanti
knitr::opts_chunk$set(
fig.width = 6, fig.height = 6,
warning = FALSE, error = FALSE, message = FALSE,
include = TRUE, echo = TRUE, strip.white = TRUE, highlight = TRUE,
results = 'hold'
)
library(tidyverse)
library(tidymodels)
library(RColorBrewer)
library(vip)
library(kernlab)
library(baguette)
library(pdp)
library(gridExtra)
path <- here::here("Data","customer_retention.csv")
data_tibble <- read_csv(path)
data <- as.data.frame(data_tibble)
The Problem: Retention, retention, retention! It’s all everyone talks about these days, and for good reason. Regork, working in telecommunications, has noticed an uptick in customer churn. In order to increase retention (and therefore revenue), this analysis looks at the types of customers likely to stay with the company, important features to those customers, and losses based on predicted customers to churn. While this serves as a start to Regork’s machine learning models for retention, it sets a great precedent for continual and marginal improvement and company retention, revenue, and profit.
Addressing Retention: In order to properly understand retention, this analysis:
TLDR: After employing the decision tree model to data provided from Regork, the following action plan is recommended based on predicted lost revenue of customer churn during the next business period:
glimpse(data)
Rows: 6,999
Columns: 20
$ Gender <chr> "Female", "Male", "Male", "Male", "Female", "Female", "Male",…
$ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No",…
$ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes",…
$ Tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 5…
$ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "…
$ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "No", "Ye…
$ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber optic", "Fi…
$ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Ye…
$ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes…
$ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No…
$ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No",…
$ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No"…
$ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No",…
$ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One year", "…
$ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "…
$ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "Bank tra…
$ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75, 104.8…
$ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.40, 301…
$ Status <chr> "Current", "Current", "Left", "Current", "Left", "Left", "Cur…
#Any empty rows or columns? Nope
data <- data[rowSums(is.na(data)) != ncol(data), ]
data <- data[ , colSums(is.na(data)) != nrow(data)]
#It looks like Total Charges is the only column with NA values.
data[!complete.cases(data),]
#Let's replace those with zeroes
data <- replace(data,is.na(data),0)
#Lets also turn Status into a factor, since it will be used in modeling
data$Status <- as.factor(data$Status)
#Finally, let's replace "No internet service" and "No phone service" with "No"
data[data == "No internet service"] <- "No"
data[data == "No phone service"] <- "No"
What is the baseline percent of status, what are the counts of status?
data %>% select(Status) %>%
mutate(Current_Num = ifelse(Status == "Current",1,0),
Left_Num = ifelse(Status == "Left",1,0)) %>%
summarize(Current_ct = sum(Current_Num),
Left_ct = sum(Left_Num),
Churn_Percent = sum(Left_Num)/sum(Current_Num))
Notice below: Relationships exist between multiple numeric variables and status. It seems that churn rates are higher during earlier periods of tenure. Additionally, a large spike of current customers appears at a lower monthly charge. Total charges (correlated to monthly charge and tenure) also sees higher retention in lower total charges.
# Looking at numeric columns
data %>% select_if(is.numeric) %>%
summary()
SeniorCitizen Tenure MonthlyCharges TotalCharges
Min. :0.0000 Min. : 0.00 Min. : 18.25 Min. : 0.0
1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 35.48 1st Qu.: 399.4
Median :0.0000 Median :29.00 Median : 70.35 Median :1394.5
Mean :0.1619 Mean :32.38 Mean : 64.75 Mean :2279.5
3rd Qu.:0.0000 3rd Qu.:55.00 3rd Qu.: 89.85 3rd Qu.:3790.4
Max. :1.0000 Max. :72.00 Max. :118.75 Max. :8684.8
numeric_var_names <- data %>% select_if(is.numeric) %>% colnames()
numeric_var_names[5] <- "Status"
#Viewing relationships between numeric variables and status
data %>%
select(all_of(numeric_var_names)) %>%
select(-data$SeniorCitizen) %>%
GGally::ggpairs(progress = FALSE)
#Lets check the unique values in each character column
data %>% select_if(is.character) %>%
sapply(function(x) unique(x))
$Gender
[1] "Female" "Male"
$Partner
[1] "Yes" "No"
$Dependents
[1] "No" "Yes"
$PhoneService
[1] "No" "Yes"
$MultipleLines
[1] "No" "Yes"
$InternetService
[1] "DSL" "Fiber optic" "No"
$OnlineSecurity
[1] "No" "Yes"
$OnlineBackup
[1] "Yes" "No"
$DeviceProtection
[1] "No" "Yes"
$TechSupport
[1] "No" "Yes"
$StreamingTV
[1] "No" "Yes"
$StreamingMovies
[1] "No" "Yes"
$Contract
[1] "Month-to-month" "One year" "Two year"
$PaperlessBilling
[1] "Yes" "No"
$PaymentMethod
[1] "Electronic check" "Mailed check" "Bank transfer (automatic)"
[4] "Credit card (automatic)"
#Split qualitative variable names into service and demographic lists for later use against status and each other.
Demographic_vars <- c("Gender","Partner","Dependents","SeniorCitizen")
Service_Vars <- colnames(data %>% select_if(is.character) %>% select(-any_of(Demographic_vars)))
# create train/test split data for models
set.seed(123) # for reproducibility
split <- initial_split(data, prop = 0.7)
data_train <- training(split)
data_test <- testing(split)
kfold <- vfold_cv(data_train, v = 5)
# train model via cross validation
lr_mod <- logistic_reg()
results <- fit_resamples(lr_mod, Status ~ ., kfold)
# Check out the cross-validated ROC-AUCs and their average using the training data
collect_metrics(results, summarize = FALSE) %>% filter(.metric == "roc_auc")
collect_metrics(results) %>% filter(.metric == "roc_auc")
NA
# retrain the model across the entire training data
final_fit_lr <- logistic_reg() %>%
fit(Status ~ ., data = data_train)
# Creating tests against test data.
final_fit_lr %>%
predict(data_train) %>%
bind_cols(data_train %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
Truth
Prediction Current Left
Current 3191 589
Left 373 746
log_regression_auc_plot <-final_fit_lr %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
log_regression_auc_plot
log_regression_conf_matrix <- final_fit_lr %>%
predict(data_train) %>%
mutate(truth = data_train$Status) %>%
conf_mat(truth, .pred_class)
log_regression_conf_matrix
Truth
Prediction Current Left
Current 3191 589
Left 373 746
log_regression_auc <- final_fit_lr %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_auc(truth, .pred_Current)
log_regression_auc
NA
Testing against the test data for the first time, the final logistic regression predicts that individuals will stay with the company with 84% accuracy. Additionally, Tenure and contract type serve as crucial metrics for predicting the churn of a customer based on the logistic regression model.
log_regression_vip <- vip(final_fit_lr$fit, num_features = 20)
log_regression_vip
This is similar to above, but with regularization variables that need to be tuned with a recipe and normalized. This is better than basic logistic regression because it: - No longer assumes a linear relationship - Keeps multicolinnearity in mind, which has already been identified in the data
set.seed(123)
# Step 1: create ridge model object
reg_mod <- logistic_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
# Step 2: Create a tuning grid
reg_grid <- grid_regular(mixture(), penalty(), levels = 5)
# Step 3: create model & preprocessing recipe
model_recipe <- recipe(Status ~ ., data = data_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Step: Resampling of data
kfold2 <- vfold_cv(data_train, v = 5, strata = "Status")
# Step 4: create a tuned workflow object to combine the recipe & model
tuning_results1 <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(reg_mod) %>%
tune_grid(resamples = kfold2, grid = reg_grid)
tuning_results1 %>%
show_best(metric = "roc_auc")
NA
# Based on the results, re-tune with new parameters
# Create a tuning grid
reg_grid2 <- grid_regular(mixture(range = c(0,0.5)), penalty(range = c(0,0.003)), levels = 5)
# Create model & preprocessing recipe
model_recipe2 <- recipe(Status ~ ., data = data_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Step 4: create a tuned workflow object to combine the recipe & model
tuning_results2 <- workflow() %>%
add_recipe(model_recipe2) %>%
add_model(reg_mod) %>%
tune_grid(resamples = kfold2, grid = reg_grid2)
tuning_results2 %>%
show_best(metric = "roc_auc")
tuning_results2 %>%
select_best(metric = "roc_auc")
best_hyperparameters_reg <- select_best(tuning_results2, metric = "roc_auc")
# Step 5: Create a finalized workflow
final_wf_reg <- workflow() %>%
add_recipe(model_recipe2) %>%
add_model(reg_mod) %>%
finalize_workflow(best_hyperparameters_reg)
# Step 6: Fit final model on all training data
final_fit_reg <- final_wf_reg %>%
fit(data = data_train)
# Step 7: Assess top 20 most influential features
reg_regression_vip <- final_fit_reg %>%
extract_fit_parsnip() %>%
vip(num_features = 20, geom = "point")
reg_regression_vip
# Step 8: Test the final, fitted workflow against test data and output the result.
reg_regression_auc_plot <- final_fit_reg %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
reg_regression_auc_plot
reg_regression_conf_matrix <- final_fit_reg %>%
predict(data_train) %>%
bind_cols(data_train %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
reg_regression_conf_matrix
Truth
Prediction Current Left
Current 3564 1335
Left 0 0
reg_regression_auc <- final_fit_reg %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_auc(truth, .pred_Current)
reg_regression_auc
# Create a modified decision tree
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Create model recipe
dt_recipe <- recipe(Status ~ ., data = data_train)
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# train model across the hyper parameter grid
dt_results <- tune_grid(dt_mod, dt_recipe, resamples = kfold, grid = dt_hyper_grid)
# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
best_dt_params <- select_best(dt_results, metric = "roc_auc")
best_dt_params
# put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(dt_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(best_dt_params)
# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = data_train)
# Test decision tree across test data and output
decision_tree_auc <- dt_final_fit %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_auc(truth, .pred_Current)
decision_tree_auc
decision_tree_conf_matrix <- dt_final_fit %>%
predict(data_train) %>%
mutate(truth = data_train$Status) %>%
conf_mat(truth = truth, estimate = .pred_class)
decision_tree_conf_matrix
Truth
Prediction Current Left
Current 3182 502
Left 382 833
decision_tree_auc_plot <- dt_final_fit %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
decision_tree_auc_plot
# plot feature importance
decision_tree_vip <- dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
decision_tree_vip
Note: All graphs in the following analysis are presented in the following model order
Taking a look at the confusion matrices, we can see that the regularized regression model chose to predict EACH customer to stay, with nobody leaving. After re-testing multiple times, this really is the outcome of the model. Based on this information, we can conclude that the Regularized Regression model will not serve our interests at Regork.
log_regression_conf_matrix
Truth
Prediction Current Left
Current 3191 589
Left 373 746
reg_regression_conf_matrix
Truth
Prediction Current Left
Current 3564 1335
Left 0 0
decision_tree_conf_matrix
Truth
Prediction Current Left
Current 3182 502
Left 382 833
This is where the models get interesting. Notice that, between the two remaining models, the decision tree outperforms the basic logistic regression by multiple points. We realize that the decision tree serves as our chosen model, for it captures the complex relationships (which exist as per EDA in beginning of this paper) affecting Status.
log_regression_auc
reg_regression_auc
decision_tree_auc
Finally, a quick visualization of the curves complementing each AUC metric.
grid.arrange(log_regression_auc_plot, reg_regression_auc_plot, decision_tree_auc_plot, ncol = 3)
Since we are using the decision tree model, let’s look at the important variables identified through that model and compare it to the basic Logistic Regression VIP:
grid.arrange(decision_tree_vip, log_regression_vip, ncol = 2)
Our two best models both note that Tenure and Contract are extremely
important features to predict customer churn. Additionally, the logistic
regression model notes that a two-year contract has a larger effect on
customer retention than a one-year contract. From this, we learn that
longer tenure and two-year contracts increase incentives to stay with
Regork.
Now honing into the decision tree VIP only:
decision_tree_vip
After contract and tenure, money serves as the main factor affecting customer retention. This makes sense, as telecommunications serves as a basic commodity in the market, making price an extremely important factor in choosing a service company.
But how does this model perform against unseen data? Take a look at the comparison between training AUC and testing AUC below (in order).
dt_final_fit %>%
predict(data_train, type = "prob") %>%
mutate(truth = data_train$Status) %>%
roc_auc(truth, .pred_Current)
dt_final_fit %>%
predict(data_test, type = "prob") %>%
mutate(truth = data_test$Status) %>%
roc_auc(truth, .pred_Current)
Survey says: A slightly lower accuracy, but still acceptable for Regork’s purposes.
Based on the decision tree’s analysis, the top three factors to focus on are:
Before creating an action plan to retain customers, let’s figure out the potential losses related customers who are predicted to leave.
test_prediction <- as.data.frame(predict(dt_final_fit, data_test))
churn_customers <- data_test %>%
mutate(prediction = test_prediction$.pred_class) %>%
filter(Status == "Current", prediction == "Left")
head(churn_customers %>% select(Status, prediction))
Notice we’re only looking at customers still with the company who are predicted to leave. Below shows the predicted worth of these customers’ monthly revenue.
churn_customers %>%
summarize(potential_losses = sum(MonthlyCharges),
avg_loss_per_customer = sum(MonthlyCharges)/n())
NA
Oof. It looks like Regork stands to lose around
$16,800 per month due to customer churn. That’s an
estimated $201,600 per year… Just on the test data!
Finally, it seems that each customer brings in around $75 per month.
We know that customers bring in an average of $75 per month, and Regork requires a 20% top-line profit to fulfill all expenses (and other programs) and still make a bottom line profit. This leaves us with ~$60 to spend per customer on a new promotional program.
Our decision tree model indicated that Tenure, Contract length, and Total Spend amount serve as important retention components for a customer. In order to keep retain customers, consider the following based on a budget of $201,600 per year:
After testing Logistic Regression, Regularized Logistic Regression, and Decision Tree Models, the Decision Tree Model provided the best accuracy with an AUC of 0.85 on training data and 0.83 on test data.
Based on the results from the decision tree, Regork stands to lose $16,800 per month, $201,600 per year, and $75 per customer from current customers predicted to churn. Additionally, the model noted that Tenure, Contract Type, and Total Cost serve as the three most important features in determining customer retention.
In order to reduce customer churn and later losses, the following action plan is recommended: