Some Text
A variety of libraries will be used to assist in the formulation of our initial analysis as well as the synthesis of various machine learning models.
library(tidyverse) #Data cleaning, manipulation, visualization
library(tidymodels) #Creating machine learning models
library(vip) #Constructing variable importance plots
library(earth) #Creating MARS models
library(baguette) #Creating bagged tree models
library(pdp) #Constructing partial dependency plots
library(ranger) #Creating Random Forest models
We will be analyzing a data set containing customer retention information. The information is provided in the form of a csv flat file, which we will first need to import in order to conduct our analysis. This data set contains information pertaining to 7000 customers.
#Grab file path to data file
path <- here::here("customer_retention.csv")
#Import data file
retention <- readr::read_csv(path)
For our purposes, we will omit null values as they may bias the results of our machine learning models or decrease the accuracy of them.
#Recode response variable as a factor
retention <- mutate(retention, Status = as.factor(Status))
#Remove null values
retention <- na.omit(retention)
In order to build our machine learning models later, we must split the data into a set to train our model and a set to test our model’s performance. We will be using a 70-30 train-test split in order to optimize the generalizability our of models.
#For reproducibility
set.seed(123)
#70-30 train-test split
split <- initial_split(retention, prop = 0.7, strata = "Status")
train <- training(split)
test <- testing(split)
Before beginning to build our machine learning models, it is important that we understand some of the underlying trends and relationships in our data.
#payment by account status
retention %>%
group_by(PaymentMethod, Status) %>%
summarize(count = n()) %>%
ggplot(aes(x = PaymentMethod, y = count, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Customer Payment Method by Account Status",
x = "Payment Method",
y = "Number of Customers") +
scale_fill_manual(values = c("dodgerblue","orange2")) +
theme_minimal()
Some text
#proportion of total customers by payment status
proportions <- retention %>%
group_by(PaymentMethod) %>%
summarize(count = n()) %>%
mutate(proportion = count / sum(count))
ggplot(proportions, aes(x = PaymentMethod, y = proportion)) +
geom_bar(stat = "identity")
Some Text
#customers that left by tenure
retention %>%
filter(Status == "Left") %>%
group_by(Tenure) %>%
summarise(count = n()) %>%
ggplot(aes(x = Tenure, y = count)) +
geom_line() +
geom_point()
Some text
#customers that left by contract type
retention %>%
filter(Status == "Left") %>%
ggplot(aes(x = Contract)) +
geom_bar()
Some Text
The problem that we are looking to solve is a classification problem. To assess the performance of our models we will be looking at the AUC.
Before we began building any models, we created a recipe to normalize our numeric predictors to provide a common comparable unit of measure across all of our variables, and we dummy encoded our nominal predictors (categorical variables) to ensure they take on a numeric form.
recipe <- recipe(Status ~ ., data = train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
Some Text
set.seed(123)
lr_kfold <- vfold_cv(train, v = 5, strata = Status)
lr_results <- logistic_reg() %>%
fit_resamples(Status ~ ., lr_kfold)
#mean AUC
collect_metrics(lr_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
#AUC across all folds
collect_metrics(lr_results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 5 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold1 roc_auc binary 0.836 Preprocessor1_Model1
## 2 Fold2 roc_auc binary 0.839 Preprocessor1_Model1
## 3 Fold3 roc_auc binary 0.865 Preprocessor1_Model1
## 4 Fold4 roc_auc binary 0.839 Preprocessor1_Model1
## 5 Fold5 roc_auc binary 0.845 Preprocessor1_Model1
Based off our mean AUC from our 5 fold resampling procedure, our regularized logistic regression model is accurate nearly 85% of the time when making predictions about a customer attriting.
When we look at each of our 5 folds, we see that our best model was predicting accurately 86.4% of the time, with the worst model’s accuracy at 83.6%.
lr_final <- logistic_reg() %>%
fit(Status ~ ., data = train)
lr_final %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
As we can see from our confusion matrix, our logistic regression model is more prone to return false negatives than it is to return false positives. Considering we are attempting to predict customer retention, false positives are likely preferred as it allows Regork to take action to prevent a customer from leaving before it even occurs.
vip(lr_final$fit, num_features = 10)
Tenure, contract length, and total charges are among the most influential variables affecting customer retention according to our logistic regression model.
While our out the box logistic regression model performed fairly well, we though we may be able to increase the AUC by performing some hyperparameter tuning.
lr_hyper <- logistic_reg(mixture = tune(), penalty = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
lr_grid <- grid_regular(mixture(), penalty(), levels = 10)
lr_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(lr_hyper)
lr_tuning_results <- lr_wf %>%
tune_grid(resamples = lr_kfold, grid = lr_grid)
lr_tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean)) %>%
print(n = 5)
## # A tibble: 100 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 0.889 roc_auc binary 0.845 5 0.00523 Preprocessor1_Mo…
## 2 0.00000000129 0.889 roc_auc binary 0.845 5 0.00523 Preprocessor1_Mo…
## 3 0.0000000167 0.889 roc_auc binary 0.845 5 0.00523 Preprocessor1_Mo…
## 4 0.000000215 0.889 roc_auc binary 0.845 5 0.00523 Preprocessor1_Mo…
## 5 0.00000278 0.889 roc_auc binary 0.845 5 0.00523 Preprocessor1_Mo…
## # ℹ 95 more rows
#autoplot(lr_tuning_results)
Our hyperparameter tuning has increased our AUC; however, the effect is only marginal.
lr_best_hyper <- select_best(lr_tuning_results, metric = "roc_auc")
lr_final_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(lr_hyper) %>%
finalize_workflow(lr_best_hyper)
lr_hyper_final <- lr_final_wf %>%
fit(data = train)
lr_hyper_final %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1360 225
## Left 180 332
Tuning our model affected our confusion matrix only slightly as well.
lr_hyper_final %>%
extract_fit_parsnip() %>%
vip(10)
Some Text
Some Text
# create MARS model object
mars <- mars(mode = "classification", num_terms = tune(), prod_degree = tune())
set.seed(123)
mars_kfold <- vfold_cv(train, v = 5)
# create a hyper parameter tuning grid
mars_grid <- grid_regular(
num_terms(range = c(1, 100)),
prod_degree(),
levels = 25
)
# train our model across the hyper parameter grid
mars_results <- tune_grid(mars, recipe, resamples = mars_kfold, grid = mars_grid)
# get best results
show_best(mars_results, metric = "roc_auc")
## # A tibble: 5 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 17 1 roc_auc binary 0.847 5 0.00946 Preprocessor1_Mo…
## 2 21 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_Mo…
## 3 25 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_Mo…
## 4 29 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_Mo…
## 5 34 1 roc_auc binary 0.847 5 0.00930 Preprocessor1_Mo…
Some Text
mars_best_hyperparameters <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_model(mars) %>%
add_recipe(recipe) %>%
finalize_workflow(mars_best_hyperparameters)
mars_final_fit <- mars_final_wf %>%
fit(data = train)
mars_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1382 242
## Left 158 315
Some Text
mars_final_wf %>%
fit(data = train) %>%
extract_fit_parsnip() %>%
vip(10, type = "rss")
Some Text
Some Text
dt <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Step 3: fit model workflow
dt_fit <- workflow() %>%
add_recipe(recipe) %>%
add_model(dt) %>%
fit(data = train)
# create resampling procedure
set.seed(123)
dt_kfold <- vfold_cv(train, v = 5)
# train model
dt_results <- fit_resamples(dt, recipe, dt_kfold)
# model results
collect_metrics(dt_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.787 5 0.00560 Preprocessor1_Model1
## 2 roc_auc binary 0.703 5 0.00571 Preprocessor1_Model1
Our AUC is considerably lower than it is using a MARS or logistic regression model.
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)
There are only 3 levels to this tree, so it is very well possible that our model performance could be increased by tuning.
Some Text (about the hyperparamters and their meaning)
#hyperparamater tuning
dt_hyper <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
dt_hyper_results <- tune_grid(dt_hyper,
recipe,
resamples = dt_kfold,
grid = dt_hyper_grid)
# get best results
show_best(dt_hyper_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.820 5 0.00898
## 2 0.0000000178 8 30 roc_auc binary 0.820 5 0.00898
## 3 0.00000316 8 30 roc_auc binary 0.820 5 0.00898
## 4 0.0000000001 8 40 roc_auc binary 0.819 5 0.00936
## 5 0.0000000178 8 40 roc_auc binary 0.819 5 0.00936
## # ℹ 1 more variable: .config <chr>
Tuning our decision tree increased our model’s accuracy by about 12%, from 70% without tuning to 82% with hyperparameter tuning.
# get best hyperparameter values
dt_best_model <- select_best(dt_hyper_results, metric = 'roc_auc')
# put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(dt_hyper) %>%
finalize_workflow(dt_best_model)
# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = train)
rpart.plot::rpart.plot(dt_final_fit$fit$fit$fit)
# confusion matrix
dt_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1316 243
## Left 224 314
Some Text
# plot feature importance
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(10)
Some Text
Some Text
# create resampling procedure
set.seed(123)
bag_kfold <- vfold_cv(train, v = 5)
# create bagged CART model object with 5 bagged trees
bag_mod <- bag_tree() %>%
set_engine("rpart", times = 5) %>%
set_mode("classification")
# train model
bag_results <- fit_resamples(bag_mod, recipe, bag_kfold)
# model results
collect_metrics(bag_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.762 5 0.00879 Preprocessor1_Model1
## 2 roc_auc binary 0.769 5 0.00884 Preprocessor1_Model1
Our out of the box AUC for the CART bagged tree model is higher than it was for our decision tree without tuning, but lower than that of our MARS and logistic regression models. With hyperparamter tuning, we expect to see the AUC rise.
Some Text (about the meaning of our hyperparameters)
bag_hyper <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
# create the hyperparameter grid
bag_hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200, 300))
# train our model across the hyper parameter grid
set.seed(123)
bag_hyper_results <- tune_grid(bag_hyper, recipe, resamples = bag_kfold, grid = bag_hyper_grid)
# model results
show_best(bag_hyper_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 7
## times .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 200 roc_auc binary 0.819 5 0.0106 Preprocessor1_Model5
## 2 300 roc_auc binary 0.819 5 0.0108 Preprocessor1_Model6
## 3 100 roc_auc binary 0.816 5 0.0105 Preprocessor1_Model4
## 4 50 roc_auc binary 0.811 5 0.0108 Preprocessor1_Model3
## 5 25 roc_auc binary 0.807 5 0.00897 Preprocessor1_Model2
Some text about the AUC
# identify best model
bag_best_hyperparameters <- bag_hyper_results %>%
select_best("roc_auc")
# finalize workflow object
bag_final_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(bag_hyper) %>%
finalize_workflow(bag_best_hyperparameters)
# final fit on training data
bag_final_fit <- bag_final_wf %>%
fit(data = train)
# confusion matrix
bag_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1340 266
## Left 200 291
This model is still more prone to false negatives.
bag_vip <- bag_final_fit %>%
extract_fit_parsnip() %>%
.[['fit']] %>%
var_imp() %>%
slice(1:10)
ggplot(bag_vip, aes(value, reorder(term, value))) +
geom_col() +
ylab(NULL) +
xlab("Feature importance")
The most important features for our CART bagged model are consistent
with those we found to be most important in our decision tree model.
Some Text
# create resampling procedure
set.seed(123)
rf_kfold <- vfold_cv(train, v = 5)
# create random forest object
rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")
# train model
rf_results <- fit_resamples(rf_mod, recipe, rf_kfold)
# model results
collect_metrics(rf_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.801 5 0.00612 Preprocessor1_Model1
## 2 roc_auc binary 0.842 5 0.0104 Preprocessor1_Model1
Our random forest model’s performance is on par with our best models even without tuning.
Some text (about the hyperparamaters)
# create random forest model object with tuning option
rf_hyper <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
trees(range = c(50, 800)),
mtry(range = c(2, 30)),
min_n(range = c(1, 20)),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
rf_hyper_results <- tune_grid(rf_hyper, recipe, resamples = rf_kfold, grid = rf_hyper_grid)
# model results
show_best(rf_hyper_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 612 20 roc_auc binary 0.841 5 0.0107 Preprocessor1_Model1…
## 2 9 425 20 roc_auc binary 0.840 5 0.0106 Preprocessor1_Model1…
## 3 9 237 20 roc_auc binary 0.840 5 0.0107 Preprocessor1_Model1…
## 4 2 237 1 roc_auc binary 0.840 5 0.0112 Preprocessor1_Model0…
## 5 9 800 20 roc_auc binary 0.840 5 0.0107 Preprocessor1_Model1…
Some Text about the AUC
# get optimal hyperparameters
rf_best_hyperparameters <- select_best(rf_hyper_results, metric = "roc_auc")
# create final workflow object
final_rf_wf <- workflow() %>%
add_recipe(recipe) %>%
add_model(rf_hyper) %>%
finalize_workflow(rf_best_hyperparameters)
# fit final workflow object
rf_final_fit <- final_rf_wf %>%
fit(data = train)
# confusion matrix
rf_final_fit %>%
predict(test) %>%
bind_cols(test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1376 261
## Left 164 296
Some Text
# plot feature importance
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
Some Text
Mention what the optimal model is
Some Text
In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving? Collect all the customers from the test dataset that you predict are going to leave. What is the predicted loss in revenue per month if no action is taken? Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).