1 Background

This data is from an old Kaggle competition - https://www.kaggle.com/datasets/blastchar/telco-customer-churn

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

I’ve recently started using tidymodels instead of caret, so I wanted to run through an example.

2 Libraries, Data, and Quick Review

library(tidymodels) 
#broom, dials, dplyr, ggplot2, infer, parsnip, purrr, recipes, rsample, tibble, tune, workflows, yardstick

tidymodels_prefer()
library(knitr)

Initial Data Review

Many of the columns have character values, but they should really be factors
customerID isn’t needed for modeling purposes, but it’s worth including in the data frame for now in case research needs to done after modeling is complete
SeniorCitizens is numeric, but it should be a character/factor
Total Charges appears to be the product of MonthlyCharges and Tenure

summary(data)

##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Update the data

df_churn <- data |>
  
  # update SeniorCitizen to a character value
  mutate(SeniorCitizen = ifelse(SeniorCitizen == 0, "No","Yes")) |>
  
  # replace NAs in TotalCharges with 0.  When filtering the NAs, all these customers had a tenure of 0, meaning they haven't paid anything...yet
  mutate(TotalCharges = ifelse(is.na(TotalCharges),0,TotalCharges)) |>
  
  # I also want to create variables for Phone Only, Internet Only, or Both
  mutate(Package = ifelse(PhoneService %in% 'Yes' & InternetService %in% 'No', 'PhoneOnly',
                          ifelse(InternetService %in% c('DSL','Fiber Optic') & PhoneService %in% 'No', 'InternetOnly','Both'))) |>
  # Put Churn at the front
  select(Churn,everything())

3 EDA and Correlation

This code utilizes a recipe to create a data frame for each of the factor variables. Rather than creating dummy variables, setting one_hot = TRUE creates an indicator column for each option. This is better because I want to see how each of the factors compares to the Churn variable.

If I had used one_hot = FALSE, than the column for contract type would NOT show me month to month, one year, and two year, it would just show 2 of the 3 options.

corr_obj <- df_churn |> select(-customerID) |>
  recipe(~ .) |>
  step_dummy(all_nominal(), one_hot = TRUE )

corr_prep <- prep(corr_obj)

corr_df <- bake(corr_prep, new_data = df_churn)

Reviews the correlation between Churn and all the variables

# Creates a matrix by expanding factor variables to a set of 'dummy variables'
churn_corr <- as.data.frame(model.matrix(~., data = corr_df)) |> select(-1)

m <- cor(churn_corr)

m_churn <- m['Churn_Yes', ]

m_churn_df <- data.frame(variable = names(m_churn), correlation = m_churn)

m_churn_df |> 
  filter(variable != c('Churn_Yes','Churn_No')) |>
  ggplot(aes(x = reorder(variable,-correlation), y = correlation)) +
  geom_bar(stat = "identity", position = "identity", fill = 'steelblue') +
  geom_text(aes(label = sprintf("%.2f", correlation)),size = 2, vjust = -0.5,
            position = position_dodge(width = 0.8)) +

  labs(title = 'Correlation with Churn', y = 'Correlation',x = 'Variable') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
  scale_y_continuous(limits = c(-0.4, 0.4), breaks = seq(-0.4, 0.4, by = 0.1))

From this bar plot, we can see that month to month contracts, no online security, no tech support, and Fiber Optic internet have the strongest positive correlations and tenure and 2 year contract have the strongest negative correlations - meaning that they are positively correlated with not churning.

# month to month is the most common contract
df_churn %>%
    ggplot(aes(x = Contract, fill = Contract)) +
    geom_bar() +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# having no online security is the most common option
df_churn %>%
    ggplot(aes(x = OnlineSecurity, fill = OnlineSecurity)) +
    geom_bar() +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# month to month and no online security is the most common
df_churn %>%
    ggplot(aes(x = Contract, fill = Contract)) +
    geom_bar() +
    facet_wrap(vars(OnlineSecurity)) +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# much of the churn appears to come from month to month contracts with No Online Security
df_churn %>%    
  ggplot(aes(x = Contract, fill = Contract)) +
  geom_bar() +
  facet_wrap(vars(Churn, OnlineSecurity)) +
  theme(legend.position="none") +
  labs(x ="") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# There are a lot of shorter tenures
df_churn %>%
    ggplot(aes(x = tenure)) +
    geom_bar(color = "blue") +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# Month-to-month is the most common contract
df_churn %>%
    ggplot(aes(x = Contract, fill = Contract)) +
    geom_bar() +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# Month-to-month have shorter tenures, two year contracts have longer tenures
df_churn %>%
    ggplot(aes(x = tenure)) +
    geom_bar(color = "blue") +
    facet_wrap(vars(Contract)) +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# much of the churn appears to come from month to month clients with a shorter tenure
df_churn %>%
    ggplot(aes(x = tenure)) +
    geom_bar(color = "blue") +
    facet_wrap(vars(Churn, Contract)) +
    theme(legend.position="none") +
    labs(x ="") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

4 Model Setup

There are a few different models I want to look at for predicting Churn:

Penalized Logistic Regression
Random Forest
XGBoost
SVM Linear
K-Nearest Neighbors

I’ll split the data on the Churn variable and use 10 fold cross-validation for the training and tuning processes

# create a model data frame and change values to  factors
model_df <- df_churn |> mutate_if(is.character, as.factor) |> mutate(customerID = as.character(customerID))

4.1 Splitting and Cross-Validation

set.seed(21425)
churn_split <- initial_split(model_df, strata = Churn)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)

set.seed(1502)
churn_folds <- 
   vfold_cv(churn_train, strata = Churn)

4.2 Recipes

The models that use distance functions will need to be normalized, so I’ll create two different recipes

model_recipe_no_norm <- 
  recipe(Churn ~ ., churn_train) |>
  
  # keeps the custID in case we have to explore a specific customer
  update_role(customerID, new_role = "ID") |> # keeps the custID in case we have to explore a specific customer
  
  #Creates dummy variables
  step_dummy(all_nominal_predictors()) |>
  
  #removes variables with zero variance, if any
  step_zv(all_predictors())

model_recipe_norm <- model_recipe_no_norm |>
  
  # Normalizes the data
  step_normalize(all_numeric_predictors())

4.3 Building Model Specifications

I’ve created model specifications below. These will be used in a workflow for each model. For all except for glmnet - I need to specify the mode - classification.

Penalized Logistic Regression
Random Forest
XGBoost
SVM Linear
K-Nearest Neighbors

glm_spec <-
  logistic_reg(penalty = tune(),
               mixture = tune() # range from 0 to 1, with 1 being a pure Lasso model
               ) |>
  set_engine("glmnet")

rf_spec <- 
   rand_forest(mtry = tune(), 
               min_n = tune(), 
               trees = 1000 # trees of 1000 should be sufficient
               ) |> 
   set_engine("ranger") |> 
   set_mode("classification")

xgb_spec <- 
   boost_tree(trees = 1000, # trees of 1000 should be sufficient
              tree_depth = tune(), 
              min_n = tune(),
              learn_rate = tune(), 
              loss_reduction = tune(), 
              mtry = tune(),
              sample_size = tune()
              ) |>
   set_engine("xgboost") |> 
   set_mode("classification")

svm_spec <- 
  svm_poly(cost = tune(), degree = tune()) |>
  set_engine("kernlab") |>
  set_mode("classification")

knn_spec <- 
   nearest_neighbor(neighbors = tune(), 
                    dist_power = tune(), 
                    weight_func = tune()) |> 
   set_engine("kknn") |> 
   set_mode("classification")

4.4 Metrics Function

I want to create a function that displays the metrics

model_metrics <- function(final_model) {
  
  #ROC Curve
  model_roc <- final_model |>
    collect_predictions() |>
    roc_curve(Churn, .pred_No) |>
    autoplot()
    print(model_roc)
  
  #Metrics
  metrics <- final_model |>
    collect_metrics() 
  print(kable(metrics))

  #Predictions
  predictions <- final_model |>
    collect_predictions()
  
  # Create a confusion matrix
  final_model |>
    collect_predictions() |>  
    conf_mat(truth = Churn, estimate = .pred_class) |> 
    autoplot(type = "heatmap") |>
    print()

  props <- predictions |>
    group_by(Churn, .pred_class) |>
    summarize(total = n(), .groups = "drop_last") |>
    mutate(prop = total / sum(total)) 
    
  matrix(props$prop, byrow = FALSE, ncol = 2,
         dimnames = list(
           c(levels(predictions$Churn)),
           c(levels(predictions$Churn))
           )
         ) |>
    conf_mat(Churn,.pred_class) |> autoplot(type = "heatmap") 

}

5 Modeling

5.1 Penalized Logistic Regression

Create a workflow

glmnet_wf <- 
  workflow() |> 
  add_model(glm_spec) |>
  add_recipe(model_recipe_norm)

Create a grid

glmnet_grid <- grid_regular(penalty(),
                          mixture(),
                          levels = 5)

Tune the hyperparameters

set.seed(345)

glmnet_tune <- 
  glmnet_wf %>% 
  tune_grid(
    resamples = churn_folds,
    control = control_grid(save_pred = TRUE),
    grid = glmnet_grid
    )

saveRDS(glmnet_tune, file = "Saved_Model_Results/glmnet_initial_tune.rds")

Results of tuning

options(scipen = 4)
glmnet_tune |>
  collect_metrics() |>
  filter(.metric == "roc_auc") |>
  arrange(-mean,penalty, mixture) |>
  head(10) |>
  kable()

penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.00000	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model21
0.00000	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model22
0.00001	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model23
0.00000	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model06
0.00000	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model07
0.00001	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model08
0.00000	0.50	roc_auc	binary	0.8411199	10	0.0051826	Preprocessor1_Model11
0.00000	0.50	roc_auc	binary	0.8411199	10	0.0051826	Preprocessor1_Model12
0.00001	0.50	roc_auc	binary	0.8411199	10	0.0051826	Preprocessor1_Model13
0.00000	0.75	roc_auc	binary	0.8411145	10	0.0051871	Preprocessor1_Model16

Plot of tuning results

glmnet_tune %>%
  collect_metrics() %>%
  mutate(mixture = factor(mixture)) %>%
  ggplot(aes(penalty, mean, color = mixture)) +
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

Lower penalties seem to do well, however mixture isn’t quite as clear - they all have rocs_aucs of around 0.83 and 0.84 Smaller penalties appear to do better, so I can re-tune and define a range of values for penalty.

# rebuild the grid
glmnet_grid <- grid_regular(penalty(range = c(-10,-5)), # range is -10 to 0 and on the log scale
                          mixture(),
                          levels = 5)

# Modeling
set.seed(345)

glmnet_retune <- 
  glmnet_wf %>% 
  tune_grid(
    resamples = churn_folds,
    control = control_grid(save_pred = TRUE),
    grid = glmnet_grid
    )

saveRDS(glmnet_retune, file = "Saved_Model_Results/glmnet_retune.rds")

options(scipen = 9)
# Collect metrics
glmnet_retune |>
  collect_metrics() |>
  filter(.metric == "roc_auc") |>
  arrange(-mean,penalty, mixture) |>
  head(10) |>
  kable()

penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.0000000	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model21
0.0000000	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model22
0.0000000	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model23
0.0000006	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model24
0.0000100	1.00	roc_auc	binary	0.8411495	10	0.0051941	Preprocessor1_Model25
0.0000000	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model06
0.0000000	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model07
0.0000000	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model08
0.0000006	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model09
0.0000100	0.25	roc_auc	binary	0.8411255	10	0.0051760	Preprocessor1_Model10

Not much of a change using smaller penalties, but a mixture of 1 (pure lasso) appears to be the best.

Select the best parameters

options(scipen = 9)
best_glmnet <- glmnet_retune %>%
  select_best("roc_auc") |> kable()
best_glmnet

penalty	mixture	.config
0	1	Preprocessor1_Model21

Finalized glmnet model

Create a workflow using the best penalty and mixture - a low penalty and a pure lasso regularization.

glmnet_wf <- 
  glmnet_wf %>% 
  finalize_workflow(best_glmnet)

Final Model

last_fit() emulates the process where, after determining the best model - aka from final_wf, the final fit on the entire training set is needed and is then evaluated on the test set.

glmnet_final <- 
  glmnet_wf %>%
  last_fit(churn_split) 

saveRDS(glmnet_final, file = "Saved_Model_Results/glmnet_final.rds")

Metrics

model_metrics(glmnet_final)

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.807 Preprocessor1_Model1
## 2 roc_auc  binary         0.858 Preprocessor1_Model1

5.2 Random Forest

Create a workflow

rf_wf <- 
  workflow() |> 
  add_model(rf_spec) |>
  add_recipe(model_recipe_no_norm)

Tuning the Hyperparameters

This will be trained a bit differently than glmnet RF models tend to perform well without tuning, but it doesn’t hurt to explore a bit. I’ll be using tune_grid initially instead of grid_regular for this tuning process

doParallel::registerDoParallel()

set.seed(12542)
rf_tune <- tune_grid(
  rf_wf,
  resamples = churn_folds,
  grid = 30
)

## i Creating pre-processing data to finalize unknown parameter: mtry

saveRDS(rf_tune, file = "Saved_Model_Results/rf_initial_tune.rds")

Metrics

rf_tune |>
  collect_metrics() |>
  filter(.metric == "roc_auc") |>
  arrange(-mean) |>
  head(10) |>
  kable()

mtry	min_n	.metric	.estimator	mean	n	std_err	.config
5	28	roc_auc	binary	0.8419630	10	0.0049751	Preprocessor1_Model26
3	17	roc_auc	binary	0.8410658	10	0.0047453	Preprocessor1_Model05
7	30	roc_auc	binary	0.8406773	10	0.0050237	Preprocessor1_Model02
9	32	roc_auc	binary	0.8395159	10	0.0051026	Preprocessor1_Model18
8	23	roc_auc	binary	0.8387182	10	0.0049896	Preprocessor1_Model27
12	33	roc_auc	binary	0.8379905	10	0.0051707	Preprocessor1_Model12
10	22	roc_auc	binary	0.8369434	10	0.0050522	Preprocessor1_Model28
14	35	roc_auc	binary	0.8368982	10	0.0052511	Preprocessor1_Model14
6	7	roc_auc	binary	0.8363543	10	0.0047926	Preprocessor1_Model22
2	29	roc_auc	binary	0.8354133	10	0.0045600	Preprocessor1_Model21

Plots

rf_tune %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")

autoplot(rf_tune)

Generally it looks like a min_n between 23 and 33 and an mtry between 3 and 9

Now that we have a range of values to use, grid_regular can be utilized.

rf_grid <- grid_regular(
  mtry(range = c(3, 9)),
  min_n(range = c(23, 33)),
  levels = 7
)

Re-tuning and Plotting

set.seed(456)
rf_retune <- tune_grid(
  rf_wf,
  resamples = churn_folds,
  grid = rf_grid
)

saveRDS(rf_retune, file = "Saved_Model_Results/rf_retune.rds")

rf_retune %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(mtry, mean, color = min_n)) +
  geom_line(alpha = 0.5, size = 1.5) +
  geom_point() +
  labs(y = "AUC") +
  scale_x_continuous(breaks = seq(0, 12, by = 1))

An mtry of 4 and a min_n of 33 perform the best

Best Model

best_rf <- rf_retune %>% select_best("roc_auc")
best_rf

## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     4    33 Preprocessor1_Model44

Finalized Model

# final_rf <- 
#   workflow() %>%
#   add_recipe(model_recipe_no_norm) %>%
#   add_model(final_rf)

rf_wf <-
  rf_wf |>
  finalize_workflow(best_rf)

rf_final <- 
  rf_wf %>%
  last_fit(churn_split)

saveRDS(rf_final, file = "Saved_Model_Results/rf_final.rds")

model_metrics(rf_final)

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.804 Preprocessor1_Model1
## 2 roc_auc  binary         0.854 Preprocessor1_Model1

Variable Importance Plot

library(vip)

rf_vip <- 
  rf_spec %>%
  finalize_model(select_best(rf_retune)) %>%
  set_engine("ranger", importance = "permutation")  # need to add a way to compute importance, so the engine needs to be adjusted
  
workflow() |>
  add_recipe(model_recipe_no_norm) |>
  add_model(rf_vip) |>
  fit(churn_train) %>%
  pull_workflow_fit() |>
  vip(geom = "point")

Tenure and TotalCharges are important, but they’re definitely correlated - TotalCharges is the product of tenure and MonthlyCharges. Fiber Optic is also important as well as Monthly Charges - which by itself is a good indicator

5.3 XGBoost

Arguments for the xgboost

mtry - same as rf; the number of predictors that are randomly sampled at each split
trees - same as rf; the number of trees in the ensemble
min_n - same as rf; the minimum number of data points in a node that is required for a node to be split further
tree_depth - the maximum depth of the tree aka number of splits
learn_rate - the rate at which the boosting algo adapts from iteration-to-iteration - shrinkage parameter
loss_reduction - a number for the reduction in the loss function required to split further
sample_size - the number of data that is exposed to the fitting routine

5.3.0.1 Create a grid

This method is a little bit different than the others since we have 6 hyperparameters. Instead of making a grid and trying all the values or racing, we’re going to create a grid that covers the parameter space and fills the 6-d space in such a way that all the different parts are spread out evenly

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = (sample_prop()), # this need to be a proportion
  mtry = finalize(mtry(),churn_train), # has an unknown in it - starts at 1, but doesn't have an end.  Finalize uses heuristics based on the churn_train df
  learn_rate(),
  size = 40
)

Alternatively, a racing method could be used

Racing methods:

these are efficient approaches to grid search. Initially the function evaluates all tuning parameters on a small initial set of resamples
the performance stats from these resamples are analyzed to determine which tuning params are NOT statistically different from the current best setting. If a param is different, it’s excluded from further resampling
the next resample is used with the remaining parameter combinations and the statistical analysis is updated. More candidate parameters may be excluded with each new resample that is processed
this function determines statistical significance using a repeated measure of an ANOVA model where the performance statistic - in this case accuracy - is the outcome data and the random effect is resamples

We are taking the 10 resamples from cross validation. It picks one set of hyperparameters and tries that with all the resamples. Then ANOVA is used to decide if any of them are statistically much worse and those are thrown away. It keeps going until it figures out which ones are best.

We don’t evaluate all 15 hyperparameter combinations on all 10 resamples. It’s likely that some of the hyperparameter combinations are thrown away after running the through the first resample.

# #uses an anova model to see if some hyperparamters are different between models
# set.seed(1235)
# xgb_tune <- 
#   tune_race_anova(
#   xgb_workflow,
#   resamples = churn_folds,
#   grid = 25,
#   control = control_race(verbose_elim = TRUE, save_pred = TRUE)
# )
# saveRDS(xgb_tune, file = "Saved_Model_Results/xgb_mod1.rds")
# 
# xgb_tune$.metrics[[1]] |> filter(.metric == 'roc_auc') |> arrange(-.estimate)

# plot_race(xgb_tune) +
#   scale_x_continuous(breaks = seq(0, 10, by = 1), labels = seq(0, 10, by = 1))

Create a workflow

xgb_wf <- 
  workflow() |> 
  add_model(xgb_spec) |>
  add_recipe(model_recipe_no_norm)

Tune the model

doParallel::registerDoParallel()

set.seed(43624)
xgb_tune <- tune_grid(
  xgb_wf,
  resamples = churn_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

saveRDS(xgb_tune, file = "Saved_Model_Results/xgb_initial_tune.rds")

Plots

xgb_tune %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer( mtry:sample_size,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x")

autoplot(xgb_tune)

This plot shows the results of each. There’s not really an indication of a range for each argument - maybe a higher tree_depth. For example, it’s hard to tell if a higher or lower mtry will give us a higher mean, generally.

We can look at the parameters for the best models based on accuracy and roc_auc

show_best(xgb_tune, "roc_auc") |> kable()

mtry	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.metric	.estimator	mean	n	std_err	.config
16	37	5	0.0042413	0.0000002	0.7407727	roc_auc	binary	0.8459393	10	0.0055821	Preprocessor1_Model21
5	18	10	0.0060422	1.0659622	0.7772719	roc_auc	binary	0.8454827	10	0.0052444	Preprocessor1_Model16
21	34	9	0.0077551	0.0000214	0.6841453	roc_auc	binary	0.8452450	10	0.0055234	Preprocessor1_Model32
19	32	12	0.0132575	0.0000021	0.5042190	roc_auc	binary	0.8444505	10	0.0053582	Preprocessor1_Model26
6	16	7	0.0012132	0.5612216	0.9770812	roc_auc	binary	0.8436718	10	0.0052316	Preprocessor1_Model35

best_xgb <- select_best(xgb_tune, "roc_auc")

…And then we can plug into the testing data

xgb_final <- 
  xgb_wf |>
  finalize_workflow(best_xgb) |>
  last_fit(churn_split)

saveRDS(xgb_final, file = "Saved_Model_Results/xgb_final.rds")

model_metrics(xgb_final)

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.811 Preprocessor1_Model1
## 2 roc_auc  binary         0.856 Preprocessor1_Model1

The best roc_auc was 85.65, which is slightly worse than penalized logistic regression If we look at the confusion matrix, the model does better at predicting no churn 84% vs churn 69%

xgb_vip <- 
  xgb_spec %>%
  finalize_model(select_best(xgb_tune, 'roc_auc'))
  
workflow() |>
  add_recipe(model_recipe_no_norm) |>
  add_model(xgb_vip) |>
  fit(churn_train) %>%
  pull_workflow_fit() |>
  vip(geom = "point")

Tenure, Fiber_Optic, and 2-year contract are the most important

extract_workflow(xgb_final) |>
  extract_fit_parsnip() |>
  vip(num_features = 10)

5.4 SVM

Model information

The fundamental idea behind SVMs is to find a hyperplane that best separates the data into different classes.
It aims to find the hyperplane with the maximum margin - which is the distance between the hyperplane and the nearest data point from each class. A larger margin indicates better generalization to new data.
The data points closest to the hyperplane and influence the position and orientation of the hyperplane are called support vectors. These are critical points that define the margin
The radial basis function kernel defines a similarity measure between data points based on Euclidean distance
RBF is used to capture complex or non-linear relationships

svm_wf <- 
  workflow() |>
  add_model(svm_spec) |>
  add_recipe(model_recipe_norm)

svm_grid <- grid_regular(cost(),
                          degree(range = c(1,3)),
                          levels = c(cost = 5, degree = 3))
head(svm_grid,5) |>kable()

cost	degree
0.0009766	1
0.0131390	1
0.1767767	1
2.3784142	1
32.0000000	1

doParallel::registerDoParallel()
set.seed(345)

svm_tune <- 
  svm_wf %>% 
  tune_grid(
    resamples = churn_folds,
    control = control_grid(save_pred = TRUE),
    grid = svm_grid
    )
saveRDS(svm_tune, file = "Saved_Model_Results/svm_initial_tune.rds")

autoplot(svm_tune)

collect_metrics(svm_tune) |> filter(.metric == "roc_auc") |> arrange(-mean) |> head(10) |> kable()

cost	degree	.metric	.estimator	mean	n	std_err	.config
0.0009766	1	roc_auc	binary	0.8305303	10	0.0058982	Preprocessor1_Model01
0.0131390	1	roc_auc	binary	0.8300172	10	0.0060825	Preprocessor1_Model02
0.1767767	1	roc_auc	binary	0.8284232	10	0.0060003	Preprocessor1_Model03
2.3784142	1	roc_auc	binary	0.8281469	10	0.0060189	Preprocessor1_Model04
32.0000000	1	roc_auc	binary	0.8281065	10	0.0059997	Preprocessor1_Model05
0.0009766	2	roc_auc	binary	0.8100025	10	0.0074573	Preprocessor1_Model06
0.0131390	2	roc_auc	binary	0.8028656	10	0.0080810	Preprocessor1_Model07
0.1767767	2	roc_auc	binary	0.8015521	10	0.0081695	Preprocessor1_Model08
2.3784142	2	roc_auc	binary	0.8009954	10	0.0082861	Preprocessor1_Model09
32.0000000	2	roc_auc	binary	0.7881594	10	0.0091714	Preprocessor1_Model10

show_best(svm_tune, "roc_auc") |> kable()

cost	degree	.metric	.estimator	mean	n	std_err	.config
0.0009766	1	roc_auc	binary	0.8305303	10	0.0058982	Preprocessor1_Model01
0.0131390	1	roc_auc	binary	0.8300172	10	0.0060825	Preprocessor1_Model02
0.1767767	1	roc_auc	binary	0.8284232	10	0.0060003	Preprocessor1_Model03
2.3784142	1	roc_auc	binary	0.8281469	10	0.0060189	Preprocessor1_Model04
32.0000000	1	roc_auc	binary	0.8281065	10	0.0059997	Preprocessor1_Model05

best_svm <- select_best(svm_tune, "roc_auc")

svm_final <- 
  svm_wf |>
  finalize_workflow(best_svm) |>
  last_fit(churn_split)

saveRDS(svm_final, file = "Saved_Model_Results/svm_final.rds")

model_metrics(svm_final)

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.801 Preprocessor1_Model1
## 2 roc_auc  binary         0.846 Preprocessor1_Model1

5.5 K-Nearest Neighbor

Here are the arguments that will be tuned:

Neighbors: an integer for the number of neighbors to consider - default is 5
weight_function: indicates the kernal function - “rectangular”, “triangular”, “epanechnikov”, “biweight”, “triweight”, “cos”, “inv”, “gaussian”, “rank”, or “optimal”
dist_power: a single number used in calculating the Minkowski distance. This parameter influences the sensitivity of the distance metric to different features - 1 is the manhattan distance, 2 is Euclidean

recipe knn_spec <- nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>% set_engine(“kknn”) %>% set_mode(“classification”)

knn_wf <- 
  workflow() |>
  add_model(knn_spec) |>
  add_recipe(model_recipe_norm)

knn_grid <- grid_regular(neighbors(),
                         dist_power(range = c(1,3)),
                         weight_func(),
                         levels = c(neighbors = 3, dist_power = 3,weight_func = 10))

doParallel::registerDoParallel()
set.seed(4575)

knn_tune <- 
  knn_wf %>% 
  tune_grid(
    resamples = churn_folds,
    control = control_grid(save_pred = TRUE),
    grid = knn_grid
    )
saveRDS(knn_tune, file = "Saved_Model_Results/knn_initial_tune.rds")

Since the grid only had 4 levels, only 4 of the 10 distance weighting functions were included. Rectangular performs the best and more neighbors perform better across the board. Dist power doesn’t appear to matter much.

autoplot(knn_tune)

Create a new grid and retune

knn_grid <- grid_regular(neighbors(range = c(10,60)),
                         dist_power(range(1,3)),
                         weight_func(values = 'rectangular'),
                         levels = c(neighbors = 8, dist_power = 3, weight_func = 1))


knn_retune <- 
  knn_wf %>% 
  tune_grid(
    resamples = churn_folds,
    control = control_grid(save_pred = TRUE),
    grid = knn_grid
    )
saveRDS(knn_retune, file = "Saved_Model_Results/knn_retuned.rds")

#autoplot(knn_retune)

autoplot(knn_retune)

A Minkowskis distance of 1 indicates Manhattan distance and 45 neighbors has the best roc_auc, 31 neighbors has the best accuracy, so there’s a bit of a tradeoff

collect_metrics(knn_retune) |> filter(.metric == "roc_auc") |> arrange(-mean) |> head(10) |>kable()

neighbors	weight_func	dist_power	.metric	.estimator	mean	n	std_err	.config
60	rectangular	1	roc_auc	binary	0.8293796	10	0.0045258	Preprocessor1_Model08
38	rectangular	1	roc_auc	binary	0.8290584	10	0.0054406	Preprocessor1_Model05
45	rectangular	1	roc_auc	binary	0.8289684	10	0.0050726	Preprocessor1_Model06
52	rectangular	1	roc_auc	binary	0.8285914	10	0.0044145	Preprocessor1_Model07
31	rectangular	1	roc_auc	binary	0.8276035	10	0.0049272	Preprocessor1_Model04
60	rectangular	2	roc_auc	binary	0.8263033	10	0.0048223	Preprocessor1_Model16
52	rectangular	2	roc_auc	binary	0.8249508	10	0.0046678	Preprocessor1_Model15
38	rectangular	2	roc_auc	binary	0.8239603	10	0.0047623	Preprocessor1_Model13
24	rectangular	1	roc_auc	binary	0.8238411	10	0.0048074	Preprocessor1_Model03
31	rectangular	2	roc_auc	binary	0.8237956	10	0.0042344	Preprocessor1_Model12

show_best(knn_retune, "roc_auc") |> kable()

neighbors	weight_func	dist_power	.metric	.estimator	mean	n	std_err	.config
60	rectangular	1	roc_auc	binary	0.8293796	10	0.0045258	Preprocessor1_Model08
38	rectangular	1	roc_auc	binary	0.8290584	10	0.0054406	Preprocessor1_Model05
45	rectangular	1	roc_auc	binary	0.8289684	10	0.0050726	Preprocessor1_Model06
52	rectangular	1	roc_auc	binary	0.8285914	10	0.0044145	Preprocessor1_Model07
31	rectangular	1	roc_auc	binary	0.8276035	10	0.0049272	Preprocessor1_Model04

best_knn <- select_best(knn_retune, "roc_auc")

knn_final <- 
  knn_wf |>
  finalize_workflow(best_knn) |>
  last_fit(churn_split)

saveRDS(knn_final, file = "Saved_Model_Results/knn_final.rds")

model_metrics(knn_final)

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.796 Preprocessor1_Model1
## 2 roc_auc  binary         0.841 Preprocessor1_Model1

6 Bringing it all together

model_results <- rbind(
  collect_metrics(glmnet_final) |> select(.metric, .estimate) |> mutate(model = 'glmnet'),
  collect_metrics(rf_final) |> select(.metric, .estimate) |> mutate(model = 'RF'),
  collect_metrics(xgb_final) |> select(.metric, .estimate) |> mutate(model = 'xgb'),
  collect_metrics(svm_final) |> select(.metric, .estimate) |> mutate(model = 'svm'),
  collect_metrics(knn_final) |> select(.metric, .estimate) |> mutate(model = 'knn')
)

model_results |> arrange(desc(.metric),desc(.estimate)) |> kable(align = 'c', caption = "All Model Results")

All Model Results
.metric	.estimate	model
roc_auc	0.8580794	glmnet
roc_auc	0.8564933	xgb
roc_auc	0.8537547	RF
roc_auc	0.8455214	svm
roc_auc	0.8410522	knn
accuracy	0.8110102	xgb
accuracy	0.8070375	glmnet
accuracy	0.8036322	RF
accuracy	0.8013621	svm
accuracy	0.7956867	knn

The lasso model produced the best area under the ROC curve and XGBoost produced the best accuracy. The area under the ROC curve is just slightly better than XGB

Using Tidymodels to Predict Churn

Mike Kaminski

2023-11-20

1 Background

2 Libraries, Data, and Quick Review

3 EDA and Correlation

4 Model Setup

4.1 Splitting and Cross-Validation

4.2 Recipes

4.3 Building Model Specifications

4.4 Metrics Function

5 Modeling

5.1 Penalized Logistic Regression

5.2 Random Forest

5.3 XGBoost

5.3.0.1 Create a grid

5.4 SVM

5.5 K-Nearest Neighbor

6 Bringing it all together