Tuning XGBoost with Tidymodels

We will learn how to tune hyper-parameter on XGBoost model via Customer Churn Prediction exercise.

Explore Data

customer <- read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")

customer <- customer %>%
  mutate_if(is.character,factor)

Plot summary of all variables with DataExplorer

library(DataExplorer)
plot_intro(customer)

Summary customer data: (Exclude customerID)

Data summary
Name	customer[, -1]
Number of rows	7043
Number of columns	20
_______________________
Column type frequency:
factor	16
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts	pct
gender	1	FALSE	2	Mal: 3555, Fem: 3488	Mal: 0.50, Fem: 0.50
Partner	1	FALSE	2	No: 3641, Yes: 3402	No: 0.52, Yes: 0.48
Dependents	1	FALSE	2	No: 4933, Yes: 2110	No: 0.70, Yes: 0.30
PhoneService	1	FALSE	2	Yes: 6361, No: 682	Yes: 0.90, No: 0.10
MultipleLines	1	FALSE	3	No: 3390, Yes: 2971, No : 682	No: 0.48, Yes: 0.42, No : 0.10
InternetService	1	FALSE	3	Fib: 3096, DSL: 2421, No: 1526	Fib: 0.44, DSL: 0.34, No: 0.22
OnlineSecurity	1	FALSE	3	No: 3498, Yes: 2019, No : 1526	No: 0.50, Yes: 0.29, No : 0.22
OnlineBackup	1	FALSE	3	No: 3088, Yes: 2429, No : 1526	No: 0.44, Yes: 0.34, No : 0.22
DeviceProtection	1	FALSE	3	No: 3095, Yes: 2422, No : 1526	No: 0.44, Yes: 0.34, No : 0.22
TechSupport	1	FALSE	3	No: 3473, Yes: 2044, No : 1526	No: 0.49, Yes: 0.29, No : 0.22
StreamingTV	1	FALSE	3	No: 2810, Yes: 2707, No : 1526	No: 0.40, Yes: 0.38, No : 0.22
StreamingMovies	1	FALSE	3	No: 2785, Yes: 2732, No : 1526	No: 0.40, Yes: 0.39, No : 0.22
Contract	1	FALSE	3	Mon: 3875, Two: 1695, One: 1473	Mon: 0.55, Two: 0.24, One: 0.21
PaperlessBilling	1	FALSE	2	Yes: 4171, No: 2872	Yes: 0.59, No: 0.41
PaymentMethod	1	FALSE	4	Ele: 2365, Mai: 1612, Ban: 1544, Cre: 1522	Ele: 0.34, Mai: 0.23, Ban: 0.22, Cre: 0.22
Churn	1	FALSE	2	No: 5174, Yes: 1869	No: 0.73, Yes: 0.27

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SeniorCitizen	0	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
tenure	0	1	32.37	24.56	0.00	9.00	29.00	55.00	72.00	▇▃▃▃▆
MonthlyCharges	0	1	64.76	30.09	18.25	35.50	70.35	89.85	118.75	▇▅▆▇▅
TotalCharges	11	1	2283.30	2266.77	18.80	401.45	1397.47	3794.74	8684.80	▇▂▂▂▁

Quick EDA

A quick Data Exploration

Categorical variables

plot_bar(customer[,-1], by = "Churn")

Gender, PhoneService, StreamingTV, StreamingMovies have similar pattern in their category, so they might not contribute much in classifying Churn status.

Numeric variables

library(ggstatsplot)
ggcorrmat(customer,
          title = "Relationship between numeric variables",
  subtitle = "Dataset: Telco Customer",
  ggcorrplot.args = list(outline.color = "black", hc.order = TRUE))

TotalCharges has a strongly correlation with tenure and MonthlyCharges

Feature engineering

Wrangling data

There are 11 NAs in TotalChanges, which we can omit them.
Looks like SeniorCitizen is not a numeric variable, but a factor, we should convert them.
We can eliminate TotalCharge due to multicollinearity.
We also can remove customerID, as it is not useful for prediction.

customer <- customer %>%
  na.omit() %>%
  mutate_at(vars(SeniorCitizen),factor) %>%
    select(-customerID, - TotalCharges)

Data splitting and resamples

Splitting and assigning data set to training and testing. Then creating resamples data from training data set - using cross validation vfold-cv() :

library(tidymodels)
set.seed(123)
cust_split <- customer %>%
  initial_split(prop = 0.8, strata = Churn)

train <- training(cust_split)
test <- testing(cust_split)

# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Churn)

Pre-processing

Let’s create a base recipe()

cust_rec <- recipe(Churn ~., data = train) %>%
#  update_role(customerID, new_role = "ID") %>%
#  step_corr(all_numeric()) %>%
  step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
  step_zv(all_numeric()) %>% # filter zero variance
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors())

Model without tuning

Let’s start by building the model without any hyperparameter tuning. Instead, we will use typically recommended values for our hyperparameters.

Model

Let’s create a base XGBoost model specification and workflow:

xgb_spec0 <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

xgb_wf0 <-
 workflow() %>%
 add_recipe(cust_rec) %>% 
 add_model(xgb_spec0)

Fit model

Fit model with resamples

# XGBoost
xgb_fit0 <- fit_resamples(xgb_wf0,
                         resamples = folds,
                         metrics = metric_set(accuracy, roc_auc, sens,spec),
                         control = control_resamples(save_pred = TRUE))

Performance metrics

xgb_fit0 %>% 
  collect_metrics()

## # A tibble: 4 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.797    10 0.00528 Preprocessor1_Model1
## 2 roc_auc  binary     0.842    10 0.00498 Preprocessor1_Model1
## 3 sens     binary     0.895    10 0.00679 Preprocessor1_Model1
## 4 spec     binary     0.526    10 0.00949 Preprocessor1_Model1

Final fit

Final fix to the whole training set and eveluata on test set

xgb_final <- last_fit(
  xgb_wf0,
  split = cust_split,
  metrics = metric_set(accuracy, roc_auc, sens,spec)
)

Performance Metrics on unseen testing data

xgb_final %>% 
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.802 Preprocessor1_Model1
## 2 sens     binary         0.894 Preprocessor1_Model1
## 3 spec     binary         0.548 Preprocessor1_Model1
## 4 roc_auc  binary         0.841 Preprocessor1_Model1

Confusion matrix

xgb_final %>%
  collect_predictions() %>%
  conf_mat(Churn,.pred_class)

##           Truth
## Prediction  No Yes
##        No  924 169
##        Yes 109 205

Model with tuning

Then we will build model with tuning hyper parameters:

Specify model

Create XGBoost model specification with hyper parameters we plan to tune:

# Setup a model specification
xgb_spec <-boost_tree(
  trees = 500,
  tree_depth = tune(), 
  min_n = tune(),
  loss_reduction = tune(),                    ## first three: model complexity
  sample_size = tune(), mtry = tune(),        ## randomness
  learn_rate = tune()                         ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
xgb_spec

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 500
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Computational engine: xgboost

Workflow

# Passing to workflow formula and Model specification
xgb_wf <- workflow() %>%
  add_formula(Churn ~.) %>%
  add_model(xgb_spec)
xgb_wf

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## Churn ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 500
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Computational engine: xgboost

Tunning

Grid search parameter:

Set up grids search parameter with grid_latin_hypercube() - a function with space-filling parameter approach to find the best tuning

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train),
  learn_rate(),
  size = 20
)

## # A tibble: 20 × 6
##    tree_depth min_n loss_reduction sample_size  mtry learn_rate
##         <int> <int>          <dbl>       <dbl> <int>      <dbl>
##  1          8    32       8.88e- 7       0.937     9   4.81e- 8
##  2          6     5       4.53e-10       0.257     8   4.00e- 4
##  3          4    36       5.16e- 3       0.139     3   2.20e-10
##  4         14     2       5.16e- 9       0.334    11   1.31e- 9
##  5         11    18       4.93e+ 0       0.693     6   1.04e- 3
##  6         12    40       1.79e- 4       0.455    13   2.25e- 2
##  7         10    13       1.25e- 5       0.898    12   1.67e- 7
##  8         14    19       1.18e+ 1       0.677     3   2.21e- 6
##  9          2    37       1.87e- 3       0.555    12   9.63e- 8
## 10          5    17       2.16e- 1       0.784    14   3.08e- 3
## 11          4    11       9.24e- 8       0.762    10   5.48e- 3
## 12          6    27       1.29e- 1       0.207    16   4.84e-10
## 13         10    22       1.21e- 2       0.635     4   6.43e- 9
## 14          7    34       2.49e- 6       0.398     5   2.32e- 5
## 15         12    28       1.31e+ 0       0.547     2   2.69e- 5
## 16          9    30       4.87e- 8       0.983    16   5.27e- 6
## 17          3    14       3.45e- 4       0.162    15   5.67e- 7
## 18          2     7       3.46e- 5       0.323    18   4.11e- 9
## 19          9    24       2.09e-10       0.490    17   1.56e- 4
## 20         13     9       1.08e- 8       0.863     6   6.10e- 2

Grid search tunning

We use tune_grid() function with the workflow, resamples on training set to tune grid parameter:

library(finetune)
doParallel::registerDoParallel()

set.seed(234)
xgb_res <-tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred  = TRUE)
)
xgb_res

## # Tuning results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 5
##    splits             id     .metrics           .notes           .predictions
##    <list>             <chr>  <list>             <list>           <list>      
##  1 <split [5062/563]> Fold01 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  2 <split [5062/563]> Fold02 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  3 <split [5062/563]> Fold03 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  4 <split [5062/563]> Fold04 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  5 <split [5062/563]> Fold05 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  6 <split [5063/562]> Fold06 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  7 <split [5063/562]> Fold07 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  8 <split [5063/562]> Fold08 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
##  9 <split [5063/562]> Fold09 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>    
## 10 <split [5063/562]> Fold10 <tibble [40 × 10]> <tibble [0 × 3]> <tibble>

Explore Results

Display our tuning parameters:

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               names_to = "parameter",
               values_to = "value") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE)+
  facet_wrap(~parameter, scales = "free_x")

Or we can show_best hyper parameter or select_best parameter

show_best(xgb_res, "roc_auc")

## # A tibble: 5 × 12
##    mtry min_n tree_depth learn_rate loss_r…¹ sampl…² .metric .esti…³  mean     n
##   <int> <int>      <int>      <dbl>    <dbl>   <dbl> <chr>   <chr>   <dbl> <int>
## 1    13    40         12    2.25e-2 1.79e- 4   0.455 roc_auc binary  0.848    10
## 2    14    17          5    3.08e-3 2.16e- 1   0.784 roc_auc binary  0.848    10
## 3    10    11          4    5.48e-3 9.24e- 8   0.762 roc_auc binary  0.847    10
## 4    12    13         10    1.67e-7 1.25e- 5   0.898 roc_auc binary  0.846    10
## 5     8     5          6    4.00e-4 4.53e-10   0.257 roc_auc binary  0.844    10
## # … with 2 more variables: std_err <dbl>, .config <chr>, and abbreviated
## #   variable names ¹loss_reduction, ²sample_size, ³.estimator

best_auc <- select_best(xgb_res, "roc_auc")
best_auc

## # A tibble: 1 × 7
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .config          
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>            
## 1    13    40         12     0.0225       0.000179       0.455 Preprocessor1_Mo…

Then we will use values from best_auc to finalize model :

final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_xgb

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## Churn ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = 13
##   trees = 500
##   min_n = 40
##   tree_depth = 12
##   learn_rate = 0.0224765893635175
##   loss_reduction = 0.000179217810199367
##   sample_size = 0.454792870174861
## 
## Computational engine: xgboost

Check model importance:

# extract_fit_parsnip() 
library(vip)
final_xgb %>%
  fit(data = train) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point")

Final model

Now, with last_fit() function, we will fit our final model to whole training data set and evaluate on unseen testing data. The Performance metric as below

final_rs <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.810 Preprocessor1_Model1
## 2 sens     binary         0.910 Preprocessor1_Model1
## 3 spec     binary         0.535 Preprocessor1_Model1
## 4 roc_auc  binary         0.850 Preprocessor1_Model1

Our results indicate an increase on both accuracy and roc_auc with tuning hyper-parameter.