Tidymodels_Workflow and Hyperparameter Tuning

Kate C

2022-02-04

Load Packages and Dataset

  • tidymodels - study objective

  • data - continue with telecom data used for last lesson with training and test data prepared from sampling function.

library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.10     ✓ recipes      0.1.17
## ✓ dials        0.0.10     ✓ rsample      0.1.1 
## ✓ dplyr        1.0.7      ✓ tibble       3.1.6 
## ✓ ggplot2      3.3.5      ✓ tidyr        1.1.4 
## ✓ infer        1.0.0      ✓ tune         0.1.6 
## ✓ modeldata    0.1.1      ✓ workflows    0.2.4 
## ✓ parsnip      0.1.7      ✓ workflowsets 0.1.0 
## ✓ purrr        0.3.4      ✓ yardstick    0.0.9
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
telecom_df <- readRDS("~/Documents/R programming/Datacamp/DC_Tidymodels/Data/telecom_df.rds")
telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>% 
                    training()
telecom_test <- telecom_split %>% 
                testing()

Key Points

Exploring recipe objects

  • first step in feature engineering is to specify a recipe object with the recipe() function

  • and add data preprocessing steps with one ore more step_*() functions

  • storing all of this information in a single recipe object makes it easier to manage complex feature engineering pipelines and transform new data sources - the latter is quite important because it allows us to train/transform the test data in the same format

  • create a recipe - fit a model with all predictors and outcome is the “canceled_service”, and transform the call mins to log base of 10.

telecom_rec <- recipe(canceled_service ~ ., 
                      data = telecom_df) %>% 
  step_log(avg_call_mins, base = 10)

telecom_rec$var_info %>% 
  group_by(type) %>% 
  count()
## # A tibble: 2 × 2
## # Groups:   type [2]
##   type        n
##   <chr>   <int>
## 1 nominal     4
## 2 numeric     5
summary(telecom_rec)
## # A tibble: 9 × 4
##   variable            type    role      source  
##   <chr>               <chr>   <chr>     <chr>   
## 1 cellular_service    nominal predictor original
## 2 avg_data_gb         numeric predictor original
## 3 avg_call_mins       numeric predictor original
## 4 avg_intl_mins       numeric predictor original
## 5 internet_service    nominal predictor original
## 6 contract            nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges     numeric predictor original
## 9 canceled_service    nominal outcome   original

here we can see there are four nominal predictor variables (including the outcome) and five numeric variables.

Creating recipe objects

  • on training dataset

  • log transform two predictor variables - reduce the range of these variables and potentially make the distributions more symmetric which may increase the accuracy of the logistic regression model. and from a statistical perspective, a log transform may stabilize the variance in a way that makes inference more legitimate.

telecom_log_rec <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  step_log(avg_call_mins, avg_intl_mins, base = 10)

telecom_log_rec %>% 
  summary()
## # A tibble: 9 × 4
##   variable            type    role      source  
##   <chr>               <chr>   <chr>     <chr>   
## 1 cellular_service    nominal predictor original
## 2 avg_data_gb         numeric predictor original
## 3 avg_call_mins       numeric predictor original
## 4 avg_intl_mins       numeric predictor original
## 5 internet_service    nominal predictor original
## 6 contract            nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges     numeric predictor original
## 9 canceled_service    nominal outcome   original

Training a recipe object - with steps

above - created a recipe with instructions to apply a log transformation to two predictor variables.

next step in feature engineering process is to train the recipe object using the training data.

then we will be able to apply trained recipe to both the training and test datasets in order to prepare them for use in model fitting and model evaluation.

  • train telecom_log_rec object using the telecom_training dataset
telecom_log_rec_prep <- telecom_log_rec %>% 
  prep(training = telecom_training)

telecom_log_rec_prep
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Training data contained 731 data points and no missing data.
## 
## Operations:
## 
## Log transformation on avg_call_mins, avg_intl_mins [trained]
  • apply to training data with bake. note that the avg_call_mins and avg_intl_mins have been log transformed.
telecom_log_rec_prep %>% 
  bake(new_data = NULL)
## # A tibble: 731 × 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 multiple_lines          8.05          2.52          2.09 digital         
##  2 single_line             9.3           2.51          2.06 fiber_optic     
##  3 multiple_lines          9.4           2.49          2.17 fiber_optic     
##  4 multiple_lines          9.96          2.53          2.13 fiber_optic     
##  5 single_line             9.37          2.58          1.94 fiber_optic     
##  6 multiple_lines         10.6           2.45          2.17 fiber_optic     
##  7 multiple_lines          5.17          2.53          2.08 digital         
##  8 multiple_lines          7.86          2.58          2.21 digital         
##  9 multiple_lines         11.0           2.59          1.89 fiber_optic     
## 10 multiple_lines         11.8           2.54          2.10 fiber_optic     
## # … with 721 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
  • apply to test data. note that the avg_call_mins and avg_intl_mins have been log transformed.
telecom_log_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             9.04          2.53          1.94 fiber_optic     
##  2 single_line            10.3           2.42          1.74 fiber_optic     
##  3 multiple_lines          8.01          2.72          1.99 fiber_optic     
##  4 multiple_lines         10.2           2.60          2.06 fiber_optic     
##  5 single_line             6.69          2.55          1.96 digital         
##  6 multiple_lines          4.11          2.57          1.81 digital         
##  7 single_line             8.67          1.97          2.12 fiber_optic     
##  8 multiple_lines          9.24          2.59          2.15 fiber_optic     
##  9 single_line             8.02          2.38          1.98 fiber_optic     
## 10 multiple_lines          6.79          2.71          2.15 digital         
## # … with 234 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

Steps transforming numeric variables

  • Discovering correlated predictors and remove them

  • specify a recipe object (named in cor to show it is to remove highly correlated predictors here defined as correlation greater than 0.8)

telecom_cor_rec <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  step_corr(all_numeric(), threshold = 0.8)
  • train the recipe using training dataset. using function prep
telecom_cor_rec_prep <- telecom_cor_rec %>% 
  prep(training = telecom_training)
  • use trained recipe to obtain transformed training dataset aka the one where highly correlated predictors are removed, here we use bake function
telecom_cor_rec_prep %>% 
  bake(new_data = NULL)
## # A tibble: 731 × 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 multiple_lines          8.05           328           122 digital         
##  2 single_line             9.3            326           114 fiber_optic     
##  3 multiple_lines          9.4            312           147 fiber_optic     
##  4 multiple_lines          9.96           340           136 fiber_optic     
##  5 single_line             9.37           382            87 fiber_optic     
##  6 multiple_lines         10.6            281           147 fiber_optic     
##  7 multiple_lines          5.17           341           119 digital         
##  8 multiple_lines          7.86           378           164 digital         
##  9 multiple_lines         11.0            390            78 fiber_optic     
## 10 multiple_lines         11.8            343           126 fiber_optic     
## # … with 721 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>
  • similarly apply the trained recipe to test data
telecom_cor_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             9.04           336            88 fiber_optic     
##  2 single_line            10.3            262            55 fiber_optic     
##  3 multiple_lines          8.01           525            97 fiber_optic     
##  4 multiple_lines         10.2            402           116 fiber_optic     
##  5 single_line             6.69           352            91 digital         
##  6 multiple_lines          4.11           371            64 digital         
##  7 single_line             8.67            93           131 fiber_optic     
##  8 multiple_lines          9.24           392           142 fiber_optic     
##  9 single_line             8.02           240            95 fiber_optic     
## 10 multiple_lines          6.79           514           141 digital         
## # … with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

More on feature engineering

  • key in recipes package is that we can include multiple preprocessing steps in a single recipe object.

  • there steps will be carried out in the order as they are entered with the step_*() functions

  • here in below we will include two steps to preprocess the data - remove highly correlated predictors and also normalize all numeric predictors

  • specify a recipe object

telecom_norm_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric())
  • train the recipe
telecom_norm_rec_prep <- telecom_norm_rec %>% 
  prep(training = telecom_training)
  • apply to test data
telecom_norm_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            0.417       -0.203         -0.648 fiber_optic     
##  2 single_line            1.09        -1.20          -1.71  fiber_optic     
##  3 multiple_lines        -0.126        2.34          -0.360 fiber_optic     
##  4 multiple_lines         1.02         0.685          0.248 fiber_optic     
##  5 single_line           -0.822        0.0127        -0.552 digital         
##  6 multiple_lines        -2.18         0.268         -1.42  digital         
##  7 single_line            0.222       -3.47           0.729 fiber_optic     
##  8 multiple_lines         0.522        0.551          1.08  fiber_optic     
##  9 single_line           -0.121       -1.49          -0.424 fiber_optic     
## 10 multiple_lines        -0.769        2.19           1.05  digital         
## # … with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

Nominal Predictors

Key points

  • process nominal predictors

  • nominal data must be transformed to numeric data for modeling

  • one-hot encoding - map categorical values to a sequence of \[0/1\] indicator variables.

More on encoding qualitative data in a numeric format

reference from book Tidy Modeling with R

  • one of the most common feature engineering tasks is to transform nominal or qualitative (factors or chars) so that they can be encoded or represented numerically.

  • sometimes we can alter the factor levels of a qualitative column in helpful ways prior to such a transformation.

  • step_unknown() - change missing values to a dedicated factor level

  • step_novel() - if anticipate that a new factor level may be encountered in future data, this can allot a new level for this purpose.

  • step_other - can be used to analyze the frequencies of the factor levels in the training set and convert infrequently occurring values to a catch-all level of “other, with a specific threshold that can be specified. i.e. bottom 1% of the dataset can be lumped into a new level called”other” in the modelling process.

  • there are a few strategies for converting a factor predictor to a numeric format. most common method is tro create” dummy” or indicator variables. i.e. “dummy variable encoding”.

    • the nominal variable column will be replaced with numeric columns whose values are either zero or one

    • these binary variables represent specific factor level values

    • in R, the convention is to exclude a column for the first factor level - because if you know the value of the rest columns , you can determine the last value - since these are mutually exclusive categories.

    • note that step_dummy() eliminate the original data column and replace it with one or more columns with different names

Continue with Key points

  • therefore for the dummy variable encoding -

    • exclude one value from original set of data values.

    • n distinct values produce (n-1) indicator variables.

    • and this is the preferred method for modeling - default in recipes package

hence the above table is revised as below - the department variable is mapped to a sequence of two indicator variables.

Steps for dummy variables encoding

  • creates dummy variables from nominal predictor variables

  • specify a recipe with model formula and data

  • pass it to step_dummy function and select the char for processing

  • pass the results to the prep function where the recipe is trained on the training data

  • finally pass into bake function where we apply the transform on test data

  • can select columns by type. i.e. all_nominal(), and -all_outcomes() to exclude the outcome variable

  • note that - it is possible to fit models without hot-encoding but not consistent across all engines, so it is recommended to use recipe to standardize the process.

Follow-up practice

  • order of step_*() functions - step functions within a recipe are carried out in sequential order.

  • important to keep the above in mind so that to avoid unexpected results in feature engineering pipeline

  • in general it is best to normalize variables before creating dummy variables

showcase - first normalize and then create dummy variables
  • first normalize then create dummy variables - note the result contains dummy variables o and 1
telecom_recipe_1 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

# train and apply telecom_recipe_1 on the test data
telecom_recipe_1 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 4
##    avg_data_gb canceled_service contract_one_year contract_two_year
##          <dbl> <fct>                        <dbl>             <dbl>
##  1       0.417 yes                              0                 0
##  2       1.09  no                               1                 0
##  3      -0.126 yes                              0                 0
##  4       1.02  no                               0                 0
##  5      -0.822 no                               0                 0
##  6      -2.18  no                               0                 1
##  7       0.222 no                               0                 1
##  8       0.522 no                               0                 0
##  9      -0.121 no                               0                 0
## 10      -0.769 yes                              1                 0
## # … with 234 more rows
  • first create dummy variables then normalize. note that because it is normalized after creating dummy variables - they all end up in numbers with fractions instead of 0 and 1 as shown in above case.
telecom_recipe_2 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training)  %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric(), -all_outcomes())

# Train and apply telecom_recipe_2 on the test data
telecom_recipe_2 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 4
##    avg_data_gb canceled_service contract_one_year contract_two_year
##          <dbl> <fct>                        <dbl>             <dbl>
##  1       0.417 yes                         -0.510            -0.469
##  2       1.09  no                           1.96             -0.469
##  3      -0.126 yes                         -0.510            -0.469
##  4       1.02  no                          -0.510            -0.469
##  5      -0.822 no                          -0.510            -0.469
##  6      -2.18  no                          -0.510             2.13 
##  7       0.222 no                          -0.510             2.13 
##  8       0.522 no                          -0.510            -0.469
##  9      -0.121 no                          -0.510            -0.469
## 10      -0.769 yes                          1.96             -0.469
## # … with 234 more rows

also note that since we only specified two predictor variables (avg_data_gb and contract) in the model formula, the rest of the columns are ignored by the recipe objects when transforming new data sources. Essentially it is like specifying predictor variables in the regression model - if they are not specified, they will not be used for training and modelling in following steps.

Complete feature engineering pipeline

Write a code chunk to complete following task

  • create a recipe that predicts canceled_service using all predictor variables in the training data

  • remove correlated predictor variables using a 0.8 threshold

  • normalize all numeric predictors

  • create dummy variables for all nominal predictors

  • train the recipe on the training data and apply it to the test data

telecom_receipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  step_corr(all_numeric(), threshold = 0.8) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

telecom_receipe %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 × 9
##    avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
##          <dbl>         <dbl>         <dbl>               <dbl> <fct>           
##  1       0.417       -0.203         -0.648              -0.981 yes             
##  2       1.09        -1.20          -1.71                0.643 no              
##  3      -0.126        2.34          -0.360              -0.615 yes             
##  4       1.02         0.685          0.248              -0.697 no              
##  5      -0.822        0.0127        -0.552              -1.14  no              
##  6      -2.18         0.268         -1.42                1.54  no              
##  7       0.222       -3.47           0.729               0.847 no              
##  8       0.522        0.551          1.08                1.42  no              
##  9      -0.121       -1.49          -0.424              -1.02  no              
## 10      -0.769        2.19           1.05               -0.900 yes             
## # … with 234 more rows, and 4 more variables:
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   contract_one_year <dbl>, contract_two_year <dbl>

Complete Modeling Flow

Now we have covered feature engineering (for both numeric and nominal variables) and the recipe, prep and bake process, it is time to incorporate the feature engineering the modeling flow itself.

prior to that let’s redefine/re-create the logistic model with parsnip (note that engine is glm and mode is classification)

logistic_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')
  • step 1 - create a recipe and include necessary steps (correlation, log transform, normalize, and create dummy variables)
telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  step_corr(all_numeric(), threshold = 0.8) %>% 
  step_log(all_numeric(), base = 10) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes())
  • step 2 - train the telecom_recipe object using the training data, and use the trained recipe to transform the training data
telecom_recipe_prep <- telecom_recipe %>% 
  prep(training = telecom_training)

telecom_training_prep <- telecom_recipe_prep %>% 
  bake(new_data = NULL)
  • step 3 - apply trained recipe object to the test dataset and view the results
telecom_test_prep <- telecom_recipe_prep %>% 
  bake(new_data = telecom_test)

telecom_test_prep
## # A tibble: 244 × 9
##    avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
##          <dbl>         <dbl>         <dbl>               <dbl> <fct>           
##  1    0.482          -0.0802        -0.495             -0.534  yes             
##  2    1.01           -1.14          -1.95               0.697  no              
##  3    0.000391        1.83          -0.194             -0.0430 yes             
##  4    0.958           0.687          0.360             -0.128  no              
##  5   -0.716           0.119         -0.392             -0.925  no              
##  6   -2.65            0.343         -1.48               0.976  no              
##  7    0.315          -5.57           0.736              0.770  no              
##  8    0.569           0.579          0.986              0.944  no              
##  9    0.00535        -1.52          -0.258             -0.615  no              
## 10   -0.657           1.74           0.964             -0.395  yes             
## # … with 234 more rows, and 4 more variables:
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   contract_one_year <dbl>, contract_two_year <dbl>
  • step 4 - train logistic model on processed training data
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ ., data = telecom_training_prep)
  • step 5 - obtain class predictions and obtain estimated probabilities on test dataset
class_preds <- predict(logistic_fit, new_data = telecom_test_prep, type = 'class')
prob_preds <- predict(logistic_fit, new_data = telecom_test_prep, type = 'prob')
  • step 6 - combine test set results with actual results
telecom_results <- telecom_test_prep %>% 
  select(canceled_service) %>% 
  cbind(class_preds, prob_preds)

head(telecom_results,5)
##   canceled_service .pred_class .pred_yes  .pred_no
## 1              yes         yes 0.5623575 0.4376425
## 2               no          no 0.2229379 0.7770621
## 3              yes         yes 0.8250824 0.1749176
## 4               no         yes 0.6356467 0.3643533
## 5               no          no 0.3911968 0.6088032

by now we have created a tibble of model results on the test dataset with the actual outcome variable value, predicted outcome values, and estimated probabilities of the positive and negative classes. next step is to evaluate the results.

  • step 7 - evaluate the results and plot ROC curve.
# Create a confusion matrix
telecom_results %>% 
  conf_mat(truth = canceled_service, estimate = .pred_class)
##           Truth
## Prediction yes  no
##        yes  52  14
##        no   30 148
# Calculate sensitivity
telecom_results %>% 
  sens(truth = canceled_service, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.634
# Calculate specificity
telecom_results %>% 
  spec(truth = canceled_service, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.914
# Plot ROC curve - note here we have two arguments for the truth here
telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

Machine learning workflows and hyperparameter tuning

Decision tree models for example

  • workflow package is designed for streamlining the model process - combines a parsnip (define and fit the model) model and recipe object into a single workflow object

  • workflow is an object as well - using a workflow encourages methodology since it is a single point of entry to the estimation components of a data analysis; enables the user to better organize their projects.

  • it is important to focus on the broader modeling process, instead of only fitting the specific model used to estimate parameters.

  • the broader process includes - any preprocessing steps, the model fit itself, and potential post-processing activities. hence it is called workflows

Practice - Creating Workflow

  • load the data - loan_df - contains financial information on consumer loans at a bank. The outcome variable in this data is loan_default.

  • Objective - create a decision tree model object and specify a feature engineering pipeline for the load data

loans_df <- readRDS("~/Documents/R programming/Datacamp/DC_Tidymodels/Data/loan_df.rds")
  • step 1 - exploring the dataset (create training and testing datasets; check for correlated predictors)
loans_split <- initial_split(loans_df, strata = loan_default)
loans_training <- loans_split %>% training()
loans_test <- loans_split %>% testing()

# Check for correlated predictors
loans_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>%
  # Calculate correlation matrix
  cor()
##                loan_amount interest_rate installment annual_income
## loan_amount    1.000000000   0.007833244  0.93928366     0.3324928
## interest_rate  0.007833244   1.000000000  0.04018282    -0.0927163
## installment    0.939283660   0.040182816  1.00000000     0.2847142
## annual_income  0.332492764  -0.092716299  0.28471419     1.0000000
## debt_to_income 0.144739018   0.171768123  0.19004895    -0.1864699
##                debt_to_income
## loan_amount         0.1447390
## interest_rate       0.1717681
## installment         0.1900490
## annual_income      -0.1864699
## debt_to_income      1.0000000
  • step 2 - specifying a decision tree model and recipe
dt_model <- decision_tree() %>% 
  # Specify the engine
  set_engine('rpart') %>% 
  # Specify the mode
  set_mode('classification')

# Build feature engineering pipeline
loans_recipe <- recipe(loan_default ~ .,
                        data = loans_training) %>% 
  # Correlation filter
  step_corr(all_numeric(), threshold = 0.85) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>% 
  # Create dummy variables
  step_dummy(all_nominal_predictors())

loans_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7
## 
## Operations:
## 
## Correlation filter on all_numeric()
## Centering and scaling for all_numeric()
## Dummy variables from all_nominal_predictors()
  • step 3 - creating workflows. Here we combine decision tree model and feature engineering recipe into a single workflow object and perform model fitting and evaluation. Here it shows the area under the ROC curve is 0.746.
#create a workflow
loans_dt_wkfl <- workflow() %>% 
  add_model(dt_model) %>% 
  add_recipe(loans_recipe)
#train the workflow
loans_dt_wkfl_fit <- loans_dt_wkfl %>% 
  last_fit(split = loans_split)
#calculate performance metrics on test data
loans_dt_wkfl_fit %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.763 Preprocessor1_Model1
## 2 roc_auc  binary         0.782 Preprocessor1_Model1

Estimating performance with cross validation

  • Cross-validation is a well established resampling method - and the most common cross-validation method is v-fold cross validation.

  • The data are randomly partitioned into V sets of roughly equal size(called the “folds”)

  • Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation.

  • Most resampling methods are iterative, meaning that this process is repeated multiple times.

  • Note that resampling is only conducted on the training set. For each iteration of resampling, the data are partitioned into two subsamples.

    • the model is fit with the analysis set

    • the model is evaluated with the assessment set.

  • in tidymodels, we use vfold_cv() for cross validation - use training data to provide multiple estimates of model performance.

Measuring performance with cross validation on Decision Tree

  • Objective - perform cross validation with decision tree model to explore its performance

  • create a cross validation object with 5 folds using training data, making sure to stratify by the outcome variable. set.seed to make sure results are reproducible.

set.seed(290)
loans_folds <- vfold_cv(loans_training, v = 5, strata = loan_default)
loans_folds
## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits            id   
##   <list>            <chr>
## 1 <split [521/132]> Fold1
## 2 <split [522/131]> Fold2
## 3 <split [523/130]> Fold3
## 4 <split [523/130]> Fold4
## 5 <split [523/130]> Fold5
  • create a custom metric function that includes the area under the ROC curve, sensitivity, and specificity
loans_metrics <- metric_set(roc_auc, sens, spec)
  • use decision tree workflow to perform cross validation using folds data and custom function
loans_dt_rs <- loans_dt_wkfl %>% 
  fit_resamples(resamples = loans_folds, metrics = loans_metrics)
  • view performance metrics
loans_dt_rs %>% collect_metrics()
## # A tibble: 3 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.795     5 0.0183  Preprocessor1_Model1
## 2 sens    binary     0.643     5 0.0239  Preprocessor1_Model1
## 3 spec    binary     0.913     5 0.00685 Preprocessor1_Model1

Measuring performance with cross validation on Logistic Regression

  • Objective - perform cross validation on the loans_training data using logistic regression and compare the results to results of decision tree model

  • create a logistic regression model

logistic_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')
  • create workflow
loans_logistic_wkfl <- workflow() %>% 
  add_model(logistic_model) %>% 
  add_recipe(loans_recipe)
  • fit to v-folds
loans_logistic_rs <- loans_logistic_wkfl %>% 
  fit_resamples(resamples = loans_folds, metrics = loans_metrics)
  • view performance metrics
loans_logistic_rs %>% collect_metrics()
## # A tibble: 3 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.860     5  0.0154 Preprocessor1_Model1
## 2 sens    binary     0.675     5  0.0321 Preprocessor1_Model1
## 3 spec    binary     0.860     5  0.0107 Preprocessor1_Model1

Comparing model performance profiles

  • note that collect_metrics function returns a tibble of cross validation results - that makes easy to calculate summary statistics with dplyr package
library(dplyr)
  • collect detailed cv results for decision tree model and calculate the key statistics of the metrics values by metric type
dt_rs_results <- loans_dt_rs %>% 
  collect_metrics(summarize = FALSE)

dt_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate), 
            median = median(.estimate), 
            max = max(.estimate))
## # A tibble: 3 × 4
##   .metric   min median   max
##   <chr>   <dbl>  <dbl> <dbl>
## 1 roc_auc 0.741  0.803 0.850
## 2 sens    0.569  0.64  0.706
## 3 spec    0.888  0.914 0.925
  • collect detailed cv results for logistic model and calculate the key statistics of the metrics values by metric type
logistic_rs_results <- loans_logistic_rs %>% 
  collect_metrics(summarize = FALSE)

logistic_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate), 
            median = median(.estimate), 
            max = max(.estimate))
## # A tibble: 3 × 4
##   .metric   min median   max
##   <chr>   <dbl>  <dbl> <dbl>
## 1 roc_auc 0.822  0.857 0.913
## 2 sens    0.569  0.7   0.76 
## 3 spec    0.838  0.852 0.9

Hyperparameter tuning

  • Hyperparameter (also called a tuning parameter) tuning is another method of optimizing model performance

  • Definition 1 - model parameters whose values are set prior to model training and control model complexity (i.e. the number of nearest neighbors);

  • Definition 2 - an unknown structural or other kind of value that has significant impact on the model but cannot be directly estimated from the data

  • default setting parameters might not be the best values for all datasets

  • To signal to tidymodels functions which arguments should be optimized - parameters are marked for tuning by assigning them a value of tune() - so that other functions know that these marked parameters need to be optimized

  • Grid search - most common method for tuning hyperparameters

    • generate a grid of unique combinations of hyperparameter values

    • for each combination, use cross validation to estimate model performance, and then choose the best performing combination

  • parameter identify the hyperparameters

  • random grid - generating random combinations - tends to provide greater chances of finding optimal hyperparameters

Practice

Setting model hyperparameters

  • first step - create a parsnip decision tree and set all three of its hyperparameters for tuning.

  • here we use rpart enginee

  • check on decision_tree’s documentation, there are three arguments (besides mode and engine): cost_complexity, tree_depth, and min_n

    • cost_complexity: penalizes large number of terminal nodes

    • tree_depth: longest path from root to terminal node

    • min_n: minimum data points required in a node for further splitting

dt_tune_model <- decision_tree(
                 cost_complexity = tune(), 
                 tree_depth = tune(), 
                 min_n = tune()
) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")
  • second step - create a tuning workflow by update loans_dt_wkfl object with the new decision tree model. Now when we print the new workflow object, the decision tree hyperparameters now appear under the main arguments section
loans_tune_wkfl <- loans_dt_wkfl %>% 
  update_model(dt_tune_model)
loans_tune_wkfl
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_corr()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
##   min_n = tune()
## 
## Computational engine: rpart

Random grid search

  • grid search is the most common method of hyperparameter tuning.

  • goal is to find the optimal combination of values (of hyperparameters) for maximizing model performance

  • random grid is non-regular grid. grid_random function generates independent uniform random numbers across the parameter ranges.

  • first step - Create a random grid of 5 hyperparameter value combinations using the hyperparameters of the dt_tune_model. Use set.seed to ensure reproducibility.

set.seed(123)
dt_grid <- grid_random(parameters(dt_tune_model), size = 5)
dt_grid
## # A tibble: 5 × 3
##   cost_complexity tree_depth min_n
##             <dbl>      <int> <int>
## 1    0.0000000387         10    15
## 2    0.00124               2    26
## 3    0.000000480           6    27
## 4    0.00885              11    28
## 5    0.0291                5     6
  • second step - use loans_tune_wkfl object to perform hyperparameter tuning on the tuning grid (set up above) with cross validation folds (set up in previous steps) and custom metrics (set up in previous steps). Note that the results of tune_grid only provide the substrate to choose appropriate tuning parameters, and the function itself does not fit a final model.
dt_tuning <- loans_tune_wkfl %>% 
  tune_grid(
    resamples = loans_folds,
    grid = dt_grid,
    metrics = loans_metrics
  )
  • view the results - 15 rows because there are five random hyperparameter combinations and three performance metrics. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.
dt_tuning %>% 
  collect_metrics()
## # A tibble: 15 × 9
##    cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##              <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
##  1    0.0000000387         10    15 roc_auc binary     0.833     5 0.0226 
##  2    0.0000000387         10    15 sens    binary     0.662     5 0.0356 
##  3    0.0000000387         10    15 spec    binary     0.858     5 0.0144 
##  4    0.00124               2    26 roc_auc binary     0.779     5 0.0116 
##  5    0.00124               2    26 sens    binary     0.631     5 0.0214 
##  6    0.00124               2    26 spec    binary     0.915     5 0.00465
##  7    0.000000480           6    27 roc_auc binary     0.864     5 0.0113 
##  8    0.000000480           6    27 sens    binary     0.678     5 0.00865
##  9    0.000000480           6    27 spec    binary     0.878     5 0.0199 
## 10    0.00885              11    28 roc_auc binary     0.811     5 0.0234 
## 11    0.00885              11    28 sens    binary     0.631     5 0.0239 
## 12    0.00885              11    28 spec    binary     0.898     5 0.0134 
## 13    0.0291                5     6 roc_auc binary     0.779     5 0.0118 
## 14    0.0291                5     6 sens    binary     0.643     5 0.0231 
## 15    0.0291                5     6 spec    binary     0.915     5 0.00465
## # … with 1 more variable: .config <chr>

Explore tuning results

  • extract detailed tuning results from dt_tuning object
dt_tuning_results <- dt_tuning %>% 
  collect_metrics(summarize = FALSE)
dt_tuning_results
## # A tibble: 75 × 8
##    id    cost_complexity tree_depth min_n .metric .estimator .estimate .config  
##    <chr>           <dbl>      <int> <int> <chr>   <chr>          <dbl> <chr>    
##  1 Fold1    0.0000000387         10    15 sens    binary         0.627 Preproce…
##  2 Fold1    0.0000000387         10    15 spec    binary         0.840 Preproce…
##  3 Fold1    0.0000000387         10    15 roc_auc binary         0.762 Preproce…
##  4 Fold2    0.0000000387         10    15 sens    binary         0.784 Preproce…
##  5 Fold2    0.0000000387         10    15 spec    binary         0.812 Preproce…
##  6 Fold2    0.0000000387         10    15 roc_auc binary         0.871 Preproce…
##  7 Fold3    0.0000000387         10    15 sens    binary         0.7   Preproce…
##  8 Fold3    0.0000000387         10    15 spec    binary         0.888 Preproce…
##  9 Fold3    0.0000000387         10    15 roc_auc binary         0.890 Preproce…
## 10 Fold4    0.0000000387         10    15 sens    binary         0.6   Preproce…
## # … with 65 more rows
  • collect the min, median, max area under the ROC curve for each fold in the detailed tuning results. here we use dplyr to extract the results
dt_tuning_results %>% 
  filter(.metric == "roc_auc") %>% 
  group_by(id) %>% 
  summarize(min_roc_auc = min(.estimate),
            median_roc_auc = median(.estimate),
            max_roc_auc = max(.estimate))
## # A tibble: 5 × 4
##   id    min_roc_auc median_roc_auc max_roc_auc
##   <chr>       <dbl>          <dbl>       <dbl>
## 1 Fold1       0.741          0.741       0.856
## 2 Fold2       0.774          0.850       0.871
## 3 Fold3       0.773          0.773       0.901
## 4 Fold4       0.800          0.832       0.832
## 5 Fold5       0.806          0.811       0.862

Select the best model

  • next step is to select the best performing combinations of hyperparameters, following previous step

  • select_best to select the best performing hyperparameters

  • Display the 5 best performing hyperparameter combinations from your tuning results based on the area under the ROC curve.

dt_tuning %>% 
  show_best(metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.000000480           6    27 roc_auc binary     0.864     5  0.0113
## 2    0.0000000387         10    15 roc_auc binary     0.833     5  0.0226
## 3    0.00885              11    28 roc_auc binary     0.811     5  0.0234
## 4    0.0291                5     6 roc_auc binary     0.779     5  0.0118
## 5    0.00124               2    26 roc_auc binary     0.779     5  0.0116
## # … with 1 more variable: .config <chr>
  • Select the best hyperparameter combination from your tuning results based on the area under the ROC curve.
best_dt_model <- dt_tuning %>% 
  # Choose the best model based on roc_auc
  select_best(metric = 'roc_auc')

best_dt_model
## # A tibble: 1 × 4
##   cost_complexity tree_depth min_n .config             
##             <dbl>      <int> <int> <chr>               
## 1     0.000000480          6    27 Preprocessor1_Model3
  • finalize the workflow with the best hyperparameter combination
final_loan_wkfl <- loans_tune_wkfl %>% 
  finalize_workflow(best_dt_model)

final_loan_wkfl
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_corr()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 4.79504068364913e-07
##   tree_depth = 6
##   min_n = 27
## 
## Computational engine: rpart
  • train finalized decision tree workflow
loans_final_fit <- final_loan_wkfl %>% 
  last_fit(split = loans_split)
  • view performance metrics
loans_final_fit %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.763 Preprocessor1_Model1
## 2 roc_auc  binary         0.796 Preprocessor1_Model1
  • use the trained workflow to create an ROC curve. the area under ROC curve is of 0.828.
loans_final_fit %>% 
  collect_predictions() %>% 
  roc_curve(truth = loan_default, .pred_yes) %>% 
  autoplot()