Load Packages and Dataset
tidymodels - study objective
data - continue with telecom data used for last lesson with training and test data prepared from sampling function.
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.10 ✓ recipes 0.1.17
## ✓ dials 0.0.10 ✓ rsample 0.1.1
## ✓ dplyr 1.0.7 ✓ tibble 3.1.6
## ✓ ggplot2 3.3.5 ✓ tidyr 1.1.4
## ✓ infer 1.0.0 ✓ tune 0.1.6
## ✓ modeldata 0.1.1 ✓ workflows 0.2.4
## ✓ parsnip 0.1.7 ✓ workflowsets 0.1.0
## ✓ purrr 0.3.4 ✓ yardstick 0.0.9
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
<- readRDS("~/Documents/R programming/Datacamp/DC_Tidymodels/Data/telecom_df.rds")
telecom_df <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_split <- telecom_split %>%
telecom_training training()
<- telecom_split %>%
telecom_test testing()
Key Points
Exploring recipe objects
first step in feature engineering is to specify a recipe object with the recipe() function
and add data preprocessing steps with one ore more step_*() functions
storing all of this information in a single recipe object makes it easier to manage complex feature engineering pipelines and transform new data sources - the latter is quite important because it allows us to train/transform the test data in the same format
create a recipe - fit a model with all predictors and outcome is the “canceled_service”, and transform the call mins to log base of 10.
<- recipe(canceled_service ~ .,
telecom_rec data = telecom_df) %>%
step_log(avg_call_mins, base = 10)
$var_info %>%
telecom_recgroup_by(type) %>%
count()
## # A tibble: 2 × 2
## # Groups: type [2]
## type n
## <chr> <int>
## 1 nominal 4
## 2 numeric 5
summary(telecom_rec)
## # A tibble: 9 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 cellular_service nominal predictor original
## 2 avg_data_gb numeric predictor original
## 3 avg_call_mins numeric predictor original
## 4 avg_intl_mins numeric predictor original
## 5 internet_service nominal predictor original
## 6 contract nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges numeric predictor original
## 9 canceled_service nominal outcome original
here we can see there are four nominal predictor variables (including the outcome) and five numeric variables.
Creating recipe objects
on training dataset
log transform two predictor variables - reduce the range of these variables and potentially make the distributions more symmetric which may increase the accuracy of the logistic regression model. and from a statistical perspective, a log transform may stabilize the variance in a way that makes inference more legitimate.
<- recipe(canceled_service ~ ., data = telecom_training) %>%
telecom_log_rec step_log(avg_call_mins, avg_intl_mins, base = 10)
%>%
telecom_log_rec summary()
## # A tibble: 9 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 cellular_service nominal predictor original
## 2 avg_data_gb numeric predictor original
## 3 avg_call_mins numeric predictor original
## 4 avg_intl_mins numeric predictor original
## 5 internet_service nominal predictor original
## 6 contract nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges numeric predictor original
## 9 canceled_service nominal outcome original
Training a recipe object - with steps
above - created a recipe with instructions to apply a log transformation to two predictor variables.
next step in feature engineering process is to train the recipe object using the training data.
then we will be able to apply trained recipe to both the training and test datasets in order to prepare them for use in model fitting and model evaluation.
- train telecom_log_rec object using the telecom_training dataset
<- telecom_log_rec %>%
telecom_log_rec_prep prep(training = telecom_training)
telecom_log_rec_prep
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Training data contained 731 data points and no missing data.
##
## Operations:
##
## Log transformation on avg_call_mins, avg_intl_mins [trained]
- apply to training data with bake. note that the avg_call_mins and avg_intl_mins have been log transformed.
%>%
telecom_log_rec_prep bake(new_data = NULL)
## # A tibble: 731 × 9
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 multiple_lines 8.05 2.52 2.09 digital
## 2 single_line 9.3 2.51 2.06 fiber_optic
## 3 multiple_lines 9.4 2.49 2.17 fiber_optic
## 4 multiple_lines 9.96 2.53 2.13 fiber_optic
## 5 single_line 9.37 2.58 1.94 fiber_optic
## 6 multiple_lines 10.6 2.45 2.17 fiber_optic
## 7 multiple_lines 5.17 2.53 2.08 digital
## 8 multiple_lines 7.86 2.58 2.21 digital
## 9 multiple_lines 11.0 2.59 1.89 fiber_optic
## 10 multiple_lines 11.8 2.54 2.10 fiber_optic
## # … with 721 more rows, and 4 more variables: contract <fct>,
## # months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
- apply to test data. note that the avg_call_mins and avg_intl_mins have been log transformed.
%>%
telecom_log_rec_prep bake(new_data = telecom_test)
## # A tibble: 244 × 9
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 9.04 2.53 1.94 fiber_optic
## 2 single_line 10.3 2.42 1.74 fiber_optic
## 3 multiple_lines 8.01 2.72 1.99 fiber_optic
## 4 multiple_lines 10.2 2.60 2.06 fiber_optic
## 5 single_line 6.69 2.55 1.96 digital
## 6 multiple_lines 4.11 2.57 1.81 digital
## 7 single_line 8.67 1.97 2.12 fiber_optic
## 8 multiple_lines 9.24 2.59 2.15 fiber_optic
## 9 single_line 8.02 2.38 1.98 fiber_optic
## 10 multiple_lines 6.79 2.71 2.15 digital
## # … with 234 more rows, and 4 more variables: contract <fct>,
## # months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
Steps transforming numeric variables
Discovering correlated predictors and remove them
specify a recipe object (named in cor to show it is to remove highly correlated predictors here defined as correlation greater than 0.8)
<- recipe(canceled_service ~ ., data = telecom_training) %>%
telecom_cor_rec step_corr(all_numeric(), threshold = 0.8)
- train the recipe using training dataset. using function prep
<- telecom_cor_rec %>%
telecom_cor_rec_prep prep(training = telecom_training)
- use trained recipe to obtain transformed training dataset aka the one where highly correlated predictors are removed, here we use bake function
%>%
telecom_cor_rec_prep bake(new_data = NULL)
## # A tibble: 731 × 8
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 multiple_lines 8.05 328 122 digital
## 2 single_line 9.3 326 114 fiber_optic
## 3 multiple_lines 9.4 312 147 fiber_optic
## 4 multiple_lines 9.96 340 136 fiber_optic
## 5 single_line 9.37 382 87 fiber_optic
## 6 multiple_lines 10.6 281 147 fiber_optic
## 7 multiple_lines 5.17 341 119 digital
## 8 multiple_lines 7.86 378 164 digital
## 9 multiple_lines 11.0 390 78 fiber_optic
## 10 multiple_lines 11.8 343 126 fiber_optic
## # … with 721 more rows, and 3 more variables: contract <fct>,
## # months_with_company <dbl>, canceled_service <fct>
- similarly apply the trained recipe to test data
%>%
telecom_cor_rec_prep bake(new_data = telecom_test)
## # A tibble: 244 × 8
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 9.04 336 88 fiber_optic
## 2 single_line 10.3 262 55 fiber_optic
## 3 multiple_lines 8.01 525 97 fiber_optic
## 4 multiple_lines 10.2 402 116 fiber_optic
## 5 single_line 6.69 352 91 digital
## 6 multiple_lines 4.11 371 64 digital
## 7 single_line 8.67 93 131 fiber_optic
## 8 multiple_lines 9.24 392 142 fiber_optic
## 9 single_line 8.02 240 95 fiber_optic
## 10 multiple_lines 6.79 514 141 digital
## # … with 234 more rows, and 3 more variables: contract <fct>,
## # months_with_company <dbl>, canceled_service <fct>
More on feature engineering
key in recipes package is that we can include multiple preprocessing steps in a single recipe object.
there steps will be carried out in the order as they are entered with the step_*() functions
here in below we will include two steps to preprocess the data - remove highly correlated predictors and also normalize all numeric predictors
specify a recipe object
<- recipe(canceled_service ~ .,
telecom_norm_rec data = telecom_training) %>%
# Remove correlated variables
step_corr(all_numeric(), threshold = 0.8) %>%
# Normalize numeric predictors
step_normalize(all_numeric())
- train the recipe
<- telecom_norm_rec %>%
telecom_norm_rec_prep prep(training = telecom_training)
- apply to test data
%>%
telecom_norm_rec_prep bake(new_data = telecom_test)
## # A tibble: 244 × 8
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 0.417 -0.203 -0.648 fiber_optic
## 2 single_line 1.09 -1.20 -1.71 fiber_optic
## 3 multiple_lines -0.126 2.34 -0.360 fiber_optic
## 4 multiple_lines 1.02 0.685 0.248 fiber_optic
## 5 single_line -0.822 0.0127 -0.552 digital
## 6 multiple_lines -2.18 0.268 -1.42 digital
## 7 single_line 0.222 -3.47 0.729 fiber_optic
## 8 multiple_lines 0.522 0.551 1.08 fiber_optic
## 9 single_line -0.121 -1.49 -0.424 fiber_optic
## 10 multiple_lines -0.769 2.19 1.05 digital
## # … with 234 more rows, and 3 more variables: contract <fct>,
## # months_with_company <dbl>, canceled_service <fct>
Nominal Predictors
Key points
process nominal predictors
nominal data must be transformed to numeric data for modeling
one-hot encoding - map categorical values to a sequence of \[0/1\] indicator variables.
More on encoding qualitative data in a numeric format
reference from book Tidy Modeling with R
one of the most common feature engineering tasks is to transform nominal or qualitative (factors or chars) so that they can be encoded or represented numerically.
sometimes we can alter the factor levels of a qualitative column in helpful ways prior to such a transformation.
step_unknown() - change missing values to a dedicated factor level
step_novel() - if anticipate that a new factor level may be encountered in future data, this can allot a new level for this purpose.
step_other - can be used to analyze the frequencies of the factor levels in the training set and convert infrequently occurring values to a catch-all level of “other, with a specific threshold that can be specified. i.e. bottom 1% of the dataset can be lumped into a new level called”other” in the modelling process.
there are a few strategies for converting a factor predictor to a numeric format. most common method is tro create” dummy” or indicator variables. i.e. “dummy variable encoding”.
the nominal variable column will be replaced with numeric columns whose values are either zero or one
these binary variables represent specific factor level values
in R, the convention is to exclude a column for the first factor level - because if you know the value of the rest columns , you can determine the last value - since these are mutually exclusive categories.
note that step_dummy() eliminate the original data column and replace it with one or more columns with different names
Continue with Key points
therefore for the dummy variable encoding -
exclude one value from original set of data values.
n distinct values produce (n-1) indicator variables.
and this is the preferred method for modeling - default in recipes package
hence the above table is revised as below - the department variable is mapped to a sequence of two indicator variables.
Steps for dummy variables encoding
creates dummy variables from nominal predictor variables
specify a recipe with model formula and data
pass it to step_dummy function and select the char for processing
pass the results to the prep function where the recipe is trained on the training data
finally pass into bake function where we apply the transform on test data
can select columns by type. i.e. all_nominal(), and -all_outcomes() to exclude the outcome variable
note that - it is possible to fit models without hot-encoding but not consistent across all engines, so it is recommended to use recipe to standardize the process.
Follow-up practice
order of step_*() functions - step functions within a recipe are carried out in sequential order.
important to keep the above in mind so that to avoid unexpected results in feature engineering pipeline
in general it is best to normalize variables before creating dummy variables
showcase - first normalize and then create dummy variables
- first normalize then create dummy variables - note the result contains dummy variables o and 1
<-
telecom_recipe_1 recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
# train and apply telecom_recipe_1 on the test data
%>%
telecom_recipe_1 prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 × 4
## avg_data_gb canceled_service contract_one_year contract_two_year
## <dbl> <fct> <dbl> <dbl>
## 1 0.417 yes 0 0
## 2 1.09 no 1 0
## 3 -0.126 yes 0 0
## 4 1.02 no 0 0
## 5 -0.822 no 0 0
## 6 -2.18 no 0 1
## 7 0.222 no 0 1
## 8 0.522 no 0 0
## 9 -0.121 no 0 0
## 10 -0.769 yes 1 0
## # … with 234 more rows
- first create dummy variables then normalize. note that because it is normalized after creating dummy variables - they all end up in numbers with fractions instead of 0 and 1 as shown in above case.
<-
telecom_recipe_2 recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
# Train and apply telecom_recipe_2 on the test data
%>%
telecom_recipe_2 prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 × 4
## avg_data_gb canceled_service contract_one_year contract_two_year
## <dbl> <fct> <dbl> <dbl>
## 1 0.417 yes -0.510 -0.469
## 2 1.09 no 1.96 -0.469
## 3 -0.126 yes -0.510 -0.469
## 4 1.02 no -0.510 -0.469
## 5 -0.822 no -0.510 -0.469
## 6 -2.18 no -0.510 2.13
## 7 0.222 no -0.510 2.13
## 8 0.522 no -0.510 -0.469
## 9 -0.121 no -0.510 -0.469
## 10 -0.769 yes 1.96 -0.469
## # … with 234 more rows
also note that since we only specified two predictor variables (avg_data_gb and contract) in the model formula, the rest of the columns are ignored by the recipe objects when transforming new data sources. Essentially it is like specifying predictor variables in the regression model - if they are not specified, they will not be used for training and modelling in following steps.
Complete feature engineering pipeline
Write a code chunk to complete following task
create a recipe that predicts canceled_service using all predictor variables in the training data
remove correlated predictor variables using a 0.8 threshold
normalize all numeric predictors
create dummy variables for all nominal predictors
train the recipe on the training data and apply it to the test data
<- recipe(canceled_service ~ ., data = telecom_training) %>%
telecom_receipe step_corr(all_numeric(), threshold = 0.8) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
%>%
telecom_receipe prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 × 9
## avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.417 -0.203 -0.648 -0.981 yes
## 2 1.09 -1.20 -1.71 0.643 no
## 3 -0.126 2.34 -0.360 -0.615 yes
## 4 1.02 0.685 0.248 -0.697 no
## 5 -0.822 0.0127 -0.552 -1.14 no
## 6 -2.18 0.268 -1.42 1.54 no
## 7 0.222 -3.47 0.729 0.847 no
## 8 0.522 0.551 1.08 1.42 no
## 9 -0.121 -1.49 -0.424 -1.02 no
## 10 -0.769 2.19 1.05 -0.900 yes
## # … with 234 more rows, and 4 more variables:
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # contract_one_year <dbl>, contract_two_year <dbl>
Complete Modeling Flow
Now we have covered feature engineering (for both numeric and nominal variables) and the recipe, prep and bake process, it is time to incorporate the feature engineering the modeling flow itself.
prior to that let’s redefine/re-create the logistic model with parsnip (note that engine is glm and mode is classification)
<- logistic_reg() %>%
logistic_model set_engine('glm') %>%
set_mode('classification')
- step 1 - create a recipe and include necessary steps (correlation, log transform, normalize, and create dummy variables)
<- recipe(canceled_service ~ ., data = telecom_training) %>%
telecom_recipe step_corr(all_numeric(), threshold = 0.8) %>%
step_log(all_numeric(), base = 10) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes())
- step 2 - train the telecom_recipe object using the training data, and use the trained recipe to transform the training data
<- telecom_recipe %>%
telecom_recipe_prep prep(training = telecom_training)
<- telecom_recipe_prep %>%
telecom_training_prep bake(new_data = NULL)
- step 3 - apply trained recipe object to the test dataset and view the results
<- telecom_recipe_prep %>%
telecom_test_prep bake(new_data = telecom_test)
telecom_test_prep
## # A tibble: 244 × 9
## avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.482 -0.0802 -0.495 -0.534 yes
## 2 1.01 -1.14 -1.95 0.697 no
## 3 0.000391 1.83 -0.194 -0.0430 yes
## 4 0.958 0.687 0.360 -0.128 no
## 5 -0.716 0.119 -0.392 -0.925 no
## 6 -2.65 0.343 -1.48 0.976 no
## 7 0.315 -5.57 0.736 0.770 no
## 8 0.569 0.579 0.986 0.944 no
## 9 0.00535 -1.52 -0.258 -0.615 no
## 10 -0.657 1.74 0.964 -0.395 yes
## # … with 234 more rows, and 4 more variables:
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # contract_one_year <dbl>, contract_two_year <dbl>
- step 4 - train logistic model on processed training data
<- logistic_model %>%
logistic_fit fit(canceled_service ~ ., data = telecom_training_prep)
- step 5 - obtain class predictions and obtain estimated probabilities on test dataset
<- predict(logistic_fit, new_data = telecom_test_prep, type = 'class')
class_preds <- predict(logistic_fit, new_data = telecom_test_prep, type = 'prob') prob_preds
- step 6 - combine test set results with actual results
<- telecom_test_prep %>%
telecom_results select(canceled_service) %>%
cbind(class_preds, prob_preds)
head(telecom_results,5)
## canceled_service .pred_class .pred_yes .pred_no
## 1 yes yes 0.5623575 0.4376425
## 2 no no 0.2229379 0.7770621
## 3 yes yes 0.8250824 0.1749176
## 4 no yes 0.6356467 0.3643533
## 5 no no 0.3911968 0.6088032
by now we have created a tibble of model results on the test dataset with the actual outcome variable value, predicted outcome values, and estimated probabilities of the positive and negative classes. next step is to evaluate the results.
- step 7 - evaluate the results and plot ROC curve.
# Create a confusion matrix
%>%
telecom_results conf_mat(truth = canceled_service, estimate = .pred_class)
## Truth
## Prediction yes no
## yes 52 14
## no 30 148
# Calculate sensitivity
%>%
telecom_results sens(truth = canceled_service, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 0.634
# Calculate specificity
%>%
telecom_results spec(truth = canceled_service, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 0.914
# Plot ROC curve - note here we have two arguments for the truth here
%>%
telecom_results roc_curve(truth = canceled_service, .pred_yes) %>%
autoplot()
Machine learning workflows and hyperparameter tuning
Decision tree models for example
workflow package is designed for streamlining the model process - combines a parsnip (define and fit the model) model and recipe object into a single workflow object
workflow is an object as well - using a workflow encourages methodology since it is a single point of entry to the estimation components of a data analysis; enables the user to better organize their projects.
it is important to focus on the broader modeling process, instead of only fitting the specific model used to estimate parameters.
the broader process includes - any preprocessing steps, the model fit itself, and potential post-processing activities. hence it is called workflows
Practice - Creating Workflow
load the data - loan_df - contains financial information on consumer loans at a bank. The outcome variable in this data is loan_default.
Objective - create a decision tree model object and specify a feature engineering pipeline for the load data
<- readRDS("~/Documents/R programming/Datacamp/DC_Tidymodels/Data/loan_df.rds") loans_df
- step 1 - exploring the dataset (create training and testing datasets; check for correlated predictors)
<- initial_split(loans_df, strata = loan_default)
loans_split <- loans_split %>% training()
loans_training <- loans_split %>% testing()
loans_test
# Check for correlated predictors
%>%
loans_training # Select numeric columns
select_if(is.numeric) %>%
# Calculate correlation matrix
cor()
## loan_amount interest_rate installment annual_income
## loan_amount 1.000000000 0.007833244 0.93928366 0.3324928
## interest_rate 0.007833244 1.000000000 0.04018282 -0.0927163
## installment 0.939283660 0.040182816 1.00000000 0.2847142
## annual_income 0.332492764 -0.092716299 0.28471419 1.0000000
## debt_to_income 0.144739018 0.171768123 0.19004895 -0.1864699
## debt_to_income
## loan_amount 0.1447390
## interest_rate 0.1717681
## installment 0.1900490
## annual_income -0.1864699
## debt_to_income 1.0000000
- step 2 - specifying a decision tree model and recipe
<- decision_tree() %>%
dt_model # Specify the engine
set_engine('rpart') %>%
# Specify the mode
set_mode('classification')
# Build feature engineering pipeline
<- recipe(loan_default ~ .,
loans_recipe data = loans_training) %>%
# Correlation filter
step_corr(all_numeric(), threshold = 0.85) %>%
# Normalize numeric predictors
step_normalize(all_numeric()) %>%
# Create dummy variables
step_dummy(all_nominal_predictors())
loans_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
##
## Operations:
##
## Correlation filter on all_numeric()
## Centering and scaling for all_numeric()
## Dummy variables from all_nominal_predictors()
- step 3 - creating workflows. Here we combine decision tree model and feature engineering recipe into a single workflow object and perform model fitting and evaluation. Here it shows the area under the ROC curve is 0.746.
#create a workflow
<- workflow() %>%
loans_dt_wkfl add_model(dt_model) %>%
add_recipe(loans_recipe)
#train the workflow
<- loans_dt_wkfl %>%
loans_dt_wkfl_fit last_fit(split = loans_split)
#calculate performance metrics on test data
%>%
loans_dt_wkfl_fit collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.763 Preprocessor1_Model1
## 2 roc_auc binary 0.782 Preprocessor1_Model1
Estimating performance with cross validation
Cross-validation is a well established resampling method - and the most common cross-validation method is v-fold cross validation.
The data are randomly partitioned into V sets of roughly equal size(called the “folds”)
Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation.
Most resampling methods are iterative, meaning that this process is repeated multiple times.
Note that resampling is only conducted on the training set. For each iteration of resampling, the data are partitioned into two subsamples.
the model is fit with the analysis set
the model is evaluated with the assessment set.
in tidymodels, we use vfold_cv() for cross validation - use training data to provide multiple estimates of model performance.
Measuring performance with cross validation on Decision Tree
Objective - perform cross validation with decision tree model to explore its performance
create a cross validation object with 5 folds using training data, making sure to stratify by the outcome variable. set.seed to make sure results are reproducible.
set.seed(290)
<- vfold_cv(loans_training, v = 5, strata = loan_default)
loans_folds loans_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [521/132]> Fold1
## 2 <split [522/131]> Fold2
## 3 <split [523/130]> Fold3
## 4 <split [523/130]> Fold4
## 5 <split [523/130]> Fold5
- create a custom metric function that includes the area under the ROC curve, sensitivity, and specificity
<- metric_set(roc_auc, sens, spec) loans_metrics
- use decision tree workflow to perform cross validation using folds data and custom function
<- loans_dt_wkfl %>%
loans_dt_rs fit_resamples(resamples = loans_folds, metrics = loans_metrics)
- view performance metrics
%>% collect_metrics() loans_dt_rs
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.795 5 0.0183 Preprocessor1_Model1
## 2 sens binary 0.643 5 0.0239 Preprocessor1_Model1
## 3 spec binary 0.913 5 0.00685 Preprocessor1_Model1
Measuring performance with cross validation on Logistic Regression
Objective - perform cross validation on the loans_training data using logistic regression and compare the results to results of decision tree model
create a logistic regression model
<- logistic_reg() %>%
logistic_model set_engine('glm') %>%
set_mode('classification')
- create workflow
<- workflow() %>%
loans_logistic_wkfl add_model(logistic_model) %>%
add_recipe(loans_recipe)
- fit to v-folds
<- loans_logistic_wkfl %>%
loans_logistic_rs fit_resamples(resamples = loans_folds, metrics = loans_metrics)
- view performance metrics
%>% collect_metrics() loans_logistic_rs
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.860 5 0.0154 Preprocessor1_Model1
## 2 sens binary 0.675 5 0.0321 Preprocessor1_Model1
## 3 spec binary 0.860 5 0.0107 Preprocessor1_Model1
Comparing model performance profiles
- note that collect_metrics function returns a tibble of cross validation results - that makes easy to calculate summary statistics with dplyr package
library(dplyr)
- collect detailed cv results for decision tree model and calculate the key statistics of the metrics values by metric type
<- loans_dt_rs %>%
dt_rs_results collect_metrics(summarize = FALSE)
%>%
dt_rs_results group_by(.metric) %>%
summarize(min = min(.estimate),
median = median(.estimate),
max = max(.estimate))
## # A tibble: 3 × 4
## .metric min median max
## <chr> <dbl> <dbl> <dbl>
## 1 roc_auc 0.741 0.803 0.850
## 2 sens 0.569 0.64 0.706
## 3 spec 0.888 0.914 0.925
- collect detailed cv results for logistic model and calculate the key statistics of the metrics values by metric type
<- loans_logistic_rs %>%
logistic_rs_results collect_metrics(summarize = FALSE)
%>%
logistic_rs_results group_by(.metric) %>%
summarize(min = min(.estimate),
median = median(.estimate),
max = max(.estimate))
## # A tibble: 3 × 4
## .metric min median max
## <chr> <dbl> <dbl> <dbl>
## 1 roc_auc 0.822 0.857 0.913
## 2 sens 0.569 0.7 0.76
## 3 spec 0.838 0.852 0.9
Hyperparameter tuning
Hyperparameter (also called a tuning parameter) tuning is another method of optimizing model performance
Definition 1 - model parameters whose values are set prior to model training and control model complexity (i.e. the number of nearest neighbors);
Definition 2 - an unknown structural or other kind of value that has significant impact on the model but cannot be directly estimated from the data
default setting parameters might not be the best values for all datasets
To signal to tidymodels functions which arguments should be optimized - parameters are marked for tuning by assigning them a value of tune() - so that other functions know that these marked parameters need to be optimized
Grid search - most common method for tuning hyperparameters
generate a grid of unique combinations of hyperparameter values
for each combination, use cross validation to estimate model performance, and then choose the best performing combination
parameter identify the hyperparameters
random grid - generating random combinations - tends to provide greater chances of finding optimal hyperparameters
Practice
Setting model hyperparameters
first step - create a parsnip decision tree and set all three of its hyperparameters for tuning.
here we use rpart enginee
check on decision_tree’s documentation, there are three arguments (besides mode and engine): cost_complexity, tree_depth, and min_n
cost_complexity: penalizes large number of terminal nodes
tree_depth: longest path from root to terminal node
min_n: minimum data points required in a node for further splitting
<- decision_tree(
dt_tune_model cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
%>%
) set_engine("rpart") %>%
set_mode("classification")
- second step - create a tuning workflow by update loans_dt_wkfl object with the new decision tree model. Now when we print the new workflow object, the decision tree hyperparameters now appear under the main arguments section
<- loans_dt_wkfl %>%
loans_tune_wkfl update_model(dt_tune_model)
loans_tune_wkfl
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_corr()
## • step_normalize()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
## min_n = tune()
##
## Computational engine: rpart
Random grid search
grid search is the most common method of hyperparameter tuning.
goal is to find the optimal combination of values (of hyperparameters) for maximizing model performance
random grid is non-regular grid. grid_random function generates independent uniform random numbers across the parameter ranges.
first step - Create a random grid of 5 hyperparameter value combinations using the hyperparameters of the dt_tune_model. Use set.seed to ensure reproducibility.
set.seed(123)
<- grid_random(parameters(dt_tune_model), size = 5)
dt_grid dt_grid
## # A tibble: 5 × 3
## cost_complexity tree_depth min_n
## <dbl> <int> <int>
## 1 0.0000000387 10 15
## 2 0.00124 2 26
## 3 0.000000480 6 27
## 4 0.00885 11 28
## 5 0.0291 5 6
- second step - use loans_tune_wkfl object to perform hyperparameter tuning on the tuning grid (set up above) with cross validation folds (set up in previous steps) and custom metrics (set up in previous steps). Note that the results of tune_grid only provide the substrate to choose appropriate tuning parameters, and the function itself does not fit a final model.
<- loans_tune_wkfl %>%
dt_tuning tune_grid(
resamples = loans_folds,
grid = dt_grid,
metrics = loans_metrics
)
- view the results - 15 rows because there are five random hyperparameter combinations and three performance metrics. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.
%>%
dt_tuning collect_metrics()
## # A tibble: 15 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000387 10 15 roc_auc binary 0.833 5 0.0226
## 2 0.0000000387 10 15 sens binary 0.662 5 0.0356
## 3 0.0000000387 10 15 spec binary 0.858 5 0.0144
## 4 0.00124 2 26 roc_auc binary 0.779 5 0.0116
## 5 0.00124 2 26 sens binary 0.631 5 0.0214
## 6 0.00124 2 26 spec binary 0.915 5 0.00465
## 7 0.000000480 6 27 roc_auc binary 0.864 5 0.0113
## 8 0.000000480 6 27 sens binary 0.678 5 0.00865
## 9 0.000000480 6 27 spec binary 0.878 5 0.0199
## 10 0.00885 11 28 roc_auc binary 0.811 5 0.0234
## 11 0.00885 11 28 sens binary 0.631 5 0.0239
## 12 0.00885 11 28 spec binary 0.898 5 0.0134
## 13 0.0291 5 6 roc_auc binary 0.779 5 0.0118
## 14 0.0291 5 6 sens binary 0.643 5 0.0231
## 15 0.0291 5 6 spec binary 0.915 5 0.00465
## # … with 1 more variable: .config <chr>
Explore tuning results
- extract detailed tuning results from dt_tuning object
<- dt_tuning %>%
dt_tuning_results collect_metrics(summarize = FALSE)
dt_tuning_results
## # A tibble: 75 × 8
## id cost_complexity tree_depth min_n .metric .estimator .estimate .config
## <chr> <dbl> <int> <int> <chr> <chr> <dbl> <chr>
## 1 Fold1 0.0000000387 10 15 sens binary 0.627 Preproce…
## 2 Fold1 0.0000000387 10 15 spec binary 0.840 Preproce…
## 3 Fold1 0.0000000387 10 15 roc_auc binary 0.762 Preproce…
## 4 Fold2 0.0000000387 10 15 sens binary 0.784 Preproce…
## 5 Fold2 0.0000000387 10 15 spec binary 0.812 Preproce…
## 6 Fold2 0.0000000387 10 15 roc_auc binary 0.871 Preproce…
## 7 Fold3 0.0000000387 10 15 sens binary 0.7 Preproce…
## 8 Fold3 0.0000000387 10 15 spec binary 0.888 Preproce…
## 9 Fold3 0.0000000387 10 15 roc_auc binary 0.890 Preproce…
## 10 Fold4 0.0000000387 10 15 sens binary 0.6 Preproce…
## # … with 65 more rows
- collect the min, median, max area under the ROC curve for each fold in the detailed tuning results. here we use dplyr to extract the results
%>%
dt_tuning_results filter(.metric == "roc_auc") %>%
group_by(id) %>%
summarize(min_roc_auc = min(.estimate),
median_roc_auc = median(.estimate),
max_roc_auc = max(.estimate))
## # A tibble: 5 × 4
## id min_roc_auc median_roc_auc max_roc_auc
## <chr> <dbl> <dbl> <dbl>
## 1 Fold1 0.741 0.741 0.856
## 2 Fold2 0.774 0.850 0.871
## 3 Fold3 0.773 0.773 0.901
## 4 Fold4 0.800 0.832 0.832
## 5 Fold5 0.806 0.811 0.862
Select the best model
next step is to select the best performing combinations of hyperparameters, following previous step
select_best to select the best performing hyperparameters
Display the 5 best performing hyperparameter combinations from your tuning results based on the area under the ROC curve.
%>%
dt_tuning show_best(metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.000000480 6 27 roc_auc binary 0.864 5 0.0113
## 2 0.0000000387 10 15 roc_auc binary 0.833 5 0.0226
## 3 0.00885 11 28 roc_auc binary 0.811 5 0.0234
## 4 0.0291 5 6 roc_auc binary 0.779 5 0.0118
## 5 0.00124 2 26 roc_auc binary 0.779 5 0.0116
## # … with 1 more variable: .config <chr>
- Select the best hyperparameter combination from your tuning results based on the area under the ROC curve.
<- dt_tuning %>%
best_dt_model # Choose the best model based on roc_auc
select_best(metric = 'roc_auc')
best_dt_model
## # A tibble: 1 × 4
## cost_complexity tree_depth min_n .config
## <dbl> <int> <int> <chr>
## 1 0.000000480 6 27 Preprocessor1_Model3
- finalize the workflow with the best hyperparameter combination
<- loans_tune_wkfl %>%
final_loan_wkfl finalize_workflow(best_dt_model)
final_loan_wkfl
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_corr()
## • step_normalize()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 4.79504068364913e-07
## tree_depth = 6
## min_n = 27
##
## Computational engine: rpart
- train finalized decision tree workflow
<- final_loan_wkfl %>%
loans_final_fit last_fit(split = loans_split)
- view performance metrics
%>%
loans_final_fit collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.763 Preprocessor1_Model1
## 2 roc_auc binary 0.796 Preprocessor1_Model1
- use the trained workflow to create an ROC curve. the area under ROC curve is of 0.828.
%>%
loans_final_fit collect_predictions() %>%
roc_curve(truth = loan_default, .pred_yes) %>%
autoplot()