Ch. 1 - Introduction to hyperparameters

Parameters vs hyperparameters

[Video]

Model parameters vs. hyperparameters

# Fit a linear model on the breast_cancer_data.
linear_model <- lm(concavity_mean ~ symmetry_mean, data = breast_cancer_data)

# Look at the summary of the linear_model.
summary(linear_model)
## 
## Call:
## lm(formula = concavity_mean ~ symmetry_mean, data = breast_cancer_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.201877 -0.039201 -0.008432  0.030655  0.226150 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.15311    0.04086  -3.747 0.000303 ***
## symmetry_mean  1.33366    0.21257   6.274 9.57e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06412 on 98 degrees of freedom
## Multiple R-squared:  0.2866, Adjusted R-squared:  0.2793 
## F-statistic: 39.36 on 1 and 98 DF,  p-value: 9.575e-09
# Extract the coefficients.
coefficients(linear_model)
##   (Intercept) symmetry_mean 
##    -0.1531055     1.3336568

Hyperparameters in linear models

Which of the following is a hyperparameter in the linear model from your last exercise? The breast_cancer_data has again been loaded and the linear model has been built just as before:

# linear_model <- lm(concavity_mean ~ symmetry_mean,
#                    data = breast_cancer_data)

Note that hyperparameters can be found in the help section for a function, while model parameters are part of the output of a function.

help(lm)
  • [*] Weights
  • Coefficients
  • Residuals
  • Intercept

What are the coefficients?

library(ggplot2)

# Plot linear relationship.
ggplot(data = breast_cancer_data, 
        aes(x = symmetry_mean, y = concavity_mean)) +
  geom_point(color = "grey") +
  geom_abline(slope = linear_model$coefficients[2], 
      intercept = linear_model$coefficients[1])

Recap of machine learning basics

[Video]

Machine Learning with caret - splitting data

# Load caret and set seed
library(caret)
## Loading required package: lattice
set.seed(42)

# Create caret partition index
index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.70, list = FALSE)

# Subset 'breast_cancer_data' with index
bc_train_data <- breast_cancer_data[index, ]
bc_train_data <- breast_cancer_data[-index, ]

Train a machine learning model with caret

Set up cross-validation:

library(caret)
library(tictoc)

# Repeated CV.
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5)

Train a Random Forest model:

tic()
set.seed(42)
rf_model <- train(diagnosis ~ ., data = bc_train_data, method = "rf", trControl = fitControl, verbose = FALSE)
toc()
## 0.855 sec elapsed
rf_model
## Random Forest 
## 
## 30 samples
## 10 predictors
##  2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 20, 20, 20, 20, 20, 20, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    0.88      0.76 
##    6    0.88      0.76 
##   10    0.88      0.76 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Machine learning with caret

# Create partition index
index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.7, list = FALSE)

# Subset `breast_cancer_data` with index
bc_train_data <- breast_cancer_data[index, ]
bc_test_data  <- breast_cancer_data[-index, ]

# Define 3x5 folds repeated cross-validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# # Run the train() function
# gbm_model <- caret::train(diagnosis ~ ., 
#                    data = bc_train_data, 
#                    method = "gbm", 
#                    trControl = fitControl,
#                    verbose = FALSE)
# 
# # Look at the model
# gbm_model

Resampling schemes

In the previous exercise, you defined a 3x5 folds repeated cross-validation resampling scheme with the following code:

# fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

Which of the following is NOT a valid resampling method in caret? Note: the caret package has already been loaded for you.

help(trainControl)
  • boot
  • [*] adaboost
  • cv
  • LGOCV

Hyperparameter tuning in caret

[Video]

Hyperparameters are specific to model algorithms

# modelLookup(model)
# https://topepo.github.io/caret/available-models.html

Hyperparameters in Support Vector Machines (SVM)

# library(caret)
# library(tictoc)

# fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5)
# tic()
# set.seed(42)
# svm_model <- train(diagnosis ~ ., data = bc_train_data,method = "svmPoly", trControl = fitControl, verbose= FALSE)
# toc()

Defining hyperparameters for automatic tuning

tuneLength

# tic()
# set.seed(42)
# svm_model_2 <- train(diagnosis ~ ., data = bc_train_data, method = "svmPoly", trControl = fitControl, verbose = FALSE, tuneLength = 5)
# toc()

Manual hyperparameter tuning in caret

tuneGrid + expand.grid

# library(caret)
# library(tictoc)

# hyperparams <- expand.grid(degree = 4, scale = 1, C = 1)

# tic()
# set.seed(42)
# svm_model_3 <- train(diagnosis ~ ., data = bc_train_data, method = "svmPoly", trControl = fitControl, tuneGrid = hyperparams, verbose = FALSE)
# toc()

Hyperparameters in Stochastic Gradient Boosting

In the previous lesson, you built a Stochastic Gradient Boosting model in caret. A similar model as the one from before has been preloaded as gbm_model. In order to optimize this model, you want to tune its hyperparameters. Which of the following is NOT a hyperparameter of the gbm method?

Note: The library caret has also been preloaded.

  • n.trees
  • n.minobsinnode
  • [*] na.action
  • interaction.depth

Changing the number of hyperparameters to tune

# # Set seed.
# set.seed(42)
# # Start timer.
# tic()
# # Train model.
# gbm_model <- train(diagnosis ~ ., 
#                    data = bc_train_data, 
#                    method = "gbm", 
#                    trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
#                    verbose = FALSE,
#                    tuneLength = 4)
# # Stop timer.
# toc()

Tune hyperparameters manually

# # Define hyperparameter grid.
# hyperparams <- expand.grid(n.trees = 200,
#                            interaction.depth = 1,
#                            shrinkage = 0.1,
#                            n.minobsinnode = 10)
# 
# # Apply hyperparameter grid to train().
# set.seed(42)
# gbm_model <- train(diagnosis ~ .,
#                    data = bc_train_data,
#                    method = "gbm",
#                    trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
#                    verbose = FALSE,
#                    tuneGrid = hyperparams)

Ch. 2 - Hyperparameter tuning with caret

Hyperparameter tuning in caret

Finding hyperparameters

Cartesian grid search in caret

Plot hyperparameter model output

Grid search with range of hyperparameters

Random search with caret

Adaptive Resampling

Advantages of Adaptive Resampling

Adaptive Resampling with caret


Ch. 3 - Hyperparameter tuning with mlr

Machine learning with mlr

Machine Learning with mlr

Modeling with mlr

Grid and Random Search with mlr

Random search with mlr

Perform hyperparameter tuning with mlr

Evaluating hyperparameters with mlr

Why to evaluate tuning?

Evaluating hyperparameter tuning results

Advanced tuning with mlr

Define advanced tuning controls

Define aggregated measures

Setting hyperparameters


Ch. 4 - Hyperparameter tuning with h2o

Machine learning with h2o

[Video]

What is H2O

library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         18 minutes 56 seconds 
##     H2O cluster timezone:       America/Denver 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_michael_clm836 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.00 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)

Preparing the data for modeling with H2O

Data as H2O frame

# seeds_data_hf <- as.h2o(seeds_data)

Define features and target variable

# y <- "seed_type"
# x <- setdiff(colnames(seeds_data_hf), y)

For classification target should be a factor

# seeds_data_hf[, y] <- as.factor(seeds_data_hf[, y])

Training, validation and test sets

# sframe <- h2o.splitFrame(data = seeds_data_hf,
#                          ratios = c(0.7, 0.15),
#                          seed = 42)
# train <- sframe[[1]]
# valid <- sframe[[2]]
# test <- sframe[[3]]

# summary(train$seed_type, exact_quantiles = TRUE)

# summary(test$seed_type, exact_quantiles = TRUE)

Model training with H2O

  • Gradient Boosted models with h2o.gbm() & h2o.xgboost()
  • Generalized linear models with h2o.glm()
  • Random Forest models with h2o.randomForest()
  • Neural Networks with h2o.deeplearning()

Evaluate model performance with H2O

Model performance

# perf <- h2o.performance(gbm_model, test)

Prepare data for modelling with h2o

# Initialise h2o cluster
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         18 minutes 57 seconds 
##     H2O cluster timezone:       America/Denver 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_michael_clm836 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.00 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)
# Convert data to h2o frame
seeds_train_data_hf <- as.h2o(seeds_train_data)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
# Identify target and features
y <- "seed_type"
x <- setdiff(colnames(seeds_train_data_hf), y)

# Split data into train & validation sets
sframe <- h2o.splitFrame(seeds_train_data_hf, seed = 42)
train <- sframe[[1]]
valid <- sframe[[2]]

# Calculate ratio of the target variable in the training set
summary(train$seed_type, exact_quantiles = TRUE)
##  seed_type  
##  Min.   :1  
##  1st Qu.:1  
##  Median :2  
##  Mean   :2  
##  3rd Qu.:3  
##  Max.   :3

Modeling with h2o

Grid and random search with h2o

Grid search with h2o

Random search with h2o

Stopping criteria

Automatic machine learning with h2o

AutoML in h2o

Scoring the leaderboard

Extract h2o models and evaluate performance

Wrap-up


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | michaelmallari.com