library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ recipes 1.0.10
## ✔ dials 1.2.1 ✔ rsample 1.2.0
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.6 ✔ tune 1.1.2
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1
## ✔ purrr 1.0.2 ✔ yardstick 1.3.0
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'rsample' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vip)
## Warning: package 'vip' was built under R version 4.3.3
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(recipes)
Part 1: Tuning Our Regularized Regression Model
For this part of the lab we’ll use the Boston housing data set
boston <- read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Recall that in last week’s lab we manually adjusted the
hyperparameters to find a model that performs best. This time, let’s use
the hyperparameter tuning strategies we learned about to find the
optimal hyperparameter values. Using the code chunk below, fill in the
blanks to:
1. Split the data into a training set and test set using a 70-30%
split. Be sure to include the set.seed(123) so that your train and test
sets are the same size as mine.
2. Create a recipe that will model cmedv as a function of all
predictor variables. Apply the following feature engineering steps in
this order:
a. Normalize all numeric predictor variables using a Yeo-Johnson
transformation
b. Standardize all numeric predictor variables
3. Create a 5-fold cross validation resampling object.
4. Create a regularized regression model object that:
a. Contains tuning placeholders for the mixture and penalty
arguments
b. Sets the engine to use the glmnet package.
5. Create our hyperparameter search grid that:
a. Searches across default values for mixture
b. Searches across values ranging from -10 to 5 for penalty
c. Will search across 10 values for each of these hyperparameters
(levels)
6. Creates a workflow object that combines our recipe object with
our model object.
7. Performs a hyperparameter search.
8. Assesses the results.
What hyperparameter values produce the best results? What is the
mean cross validated RMSE for this optimal model?
# Step 1. split our data
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(split)
boston_test <- testing(split)
# Step 2. create our feature engineering recipe
boston_recipe <- recipe(cmedv ~ ., data = boston_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Step 3. create resampling object
set.seed(123)
kfolds <- vfold_cv(boston_train, v = 5, strata = cmedv)
# Step 4. create our model object
reg_mod <- linear_reg(mixture = tune(), penalty = tune()) %>%
set_engine("glmnet")
# Step 5. create our hyperparameter search grid
reg_grid <- grid_regular(penalty(range = c(-10, 5)), mixture(), levels = 10)
reg_grid
## # A tibble: 100 × 2
## penalty mixture
## <dbl> <dbl>
## 1 1 e-10 0
## 2 4.64e- 9 0
## 3 2.15e- 7 0
## 4 1 e- 5 0
## 5 4.64e- 4 0
## 6 2.15e- 2 0
## 7 1 e+ 0 0
## 8 4.64e+ 1 0
## 9 2.15e+ 3 0
## 10 1 e+ 5 0
## # ℹ 90 more rows
# Step 6. create our workflow object
boston_wf <- workflow() %>%
add_recipe(boston_recipe) %>%
add_model(reg_mod)
# Step 7. perform hyperparamter search
tuning_results <- boston_wf %>%
tune_grid(resamples = kfolds, grid = reg_grid)
## Warning: package 'glmnet' was built under R version 4.3.3
## Warning: package 'Matrix' was built under R version 4.3.3
## → A | warning: A correlation computation is required, but `estimate` is constant and has 0
## standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## There were issues with some computations A: x1There were issues with some computations A: x1
## Warning: More than one set of outcomes were used when tuning. This should never
## happen. Review how the outcome is specified in your model.
# Step 8. assess results
tuning_results %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean)
## # A tibble: 100 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0215 1 rmse standard 4.24 1 NA Preprocessor1_M…
## 2 0.0215 0.889 rmse standard 4.24 1 NA Preprocessor1_M…
## 3 0.0215 0.778 rmse standard 4.24 1 NA Preprocessor1_M…
## 4 0.0215 0.667 rmse standard 4.24 1 NA Preprocessor1_M…
## 5 0.0215 0.556 rmse standard 4.25 1 NA Preprocessor1_M…
## 6 0.0215 0.444 rmse standard 4.25 1 NA Preprocessor1_M…
## 7 0.0215 0.333 rmse standard 4.26 1 NA Preprocessor1_M…
## 8 0.0000000001 0.667 rmse standard 4.26 1 NA Preprocessor1_M…
## 9 0.00000000464 0.667 rmse standard 4.26 1 NA Preprocessor1_M…
## 10 0.000000215 0.667 rmse standard 4.26 1 NA Preprocessor1_M…
## # ℹ 90 more rows
Assess this plot regarding our hyperparameter search results. What
do the results tell you? Does the amount of regularization (size of the
penalty) have a larger influence on the RMSE or does the type of penalty
applied (mixture) have a larger influence?
autoplot(tuning_results)

Now, fill in the blanks below to:
1. finalize our workflow object with the optimal hyperparameter
values,
2. fit our final workflow object across the full training set data,
and
3. plot the top 10 most influential features.
We see that lstat is the most influential predictor variable
followed by dis and rm.
# Step 1. finalize our workflow object with the optimal hyperparameter values
best_hyperparameters <- select_best(tuning_results, metric = "rmse")
final_wf <- workflow() %>%
add_recipe(boston_recipe) %>%
add_model(reg_mod) %>%
finalize_workflow(best_hyperparameters)
# Step 2. fit our final workflow object across the full training set data
final_fit <- final_wf %>%
fit(data = boston_train)
# Step 3. plot the top 10 most influential features
final_fit %>%
extract_fit_parsnip() %>%
vip()

Part 2: Tuning a Regularized Classification Model
For this part of the lab we’ll use the kernlab::spam data set. This
data set was collected at Hewlett-Packard Labs, and it classifies 4601
e-mails as spam or non-spam. The response variable is type. In addition
to this response variable there are 57 predictor variables indicating
the frequency of certain words and characters in the e-mail. Your
objective is to tune a regularized logistic regression model to find the
hyperparameter values that maximize the AUC model metric.
Using the code chunk below, fill in the blanks to:
1. Split the data into a training set and test set using a 70-30%
split. Be sure to include the set.seed(123) so that your train and test
sets are the same size as mine.
2. Create a recipe that will model type as a function of all
predictor variables. Apply the following feature engineering steps in
this order:
a. Normalize all numeric predictor variables using a Yeo-Johnson
transformation
b. Standardize all numeric predictor variables
3. Create a 5-fold cross validation resampling object.
4. Create a regularized logistic regression model object that:
a. Contains tuning placeholders for the mixture and penalty
arguments
b. Sets the engine to use the glmnet package.
c. Sets the mode to be a classification model.
5. Create our hyperparameter search grid that:
a. Searches across default values for mixture
b. Searches across values ranging from -10 to 5 for penalty
c. Will search across 10 values for each of these hyperparameters
(levels)
6. Creates a workflow object that combines our recipe object with
our model object.
7. Performs a hyperparameter search.
8. Assesses the results.
What hyperparameter values produce the best results? What is the
mean cross validated RMSE for this optimal model?
# install.packages("kernlab") if you don't have this package installed
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:scales':
##
## alpha
data(spam)
# Step 1: create train and test splits
set.seed(123) # for reproducibility
split <- initial_split(spam, prop = 0.7, strata = "type")
spam_train <- training(split)
spam_test <- testing(split)
# Step 2: create model & preprocessing recipe
spam_recipe <- recipe(type ~ ., data = spam_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Step 3: fit model across resampling object and collect results
set.seed(123)
kfolds <- vfold_cv(spam_train, v = 5, strata = type)
# Step 4: create ridge model object
logit_mod <- logistic_reg(mixture = tune(), penalty = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
# Step 5. create our hyperparameter search grid
logit_grid <- grid_regular(mixture (), penalty(range = c(-10, 5)), levels = 10)
# Step 6: create workflow object to combine the recipe & model
spam_wf <- workflow() %>%
add_recipe(spam_recipe) %>%
add_model(logit_mod)
# Step 7. perform hyperparamter search
tuning_results <- spam_wf %>%
tune_grid(resamples = kfolds, grid = logit_grid)
# Step 8. assess results
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 100 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000464 1 roc_auc binary 0.979 5 0.00352 Preprocessor1_Mo…
## 2 0.000464 0.889 roc_auc binary 0.979 5 0.00349 Preprocessor1_Mo…
## 3 0.000464 0.778 roc_auc binary 0.979 5 0.00347 Preprocessor1_Mo…
## 4 0.000464 0.667 roc_auc binary 0.979 5 0.00344 Preprocessor1_Mo…
## 5 0.000464 0.222 roc_auc binary 0.979 5 0.00329 Preprocessor1_Mo…
## 6 0.000464 0.333 roc_auc binary 0.979 5 0.00333 Preprocessor1_Mo…
## 7 0.000464 0.556 roc_auc binary 0.979 5 0.00342 Preprocessor1_Mo…
## 8 0.000464 0.111 roc_auc binary 0.979 5 0.00328 Preprocessor1_Mo…
## 9 0.000464 0.444 roc_auc binary 0.979 5 0.00339 Preprocessor1_Mo…
## 10 0.0000000001 0.111 roc_auc binary 0.979 5 0.00335 Preprocessor1_Mo…
## # ℹ 90 more rows
autoplot(tuning_results)

# Step 1. finalize our workflow object with the optimal hyperparameter values
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(spam_recipe) %>%
add_model(logit_mod) %>%
finalize_workflow(best_hyperparameters)
# Step 2. fit our final workflow object across the full training set data
final_fit <- final_wf %>%
fit(data = spam_train)
# Step 3. plot the top 10 most influential features
final_fit %>%
extract_fit_parsnip() %>%
vip()

Part 3: Tuning a MARS Classification Model
For this part of the lab we’ll continue using the kernlab::spam data
set. Your objective is to tune a MARS classification model to find the
hyperparameter values that maximize the AUC model metric.
Using the code chunk below, fill in the blanks to:
1. Split the data into a training set and test set using a 70-30%
split. Be sure to include the set.seed(123) so that your train and test
sets are the same size as mine.
2. Create a recipe that will model type as a function of all
predictor variables. Apply the following feature engineering steps in
this order:
a. Normalize all numeric predictor variables using a Yeo-Johnson
transformation
b. Standardize all numeric predictor variables
3. Create a 5-fold cross validation resampling object.
4. Create a MARS model object that:
a. Contains tuning placeholders for the num_terms and prod_degree
arguments
b. Sets the mode to be a classification model.
5. Create our hyperparameter search grid that:
a. Searches across values ranging from 1 to 30 for num_terms
b. Searches across default values for prod_degree
c. Will search across 25 values for each of these hyperparameters
(levels)
6. Creates a workflow object that combines our recipe object with
our model object.
7. Performs a hyperparameter search.
8. Assesses the results.
What hyperparameter values produce the best results? What is the
mean cross validated RMSE for this optimal model?
# Step 1: create train and test splits
set.seed(123) # for reproducibility
split <- initial_split(spam, prop = 0.7, strata = type)
spam_train <- training(split)
spam_test <- testing(split)
# Step 2: create model & preprocessing recipe
spam_recipe <- recipe(type ~ ., data = spam_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Step 3: fit model across resampling object and collect results
set.seed(123)
kfolds <- vfold_cv(spam_train, v = 5, strata = type)
# Step 4: create ridge model object
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode('classification')
# Step 5. create our hyperparameter search grid
mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 25)
# Step 6: create workflow object to combine the recipe & model
spam_wf <- workflow() %>%
add_recipe(spam_recipe) %>%
add_model(mars_mod)
# Step 7. perform hyperparamter search
tuning_results <- spam_wf %>%
tune_grid(resamples = kfolds, grid = mars_grid)
## Warning: package 'earth' was built under R version 4.3.3
## Warning: package 'plotmo' was built under R version 4.3.3
## → A | warning: glm.fit: algorithm did not converge, glm.fit: fitted probabilities numerically 0 or 1 occurred, the glm algorithm did not converge for response "spam"
## There were issues with some computations A: x1There were issues with some computations A: x2There were issues with some computations A: x2
# Step 8. assess results
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 50 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 25 2 roc_auc binary 0.975 5 0.00339 Preprocessor1_M…
## 2 26 2 roc_auc binary 0.975 5 0.00339 Preprocessor1_M…
## 3 27 2 roc_auc binary 0.975 5 0.00339 Preprocessor1_M…
## 4 28 2 roc_auc binary 0.975 5 0.00339 Preprocessor1_M…
## 5 30 2 roc_auc binary 0.975 5 0.00339 Preprocessor1_M…
## 6 22 2 roc_auc binary 0.975 5 0.00338 Preprocessor1_M…
## 7 20 2 roc_auc binary 0.975 5 0.00333 Preprocessor1_M…
## 8 21 2 roc_auc binary 0.975 5 0.00327 Preprocessor1_M…
## 9 23 2 roc_auc binary 0.975 5 0.00335 Preprocessor1_M…
## 10 16 2 roc_auc binary 0.975 5 0.00335 Preprocessor1_M…
## # ℹ 40 more rows
Assess this plot regarding our hyperparameter search results. What
do the results tell you? Does the number of model terms have a larger
influence on the AUC or does the degree of interaction have a larger
influence?
autoplot(tuning_results)

Now, fill in the blanks below to:
1. finalize our workflow object with the optimal hyperparameter
values,
2. fit our final workflow object across the full training set data,
and
3. plot the top 10 most influential features
# Step 1. finalize our workflow object with the optimal hyperparameter values
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(spam_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_hyperparameters)
# Step 2. fit our final workflow object across the full training set data
final_fit <- final_wf %>%
fit(data = spam_train)
# Step 3. plot the top 10 most influential features
final_fit %>%
extract_fit_parsnip() %>%
vip()
