In this episode of SLICED, contestants are challenged to use a variety of features to predict whether a batter’s hit results in a home run.
The evaluation algorithm is log loss.
I’ll be learning how to use Racing Methods by coding along Julia Silge!
suppressWarnings(if(!require(pacman)) install.packages("pacman"))## Loading required package: pacman
pacman::p_load("tidyverse", "tidymodels", "here", "scales", "glmnet", "stacks", "janitor", "finetune", "vip")
doParallel::registerDoParallel()Against my better judgement, am moving straight to modelling!
In this section, we allocate specific subsets of data for different tasks.
set.seed(2056)
# Load data
train_raw <- read_csv("train.csv", show_col_types = FALSE)
holdout <- read_csv("test.csv", show_col_types = FALSE)
# Convert 0s and 1s (outcome) into a factor and split data
bb_split <- train_raw %>%
mutate(is_home_run = if_else(as.logical(is_home_run), "HR", "no"),
is_home_run = factor(is_home_run)) %>%
initial_split(strata = is_home_run)
bb_train <- training(bb_split)
bb_test <- testing(bb_split)
# Training folds
set.seed(2056)
bb_folds <- bb_train %>%
vfold_cv(v = 10, strata = is_home_run)
eval_metrics <- metric_set(mn_log_loss)
theme_set(theme_light())Feature engineering encompasses activities that reformat predictor values to make them easier for a model to use effectively.
bb_rec <- recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z
+ inning + balls +
strikes + game_date +
bb_type + bearing +
pitch_mph + is_batter_lefty + is_pitcher_lefty, data = bb_train) %>%
# Extract week of year from date
step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
# Assign missing factor to "unknown"
step_unknown(all_nominal_predictors()) %>%
# Convert nominal features to numeric
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
# Impute missing numeric values using the median
step_impute_median(all_numeric_predictors(), - launch_angle, - launch_speed) %>%
step_impute_linear(launch_angle, launch_speed, impute_with = imp_vars(plate_x, plate_z, pitch_mph)) %>%
step_nzv(all_predictors())
prep(bb_rec)## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 13
##
## Training data contained 34683 data points and 15303 incomplete rows.
##
## Operations:
##
## Date features from game_date [trained]
## Unknown factor level assignment for bb_type, bearing [trained]
## Dummy variables from bb_type, bearing [trained]
## Median Imputation for plate_x, plate_z, inning, balls, strikes, pitch... [trained]
## Linear regression imputation for launch_angle, launch_speed [trained]
## Sparse, unbalanced variable filter removed bb_type_unknown, bearing_unknown [trained]
# Model specification
xgb_spec <-
boost_tree(
trees = tune(),
min_n = tune(),
mtry = tune(),
learn_rate = 0.01
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Workflow
xgb_wf <- workflow() %>%
add_recipe(bb_rec) %>%
add_model(xgb_spec)In racing methods, the tuning process evaluates all models on an initial subset of resamples. Based on their current performance metrics, the process eliminates tuning parameter combinations that are unlikely to be the best results using a repeated measure ANOVA model. This is unlike grid search where all models need to be fit across all resamples before any tuning parameters can be evaluated.
doParallel::registerDoParallel()
library(finetune)
set.seed(2056)
# Grid search via racing
xgb_rs <- tune_race_anova(
object = xgb_wf,
resamples = bb_folds,
# Try out 15 different combinations of parameters
# i.e 15 different models
grid = 15,
metrics = metric_set(mn_log_loss),
control = control_race(verbose_elim = TRUE)
)## i Creating pre-processing data to finalize unknown parameter: mtry
## i Racing will minimize the mn_log_loss metric.
## i Resamples are analyzed in a random order.
## i Fold10: 12 eliminated; 3 candidates remain.
## i Fold05: All but one parameter combination were eliminated.
xgb_rs$.metrics## [[1]]
## # A tibble: 15 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 1 1965 13 mn_log_loss binary 0.113 Preprocessor1_Model01
## 2 3 1793 37 mn_log_loss binary 0.109 Preprocessor1_Model02
## 3 4 1527 34 mn_log_loss binary 0.108 Preprocessor1_Model03
## 4 5 165 32 mn_log_loss binary 0.195 Preprocessor1_Model04
## 5 6 1687 26 mn_log_loss binary 0.107 Preprocessor1_Model05
## 6 7 407 16 mn_log_loss binary 0.118 Preprocessor1_Model06
## 7 8 589 12 mn_log_loss binary 0.111 Preprocessor1_Model07
## 8 10 836 3 mn_log_loss binary 0.109 Preprocessor1_Model08
## 9 10 1322 20 mn_log_loss binary 0.106 Preprocessor1_Model09
## 10 12 1456 9 mn_log_loss binary 0.105 Preprocessor1_Model10
## 11 12 750 19 mn_log_loss binary 0.109 Preprocessor1_Model11
## 12 14 1080 24 mn_log_loss binary 0.107 Preprocessor1_Model12
## 13 16 53 7 mn_log_loss binary 0.395 Preprocessor1_Model13
## 14 16 357 29 mn_log_loss binary 0.122 Preprocessor1_Model14
## 15 17 967 37 mn_log_loss binary 0.109 Preprocessor1_Model15
##
## [[2]]
## # A tibble: 15 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 1 1965 13 mn_log_loss binary 0.111 Preprocessor1_Model01
## 2 3 1793 37 mn_log_loss binary 0.104 Preprocessor1_Model02
## 3 4 1527 34 mn_log_loss binary 0.104 Preprocessor1_Model03
## 4 5 165 32 mn_log_loss binary 0.194 Preprocessor1_Model04
## 5 6 1687 26 mn_log_loss binary 0.102 Preprocessor1_Model05
## 6 7 407 16 mn_log_loss binary 0.114 Preprocessor1_Model06
## 7 8 589 12 mn_log_loss binary 0.106 Preprocessor1_Model07
## 8 10 836 3 mn_log_loss binary 0.104 Preprocessor1_Model08
## 9 10 1322 20 mn_log_loss binary 0.102 Preprocessor1_Model09
## 10 12 1456 9 mn_log_loss binary 0.101 Preprocessor1_Model10
## 11 12 750 19 mn_log_loss binary 0.105 Preprocessor1_Model11
## 12 14 1080 24 mn_log_loss binary 0.103 Preprocessor1_Model12
## 13 16 53 7 mn_log_loss binary 0.394 Preprocessor1_Model13
## 14 16 357 29 mn_log_loss binary 0.117 Preprocessor1_Model14
## 15 17 967 37 mn_log_loss binary 0.104 Preprocessor1_Model15
##
## [[3]]
## # A tibble: 15 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 1 1965 13 mn_log_loss binary 0.113 Preprocessor1_Model01
## 2 3 1793 37 mn_log_loss binary 0.109 Preprocessor1_Model02
## 3 4 1527 34 mn_log_loss binary 0.109 Preprocessor1_Model03
## 4 5 165 32 mn_log_loss binary 0.196 Preprocessor1_Model04
## 5 6 1687 26 mn_log_loss binary 0.107 Preprocessor1_Model05
## 6 7 407 16 mn_log_loss binary 0.119 Preprocessor1_Model06
## 7 8 589 12 mn_log_loss binary 0.113 Preprocessor1_Model07
## 8 10 836 3 mn_log_loss binary 0.110 Preprocessor1_Model08
## 9 10 1322 20 mn_log_loss binary 0.107 Preprocessor1_Model09
## 10 12 1456 9 mn_log_loss binary 0.106 Preprocessor1_Model10
## 11 12 750 19 mn_log_loss binary 0.111 Preprocessor1_Model11
## 12 14 1080 24 mn_log_loss binary 0.109 Preprocessor1_Model12
## 13 16 53 7 mn_log_loss binary 0.395 Preprocessor1_Model13
## 14 16 357 29 mn_log_loss binary 0.123 Preprocessor1_Model14
## 15 17 967 37 mn_log_loss binary 0.110 Preprocessor1_Model15
##
## [[4]]
## # A tibble: 3 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 6 1687 26 mn_log_loss binary 0.0927 Preprocessor1_Model05
## 2 10 1322 20 mn_log_loss binary 0.0924 Preprocessor1_Model09
## 3 12 1456 9 mn_log_loss binary 0.0928 Preprocessor1_Model10
##
## [[5]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.0937 Preprocessor1_Model10
##
## [[6]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.105 Preprocessor1_Model10
##
## [[7]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.0955 Preprocessor1_Model10
##
## [[8]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.0884 Preprocessor1_Model10
##
## [[9]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.0983 Preprocessor1_Model10
##
## [[10]]
## # A tibble: 1 x 7
## mtry trees min_n .metric .estimator .estimate .config
## <int> <int> <int> <chr> <chr> <dbl> <chr>
## 1 12 1456 9 mn_log_loss binary 0.0977 Preprocessor1_Model10
The metric shows the incremental elimination of tuning parameters that may not likely improve performance.
grid: An integer or data frame. When an integer is used, the function creates a space-filling design withgridnumber of candidate parameter combinations. space-filling designs generally find a configuration of points that cover the parameter space with the smallest chance of overlapping or redundant values.
# See the race visually
plot_race(xgb_rs)# Best model
show_best(xgb_rs)Update the xg_boost model with the best tuning parameters
# Finalize workflow
xgb_last <- xgb_wf %>%
finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %>%
last_fit(bb_split)
# Collect predictions
xgb_last %>%
collect_predictions() %>%
mn_log_loss(is_home_run, .pred_HR)library(vip)
# Extract fitted xg_boost wf
extract_workflow(xgb_last) %>%
extract_fit_parsnip() %>%
vip(geom = "point", num_features = 15)