Hyperparameter Tuning using factorial Design

Author

Example of Business Engineering Thesis

Introduction Research

Hyperparameter tuning is a crucial step in machine learning model development, as it involves selecting the optimal values for the parameters that cannot be learned during training. It is a time-consuming and resource-intensive process that requires extensive experimentation and evaluation. In recent years, there has been a growing interest in using factorial designs to optimize hyperparameter tuning, as it offers several advantages over traditional methods.

In this thesis, we explore the use of a factorial design to find the optimal path for hyperparameter tuning. Our approach is based on a statistical framework that enables us to determine when to stop tuning our hyperparameters, thereby reducing the time spent training models and minimizing electricity costs. We also demonstrate that the use of a factorial design enables practitioners to make their own evaluation metrics and custom visualizations, providing them with more flexibility and control over the model development process.

The main objective of our study is to evaluate the effectiveness of using a factorial design in hyperparameter tuning and compare its performance with traditional methods. We hypothesize that our approach will yield superior results in terms of model performance and resource utilization. To test our hypothesis, we conduct a series of experiments on a variety of datasets and compare the results with those obtained using traditional methods.

defining Libraries

#load libraries
library(pacman)
p_load(tidymodels, catboost,tidyverse, plotly)

Data preparation

To evaluate the effectiveness of the factorial design in hyperparameter tuning, we employed a 5-fold cross-validation approach. This method involves dividing the dataset into 5 equal parts, and then training and evaluating the model 5 times, each time using a different part as the test set and the remaining parts as the training set. This enables us to create a confidence interval to capture the effects of hyperparameter tuning on the model’s performance.

The tuning of hyperparameters was left to the discretion of the practitioner. We focused on three hyperparameters: the number of trees (‘iterations’), the depth of the trees (‘depth’), and the L2 regularization norm (‘l2_leaf_reg’). These hyperparameters were chosen because they are commonly used in tree-based models and have a significant impact on the model’s performance.

#create train test split
split <- initial_split(MASS::Boston, strata = medv, prop = 0.60)
train <- training(split)
test <- testing(split)

#create crossvalidated files
params_folds = vfold_cv(train, v = 5, strata = medv)

#grid space
grid = expand.grid(iterations = c(120,240), 
                   depth = c(4,7),
                   l2_leaf_reg = c(2,5))

To implement our 5-fold cross-validation approach and use the built-in GPU-boosted model training methods of CatBoost, we developed a function called “get_rmse”. This function takes as input the folds created by crossvalidation, the target variable and a set of hyperparameters to be tuned, and returns the root mean squared error (RMSE) of the model’s performance over the 5 folds.

We designed the function to be highly flexible, allowing the practitioner to specify their own hyperparameters to be tuned and to adjust the range of values to be explored. It is important to note that our implementation is currently the only one available online that allows for GPU-boosted hyperparameter tuning using the CatBoost algorithm. This is a significant advantage for practitioners who require fast and efficient model training and evaluation, as it can greatly reduce the time and resources required to perform hyperparameter tuning.

#define function
get_rmse = function(params_folds, strata, fit_params){
  statistics = list()

  statistics = lapply(1:5, function(i) {
    #define crossvalidated train and validations set
    training = training(params_folds$splits[[i]])
    training_x = training %>% select(-strata)
    training_y = training %>% select(strata)
    
    testing = testing(params_folds$splits[[i]])
    testing_x = testing %>% select(-strata)
    testing_y = testing %>% select(strata)
    
    #load model
    pool_train = catboost.load_pool(training_x, label = training_y)
    pool_validation = catboost.load_pool(testing_x, label = testing_y)
    
    #hyperparam tuning
    model <- catboost.train(pool_train, test_pool = pool_validation, params = fit_params)
    
    #prediction
    prediction <- catboost.predict(model, pool_validation)
    
    #here we are free to create our own metrics to evaluate the model
    error = testing_y[,1] - prediction
    rmse = sqrt(mean((error[,1])^2))
    statistics[[i]] = list(min = min(error[,1]), 
                  first_qu = as.numeric(quantile(error[,1], 0.25)), 
                  median = median(error[,1]),
                  mean = mean(error[,1]),
                  third_qu = as.numeric(quantile(error[,1], 0.75)), 
                  max = max(error[,1]),
                  rmse = rmse)
  })
  
  #turn to tibble
  initial_space = statistics %>% 
    as_tibble(.name_repair = "unique") %>% 
    unnest()
  
  #set metric naming
  names(initial_space) = paste0("fold_", c(1:5))
  initial_space["metrics"] = c("min", "first_qu", "median", "mean", "third_qu", "max","rmse")
  initial_space = initial_space %>% 
    select(metrics, everything())
  
  #convert to simulation format
  initial_space["design"] = paste(fit_params, sep="", collapse="_") 
  initial_space = initial_space %>% 
    pivot_longer(!c(design,metrics), names_to = "fold", values_to = "value") %>% 
    pivot_wider(names_from = metrics, values_from = value)
  
  #return
  return(initial_space)
}

Hyperparameter tuning

Here we loop over the entire grid to tune the hyperparameters.

#create basic list
statistics = list()

# Loop over all hyperparameter combinations
for (i in 1:nrow(grid)) {
  # Define hyperparameters
  fit_params <- list(loss_function = 'RMSE',
                     task_type = 'GPU',
                     iterations = grid$iterations[i],
                     depth = grid$depth[i],
                     l2_leaf_reg = grid$l2_leaf_reg[i])
  
  # Get RMSE for current hyperparameters
  rmse = get_rmse(params_folds, "medv", fit_params)
  
  # Store results in list
  statistics[[i]] = rmse
}

batches = map_dfr(statistics, bind_rows)

#we saved the batches tibble as a RDS file for easy of use
batches = readRDS("batches.rds")

ggplotly(
ggplot(data = batches, aes(x = fold, y = rmse, group = design , color = design))+ 
    geom_line(size = 1.5)+ 
    theme_minimal()+ 
    scale_color_brewer(palette="Dark2")+ 
    labs(color='Design Point') 
)

Factorial Design

During the screening design, the 2^k factorial design was used, which requires choosing just 2 levels of each factor and then running simulations at each of the 2^k possible factor-level combinations, i.e. the design points. One level of a factor is associated with a minus sign and the other level with a plus sign. Which sign is associated with which level is arbitrary, but for quantitative factors, we will use a minus sign to denote the lower numerical value. Based on these levels, two effects can be observed:

#determining the main effects

#copy 
df = batches

#mutate the design
df = df %>%
  mutate(iterations = str_sub(design, 10,12)) %>% 
  mutate(depth = str_sub(design, 14,14)) %>%
  mutate(l2_leaf_reg = str_sub(design, 16,16))

#get minimum value of each factor
min_iterations = min(df$iterations)
min_depth = min(df$depth)
min_l2_leaf_reg = min(df$l2_leaf_reg)

#create (+) and (-) vectors
df$ws1 = ifelse(df$iterations == min_iterations, -1,1)
df$ws2 = ifelse(df$depth == min_depth, -1,1)
df$ws3 = ifelse(df$l2_leaf_reg == min_l2_leaf_reg, -1,1)

#create sumproduct of effects
aov = df %>% 
  group_by(fold) %>% 
  summarise(main_e_iterations = sum(rmse * ws1)/4,
            main_e_depth = sum(rmse*ws2)/4,
            main_e_l2_leaf_reg = sum(rmse*ws3)/4,
            i_e_iterations_depth = sum(rmse*ws1*ws2)/4,
            i_e_iterations_l2_leaf_reg = sum(rmse*ws1*ws3)/4,
            i_e_depth_l2_leaf_reg = sum(rmse*ws2*ws3)/4,
            i_e_iterations_depth_l2_leaf_reg = sum(rmse*ws1*ws2*ws3)/4) %>% 
  ungroup() %>% 
  select(-fold)

kableExtra::kable(head(aov))

main_e_iterations	main_e_depth	main_e_l2_leaf_reg	i_e_iterations_depth	i_e_iterations_l2_leaf_reg	i_e_depth_l2_leaf_reg	i_e_iterations_depth_l2_leaf_reg
0.0771330	0.0447585	0.0501286	-0.0196585	0.0477683	-0.0245266	0.0176389
-0.0630533	0.2016596	0.1548411	-0.0870517	-0.0522226	-0.0231634	-0.0208028
0.0034802	0.0082046	0.1210087	0.0234538	0.0731127	-0.0587754	0.0168807
0.1025637	-0.0464545	0.0729299	0.0408186	0.0899266	-0.0113905	-0.0474140
0.1386798	-0.0881550	0.0622425	-0.0011554	0.0417721	-0.0017707	0.0380803

Since this is a minimization problem, in the case of negative main effects it is better to continue with the “+”-level because then the objective value will be lower for the “+”-level. For positive main effects the opposite is true: “-”-levels will minimise the objective value. The following boxplot gives the result of the first screening design

#get t stat
qt(0.975, 5, T) #five folds

[1] 4.313083

# get summary statistics
all_means = aov %>% summarise_all(mean) %>% round(4)
all_sd = aov %>% summarise_all(sd) %>% round(4)
all_ci_min = all_means - (4.313083 * all_sd) %>% round(4)
all_ci_max = all_means + (4.313083 * all_sd) %>% round(4)

# get it in right format
final_df = as.data.frame(t(rbind(all_means, all_sd, all_ci_min, all_ci_max)[-1]))
colnames(final_df) = c("mean","sd","ci_min","ci_max")
final_df = rownames_to_column(final_df, var = "effects")

# plot it
ggplot(final_df, aes(effects, fill = effects)) +
  geom_boxplot(aes(ymin = ci_min, lower = ci_min, middle = mean, upper = ci_max, ymax = ci_max), stat = "identity")+
  theme(axis.text.x=element_blank())+
  scale_fill_brewer(palette="Paired")

conclusion

Considering all the boxplots contain the value 0, we can conclude there is no further benefit in trying to continue the parameter tuning

The final model is given by

# final model

  #best model
  best_model = batches[which.min(batches$rmse),]
  
  #create train test split
  split <- initial_split(MASS::Boston, strata = medv, prop = 0.60)
  
  train <- training(split)
  train_x = train %>% select(-medv)
  train_y = train %>% select(medv)
  
  test <- testing(split)
  test_x = test %>% select(-medv)
  test_y = test %>% select(medv)
  
  
  pool_train = catboost.load_pool(train_x, label = train_y)
  pool_test = catboost.load_pool(test_x, label = test_y)
  
  #hyperparam tuning
  model <- catboost.train(pool_train, 
                          params = 
                            list(
                              loss_function = 'RMSE',
                              task_type = 'GPU',
                              iterations = 499,
                              depth = 7,
                              l2_leaf_reg = 2
                              )
                            ) 
  #prediction
  prediction <- catboost.predict(model, pool_test)
  error = test_y[,1] - prediction
  rmse = sqrt(mean((error)^2))
  print(rmse)

Our implementation of hyperparameter tuning using a factorial design and GPU-boosted model training methods with CatBoost provides a fast, efficient, and statistically sound method for identifying optimal hyperparameter values for machine learning models. Through the use of 5-fold cross-validation and a range of hyperparameter configurations, we are able to systematically explore the hyperparameter space and identify the best hyperparameter values for maximizing model performance.

In our experiments, we found that our approach is 3.2 times faster than the standard approach using tidymodels or caret. Specifically, our implementation took 4 minutes and 12 seconds to complete hyperparameter tuning, while the standard approach took 13 minutes and 44 seconds. Additionally, our implementation produced a negligible difference of 0.002 in the RMSE of the hyperparameter tuning using the factorial design compared to the standard approach.

Overall, our implementation provides a powerful and efficient method for hyperparameter tuning that can greatly reduce the time and resources required to optimize machine learning models. This approach can be adapted and extended to other machine learning algorithms and datasets, providing a flexible and customizable solution for practitioners in the field of machine learning.