1 How Accurate Do I Have To Be?

You are at strategy meeting to solve a business problem. The crux of the situation is the business owners, board members, stakeholders, customers, process owners, special interest groups, and concerned parents, need to know how exactly to produce the product, output, or service that is the best solution.

This is an intensely complex system with plenty of “magic bullet” solutions in the ditch behind you.

You, as the data guru are faced with this situation.

You have at your disposal reality bending algorithms and mind-reading social media troll farms.

Wow!

where do you start?
when do you stop?

How accurate do you have to be?

2 Your Data Describes A Non-Linear System

It is easiest to define a non-linear function as one where the relationship between the input and the output does not follow a straight line. There are a huge number of ways a function can be non-linear: the above figure shows what non-linear relationships can look like. There are multiple minima with peaks and valleys between. Our problem is that industrial processes, transactional business processes, biological and chemical systems are non-linear owing to the large number of non-linear factors and the fact that these factors also interact with each other to produce the final result. Our primary goal in machine learning is to fit a function to data from these kinds of systems. Much of machine learning is focused on algorithms designed to determine parameters to these functions.

If the conversations with the subject matter experts include phrases like, “Oh, it’s not that simple. It depends on…”, then it means that higher order terms in regression models and hidden layers in neural networks are required. The conflict is that adding more higher order terms or hidden layers will add more complexity to your model, requiring more data, more computing expense, and will increase the likelihood of a non-optimal and robust solution. This article is a good reference.

Parameters Configurations and values your machine learning algorithm learns from your data during training. These parameters can interact to create complex, non-linear systems.

The algorithms themselves also have a large number of unknown hyperparameters. These are not determined directly from the data and can dramatically increase model performance and improve training times.

Hyperparameters Express high level concepts, such as statistical assumptions( eg. regularization), are fixed before training or are hard to learn from data (eg. neural network architecture), and influence the objective, test time performance, and computational cost.

…but this only opens up the problem of yet another complex system.

3 Hyperparameters Also Describe A Non-Linear System

The best combination of hyperparameters for the Adaboost algorithm changes depending on the dataset and problem definition: there is no “magic combination” of hyperparameters suitable for all problems.

4 Searching For The Optimal Hyperparameters

The quest for the best combination of hyperparameters can an daunting task because there is no clear methodology for determining them from the application or the data.

Hyperparameter search procedures fall into three groups:

The ground truth can only be determined by an exhaustive search of all combinations of hyperparameters using a grid search. While a grid search is comprehesive and complete, it is usually impractical and would take nearly an infinite length of time.
A random search is superior to a systematic search, can cover a similar domain, and can incorporate emperically determined sampling probability distributions.
Adaptive searchs (there are a large number) begin with a sparse random or grid search, but incorporate early results in subsequent sampling. In other words, if the sampling has identified regions with very poor performance, those regions should be avoided while concentrating on regions with better performance. This is a bet on a winner strategy.

Fool me once, shame on thee; fool me twice, shame on me

The adaptive selection includes a number of different algorithms that incorporate new information to change and focus the continued search.

5 A Complex, Non-Linear System Vignette

For this vignette we are going to use data from a complex, non-linear chemical process. The research paper, Modeling of strength of high performance concrete using artificial neural networks by I-Cheng Yeh published in Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998) is our choice.

From the paper, we know that the strength of High Performance Concrete follows a very non-linear relationship with the predictor variables and is poorly modeled by normal regression methods. In the 1970’s the practical limit of the compression strength of ready-mixed concrete would unlikely to be more than about 11,000 psi. Modern development of high-perfomance concrete has resulted in pores with a compressive strength of more than 19,000 psi. This has allowed the construction of lighter structures and taller buildings.

The American Concrete Institute defines high-strength concrete as concrete with a compressive strength of greater than 6,000 psi. This is owing to the management of the components of cement, water, aggregates, and admixtures. Strength is a balance between the chemical processes within the components to produce a bond between the cement paste and the aggregate. This involves time dependent chemical processes within the cement paste and the surface characteristics of the aggregate.

Some information from our domain experts:

There are at least six different testing procedures accepted by the ASTM while standards for testing differ in Europe.
Since the different chemical processes have different time dependencies, concrete strength standards are define for curing times of 2, 7, 28 and 90 days.
When the water to cement ratio exceeds approximately 0.4, the water beyond the stoichometry of the chemistry will form water filled pockets, thus weakening the concrete.
Sulphate content (gypsum) from fly ash and slag retards the hydration of the aluminate phase. Too much can cause a flash set; too much can cause false setting.
Particle size and milling temperature of clinker sulphate can change the strength
The amount of added water might be changed to create the desired “slump” or workability
A regression model does not fit the data well

5.1 Our Data

A strength model based on ANN is more accurate than a model based on regression analysis. I-Cheng Yeh

The dataset contains 1030, examples and the following features:

Input Variable: Cement (\(kg/m^3\) mixture)
Input Variable: Blast Furnace Slag (\(kg/m^3\) mixture)
Input Variable: Fly Ash (\(kg/m^3\) mixture)
Input Variable: Water (\(kg/m^3\) mixture)
Input Variable: Superplasticizer (\(kg/m^3\) mixture)
Input Variable: Coarse Aggregate (\(kg/m^3\) mixture)
Input Variable: Fine Aggregate (\(kg/m^3\) mixture)
Input Variable: Age (days)
Output Variable: Concrete compressive strength (\(MPa\))

This vignette is based on the RPubs article by Jean Dos Santos. It covers only two of the following:

heuristics and manual search (gut feelings)
- select hyperparameter settings based on experience with similar problems
grid search
- define a range and step size for each hyperparameter to define an n-dimensional grid
- the grid may be sub-sampled, eg. Latin squares
adaptive random search
- a set of values of hyperparameters are generated that depend on the values of the variable of interest observed during the survey
genetic algorithms
- first choose a random population of hyperparameters, calculate the loss function for each set, then create offspring of your better solutions with some mutation, calculate the loss function for this new generation, and repeat the whole process until it converges
differential evolution
- a variation of genetic algorithms where a population of algorithms evolves
particle swarm optimization
- another method based on an initial array of random points with local improvement
Bayesian methods
- an initial array of random points is evaluated with a loss function and results are used to modify weights
AutoML in H2O
- this can include automatic selection of hyperparameters at the level of model architecture

5.2 Load Packages

Package management is an integral part of an analytic and modeling project and I use the pacman package to handle installation and loading. There are a lot of tools associated with this project and I have arranged them in a logical sequence. In the manner of reproducible research, I go back and add to this section to add packages as I proceed through the project. In the spirit of The Lean Startup, this script is always a minimally viable product.

#rm(list = ls())

pacman::p_load(
  # text formatting
  knitr, #A General-Purpose Package for Dynamic Report Generation in R

  # reading and manipulating data
  data.table, # very fast reading, writing and manipulation of large datasets
  tidyverse, # The tidy verse collection of data manipulation and graphics packages (Hadley Wickham)
  tidylog, # reporting of tidyr functions
  
  # statistical manipulation
  randtoolbox, # Sobol sequences
  MASS, # "Modern Applied Statistics with S" (4th edition, 2002)
  
  # model definition and management              
  caret,  # Classification and Regression Training (Max Kuhn)
  
  # EDA and visualization
  plotly, # Create Interactive Web Graphics via 'plotly.js'
  GGally, # Extension to 'ggplot2'
  gridExtra, # Miscellaneous Functions for "Grid" Graphics
  ggthemes, # Extra Themes, Scales and Geoms for 'ggplot2'
  funModeling, # EDA and rapid assessment of input data
  DataExplorer, # EDA and visualization similar to funModeling
  
  # ML and optimization routines
  xgboost, # Extreme Gradient Boosting
  GauPro, #Gaussian process model
 
  # Bayesian optimization
  ParBayesianOptimization, # Bayesian optimization with init points and parallel processing
  
  # parallel processing and performance
  tictoc, # Functions for timing R scripts, as well as implementations of Stack and List structures
  parallel, # Support for parallel computation, including random-number generation
  doParallel, # Foreach Parallel Adaptor for the 'parallel' Package
   
  install = TRUE)
library("tidylog", warn.conflicts = FALSE)

5.3 Import the Data

The dataset can be downloaded through the UCI Machine learning Repository, and reformatted in Excel as a csv file in the data directory. Excel creates far too many problems with automatic date, time, and number conversions for me to trust it.

concrete_data_scaled <- fread(file = "./data/concrete_data_scaled.csv")

dplyr::glimpse(concrete_data_scaled)

## Observations: 1,030
## Variables: 9
## $ Strength         <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 47.03, 43....
## $ Cement           <dbl> 2.4767117, 2.4767117, 0.4911867, 0.4911867, -...
## $ Slag             <dbl> -0.85647182, -0.85647182, 0.79514022, 0.79514...
## $ Ash              <dbl> -0.8467326, -0.8467326, -0.8467326, -0.846732...
## $ Water            <dbl> -0.9163193, -0.9163193, 2.1744049, 2.1744049,...
## $ Superplasticizer <dbl> -0.6201471, -0.6201471, -1.0386383, -1.038638...
## $ Coarse_Aggregate <dbl> 0.86273513, 1.05565137, -0.52626175, -0.52626...
## $ Fine_Aggregate   <dbl> -1.2170788, -1.2170788, -2.2398290, -2.239829...
## $ Age              <dbl> 0.1400634, 0.1400634, 2.0421303, 2.2951627, 2...

5.3.1 Generate Train and Test Set

We start with the split of data into test and train. Since there is no time dependence of consecutive observations, we can used random sampling to construct train/test and folds within the train set.

An XGBoost model will allow the determination of interactions between variables. The model, however, will require the determination of hyperparameters, defined as any parameter not determined from the data. This means user defined assumptions such as regularization, scaling, transformation, feature engineering, neural net architecture, number of epochs, and learning rate (eta). Individual algorithms have a set of hyperparameters commonly used for tuning. For XGBoost, we will be changing:

nrounds - number of boosting iterations
eta - step size shrinkage
alpha - L1 regularization
lambda - L2 regularization

6 XGBoost for Regression

XGBoost stands for eXtreme Gradient Boosting. The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. - Tianqi Chen

We will used the xgbLinear model, while it is called a regression and contains the word, “linear”, a regression tree is a nonlinear mode: It is really a “regularized linear model” using delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization. It is also an ensemble learner that results in a non-linear transfer function.

Trees ## Training Parameters(repeated_cv and adaptive_cv)

# Training Parameters
CV_folds <- 5 # number of folds
CV_repeats <- 3 # number of repeats
minimum_resampling <- 5 # minimum number of resamples

# Training Settings
set.seed(1)

# trainControl object for standard repeated cross-validation
train_control <- caret::trainControl(
  method = "repeatedcv", 
  number = CV_folds, 
  repeats = CV_repeats, 
  verboseIter = FALSE, 
  returnData = FALSE
) 

# trainControl object for repeated cross-validation with grid search
adapt_control_grid <- caret::trainControl(
  method = "adaptive_cv", 
  number = CV_folds, 
  repeats = CV_repeats, 
  adaptive = list(
    min = minimum_resampling, 
    alpha = 0.05, # CI used to exclude parameter settings
    method = "gls", # generalized least squares
    complete = TRUE
  ), 
  search = "grid", # execute grid search
  verboseIter = FALSE, returnData = FALSE)

In general, a random search of some kind is usually more efficient than a [grid search] (http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf). A Gaussian process analysis of of the function from hyperparameters to validation shows that for most datasets, only a few hyperparameters really matter.

6.1 SS - Sobol Sequence

Sobol sequences are designed to produce a pseudo-random sample, but tend to spread the points apart from one another more than a uniformly random sample. We are setting the sample up with the same logarithmically defined sample space we used in the grid search.

6.1.1 SS - Create the Grid

set.seed(1)
sobolGrid <- data.frame(runif.sobol(n = 500, dim = 4, scrambling = 3, seed = 1, init = TRUE))
names(sobolGrid) <- c("nrounds", "eta", "alpha", "lambda")
SSBoost_grid <- data.table(nrounds = round(sobolGrid$nrounds*400) + 100,
                           eta = 10^((sobolGrid$eta * 4) - 3),  # learning rate, low value less overfitting
                           alpha = 10^((sobolGrid$alpha * 4) - 3), # L2 Regularization (Ridge Regression)
                           lambda = 10^((sobolGrid$lambda *4) - 3)) # L1 Regularization (Lasso Regression)
summary(SSBoost_grid)

##     nrounds           eta               alpha              lambda        
##  Min.   :101.0   Min.   :0.001014   Min.   :0.001006   Min.   :0.001014  
##  1st Qu.:200.8   1st Qu.:0.010303   1st Qu.:0.010036   1st Qu.:0.010002  
##  Median :300.0   Median :0.102077   Median :0.100222   Median :0.099857  
##  Mean   :300.0   Mean   :1.098101   Mean   :1.087214   Mean   :1.075555  
##  3rd Qu.:399.0   3rd Qu.:1.017107   3rd Qu.:1.001213   3rd Qu.:0.976019  
##  Max.   :499.0   Max.   :9.957664   Max.   :9.980181   Max.   :9.833644

nrow(SSBoost_grid)

## [1] 500

6.1.2 SS - Run the Optimization

SS_T0 <- Sys.time()

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

# Train model with grid search
SS_XGBoost_Linear_model <- caret::train(
  Strength ~., 
  data = train.data,
  method = "xgbLinear",
  trControl = adapt_control_grid,
  verbose = TRUE, 
  silent = 1,
  # tuneLength = 20
  tuneGrid = SSBoost_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
SS_T1 <- Sys.time()
SS_T1 - SS_T0

## Time difference of 17.13722 mins

toc()

## 1028.25 sec elapsed

saveRDS(object = SS_XGBoost_Linear_model, 
        file = "./Models/SS_XGBoost_Linear_model.rds")
saveRDS(object = SS_XGBoost_Linear_model$finalModel, 
        file = paste0("Models/SS_XGBoost_Linear_model_",
                      class(SS_XGBoost_Linear_model$finalModel)[1],".rds"))

6.1.3 SS - Plot the Results

data_points <- data.table(SS_XGBoost_Linear_model$results$nrounds)
data_points$eta <- SS_XGBoost_Linear_model$results$eta
data_points$alpha <- SS_XGBoost_Linear_model$results$alpha
data_points$lambda <- SS_XGBoost_Linear_model$results$lambda
data_points$RMSE <- SS_XGBoost_Linear_model$results$RMSE
colnames(data_points) <- c("nrounds", "eta", "alpha", "lambda", "RMSE")

data_points %>%
  pivot_longer(cols = c("nrounds", "eta", "alpha", "lambda", "RMSE")) %>%
  ggplot(aes(y = value, x = seq_along(value))) +
  geom_line(col = "blue") +
  facet_wrap(~name, scales = "free_y", ncol = 3) +
    scale_x_continuous(name = "Optimization Phase") +
  labs(title = "Sobel Grid Search") +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(name = "")

6.1.4 SS - Test the Model with Optimized Settings

# Grid of optimal hyperparameter values

SS_XGBoost_Linear_grid <- expand.grid(
  nrounds = SS_XGBoost_Linear_model$bestTune$nrounds,  # learning rate, low value means model is more robust to overfitting
  eta = SS_XGBoost_Linear_model$bestTune$eta, # number of boosting iterations
  alpha = SS_XGBoost_Linear_model$bestTune$alpha, # L2 Regularization (Ridge Regression)
  lambda = SS_XGBoost_Linear_model$bestTune$lambda # L1 Regularization (Lasso Regression)
)

#SS_XGBoost_Linear_grid
tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
SS_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  # tuneLength = 1
  tuneGrid = SS_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing

toc()

## 17.61 sec elapsed

SS_XGBoost_Linear_model_fin

## eXtreme Gradient Boosting 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.439561  0.9288858  2.933334
## 
## Tuning parameter 'nrounds' was held constant at a value of 424
## 
## Tuning parameter 'alpha' was held constant at a value of
##  0.02694503
## Tuning parameter 'eta' was held constant at a value of 7.216885

saveRDS(object = SS_XGBoost_Linear_model_fin, file = "Models/GS_XGBoost_Linear_model.rds")

6.2 BOP - Bayesian Optimization with Parallel Processing

Since this is an optimization problem, we can use optimization techniques to assess and refine the best combinations of hyperparemeters. Here’s the reference

6.2.1 BOP - Create the Function

xgb_cv_fun <- function(x1, x2, x3, x4) {
  BOP_XGBoost_Linear_grid <- expand.grid(
    nrounds = x1,  # learning rate, low value means model is more robust to overfitting
    eta = 10^x2, # number of boosting iterations
    alpha = 10^x3, # L2 Regularization (Ridge Regression)
    lambda = 10^x4 # L1 Regularization (Lasso Regression)
  )
  
  #BOP_XGBoost_Linear_grid
  set.seed(1)
  # Train model with optimal values
  BOP_XGBoost_Linear_model_fin <- caret::train(
    Strength ~., 
    data = train.data, 
    method = "xgbLinear",
    trControl = train_control,
    verbose = F, metric = "RMSE", maximize = FALSE,
    silent = 1,
    # tuneLength = 1
    tuneGrid = BOP_XGBoost_Linear_grid
  )
  #  return(list(Score = -min(xgb_cv$evaluation_log$test_rmse_mean),
  #              nrounds = xgb_cv$best_iteration))
  return(list(Score = -BOP_XGBoost_Linear_model_fin$results$RMSE,
         nrounds = BOP_XGBoost_Linear_model_fin$results$nrounds))
}

# test of function for the best solution from Sobol Search
# should return about -4.439561
  x1 = SS_XGBoost_Linear_model$bestTune$nrounds  # learning rate, low value means model is more robust to overfitting
  x2 = log10(SS_XGBoost_Linear_model$bestTune$eta)  # number of boosting iterations
  x3 = log10(SS_XGBoost_Linear_model$bestTune$alpha) # L2 Regularization (Ridge Regression)
  x4 = log10(SS_XGBoost_Linear_model$bestTune$lambda) # L1 Regularization (Lasso Regression)
xgb_cv_fun(x1, x2, x3, x4)

## $Score
## [1] -4.439561
## 
## $nrounds
## [1] 424

6.2.2 BOP - Create the Search Grid

bounds = list(x1 = c(100L,900L), #nrounds
              x2 = c(-3, 1), #eta
              x3 = c(-3, 1), #alpha
              x4 = c(-8, 1)) #lambda
kern <- "Matern52"
acq <- "ei"
exp <- c("train.data", "train_control")
pac <- c("xgboost")

6.2.3 BOP - Run the Optimization

#rm(ScoreResult)

BOP_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1) # number of cores, leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(0)

ScoreResult <- ParBayesianOptimization::BayesianOptimization(
  FUN = xgb_cv_fun,
  bounds = bounds,
  initPoints = 20,
  bulkNew = 1,
  nIters = 30,
  kern = kern,
  acq = acq,
  export = exp,
  packages = pac,
#  initGrid = data.frame(x1 = c(427, 402, 868), 
#                        x2 = c(0.15960830, -2.8137824, -0.2139339), 
#                        x3 = c(-2.1781916, 0.1079436, -2.3054470), 
#                        x4 = c(-1.45261072, -7.6470485, -0.02187309)),
  parallel = FALSE
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
BOP_T1 <- Sys.time()
BOP_T1- BOP_T0

## Time difference of 6.053262 mins

6.2.4 BOP - Plot the Results

ScoreResult$ScoreDT

##     Iteration  x1          x2          x3         x4   gpUtility
##  1:         0 428  0.09449758 -1.85097797 -6.2811358 0.000000000
##  2:         0 747 -1.22749786  0.65465105 -5.3586542 0.000000000
##  3:         0 147 -0.96691698 -2.46457878 -0.3338607 0.000000000
##  4:         0 123  0.92408559 -0.13041051 -4.3009890 0.000000000
##  5:         0 529  0.65846628  0.58940410 -1.0848621 0.000000000
##  6:         0 255 -2.70877165 -2.13227525 -4.7124366 0.000000000
##  7:         0 368  0.40912784  0.80634264 -7.2224763 0.000000000
##  8:         0 199 -0.76446765 -0.32928284 -5.8851221 0.000000000
##  9:         0 413 -1.78841359 -2.32257528 -7.5904726 0.000000000
## 10:         0 301 -0.41111795 -0.92861057 -1.7694788 0.000000000
## 11:         0 580  0.26854869 -1.60800628  0.9003451 0.000000000
## 12:         0 285 -2.49640989  0.07674764 -1.5196205 0.000000000
## 13:         0 877 -0.27495093 -1.29073248 -6.8404379 0.000000000
## 14:         0 581 -2.15254703 -2.61493530 -0.7656382 0.000000000
## 15:         0 730 -1.09677522 -0.41662809  0.4928387 0.000000000
## 16:         0 668 -0.03869163 -1.15145340 -3.0696714 0.000000000
## 17:         0 660 -1.53034705 -2.85658183 -5.0725110 0.000000000
## 18:         0 856 -2.22826251  0.26978921 -2.1840351 0.000000000
## 19:         0 488 -2.99311225 -0.68914333 -3.8894689 0.000000000
## 20:         0 809 -1.80580057 -1.45140552 -2.9425683 0.000000000
## 21:         1 839 -2.39108184  0.47461610  0.9438252 0.091117344
## 22:         2 647  0.87468230  1.00000000  1.0000000 0.022057745
## 23:         3 682 -3.00000000 -0.12954081  1.0000000 0.024086698
## 24:         4 695 -3.00000000 -0.46591537  1.0000000 0.005663027
## 25:         5 890 -3.00000000 -0.03959250  1.0000000 0.023826304
## 26:         6 173 -3.00000000 -0.04868371  1.0000000 0.021186875
## 27:         7 793 -3.00000000 -0.06079731  1.0000000 0.018365905
## 28:         8 815 -3.00000000 -0.07124315  1.0000000 0.015619127
## 29:         9 819 -3.00000000 -0.07866706  1.0000000 0.013522510
## 30:        10 819 -3.00000000 -0.07879422  1.0000000 0.012161934
##     Iteration  x1          x2          x3         x4   gpUtility
##     acqOptimum Elapsed     Score nrounds
##  1:      FALSE   12.20 -4.954380     428
##  2:      FALSE    6.71 -4.819648     747
##  3:      FALSE    5.28 -4.818427     147
##  4:      FALSE    4.29 -4.838600     123
##  5:      FALSE    5.43 -4.894194     529
##  6:      FALSE    8.54 -4.958475     255
##  7:      FALSE    4.32 -4.824695     368
##  8:      FALSE    5.20 -4.920022     199
##  9:      FALSE   10.19 -4.960505     413
## 10:      FALSE    8.01 -4.961380     301
## 11:      FALSE   18.74 -4.570489     580
## 12:      FALSE    4.83 -4.903276     285
## 13:      FALSE   11.26 -4.952903     877
## 14:      FALSE   11.85 -4.888112     581
## 15:      FALSE   10.42 -4.525337     730
## 16:      FALSE    9.73 -4.930622     668
## 17:      FALSE   11.73 -4.938374     660
## 18:      FALSE    7.22 -4.842665     856
## 19:      FALSE    7.53 -4.973670     488
## 20:      FALSE   11.63 -4.927271     809
## 21:       TRUE    8.45 -4.456585     839
## 22:       TRUE    6.42 -4.564195     647
## 23:       TRUE   12.21 -4.411621     682
## 24:       TRUE   15.02 -4.446245     695
## 25:       TRUE   12.11 -4.447262     890
## 26:       TRUE    5.95 -4.447645     173
## 27:       TRUE   12.09 -4.442750     793
## 28:       TRUE   11.86 -4.439011     815
## 29:       TRUE   12.57 -4.448807     819
## 30:       TRUE   12.58 -4.448483     819
##     acqOptimum Elapsed     Score nrounds

ScoreResult$BestPars

##     Iteration  x1        x2         x3        x4     Score nrounds
##  1:         0 730 -1.096775 -0.4166281 0.4928387 -4.525337     730
##  2:         1 839 -2.391082  0.4746161 0.9438252 -4.456585     839
##  3:         2 839 -2.391082  0.4746161 0.9438252 -4.456585     839
##  4:         3 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##  5:         4 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##  6:         5 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##  7:         6 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##  8:         7 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##  9:         8 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
## 10:         9 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
## 11:        10 682 -3.000000 -0.1295408 1.0000000 -4.411621     682
##     elapsedSecs
##  1:    178 secs
##  2:    195 secs
##  3:    206 secs
##  4:    223 secs
##  5:    243 secs
##  6:    262 secs
##  7:    278 secs
##  8:    299 secs
##  9:    320 secs
## 10:    341 secs
## 11:    361 secs

6.2.5 BOP - Test the Model with Optimized Settings

# test grid with know values
#BOP_XGBoost_Linear_grid <- expand.grid(
#  nrounds = 427,  # number of boosting iterations
#  eta = 10^(0.15960830), # learning rate, low value means model is more robust to overfitting
#  alpha = 10^(-2.1787916), # L2 Regularization (Ridge Regression)
#  lambda = 10^(-1.45261072) # L1 Regularization (Lasso Regression)
#)

bestLine <- nrow(ScoreResult$BestPars)

BOP_XGBoost_Linear_grid <- expand.grid(
  nrounds = round(ScoreResult$BestPars$x1[[bestLine]]),  # number of boosting iterations
  eta = 10^ScoreResult$BestPars$x2[[bestLine]], # learning rate, low value means model is more robust to overfitting
  alpha = 10^ScoreResult$BestPars$x3[[bestLine]], # L2 Regularization (Ridge Regression)
  lambda = 10^ScoreResult$BestPars$x4[[bestLine]] # L1 Regularization (Lasso Regression)
)

#xgb_cv_fun(ScoreResult$BestPars$x1[[1]],
#           ScoreResult$BestPars$x2[[1]],
#           ScoreResult$BestPars$x3[[1]],
#           ScoreResult$BestPars$x4[[1]])
tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
#modelLookup(model = "xgbLinear")
BOP_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  tuneGrid = BOP_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

## 17.22 sec elapsed

BOP_XGBoost_Linear_model_fin

## eXtreme Gradient Boosting 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.411621  0.9298282  2.920207
## 
## Tuning parameter 'nrounds' was held constant at a value of 682
## 
## Tuning parameter 'alpha' was held constant at a value of 0.7420945
## 
## Tuning parameter 'eta' was held constant at a value of 0.001

#saveRDS(object = BOP_XGBoost_Linear_model_fin, file = paste0("Models/BOP_XGBoost_Linear_model_fin", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))

7 Performance On Test Set

#Custom functions to plot observed, predicted and residual values for each method
library(ggplot2, quietly = T, verbose = F)

# Function to plot observed vs predicted values
predicted_observed_plot <- function(predicted_val, observed_val, residual_val, model_name = "", R_squared, ...) {
  
  plot <- ggplot(mapping = aes(x = predicted_val, y = observed_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 1) +
    # facet_wrap(~) +
    labs(title = paste0(model_name, "\nPredicted vs Observed: Test Set"),
         subtitle = paste0("R-squared: ", R_squared),
         x = "Predicted",
         y = "Observed",
         col = "Absolute Deviation") +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) + 
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}

# Function to plot residuals
residuals_plot <- function(predicted_val, residual_val, model_name = "", MAE, RMSE, ...) {

  plot <- ggplot(mapping = aes(x = predicted_val, y = residual_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 0) +
    # facet_wrap(~) +
    labs(
       title = paste0(model_name, "\nResiduals: Test Set"),
       subtitle = paste0("RMSE: ", RMSE, ", MAE: ", round(MAE, 3)),
       x = "Predicted",
       y = "Residual",
       col = "Absolute Deviation"
       ) +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) +
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}

7.1 SS - Residual Plots

### Sobol Sequence (SS)

# Make predictions on test set
test.data$XGBoost_Linear_SS <- predict(SS_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_SS_residual <- test.data$Strength - test.data$XGBoost_Linear_SS

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_SS, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_SS, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_SS, obs = test.data$Strength), 6)

SS_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_SS_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_SS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_SS_residual, R_squared = R_squared, model_name = "Sobol Sequence")
XGBoost_Linear_SS_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_SS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_SS_residual, MAE = MAE, RMSE = RMSE, model_name = "Sobol Sequence")
gridExtra::grid.arrange(XGBoost_Linear_SS_pred_obs, XGBoost_Linear_SS_residuals, 
                             ncol = 2)

7.2 BOP - Residual Plots

# Make predictions on test set
test.data$XGBoost_Linear_BOP <- predict(BOP_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_BOP_residual <- test.data$Strength - test.data$XGBoost_Linear_BOP

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_BOP, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_BOP, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_BOP, obs = test.data$Strength), 6)

BOP_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_BOP_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_BOP, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_BOP_residual, R_squared = R_squared, model_name = "Parallel Bayesian Optimization")
XGBoost_Linear_BOP_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_BOP, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_BOP_residual, MAE = MAE, RMSE = RMSE, model_name = "Parallel Bayesian Optimization")
gridExtra::grid.arrange(XGBoost_Linear_BOP_pred_obs, XGBoost_Linear_BOP_residuals, 
                             ncol = 2)

8 Summary Table - Training Data

# Create summary table
Summary_Table_Training <- bind_rows(
  SS_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  BOP_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5))


Summary_Table_Training <- Summary_Table_Training %>% 
  add_column(Method = c("Sobol Sequence", "Parallel Bayesian Optimization"), .before = 1) %>% 
  add_column(`Processing Time` = round(c(SS_T1-SS_T0, BOP_T1- BOP_T0),0))

# Print table
Summary_Table_Training %>% 
  kable(align = "c", caption = "Training Set Statistics and Hyperparameter Values of XGBoost Models.") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = paste0("Note:\nSummary statistics obtained using ", CV_folds, "-fold cross validation repeated ", CV_repeats, " times.","\nGrid and Random Search  performed with adaptive resampling."), general_title = "\n ")

## Warning in kableExtra::kable_styling(., bootstrap_options = c("striped", :
## Please specify format in kable. kableExtra can customize either HTML or
## LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.

## Warning in kableExtra::footnote(., general = paste0("Note:\nSummary
## statistics obtained using ", : Please specify format in kable.
## kableExtra can customize either HTML or LaTeX outputs. See https://
## haozhu233.github.io/kableExtra/ for details.

Training Set Statistics and Hyperparameter Values of XGBoost Models.
Method	RMSE	MAE	Rsquared	nrounds	eta	lambda	alpha	Processing Time
Sobol Sequence	4.43956	2.93333	0.92889	424	7.21688	3.89845	0.02695	17 mins
Parallel Bayesian Optimization	4.41162	2.92021	0.92983	682	0.00100	10.00000	0.74209	6 mins

9 Summary Table - Test Data

# Create summary table
Summary_Table_Test <- rbind(
  SS_test.data_Statistics,
  BOP_test.data_Statistics,
  deparse.level = 0 
  ) %>% data.frame()

colnames(Summary_Table_Test) <- c("RMSE", "MAE", "R-squared")
#Summary_Table_Test <- Summary_Table_Test %>% add_column(Method = c("Grid Search", "Random Search", "Genetic Algorithm", "Sobol Sequence", "Parallel Bayesian Optimization"), .before = 1)

Summary_Table_Test <- Summary_Table_Test %>% add_column(Method = c("Sobol Sequence", "Parallel Bayesian Optimization"), .before = 1)


# Print table
Summary_Table_Test %>% 
  kable(align = "c", caption = "Test Set Statistics of XGBoost Models.") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = paste0(""), general_title = "\n ")

Test Set Statistics of XGBoost Models.
Method	RMSE	MAE	R-squared
Sobol Sequence	4.61469	2.69963	0.9636
Parallel Bayesian Optimization	4.41466	2.60606	0.9668

10 Summary

The Bayesian approuch with primed values is superior to Sobol Sequence pseudo-random searches for budget purposes

ParBayesianOptimization

Parallel Processing Meets Bayesian Hyperparameter Optimization

Notes by Alastair K. Muir

2019-10-30