1 How Accurate Do I Have To Be?

You are at strategy meeting to solve a business problem. The crux of the situation is the business owners, board members, stakeholders, customers, process owners, special interest groups, and concerned parents, need to know how exactly to produce the product, output, or service that is the best solution.

This is an intensely complex system with plenty of “magic bullet” solutions in the ditch behind you.

You, as the data guru are faced with this situation.

You have at your disposal reality bending algorithms and mind-reading social media troll farms.

Wow!

How accurate do you have to be?

2 Describe a Complex System

What will the output be?

What will the output be?

Industrial processes, biological and chemical systems are very complex owing to the large number of factors and they interact with each other to produce the final result. As complex as the systems are, the predictive models are equally complex. Not only do they have a large number of unknown parameters, but the algorithms themselves have a large number of unknown hyperparameters. These hyperparameters can dramatically increase model performance and improve training times. Tuning can be a complex and expensive process.

Tuning hyperparameters of a deep neural network, probe drilling for oil at given geographic coordinates or evaluating the effectiveness of a drug candidate taken from a chemical search space then it is important to minimize the number of samples drawn from the black box function

Parameters Configurations and values your machine learning algorithm learns from your data during training. These parameters can interact to create complex, non-linear systems.

Hyperparameters Express high level concepts, such as statistical assumptions( eg. regularization), are fixed before training or are hard to learn from data (eg. neural network architecture), and influence the objective, test time performance, and computational cost.

For this vignette we are going to use data from a very complex, non-linear system. The research paper, Modeling of strength of high performance concrete using artificial neural networks by I-Cheng Yeh published in Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998) is our choice.

From the paper, we know that the strength of High Performance Concrete follows a very non-linear relationship with the predictor variables and is poorly modeled by normal regression methods.

A strength model based on ANN is more accurate than a model based on regression analysis

The dataset contains 1030, examples and the following features:

This vignette is based on the RPubs article by Jean Dos Santos. It covers:

3 Load Packages

Package management is an integral part of an analytic and modeling project and I use the pacman package to handle installation and loading. There are a lot of tools associated with this project and I have arranged them in a logical sequence. In the manner of reproducible research, I go back and add to this section to add packages as I proceed through the project. In the spirit of The Lean Startup, this script is always a minimally viable product.

#rm(list = ls())

pacman::p_load(
  # text formatting
  knitr, #A General-Purpose Package for Dynamic Report Generation in R
  kableExtra, #Construct Complex Table with 'kable' and Pipe Syntax

  # reading and manipulating data
  data.table, # very fast reading, writing and manipulation of large datasets
  tidyverse, # The tidy verse collection of data manipulation and graphics packages (Hadley Wickham)
  tidylog, # reporting of tidyr functions
  
  # statistical manipulation
  MASS, # "Modern Applied Statistics with S" (4th edition, 2002)
  
  # model definition and management              
  caret,  # Classification and Regression Training (Max Kuhn)
  
  # EDA and visualization
  plotly, # Create Interactive Web Graphics via 'plotly.js'
  GGally, # Extension to 'ggplot2'
  gridExtra, # Miscellaneous Functions for "Grid" Graphics
  ggthemes, # Extra Themes, Scales and Geoms for 'ggplot2'
  corrplot, # Visualization of a Correlation Matrix
  qqplotr, # Quantile-Quantile Plot Extensions for 'ggplot2'
  umap, # Uniform Manifold Approximation and Projection
  factoextra, # Extract and Visualize the Results of Multivariate Data Analyses
  ggpubr, # 'ggplot2' Based Publication Ready Plots
  funModeling, # EDA and rapid assessment of input data
  DataExplorer, # EDA and visualization similar to funModeling
  
  # ML and optimization routines
  pls, # Partial Least Squares and Principal Component Regression
  xgboost, # Extreme Gradient Boosting
    pso, # Particle Swarm Optimization
  GA, # Genetic Algorithms
  DEoptim, # Global Optimization by Differential Evolution
  
  # Bayesian optimization suite
  smoof,  #Single and Multi-Objective Optimization Test Functions
  mlr, # Machine Learning in R
  mlrMBO, # Bayesian Optimization and Model-Based Optimization of Expensive Black-Box Functions
  mlrHyperopt, # Hyperparameter optimization with mlr and mlrMBO
  ParamHelpers, # Functions for parameter descriptions and operations in black-box optimization, tuning and machine learning. Parameters can be described (type, constraints, defaults, etc.), combined to parameter sets and can in general be programmed on. A useful OptPath object (archive) to log function evaluations is also provided.

  # Caret based Bayesian optimization
  rBayesianOptimization, # update is expected in early 2019
  
  # parallel processing and performance
  tictoc, # Functions for timing R scripts, as well as implementations of Stack and List structures
  parallel, # Support for parallel computation, including random-number generation
  doParallel, # Foreach Parallel Adaptor for the 'parallel' Package
  
  # tools for spacial data
  fields, # Tools for Spatial Data
  
  # tools for explainable models
  DALEX2, # Machine Learning (ML) models are widely used and have various applications in classification or regression. Models created with boosting, bagging, stacking or similar techniques are often used due to their high performance, but such black-box models usually lack of interpretability.
  # DALEX package contains various explainers that help to understand the link between input variables and model output. The single_variable() explainer extracts conditional response of a model as a function of a single selected variable. It is a wrapper over packages 'pdp' (Greenwell 2017) <doi:10.32614/RJ-2017-016>, 'ALEPlot' (Apley 2018) <arXiv:1612.08468>  and 'factorMerger' (Sitko and Biecek 2017) <arXiv:1709.04412>.
  # The single_prediction() explainer attributes parts of a model prediction to particular variables used in the model. It is a wrapper over 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>. The variable_dropout() explainer calculates variable importance scores based on variable shuffling (Fisher at al. 2018) <arXiv:1801.01489>. All these explainers can be plotted with generic plot() function and compared across different models.
  modelStudio, #Interactive Studio with explanations for ML Predictive models

  # if not installed, load and install
  install = TRUE)
library("tidylog", warn.conflicts = FALSE)

3.1 … and Functions For Visualizations

Voilin plots are pretty cool and the default palettes often do not have enough information to generate a palette if your variable has more than about 12 ‘bins’ so we do a bit of RColorBrewer palette tuning.

4 Import the Data

The dataset can be downloaded through the UCI Machine learning Repository, and reformatted in Excel as a csv file in the data directory. Excel creates far too many problems with automatic date, time, and number conversions for me to trust it.

Observations: 1,030
Variables: 9
$ Cement           <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 266.0, 380.0, 380.0, 266.0, 475.0, 198.6, 198....
$ Slag             <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 114.0, 95.0, 95.0, 114.0, 0.0, 132.4, 132.4, 47.5,...
$ Ash              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Water            <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 228, 228, 192, 192, 228, 228, 228, 228, 1...
$ Superplasticizer <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...
$ Coarse_Aggregate <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 97...
$ Fine_Aggregate   <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 670.0, 594.0, 594.0, 670.0, 594.0, 825.5, 825....
$ Age              <int> 28, 28, 270, 365, 360, 90, 365, 28, 28, 28, 90, 28, 270, 90, 28, 90, 90, 365, 90,...
$ Strength         <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 47.03, 43.70, 36.45, 45.85, 39.29, 38.07, 28.0...

The data shows the data has 1030 observations with 9 continuous target and predictors.

5 Pre-Processing and Exploratory Data Analysis

5.1 Missing Values, Complete Cases, Near Zero Variances, and Similar Traps

This is an overview of number of zero, missing, and unique values.

[1] "Complete Cases"
   Mode    TRUE 
logical    1030 

All values are continuous variables, and data are complete: imputation is not required. Arrange columns so target is in column 1, just…, because I like it. There are no non-zero variance predictors or target. There is no need to delete non-zero variance predictors.

5.2 Density Kernels

I have a look for strikingly non-normal distributions and outliers. This is a prelude to scaling.

First the response variable,

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.33   23.71   34.45   35.82   46.13   82.60 

Now the predictors

gather: reorganized (Cement, Slag, Ash, Water, Superplasticizer, 㠼㸵) into (Component) [was 1030x9, now 8240x2]
summarise_all: now 8 rows and 5 columns, ungrouped

There is some clustering of predictors, especially age, and there appear be no clear outliers or related problems.

Slag, ash and Superplasticizer show a large proportion of zeros and skew the approximately unimodel distributions. Age shows a very non-normal distribution.

5.4 Take the Logarithm of the Age Predictor for Clarity

The logarithm of Age makes the partial least squares regression(PLS) and principal components regression(PCR) behave much better. Typical values for RMSE have decreased from 11-12 to 7.7-8.2 following transformation of the Age parameter.

5.5 Check for Correlated Predictors and Plot the Correlogram

Historical data can sometimes show clustering that creates problems in modeling. Strong correlations can also be used to identify clusters.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.65753 -0.25924 -0.09367 -0.06380  0.08243  0.55218 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                                                

Even when retaining correlations at a significance of 0.05, there are none big enough to create problems with later analysis. There are some indications that water content and superplastizer are important. No correlations between the target variable and any of the predictor variables jump out as large. This will take some effort during modeling.

5.6 Linear Dependencies

This is to check if there are any clear linear dependencies in the data. This is related to the correlation tested above. There are none: the relationship between variables is more complex.

             Length Class  Mode
linearCombos 0      -none- list
remove       0      -none- NULL

Interactions between variables can be investigated using a parallel coordinates plot.

5.7 Z-Scale All Data

Now that the data have been screened for possible problems, we can look at the distributions of each predictor and target to get a feeling for potential clusters and relationships. We will do this using the preprocess function in caret. The scale function will scale all variables to their z values.

5.8 Edited, Screened and Transformed Data Ready For Analysis

gather: reorganized (Cement, Slag, Ash, Water, Superplasticizer, 㠼㸵) into (Component, Value) [was 1030x9, now 8240x3]

The data should look approximately on the same scale with approximately similar variances for regression models. This is not a requirement for XGBoost or LightGBM.

6 Check For Potential Clustering

Use UMAP to look for potential clustering in the predictor variables. This is to investigate whether the predictor data is diverse and extensive. Any clustering may show physical or convenience based restrictions governing the experimental data. There appear to be some clusters with unknown origin. The mid and lower right hand side clusters seem to show a preference for concrete with strong and weak strength.

failed creating initial embedding; using random embedding insteadx

7 Some PCA to Assess Combinations of Variables

There’s a few pretty good reasons to use PCA. We actually capture 66.3% of variance in the dataset by just using three principal components. Since this is purely introductory I’ll skip the math and give you a quick rundown of the workings of PCA:

Importance of components:
                          PC1    PC2    PC3    PC4    PC5     PC6     PC7
Standard deviation     1.4761 1.1862 1.0394 0.9853 0.9203 0.61604 0.36952
Proportion of Variance 0.3113 0.2010 0.1543 0.1387 0.1210 0.05422 0.01951
Cumulative Proportion  0.3113 0.5123 0.6666 0.8053 0.9263 0.98049 1.00000

The scree plot shows there are no “magic” combinations of variables that will solve our problem - keep digging.

The Variables - PCA plot indicates that Age is relatively unimportant in the first two PCA dimensions.

8 Generate Train and Test Set

We start with the split of data into test and train. Since there is no time dependence of consecutive observations, we can used random sampling to construct train/test and folds within the train set.

Look at this blog post

9 Training Parameters(repeated_cv and adaptive_cv)

10 The Response Surface for Cement

I like to get an idea about the non-linearity of the problem before going too much in depth. Complexities will show up when:

Let’s plot strength as a function of the first three PCA components to get an idea of the non-linearity of the strength of concrete. It has a number of peaks showing the non-linearity of the systems. This means that a simple regression will not capture the response surface well. We also have to accept random variation in the experimental setup - the data does not come from a replicated experiment.

11 Baseline Principal Component Regression Model with Train/Test Split

Data:   X dimension: 826 8 
    Y dimension: 826 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X         26.9591  44.1239    60.79    74.30    86.84    97.36    99.61
.outcome   0.7167   0.8946    28.46    54.25    62.84    75.37    80.30

12 Baseline Partial Least Squares Model with Train/Test Split

Data:   X dimension: 826 8 
    Y dimension: 826 1
Fit method: oscorespls
Number of components considered: 7
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X           15.14    34.75    52.21    58.98    69.71    74.49    83.72
.outcome    74.52    77.66    79.07    80.93    81.27    82.03    82.37

13 XGBoost for Regression

PLS and PCA regressions do not require hyperparameters. In order to get better results from more advanced algorithms, each method needs to be run with the optimum set of hyperparameters. There is no general way to determine these before conducting analysis.

The analysis so far indicates that each variable is pretty important for the final prediction, but the linear models only get just so far with an RMSE of about 7.7-8.2, and Rsquare of 0.773-0.798. XGBoost was chosen for further modeling based on perfomance and speed and its structure allowing a fit to non-linear and complex problems.

XGBoost stands for eXtreme Gradient Boosting. The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. - Tianqi Chen

We will used the xgbLinear model, while it is called a regression and contains the word, “linear”, a regression tree is a nonlinear mode: It is really a “regularized linear model” using delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization. It is also an ensemble learner that results in a non-linear transfer function.

Trees

Trees

An XGBoost model will allow the determination of interactions between variables. The model, however, will require the determination of hyperparameters, defined as any parameter not determined from the data. This means user defined assumptions such as regularization, scaling, transformation, feature engineering, neural net architecture, number of epochs, and learning rate (eta). Individual algorithms have a set of hyperparameters commonly used for tuning. For XGBoost, we will be changing:

13.2 RS - Random Search Using Random Adaptive Sampling

We are going to see if we can reach similar results to the grid search by choosing the same number of configurations, but chosen randomly.

13.2.2 RS - Summarize the Optimization

13.2.4 RS - Investigate Interactions Between Hyperparameter Values

Of course, we see that the the hyperspace of sampling is larger than the grid search, but the algorithm hones in on the best areas in hyperspace. We are seeing, in general, higher values of lambda, alpha and eta are better and that results don’t change much after about 200 rounds. This took 1102.33 sec to reach this solution by evaluating 672 models.

13.2.5 RS - Test The Model with Optimized RS Settings

6.33 sec elapsed
eXtreme Gradient Boosting 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.686568  0.9214965  3.176164

Tuning parameter 'nrounds' was held constant at a value of 64
Tuning parameter 'lambda' was held constant at
 a value of 0.3400763
Tuning parameter 'alpha' was held constant at a value of 0.0004462638
Tuning
 parameter 'eta' was held constant at a value of 2.104762

13.3 GA - Genetic Algorithm

Since this is an optimization problem, we can use optimization techniques to assess and refine the best combinations of hyperparemeters. This approach has two requirements:

  • a function to evaluate the score for strategies, called the fitness
  • the minimum and maximum values allowable for each hyperparameter
  • the population size

Good article here from the R-bloggers mailing list. There is a vignette here by Luca Scrucca.

Have a look at this article

13.3.2 GA - Run The Optimization and Save the Model

GA | iter = 1 | Mean = -4.832574 | Best = -4.595089
GA | iter = 2 | Mean = -4.738141 | Best = -4.595089
GA | iter = 3 | Mean = -4.706129 | Best = -4.573569
GA | iter = 4 | Mean = -4.612675 | Best = -4.548229
GA | iter = 5 | Mean = -4.651435 | Best = -4.548229
GA | iter = 6 | Mean = -4.710580 | Best = -4.548229
GA | iter = 7 | Mean = -4.658474 | Best = -4.548229
GA | iter = 8 | Mean = -4.623334 | Best = -4.512393
GA | iter = 9 | Mean = -4.649231 | Best = -4.502360
GA | iter = 10 | Mean = -4.666363 | Best = -4.502360
Time difference of 10.13395 mins
eXtreme Gradient Boosting 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.672078  0.9212513  3.080341

Tuning parameter 'nrounds' was held constant at a value of 464
Tuning parameter 'lambda' was held constant
 at a value of 1.195233
Tuning parameter 'alpha' was held constant at a value of 0.3587536
Tuning
 parameter 'eta' was held constant at a value of 0.593926

13.3.3 GA - Summarize the Optimization

-- Genetic Algorithm ------------------- 

GA settings: 
Type                  =  real-valued 
Population size       =  10 
Number of generations =  10 
Elitism               =  0.3 
Crossover probability =  0.8 
Mutation probability  =  0.5 
Search domain = 
        x1 x2 x3 x4
lower   10 -5 -3 -3
upper 1000  3  1  1

GA results: 
Iterations             = 10 
Fitness function value = -4.50236 
Solution = 
          x1         x2         x3         x4
[1,] 464.327 -0.2262676 -0.4452038 0.07745264

13.4 DE - Differential Evolution

The DEoptim package will be used to optimize the hyperparameters of the XGBoost model using differential evolution. In the same manner as the genetic algorithm above, we first define a custom objective function to be minimized: this will be the cross-validated RMSE.

13.4.2 DE - Run The Optimization

Run the model (1344.78 sec elapsed)

Iteration: 1 bestvalit: 4.420695 bestmemit:  272.853577    1.567570   -1.261362    0.967354
Iteration: 2 bestvalit: 4.420695 bestmemit:  272.853577    1.567570   -1.261362    0.967354
Iteration: 3 bestvalit: 4.420695 bestmemit:  272.853577    1.567570   -1.261362    0.967354
Iteration: 4 bestvalit: 4.420695 bestmemit:  272.853577    1.567570   -1.261362    0.967354
Iteration: 5 bestvalit: 4.420695 bestmemit:  272.853577    1.567570   -1.261362    0.967354
Iteration: 6 bestvalit: 4.396566 bestmemit:  551.286888    1.075946   -0.840488    0.776022
Iteration: 7 bestvalit: 4.345567 bestmemit:  459.508884   -0.069906    0.246125    0.560392
Iteration: 8 bestvalit: 4.345567 bestmemit:  459.508884   -0.069906    0.246125    0.560392
Iteration: 9 bestvalit: 4.345567 bestmemit:  459.508884   -0.069906    0.246125    0.560392
Iteration: 10 bestvalit: 4.345567 bestmemit:  459.508884   -0.069906    0.246125    0.560392
Time difference of 55.42633 mins

13.4.4 DE - Test The Model with Optimized DE Settings

11.12 sec elapsed
eXtreme Gradient Boosting 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.482818  0.9273562  2.982487

Tuning parameter 'nrounds' was held constant at a value of 460
Tuning parameter 'lambda' was held constant
 at a value of 3.634057
Tuning parameter 'alpha' was held constant at a value of 1.762483
Tuning
 parameter 'eta' was held constant at a value of 0.851323

13.5 PSO - Particle Swarm Optimization

We will use the pso package to optimize the four hyperparameters. Just like the Genetic Algorithm and Differential Evolution ww will define a function to be optimized.

13.5.2 PSO - Run The Optimization

We can now use the pso::psoptim() function to run the search

S=10, K=3, p=0.271, w0=0.7213, w1=0.7213, c.p=1.193, c.g=1.193
v.max=NA, d=990, vectorize=TRUE, hybrid=off
It 1: fitness=4.47
It 2: fitness=4.47
It 3: fitness=4.422
It 4: fitness=4.422
It 5: fitness=4.422
It 6: fitness=4.422
It 7: fitness=4.422
It 8: fitness=4.422
It 9: fitness=4.422
It 10: fitness=4.422
Maximal number of iterations reached
Time difference of 11.49972 mins

13.5.4 PSO - Test The Model with Optimized PSO Settings

12.21 sec elapsed
eXtreme Gradient Boosting 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.705027  0.9204373  3.129531

Tuning parameter 'nrounds' was held constant at a value of 682
Tuning parameter 'lambda' was held constant
 at a value of 1
Tuning parameter 'alpha' was held constant at a value of 1
Tuning parameter 'eta' was
 held constant at a value of 2.016966

13.6 BO - Bayesian Optimization

Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample.

Optimization by Bayesian methods spans volumes, this will hit only the high points to get the job done. The methods here are very flexible and can be applied to a variety of ML algorithms. The schemes fit into two toolset camps.

The mlr Toolset

The mlr camp has a large suite of related tools for defining, running, tuning and evaluating ML models.

The caret Toolset

The caret camp follows the model definition and logic used thus far, vide supra.

  • The caret tools.

As before, define a function to be optimized. I’m going to use the same definition as the one used for the particle swarm optimization and modify it to return the RMSE and the cross-validated preditions, as required by rBayesianOptimization.

13.6.3 BO - Run the optimization

20 points in hyperparameter space were pre-sampled
elapsed = 8.41  Round = 21  x1 = 273.0000   x2 = -3.3029    x3 = 0.1317 x4 = -1.6704    Value = -4.6980 
elapsed = 8.59  Round = 22  x1 = 378.0000   x2 = 0.2134 x3 = -0.7879    x4 = -0.3965    Value = -4.7937 
elapsed = 9.33  Round = 23  x1 = 577.0000   x2 = -3.9956    x3 = -0.8811    x4 = -1.9679    Value = -4.8322 
elapsed = 9.00  Round = 24  x1 = 909.0000   x2 = -2.8622    x3 = 0.1574 x4 = -1.0858    Value = -4.5641 
elapsed = 7.69  Round = 25  x1 = 210.0000   x2 = -1.9111    x3 = -2.9067    x4 = 0.0652 Value = -4.5660 
elapsed = 11.56 Round = 26  x1 = 899.0000   x2 = -4.8929    x3 = -1.0911    x4 = -2.6630    Value = -4.8647 
elapsed = 10.11 Round = 27  x1 = 945.0000   x2 = -1.9409    x3 = -0.0707    x4 = 0.5013 Value = -4.7460 
elapsed = 8.73  Round = 28  x1 = 664.0000   x2 = 1.9575 x3 = -0.2291    x4 = -1.6437    Value = -4.7064 
elapsed = 13.25 Round = 29  x1 = 633.0000   x2 = -2.2772    x3 = -1.0895    x4 = 0.3578 Value = -4.6366 
elapsed = 3.05  Round = 30  x1 = 71.0000    x2 = -1.1434    x3 = 0.4448 x4 = -1.6133    Value = -4.8764 
elapsed = 8.69  Round = 31  x1 = 214.0000   x2 = -0.2035    x3 = -1.2476    x4 = -1.6649    Value = -4.7625 
elapsed = 6.97  Round = 32  x1 = 185.0000   x2 = -1.0517    x3 = -2.0208    x4 = -1.0946    Value = -4.7394 
elapsed = 22.92 Round = 33  x1 = 690.0000   x2 = -3.5103    x3 = -2.7173    x4 = 0.5688 Value = -4.6244 
elapsed = 13.59 Round = 34  x1 = 390.0000   x2 = 1.6190 x3 = -2.6021    x4 = 0.4574 Value = -4.6105 
elapsed = 12.91 Round = 35  x1 = 772.0000   x2 = 0.3477 x3 = -1.7349    x4 = -1.4400    Value = -4.9973 
elapsed = 11.37 Round = 36  x1 = 503.0000   x2 = 1.3539 x3 = -0.9255    x4 = 0.1093 Value = -4.7277 
elapsed = 12.64 Round = 37  x1 = 720.0000   x2 = -4.1365    x3 = -0.3520    x4 = 0.8425 Value = -4.4594 
elapsed = 12.56 Round = 38  x1 = 992.0000   x2 = 0.7897 x3 = -1.3727    x4 = -1.2614    Value = -4.7657 
elapsed = 5.26  Round = 39  x1 = 386.0000   x2 = -1.7098    x3 = 0.6515 x4 = -0.1499    Value = -4.6417 
elapsed = 12.58 Round = 40  x1 = 780.0000   x2 = 1.5676 x3 = -1.8256    x4 = -1.4000    Value = -4.8411 
elapsed = 12.35 Round = 41  x1 = 935.0000   x2 = 0.1765 x3 = -1.1637    x4 = -1.6986    Value = -4.7265 
elapsed = 14.83 Round = 42  x1 = 909.0000   x2 = -3.3086    x3 = -0.3244    x4 = 0.8438 Value = -4.5552 
elapsed = 22.93 Round = 43  x1 = 773.0000   x2 = -3.6624    x3 = -2.6182    x4 = 0.8686 Value = -4.4360 
elapsed = 8.05  Round = 44  x1 = 626.0000   x2 = -3.8004    x3 = 0.1738 x4 = 0.8758 Value = -4.5831 
elapsed = 15.03 Round = 45  x1 = 681.0000   x2 = 2.5676 x3 = -2.5975    x4 = 0.2275 Value = -4.5831 
elapsed = 19.20 Round = 46  x1 = 864.0000   x2 = -4.1074    x3 = -2.6283    x4 = 0.5991 Value = -4.5171 
elapsed = 11.19 Round = 47  x1 = 797.0000   x2 = -4.3500    x3 = -0.2751    x4 = 0.8402 Value = -4.4480 
elapsed = 21.28 Round = 48  x1 = 860.0000   x2 = -0.6660    x3 = -2.9421    x4 = 0.6450 Value = -4.4497 
elapsed = 9.12  Round = 49  x1 = 699.0000   x2 = -0.3703    x3 = -0.2847    x4 = 0.4009 Value = -4.7358 
elapsed = 12.02 Round = 50  x1 = 980.0000   x2 = 1.9156 x3 = -0.2964    x4 = 0.6458 Value = -4.6951 
elapsed = 12.30 Round = 51  x1 = 773.0000   x2 = -3.7512    x3 = -0.3155    x4 = 1.0000 Value = -4.4232 
elapsed = 20.27 Round = 52  x1 = 752.0000   x2 = -2.6248    x3 = -2.5501    x4 = 0.7032 Value = -4.5145 
elapsed = 7.36  Round = 53  x1 = 854.0000   x2 = -3.2271    x3 = 0.9393 x4 = 0.9625 Value = -4.5646 
elapsed = 20.43 Round = 54  x1 = 694.0000   x2 = -2.1185    x3 = -2.1646    x4 = 0.9736 Value = -4.5319 
elapsed = 24.84 Round = 55  x1 = 872.0000   x2 = -0.1053    x3 = -2.8603    x4 = 0.7957 Value = -4.5019 
elapsed = 23.84 Round = 56  x1 = 843.0000   x2 = -0.9169    x3 = -2.1367    x4 = 0.7053 Value = -4.5409 
elapsed = 25.31 Round = 57  x1 = 823.0000   x2 = -2.1762    x3 = -2.9400    x4 = 1.0000 Value = -4.5146 
elapsed = 8.22  Round = 58  x1 = 775.0000   x2 = 2.5669 x3 = 0.5518 x4 = 0.6411 Value = -4.5335 
elapsed = 7.74  Round = 59  x1 = 701.0000   x2 = -3.2843    x3 = 0.7942 x4 = 1.0000 Value = -4.7025 
elapsed = 25.02 Round = 60  x1 = 960.0000   x2 = -3.4291    x3 = -2.9635    x4 = 0.7953 Value = -4.5574 
elapsed = 27.01 Round = 61  x1 = 868.0000   x2 = 2.5777 x3 = -2.5853    x4 = 1.0000 Value = -4.4351 
elapsed = 27.93 Round = 62  x1 = 858.0000   x2 = -1.1272    x3 = -2.6916    x4 = 1.0000 Value = -4.4897 
elapsed = 11.67 Round = 63  x1 = 686.0000   x2 = 0.7470 x3 = -0.2481    x4 = 1.0000 Value = -4.4904 
elapsed = 23.33 Round = 64  x1 = 775.0000   x2 = -3.9130    x3 = -2.5404    x4 = 1.0000 Value = -4.5202 
elapsed = 20.86 Round = 65  x1 = 735.0000   x2 = 0.2635 x3 = -2.9401    x4 = 0.6138 Value = -4.5449 
elapsed = 19.14 Round = 66  x1 = 700.0000   x2 = -2.8481    x3 = -2.0700    x4 = 0.6463 Value = -4.5600 
elapsed = 14.23 Round = 67  x1 = 781.0000   x2 = -3.6805    x3 = -0.4549    x4 = 0.9725 Value = -4.4834 
elapsed = 18.33 Round = 68  x1 = 884.0000   x2 = -3.9917    x3 = -2.9099    x4 = 0.2773 Value = -4.6843 
elapsed = 25.08 Round = 69  x1 = 793.0000   x2 = -2.9913    x3 = -2.0464    x4 = 1.0000 Value = -4.5901 
elapsed = 28.95 Round = 70  x1 = 970.0000   x2 = 2.4240 x3 = -2.5725    x4 = 0.8547 Value = -4.5453 
elapsed = 13.84 Round = 71  x1 = 655.0000   x2 = -2.5102    x3 = -0.4337    x4 = 1.0000 Value = -4.5129 

 Best Parameters Found: 
Round = 51  x1 = 773.0000   x2 = -3.7512    x3 = -0.3155    x4 = 1.0000 Value = -4.4232 
Time difference of 22.43552 mins

13.6.5 BO - Test The Model with Optimized BO Settings

18.36 sec elapsed
eXtreme Gradient Boosting 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 662, 661, 660, 660, 661, 661, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4.457225  0.9284931  2.939831

Tuning parameter 'nrounds' was held constant at a value of 773
Tuning parameter 'lambda' was held constant
 at a value of 10
Tuning parameter 'alpha' was held constant at a value of 0.4835916
Tuning parameter
 'eta' was held constant at a value of 0.0001773221

14 Summary of All Training Runs

# Create summary table
Summary_Table_Training <- bind_rows(
  GS_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  RS_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  GA_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  DE_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  PSO_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  BO_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5))

Summary_Table_Training <- Summary_Table_Training %>% 
  add_column(Method = c("Grid Search", "Adaptive Random Search", "Genetic Algorithm", "Differential Evolution", "Particle Swarm Optimization", "Bayesian Optimization"), .before = 1) %>% 
  add_column(`Processing Time` = round(c(GS_T1-GS_T0, RS_T1-RS_T0, GA_T1-GA_T0, DE_T1-DE_T0, PSO_T1-PSO_T0, BO_T1- BO_T0),0))

#Summary_Table_Training <- Summary_Table_Training %>% 
#  add_column(Method = c("Grid Search", "Random Search", "Genetic Algorithm", "Differential Evolution", "Particle #Swarm Optimization"), .before = 1)

# Print table
Summary_Table_Training %>% 
  kable(align = "c", caption = "Training Set Statistics and Hyperparameter Values of XGBoost Models.") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = paste0("Note:\nSummary statistics obtained using ", CV_folds, "-fold cross validation repeated ", CV_repeats, " times.","\nGrid and Random Search  performed with adaptive resampling."), general_title = "\n ")
Training Set Statistics and Hyperparameter Values of XGBoost Models.
Method RMSE MAE Rsquared nrounds eta lambda alpha Processing Time
Grid Search 4.62847 3.06958 0.92303 150 0.03000 0.90000 1.00000 14 mins
Adaptive Random Search 4.68657 3.17616 0.92150 64 2.10476 0.34008 0.00045 5 mins
Genetic Algorithm 4.67208 3.08034 0.92125 464 0.59393 1.19523 0.35875 10 mins
Differential Evolution 4.48896 2.91747 0.92742 386 269.92122 8.71728 0.47104 55 mins
Particle Swarm Optimization 4.70503 3.12953 0.92044 682 2.01697 1.00000 1.00000 11 mins
Bayesian Optimization 4.45722 2.93983 0.92849 773 0.00018 10.00000 0.48359 22 mins

Note:
Summary statistics obtained using 5-fold cross validation repeated 3 times.
Grid and Random Search performed with adaptive resampling.

15 Performance On Test Set

#Custom functions to plot observed, predicted and residual values for each method
library(ggplot2, quietly = T, verbose = F)

# Function to plot observed vs predicted values
predicted_observed_plot <- function(predicted_val, observed_val, residual_val, model_name = "", R_squared, ...) {
  
  plot <- ggplot(mapping = aes(x = predicted_val, y = observed_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 1) +
    # facet_wrap(~) +
    labs(title = paste0(model_name, "\nPredicted vs Observed: Test Set"),
         subtitle = paste0("R-squared: ", R_squared),
         x = "Predicted",
         y = "Observed",
         col = "Absolute Deviation") +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) + 
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}

# Function to plot residuals
residuals_plot <- function(predicted_val, residual_val, model_name = "", MAE, RMSE, ...) {

  plot <- ggplot(mapping = aes(x = predicted_val, y = residual_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 0) +
    # facet_wrap(~) +
    labs(
       title = paste0(model_name, "\nResiduals: Test Set"),
       subtitle = paste0("RMSE: ", RMSE, ", MAE: ", round(MAE, 3)),
       x = "Predicted",
       y = "Residual",
       col = "Absolute Deviation"
       ) +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) +
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}

15.0.7 Create Summary Table

Test Set Statistics of XGBoost Models.
Method RMSE MAE R-squared
Grid Search 4.96996 2.97559 0.9580
Random Search 5.10291 3.24413 0.9555
Genetic Algorithm 4.77734 2.78170 0.9611
Differential Evolution 4.63936 2.66680 0.9633
Particle Swarm Optimization 4.99273 3.01354 0.9577
Bayesian Optimization 4.50993 2.68133 0.9653

16 Summary

The Bayesian approuch with primed values is superior to all other approaches in terms of accuracy and computing resources.

---
title: "Hyperparameter Tuning Techniques"
subtitle: "Applicable to a variety of Machine Learning Algorithms"
output:
  html_notebook:
    code_folding: hide
    fig_caption: yes
    fig_height: 4.5
    fig_width: 7
    highlight: tango
    number_sections: yes
    theme: cosmo
    toc: yes
  html_document:
    df_print: paged
    toc: yes
author: Notes by Alastair K. Muir
date: "`r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<style type="text/css">

body{ /* Normal  */
font-size: 18px;
}
td {  /* Table  */
font-size: 20px;
}
h1.title {
  font-size: 34px;
  color: DarkBlue;
}
h1 { /* Header 1 */
  font-size: 28px;
  color: DarkBlue;
}
h2 { /* Header 2 */
  font-size: 24px;
  color: DarkBlue;
}
h3 { /* Header 3 */
  font-size: 20px;
  color: DarkBlue;
}
h4 { /* Header 4 */
  font-size: 18px;
  color: DarkBlue;
}
code.r{ /* Code block */
  font-size: 20px;
}
pre { /* Code block - determines code spacing between lines */
  font-size: 18px;
}
</style>

# How Accurate Do I Have To Be?

You are at strategy meeting to solve a business problem. The crux of the situation is the business owners, board members, stakeholders, customers, process owners, special interest groups, and concerned parents, need to know how *exactly* to produce the product, output, or service that is the best solution.

This is an intensely complex system with plenty of "magic bullet" solutions in the ditch behind you.

You, as the **data guru** are faced with this situation.

You have at your disposal reality bending algorithms and mind-reading social media troll farms.

Wow!

- where do you start?
- **when do you stop?**

>**How accurate do you have to be?**

# Describe a Complex System

![What will the output be?](./figures/function_f19.png)

Industrial processes, biological and chemical systems are very complex owing to the large number of factors and they interact with each other to produce the final result. As complex as the systems are, the predictive models are equally complex. Not only do they have a large number of unknown parameters, but the algorithms themselves have a large number of unknown hyperparameters. These hyperparameters can dramatically increase model performance and improve training times. Tuning can be a complex and expensive process.

Tuning hyperparameters of a deep neural network, probe drilling for oil at given geographic coordinates or evaluating the effectiveness of a drug candidate taken from a chemical search space then it is important to minimize the number of samples drawn from the black box function


> **Parameters**
> Configurations and values your machine learning algorithm learns from your data during training. These parameters can interact to create complex, non-linear systems.

> **Hyperparameters** 
> Express high level concepts, such as statistical assumptions( eg. regularization), are fixed before training or are hard to learn from data (eg. neural network architecture), and influence the objective, test time performance, and computational cost.

For this vignette we are going to use data from a very complex, non-linear system. The research paper, [Modeling of strength of high performance concrete using artificial neural networks](https://www.sciencedirect.com/science/article/pii/S0008884698001653) by I-Cheng Yeh published in Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998) is our choice.

From the paper, we know that the strength of High Performance Concrete follows a very non-linear relationship with the predictor variables and is poorly modeled by normal regression methods.

> A strength model based on ANN is more accurate than a model based on regression analysis

The dataset contains 1030, examples and the following features:

- Input Variable: Cement ($kg/m^3$ mixture)
- Input Variable: Blast Furnace Slag ($kg/m^3$ mixture)
- Input Variable: Fly Ash ($kg/m^3$ mixture)
- Input Variable: Water ($kg/m^3$ mixture)
- Input Variable: Superplasticizer ($kg/m^3$ mixture)
- Input Variable: Coarse Aggregate ($kg/m^3$ mixture)
- Input Variable: Fine Aggregate ($kg/m^3$ mixture)
- Input Variable: Age (days)
- Output Variable: Concrete compressive strength ($MPa$)

This vignette is based on the [RPubs article](http://rpubs.com/jeandsantos88/search_methods_for_hyperparameter_tuning_in_r) by Jean Dos Santos. It covers:

- heuristics and manual search (gut feelings)
   - select hyperparameter settings based on experience with similar problems
- grid search
   - define a range and step size for each hyperparameter to define an n-dimensional grid
   - the grid may be sub-sampled, eg. Latin squares 
- adaptive random search
   - a set of values of hyperparameters are generated that depend on the values of the variable of interest observed during the survey
- genetic algorithms
   - first choose a random population of hyperparameters, calculate the loss function for each set, then create offspring of your better solutions with some mutation, calculate the loss function for this new generation, and repeat the whole process until it converges
- differential evolution
   - a variation of genetic algorithms where a population of algorithms evolves
- particle swarm optimization
   - another method based on an initial array of random points with local improvement
- Bayesian methods
   - an initial array of random points is evaluated with a loss function and results are used to modify weights
- AutoML in H2O
   - this can include automatic selection of hyperparameters at the level of model architecture

# Load Packages

Package management is an integral part of an analytic and modeling project and I use the `pacman` package to handle installation and loading. There are a lot of tools associated with this project and I have arranged them in a logical sequence. In the manner of reproducible research, I go back and add to this section to add packages as I proceed through the project. In the spirit of **The Lean Startup**, this script is always a **minimally viable product**.

```{r, echo = TRUE}
#rm(list = ls())

pacman::p_load(
  # text formatting
  knitr, #A General-Purpose Package for Dynamic Report Generation in R
  kableExtra, #Construct Complex Table with 'kable' and Pipe Syntax

  # reading and manipulating data
  data.table, # very fast reading, writing and manipulation of large datasets
  tidyverse, # The tidy verse collection of data manipulation and graphics packages (Hadley Wickham)
  tidylog, # reporting of tidyr functions
  
  # statistical manipulation
  MASS, # "Modern Applied Statistics with S" (4th edition, 2002)
  
  # model definition and management              
  caret,  # Classification and Regression Training (Max Kuhn)
  
  # EDA and visualization
  plotly, # Create Interactive Web Graphics via 'plotly.js'
  GGally, # Extension to 'ggplot2'
  gridExtra, # Miscellaneous Functions for "Grid" Graphics
  ggthemes, # Extra Themes, Scales and Geoms for 'ggplot2'
  corrplot, # Visualization of a Correlation Matrix
  qqplotr, # Quantile-Quantile Plot Extensions for 'ggplot2'
  umap, # Uniform Manifold Approximation and Projection
  factoextra, # Extract and Visualize the Results of Multivariate Data Analyses
  ggpubr, # 'ggplot2' Based Publication Ready Plots
  funModeling, # EDA and rapid assessment of input data
  DataExplorer, # EDA and visualization similar to funModeling
  
  # ML and optimization routines
  pls, # Partial Least Squares and Principal Component Regression
  xgboost, # Extreme Gradient Boosting
    pso, # Particle Swarm Optimization
  GA, # Genetic Algorithms
  DEoptim, # Global Optimization by Differential Evolution
  
  # Bayesian optimization suite
  smoof,  #Single and Multi-Objective Optimization Test Functions
  mlr, # Machine Learning in R
  mlrMBO, # Bayesian Optimization and Model-Based Optimization of Expensive Black-Box Functions
  mlrHyperopt, # Hyperparameter optimization with mlr and mlrMBO
  ParamHelpers, # Functions for parameter descriptions and operations in black-box optimization, tuning and machine learning. Parameters can be described (type, constraints, defaults, etc.), combined to parameter sets and can in general be programmed on. A useful OptPath object (archive) to log function evaluations is also provided.

  # Caret based Bayesian optimization
  rBayesianOptimization, # update is expected in early 2019
  
  # parallel processing and performance
  tictoc, # Functions for timing R scripts, as well as implementations of Stack and List structures
  parallel, # Support for parallel computation, including random-number generation
  doParallel, # Foreach Parallel Adaptor for the 'parallel' Package
  
  # tools for spacial data
  fields, # Tools for Spatial Data
  
  # tools for explainable models
  DALEX2, # Machine Learning (ML) models are widely used and have various applications in classification or regression. Models created with boosting, bagging, stacking or similar techniques are often used due to their high performance, but such black-box models usually lack of interpretability.
  # DALEX package contains various explainers that help to understand the link between input variables and model output. The single_variable() explainer extracts conditional response of a model as a function of a single selected variable. It is a wrapper over packages 'pdp' (Greenwell 2017) <doi:10.32614/RJ-2017-016>, 'ALEPlot' (Apley 2018) <arXiv:1612.08468>  and 'factorMerger' (Sitko and Biecek 2017) <arXiv:1709.04412>.
  # The single_prediction() explainer attributes parts of a model prediction to particular variables used in the model. It is a wrapper over 'breakDown' package (Staniak and Biecek 2018) <doi:10.32614/RJ-2018-072>. The variable_dropout() explainer calculates variable importance scores based on variable shuffling (Fisher at al. 2018) <arXiv:1801.01489>. All these explainers can be plotted with generic plot() function and compared across different models.
  modelStudio, #Interactive Studio with explanations for ML Predictive models

  # if not installed, load and install
  install = TRUE)
library("tidylog", warn.conflicts = FALSE)
```

## ... and Functions For Visualizations

Voilin plots are pretty cool and the default palettes often do not have enough information to generate a palette if your variable has more than about 12 'bins' so we do a bit of `RColorBrewer` palette tuning.

```{r, echo = FALSE}
getPalette = colorRampPalette(RColorBrewer::brewer.pal(8, "Set2"))(26) 

# for testing of significance in correlation plots
# mat : is a matrix of data
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# somewhat hackish solution to:
# https://twitter.com/EamonCaddigan/status/646759751242620928
# based mostly on copy/pasting from ggplot2 geom_violin source:
# https://github.com/hadley/ggplot2/blob/master/R/geom-violin.r

"%||%" <- function(a, b) {
  if (!is.null(a)) a else b
}

geom_flat_violin <- function(mapping = NULL, data = NULL, stat = "ydensity",
                        position = "dodge", trim = TRUE, scale = "area",
                        show.legend = NA, inherit.aes = TRUE, ...) {
  layer(
    data = data,
    mapping = mapping,
    stat = stat,
    geom = GeomFlatViolin,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      trim = trim,
      scale = scale,
      ...
    )
  )
}

#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
GeomFlatViolin <-
  ggproto("GeomFlatViolin", Geom,
          setup_data = function(data, params) {
            data$width <- data$width %||%
              params$width %||% (resolution(data$x, FALSE) * 0.9)
            
            # ymin, ymax, xmin, and xmax define the bounding rectangle for each group
            data %>%
              group_by(group) %>%
              mutate(ymin = min(y),
                     ymax = max(y),
                     xmin = x,
                     xmax = x + width / 2)
            
          },
          
          draw_group = function(data, panel_scales, coord) {
            # Find the points for the line to go all the way around
            data <- transform(data, xminv = x,
                              xmaxv = x + violinwidth * (xmax - x))
            
            # Make sure it's sorted properly to draw the outline
            newdata <- rbind(plyr::arrange(transform(data, x = xminv), y),
                             plyr::arrange(transform(data, x = xmaxv), -y))
            
            # Close the polygon: set first and last point the same
            # Needed for coord_polar and such
            newdata <- rbind(newdata, newdata[1,])
            
            ggplot2:::ggname("geom_flat_violin", GeomPolygon$draw_panel(newdata, panel_scales, coord))
          },
          
          draw_key = draw_key_polygon,
          
          default_aes = aes(weight = 1, colour = "grey20", fill = "white", size = 0.5,
                            alpha = NA, linetype = "solid"),
          
          required_aes = c("x", "y")
)

raincloud_theme = theme(
  text = element_text(size = 10),
  axis.title.x = element_text(size = 16),
  axis.title.y = element_text(size = 16),
  axis.text = element_text(size = 14),
  axis.text.x = element_text(angle = 45, vjust = 0.5),
  legend.title=element_text(size=16),
  legend.text=element_text(size=16),
  legend.position = "right",
  plot.title = element_text(lineheight=.8, face="bold", size = 16),
  panel.border = element_blank(),
  panel.grid.minor = element_blank(),
  panel.grid.major = element_blank(),
  axis.line.x = element_line(colour = 'black', size=0.5, linetype='solid'),
  axis.line.y = element_line(colour = 'black', size=0.5, linetype='solid')
)

lb <- function(x) mean(x) - sd(x)
ub <- function(x) mean(x) + sd(x)
```

# Import the Data

The dataset can be downloaded through the [UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength), and reformatted in Excel as a *csv* file in the data directory. Excel creates far too many problems with automatic date, time, and number conversions for me to trust it.

```{r, echo = true}
concrete_data <- fread(file = "./data/Concrete_Data.csv")
colnames(concrete_data) <- c("Cement",
                             "Slag",
                             "Ash",
                             "Water",
                             "Superplasticizer",
                             "Coarse_Aggregate",
                             "Fine_Aggregate",
                             "Age",
                             "Strength")
dplyr::glimpse(concrete_data)
```

The data shows the data has `r nrow(concrete_data)` observations with `r ncol(concrete_data)` continuous target and predictors.

# Pre-Processing and Exploratory Data Analysis

## Missing Values, Complete Cases, Near Zero Variances, and Similar Traps

This is an overview of number of zero, missing, and unique values.

```{r}
funModeling::df_status(concrete_data)
```

```{r, echo=TRUE}
# reorder columns
preds <- setdiff(names(concrete_data), "Strength")
concrete_data_short <- dplyr::select(concrete_data, c("Strength", preds))

print("Complete Cases")
complete.cases(concrete_data_short) %>%
  summary
```

```{r, echo=TRUE}
nzv <- nearZeroVar(concrete_data_short, saveMetrics = TRUE)
nzv

#concrete_data1 <- concrete_data[, -nzv]
```

All values are continuous variables, and data are complete: imputation is not required. Arrange columns so target is in column 1, just..., because I like it. There are no non-zero variance predictors or target. There is no need to delete non-zero variance predictors.

## Density Kernels

I have a look for strikingly non-normal distributions and outliers. This is a prelude to scaling.

First the response variable,

```{r, echo=TRUE}
ggplot(data = concrete_data_short, aes(x = Strength)) + 
  geom_density(alpha = 0.4, fill = "blue", adjust = 0.25) + 
  theme(legend.position = "none")

DataExplorer::plot_qq(concrete_data$Strength)
summary(concrete_data_short$Strength)
```

Now the predictors

```{r, echo=TRUE}
concrete_data_long <- concrete_data_short %>%
  gather(c("Cement",
           "Slag",
           "Ash",
           "Water",
           "Superplasticizer",
           "Coarse_Aggregate",
           "Fine_Aggregate",
           "Age"), 
         key = "Component", value = "Strength")
#glimpse(concrete_data_long)

lb <- function(x) mean(x) - sd(x)
ub <- function(x) mean(x) + sd(x)

sumld <- concrete_data_long %>% 
  group_by(Component) %>% 
  summarise_all(list(mean = mean, median = median, lower = lb, upper = ub))
#sumld

ggplot(data = concrete_data_long, aes(x = Strength)) + 
  geom_density(aes(fill = Component), alpha = 0.4, adjust = 0.25) + 
  facet_wrap(~ Component, scales = "free") + 
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "none")
```

There is some clustering of predictors, especially `age`, and there appear be no clear outliers or related problems.

```{r}
concrete_data %>%
  dplyr::select(-Strength) %>%
  DataExplorer::plot_qq()
```

Slag, ash and Superplasticizer show a large proportion of zeros and skew the approximately unimodel distributions. Age shows a very non-normal distribution.

## Examine the Distribution of the `Age` Parameter

```{r}
table(concrete_data_short$Age)

# Use fitdistr from MASS to estimate distribution params
params <- as.list(fitdistr(concrete_data_short$Age, densfun = "lognormal")$estimate)
ggplot(data = concrete_data_short, aes(sample = Age)) +
  stat_qq(distribution  = stats::qlnorm, dparams = params) +
  geom_qq(geom = "point") +
  stat_qq_band(distribution = "lnorm", dparams = params) +
  stat_qq_line(distribution = "lnorm",dparams = params) +
  labs(title = "Q-Q Plot of log(Age)", x = "Theoretical Quantiles", y = "Sample Quantiles")
DataExplorer::plot_qq(log(concrete_data_short$Age))
```


```{r}
p1 <- ggplot(data = concrete_data_short, aes(x = Age)) + 
  geom_density(alpha = 0.4, fill = "blue", adjust = 0.25) + 
  theme(legend.position = "none") +
    labs(title = "Age Distribution", x = "Age", y = "")

p2 <- ggplot(data = concrete_data_short, aes(x = log(Age))) + 
  geom_density(alpha = 0.4, fill = "blue", adjust = 0.25) + 
  theme(legend.position = "none") +
    labs(title = "log(Age) Distribution", x = "log(Age)", y = "")

cowplot::plot_grid(p1, p2, ncol = 2)
rm(p1, p2)
```

## Take the Logarithm of the `Age` Predictor for Clarity

The logarithm of `Age` makes the partial least squares regression(PLS) and principal components regression(PCR) behave much better. Typical values for RMSE have decreased from 11-12 to 7.7-8.2 following transformation of the `Age` parameter. 

```{r}
concrete_data_short$Age = log(concrete_data_short$Age)
```

## Check for Correlated Predictors and Plot the Correlogram

Historical data can sometimes show clustering that creates problems in modeling. Strong correlations can also be used to identify clusters.

```{r, eval=TRUE}
concrete_data_corr <- cor(concrete_data_short, use="all.obs", method = "pearson")

highCorr <- sum(abs(concrete_data_corr[upper.tri(concrete_data_corr)]) > 0.999)
summary(concrete_data_corr[upper.tri(concrete_data_corr)])

highlyCorDescr <- findCorrelation(concrete_data_corr, cutoff = 0.75)
filteredDescr <- concrete_data_corr[,-highlyCorDescr]

descrCor2 <- cor(filteredDescr)
summary(descrCor2[upper.tri(descrCor2)])

p.mat <- cor.mtest(concrete_data_short)

corrplot::corrplot(concrete_data_corr, method = "circle", type = "lower", order = "hclust",
         p.mat = p.mat, sig.level = 0.05)
```

```{r, eval=TRUE, fig.height=6.0, fig.width=6.0}
# reference (https://github.com/ggobi/ggally/issues/139#)
my_custom_smooth <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_point(color = I("blue"), alpha = 0.1, size = 0.1) + 
    geom_smooth(method = "lm", color = I("black"), ...)
}
concrete_data_short %>%
  ggpairs(upper = list(continuous = wrap(ggally_cor, size = 3)),
          lower = list(continuous = my_custom_smooth), 
          axisLabels = "internal") +
  labs(title = "Correlation Plots")
```

```{r, eval=TRUE, fig.height=2.0, fig.width=8}
concrete_data_short %>%
  ggduo(columnsX = 2:9, 
        columnsY = 1, 
        types = list(continuous = "smooth_lm"),
        mapping = ggplot2::aes(color = -Strength, alpha = 0.3)) +
  theme_bw()
```

Even when retaining correlations at a significance of 0.05, there are none big enough to create problems with later analysis. There are some indications that water content and superplastizer are important. No correlations between the target variable and any of the predictor variables jump out as large. This will take some effort during modeling.

## Linear Dependencies

This is to check if there are any clear linear dependencies in the data. This is related to the correlation tested above. There are none: the relationship between variables is more complex.

```{r, echo=TRUE}
comboInfo <- findLinearCombos(concrete_data_short)
summary(comboInfo)
```

Interactions between variables can be investigated using a parallel coordinates plot.

```{r}
#min(concrete_data$Strength)
#max(concrete_data$Strength)
concrete_data_short %>%
#  mutate(Age = log(Age)) %>%
  plot_ly(width = 700, height = 450) %>%
  add_trace(type = 'parcoords',
            line = list(color = ~Strength,
                        colorscale = list(getPalette),
                        cmin = 1,
                        cmax = 90),
            dimensions = list(
              list(range = c(~min(Strength),~max(Strength)),
                   label = 'Strength', values = ~Strength),
              list(range = c(~min(Cement),~max(Cement)),
                   visible = TRUE,
                   label = 'Cement', values = ~Cement),
              list(range = c(~min(Slag),~max(Slag)),
                   label = 'Slag', values = ~Slag),
              list(range = c(~min(Ash),~max(Ash)),
                   label = 'Ash', values = ~Ash),
              list(range = c(~min(Cement),~max(Cement)),
                   label = 'Water', values = ~Water),
              list(range = c(~min(Superplasticizer),~max(Superplasticizer)),
                   label = 'Superplasticizer', values = ~Superplasticizer),
              list(range = c(~min(Coarse_Aggregate),~max(Coarse_Aggregate)),
                   label = 'Coarse_Aggregate', values = ~Coarse_Aggregate),
              list(range = c(~min(Fine_Aggregate),~max(Fine_Aggregate)),
                   label = 'Fine_Aggregate', values = ~Fine_Aggregate),
              list(range = c(~min(Age),~max(Age)),
                   label = 'Age', values = ~Age)
            )
  )
```

## Z-Scale All Data

Now that the data have been screened for possible problems, we can look at the distributions of each predictor and target to get a feeling for potential clusters and relationships. We will do this using the `preprocess` function in `caret`. The `scale` function will scale all variables to their `z` values.

```{r, echo=TRUE}
preProcValues <- preProcess(concrete_data_short[,-1], method = c("center", "scale"))
concrete_data_transformed <- predict(preProcValues, concrete_data_short[,-1])
concrete_data_scaled <- cbind(concrete_data_short[, 1], concrete_data_transformed)
```

## Edited, Screened and Transformed Data Ready For Analysis

```{r, echo = TRUE}
concrete_data_long_scaled <- concrete_data_scaled %>% 
  gather(c("Cement",
           "Slag",
           "Ash",
           "Water",
           "Superplasticizer",
           "Coarse_Aggregate",
           "Fine_Aggregate",
           "Age"), 
         key = "Component", value = "Value")

#concrete_data_long_scaled %>%
#  group_by(Component) %>%
#  summarise(mean = mean(Strength), sd = sd(Strength))
```

The data should look approximately on the same scale with approximately similar variances for regression models. This is not a requirement for XGBoost or LightGBM.

```{r, echo=TRUE}
ggplot(data = concrete_data_long_scaled, aes(x = Value)) + 
  geom_density(aes(fill = Component), alpha = 0.4, adjust = 0.5) + 
  facet_wrap(~ Component, scales = "fixed") + 
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "none") +
  labs(title = "Z Scaled Predictors")
```


```{r, echo=TRUE}
ggplot(data = concrete_data_long_scaled, 
       aes(x = Component, y = Value, fill = Component)) +
  geom_flat_violin(position = position_nudge(x = .2, y = 0), alpha = .8, scale = "width", adjust = 0.25) +
  #  geom_flat_violin(position = position_nudge(x = .2, y = 0), alpha = .8) +
  geom_point(aes(y = Value, color = Component), 
             position = position_jitter(width = .15), size = .5, alpha = 0.8) +
  geom_boxplot(width = .1, outlier.shape = NA, alpha = 0.5) +
  expand_limits(x = 5.25) +
  guides(fill = FALSE) +
  guides(color = FALSE) +
  coord_flip() +
  scale_color_brewer(palette = "Spectral") +
  scale_fill_brewer(palette = "Spectral") +
  theme_bw() +
  raincloud_theme
```

# Check For Potential Clustering

Use UMAP to look for potential clustering in the **predictor variables**. This is to investigate whether the predictor data is diverse and extensive. Any clustering may show physical or convenience based restrictions governing the experimental data. There appear to be some clusters with unknown origin. The mid and lower right hand side clusters seem to show a preference for concrete with strong and weak strength. 

```{r, eval=TRUE, fig.height=6, fig.width=6.5}
custom.config = umap.defaults
custom.config$random_state = 123

concrete_data_umap <- umap(d = concrete_data_scaled[,-1], custom.config)
temp.df <- data.table(x = concrete_data_umap$layout[,1], 
     y = concrete_data_umap$layout[,2],
     Strength = concrete_data_scaled$Strength)
mid <- mean(temp.df$Strength)
ggplot(data = temp.df, aes(x = x, y = y, color = Strength)) +
  geom_point() +
  scale_color_gradient2(midpoint=mid, low="blue", mid="white",
                        high="red", space ="Lab" )
```

# Some PCA to Assess Combinations of Variables

There’s a few pretty good reasons to use PCA. We actually capture 66.3% of variance in the dataset by just using three principal components. Since this is purely introductory I’ll skip the math and give you a quick rundown of the workings of PCA:

- Standardize the data (Center and scale).
- Calculate the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix (One could also use Singular Vector Decomposition).
- Sort the Eigenvalues in descending order and choose the K largest Eigenvectors (Where K is the desired number of dimensions of the new feature subspace k ≤ d).
- Construct the projection matrix W from the selected K Eigenvectors.
- Transform the original dataset X via W to obtain a K-dimensional feature subspace Y.

```{r, eval = TRUE}
concrete_pca <- prcomp(concrete_data_transformed[,2:8], scale = TRUE)
# scree plot
summary(concrete_pca)

fviz_eig(concrete_pca)

fviz_pca_var(concrete_pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)     # Avoid text overlapping

# 3D plot of the first two PCAs with z as Strength
concrete_coord <- get_pca_ind(concrete_pca)
concrete_data_3D <- data.table(Dim1 = concrete_coord$coord[,1],
                               Dim2 = concrete_coord$coord[,2],
                               Dim3 = concrete_coord$coord[,3])
concrete_data_3D <- cbind(concrete_data_3D, temp.df$Strength)
mid <- mean(temp.df$Strength)
ggplot(concrete_data_3D, aes(x = Dim1, y = Dim2, color = V2)) +
  geom_point() +
    scale_color_gradient2(midpoint=mid, low="blue", mid="white",
                        high="red", space ="Lab" )
```

The scree plot shows there are no "magic" combinations of variables that will solve our problem - keep digging.

The **Variables - PCA** plot indicates that `Age` is relatively unimportant in the first two PCA dimensions.

# Generate Train and Test Set

We start with the split of data into test and train. Since there is no time dependence of consecutive observations, we can used random sampling to construct train/test and folds within the train set.

Look at [this blog post](https://www.displayr.com/using-partial-least-squares-to-conduct-relative-importance-analysis-in-r/)

```{r, echo = FALSE}
#concrete_data_scaled %>%
#  pivot_longer(cols = 1:8) %>%
#  group_by(name) %>%
#  summarise_all(c(mean=mean, sd=sd))

rm(train.data, test.data)
set.seed(42)
sel <- createDataPartition(concrete_data_scaled$Strength, p = 0.8, list = FALSE)
train.data <- concrete_data_scaled[sel,]
test.data <- concrete_data_scaled[-sel,]
```

# Training Parameters(repeated_cv and adaptive_cv)

```{r}
# Training Parameters
CV_folds <- 5 # number of folds
CV_repeats <- 3 # number of repeats
minimum_resampling <- 5 # minimum number of resamples

# Training Settings
set.seed(1)

# trainControl object for standard repeated cross-validation
train_control <- caret::trainControl(
  method = "repeatedcv", 
  number = CV_folds, 
  repeats = CV_repeats, 
  verboseIter = FALSE, 
  returnData = FALSE
) 

# trainControl object for repeated cross-validation with grid search
adapt_control_grid <- caret::trainControl(
  method = "adaptive_cv", 
  number = CV_folds, 
  repeats = CV_repeats, 
  adaptive = list(
    min = minimum_resampling, 
    alpha = 0.05, # CI used to exclude parameter settings
    method = "gls", # generalized least squares
    complete = TRUE
  ), 
  search = "grid", # execute grid search
  verboseIter = FALSE, returnData = FALSE) 

# trainControl object for repeated cross-validation with random search
adapt_control_random <- caret::trainControl(
  method = "adaptive_cv", 
  number = CV_folds, 
  repeats = CV_repeats, 
  adaptive = list(
    min = minimum_resampling, 
    alpha = 0.05, # CI used to exclude parameter settings
    method = "gls", # generalized least squares
    complete = TRUE
  ), 
  search = "random", # execute random search
  verboseIter = FALSE, returnData = FALSE
) 
```

# The Response Surface for Cement

I like to get an idea about the non-linearity of the problem before going too much in depth. Complexities will show up when:

- PCA does not show a distinct **elbow"
- PCA components do not show a large cumulative proportion af variation after the first few components
- The graph of the response surface shows a lot of non-linear behaviour

Let's plot strength as a function of the first three PCA components to get an idea of the non-linearity of the strength of concrete. It has a number of peaks showing the non-linearity of the systems. This means that a simple regression will not capture the response surface well. We also have to accept random variation in the experimental setup - the data does not come from a replicated experiment.

```{r, eval=TRUE}
plot_ly(type = "scatter3d",
        mode = "markers",
        symbol = "circle-dot",
        size = 3,
        x = concrete_data_3D$Dim1,
        y = concrete_data_3D$Dim2, 
        z = concrete_data_3D$Dim3,
        color = concrete_data_3D$V2)

#temp <- distinct(concrete_data_3D, Dim1, Dim2, .keep_all = TRUE)
#ggplot(data = temp, aes(x = Dim1, y = Dim2, z = V2)) +
#  stat_summary_2d(colour = "white", bins = 100, fun = mean) +
#  scale_fill_gradient2(mid = "blue", high ="red", space = "Lab")
```

# Baseline Principal Component Regression Model with Train/Test Split

```{r}
# Build the model on training set
model <- caret::train(
  Strength ~ .,
  data = train.data, 
  method = "pcr",
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)
#library(pls)
#cv = RMSEP(model)

# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune
summary(model$finalModel)

# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
  RMSE = caret::RMSE(predictions, test.data$Strength),
  Rsquare = caret::R2(predictions, test.data$Strength)
)
```


# Baseline Partial Least Squares Model with Train/Test Split

```{r}
# Build the model on training set
set.seed(42)
model <- caret::train(
  Strength~., data = train.data, method = "pls",
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune
summary(model$finalModel)

# Make predictions
predictions <- model %>%
  predict(test.data)
# Model performance metrics
data.frame(
  RMSE = caret::RMSE(predictions, test.data$Strength),
  Rsquare = caret::R2(predictions, test.data$Strength)
)
```

# XGBoost for Regression

PLS and PCA regressions do not require hyperparameters. In order to get better results from more advanced algorithms, each method needs to be run with the optimum set of hyperparameters. There is no general way to determine these before conducting analysis.

The analysis so far indicates that each variable is pretty important for the final prediction, but the linear models only get just so far with an RMSE of about 7.7-8.2, and Rsquare of 0.773-0.798. XGBoost was chosen for further modeling based on perfomance and speed and its structure allowing a fit to non-linear and complex problems. 

>XGBoost stands for eXtreme Gradient Boosting.
>The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. - Tianqi Chen

We will used the *xgbLinear* model, while it is called a regression and contains the word, "linear", a regression tree is a nonlinear mode: It is really a “regularized linear model” using delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization. It is also an ensemble learner that results in a non-linear transfer function.   

![Trees](./figures/trees.jpg)


An [XGBoost model](http://dmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf) will allow the determination of interactions between variables. The model, however, will require the determination of *hyperparameters*, defined as any parameter **not** determined from the data. This means user defined assumptions such as regularization, scaling, transformation, feature engineering, neural net architecture, number of epochs, and learning rate (eta). Individual algorithms have a set of hyperparameters commonly used for tuning. For XGBoost, we will be changing:

## GS - Grid Search
- `nrounds` - number of boosting iterations
- `eta` - step size shrinkage
- `lambda` - L2 regularization
- `alpha` - L1 regularization

### GS - Create The Grid

```{r}
XGBoost_grid <- expand.grid(
  nrounds = c(50, 75, 100, 150, 225, 325), # number of boosting iterations
  eta = c(0.01, 0.03, 0.10, 1.0), # learning rate, low means less overfitting
  lambda = c(0.1, 0.3, 0.5, 0.75, 0.9, 1.0), # L2 regularization (Ridge regression)
  alpha = c(0.1, 0.5, 0.8, 1.0) # L1 regularization (Lasso regression)
)
```

This means $6*4*6*4 = 576$ combinations and takes about 1 seond per combination. Training will be done with adaptive sampling with a minimum resampling number of 5.

### GS - Run the Search and Save the Model 

```{r}
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1) # number of cores, leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1); 

# Train model with grid search
GS_XGBoost_Linear_model <- caret::train(
  Strength ~., 
  data = train.data,
  method = "xgbLinear",
  trControl = adapt_control_grid,
  verbose = FALSE, 
  silent = 1,
  # tuneLength = 20
  tuneGrid = XGBoost_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
GS_T1 <- Sys.time()
GS_T1 - GS_T0

saveRDS(object = GS_XGBoost_Linear_model, 
        file = "./Models/GS_XGBoost_Linear_model.rds")
saveRDS(object = GS_XGBoost_Linear_model$finalModel, 
        file = paste0("Models/GS_XGBoost_Linear_model_",
                      class(GS_XGBoost_Linear_model$finalModel)[1],".rds"))
```






### GS - Summarize the Search

```{r}
GS_XGBoost_Linear_model
```

### GS - Plot the Search Progress

```{r, fig.height=3, fig.width=8}
rm(data_points)
data_points <- data.table(GS_XGBoost_Linear_model$results$nrounds)
data_points$eta <- GS_XGBoost_Linear_model$results$eta
data_points$alpha <- GS_XGBoost_Linear_model$results$alpha
data_points$lambda <- GS_XGBoost_Linear_model$results$lambda
data_points$RMSE <- GS_XGBoost_Linear_model$results$RMSE
colnames(data_points) <- c("nrounds", "eta", "alpha", "lambda", "RMSE")

data_points %>%
  pivot_longer(cols = c("nrounds", "eta", "alpha", "lambda", "RMSE")) %>%
  ggplot(aes(y = value, x = seq_along(value))) +
  geom_line(col = "blue") +
  facet_wrap(~name, scales = "free_y", ncol = 3) +
    scale_x_continuous(name = "Optimization Phase") +
  labs(title = "Systematic Grid Search") +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(name = "")
```

#### GS - Plot The RMSE Results

```{r, fig.height=6, fig.width=8}
RMSE_plot_XGBoost_Linear <- GS_XGBoost_Linear_model$results %>% 
  dplyr::select(nrounds, lambda, alpha, eta, RMSE) %>% 
  gather(key = Parameter, value = Value, -nrounds, -lambda, -alpha, -eta) %>% 
  filter(Parameter == "RMSE") %>% 
  ggplot(mapping = aes(x = nrounds, y = Value, col = factor(lambda), shape = factor(alpha))) +
    geom_line(size = 1) +
    geom_point(size = 2) +
    facet_wrap(~eta) +
    labs(title = "Average Cross-validated RMSE of XGBoost Linear Model",
       subtitle = "Tiles representative of step size shrinkage (eta)",
       x = "# Boosting Iterations",
       y = "RMSE",
       col = "lambda",
       shape = "alpha"
       ) +
  theme_bw() +
  theme(
        # panel.grid = element_blank(),
        legend.title = element_text(size = 10, face="bold"),
        legend.text = element_text(size = 9),
        plot.title = element_text(size=16),
        axis.title=element_text(size=10, face="bold"),
        axis.text.x = element_text(angle = 0)) +
  scale_x_continuous(breaks = unique(GS_XGBoost_Linear_model$results$nrounds)) +
  scale_color_brewer(type = "qual", palette = "Set1")
RMSE_plot_XGBoost_Linear
```

#### GS - Plot the MAE Results

```{r, fig.height=6, fig.width=8}
MAE_plot_XGBoost_Linear <- GS_XGBoost_Linear_model$results %>% 
  dplyr::select(nrounds, lambda, alpha, eta, MAE) %>% 
  gather(key = Parameter, value = Value, -nrounds, -lambda, -alpha, -eta) %>% 
  filter(Parameter == "MAE") %>% 
  ggplot(mapping = aes(x = nrounds, y = Value, col = factor(lambda), shape = factor(alpha))) +
    geom_line(size = 1) +
    geom_point(size = 2) +
    facet_wrap(~eta) +
    labs(title = "Average Cross-validated MAE of XGBoost Linear Model",
       subtitle = "Tiles representative of step size shrinkage (eta)",
       x = "# Boosting Iterations",
       y = "MAE",
       col = "lambda",
       shape = "alpha"
       ) +
  theme_bw() +
  theme(
        # panel.grid = element_blank(),
        legend.title = element_text(size = 10, face="bold"),
        legend.text = element_text(size = 9),
        plot.title = element_text(size=16),
        axis.title=element_text(size=10, face="bold"),
        axis.text.x = element_text(angle = 0)) +
  scale_x_continuous(breaks = unique(GS_XGBoost_Linear_model$results$nrounds)) +
  scale_color_brewer(type = "qual", palette = "Set1")
MAE_plot_XGBoost_Linear
```

#### GS - Plot the RSquared Results

```{r, fig.height=6, fig.width=8}
Rsquared_plot_XGBoost_Linear <- GS_XGBoost_Linear_model$results %>% 
  dplyr::select(nrounds, lambda, alpha, eta, Rsquared) %>% 
  gather(key = Parameter, value = Value, -nrounds, -lambda, -alpha, -eta) %>% 
  filter(Parameter == "Rsquared") %>% 
  ggplot(mapping = aes(x = nrounds, y = Value, col = factor(lambda), shape = factor(alpha))) +
    geom_line(size = 1) +
    geom_point(size = 2) +
    facet_wrap(~eta) +
    labs(title = "Average Cross-validated R-squared of XGBoost Linear Model",
       subtitle = "Tiles representative of step size shrinkage (eta)",
       x = "# Boosting Iterations",
       y = "R-squared",
       col = "lambda",
       shape = "alpha"
       ) +
  theme_bw() +
  theme(
        # panel.grid = element_blank(),
        legend.title = element_text(size = 10, face="bold"),
        legend.text = element_text(size = 9),
        plot.title = element_text(size=16),
        axis.title=element_text(size=10, face="bold"),
        axis.text.x = element_text(angle = 0)) +
  scale_x_continuous(breaks = unique(GS_XGBoost_Linear_model$results$nrounds)) +
  scale_color_brewer(type = "qual", palette = "Set1") 
Rsquared_plot_XGBoost_Linear
```

### GS - Investigate Interactions Between Hyperparameter Values

```{r}
getPalette = colorRampPalette(RColorBrewer::brewer.pal(8, "Set2"))(10) 
GS_XGBoost_Linear_model$results %>%
  plot_ly(width = 700, height = 500) %>%
  add_trace(type = 'parcoords',
            line = list(color = ~RMSE,
                        colorscale = list(getPalette),
                        cmin = 4.60,
                        cmax = 4.85),
            dimensions = list(
              list(range = c(~min(RMSE),~max(RMSE)),
                   label = 'RMSE', values = ~RMSE),
              list(range = c(~min(nrounds),~max(nrounds)),
                   label = 'nrounds', values = ~nrounds),
              list(range = c(~min(eta),~max(eta)),
                   label = 'eta', values = ~eta),
              list(range = c(~min(lambda),~max(lambda)),
                   label = 'lambda', values = ~lambda),
              list(range = c(~min(alpha),~max(alpha)),
                   label = 'alpha', values = ~alpha)
            )
  )
```

We are seeing, in general, higher values of lambda, alpha and eta are better and that results don't change much after about 200 rounds. This took 1102.33 sec to reach this solution by evaluating 672 models.

### GS - Test The Model with Optimized GS Settings

```{r}
# Grid of optimal hyperparameter values

GS_XGBoost_Linear_grid <- expand.grid(
  nrounds = GS_XGBoost_Linear_model$bestTune$nrounds,  # learning rate, low value means model is more robust to overfitting
  eta = GS_XGBoost_Linear_model$bestTune$eta, # number of boosting iterations
  alpha = GS_XGBoost_Linear_model$bestTune$alpha, # L2 Regularization (Ridge Regression)
  lambda = GS_XGBoost_Linear_model$bestTune$lambda # L1 Regularization (Lasso Regression)
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
GS_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  # tuneLength = 1
  tuneGrid = GS_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

GS_XGBoost_Linear_model_fin

saveRDS(object = GS_XGBoost_Linear_model_fin, file = "Models/GS_XGBoost_Linear_model.rds")
```
















## RS - Random Search Using Random Adaptive Sampling

We are going to see if we can reach similar results to the grid search by choosing the same number of configurations, but chosen randomly.

```{r}
n_combinations <- nrow(XGBoost_grid)
```

### RS - Run the Optimization and Save the Model

```{r}
RS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1) # number of cores, leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with grid search
RS_XGBoost_Linear_model <- caret::train(
  Strength ~., 
  data = train.data,
  method = "xgbLinear",
  trControl = adapt_control_random,
  verbose = FALSE, 
  silent = 1,
  tuneLength = n_combinations
#  tuneLength = 20
)
stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
RS_T1 <- Sys.time()
RS_T1 - RS_T0

#RS_XGBoost_Linear_model <- readRDS(file = "Models/RS_XGBoost_Linear_model.rds")

saveRDS(object = RS_XGBoost_Linear_model, file = "Models/RS_XGBoost_Linear_model.rds")
saveRDS(object = RS_XGBoost_Linear_model$finalModel, file = paste0("Models/RS_XGBoost_Linear_model_", class(RS_XGBoost_Linear_model$finalModel)[1],".rds"))
```

### RS - Summarize the Optimization

```{r}
RS_XGBoost_Linear_model
```

### RS - Grid Search Progress

```{r, fig.height=3, fig.width=8}
rm(data_points)
data_points <- data.table(RS_XGBoost_Linear_model$results$nrounds)
data_points$eta <- RS_XGBoost_Linear_model$results$eta
data_points$alpha <- RS_XGBoost_Linear_model$results$alpha
data_points$lambda <- RS_XGBoost_Linear_model$results$lambda
data_points$RMSE <- RS_XGBoost_Linear_model$results$RMSE
colnames(data_points) <- c("nrounds", "eta", "alpha", "lambda", "RMSE")

data_points %>%
  pivot_longer(cols = c("nrounds", "eta", "alpha", "lambda", "RMSE")) %>%
  ggplot(aes(y = value, x = seq_along(value))) +
  geom_line(col = "blue") +
  facet_wrap(~name, scales = "free_y", ncol = 3) +
    scale_x_continuous(name = "Optimization Phase") +
  labs(title = "Adaptive Random Search") +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(name = "")
```









### RS - Investigate Interactions Between Hyperparameter Values

```{r}
getPalette = colorRampPalette(RColorBrewer::brewer.pal(8, "Set2"))(10) 
RS_XGBoost_Linear_model$results %>%
  plot_ly(width = 700, height = 500) %>%
  add_trace(type = 'parcoords',
            line = list(color = ~RMSE,
                        colorscale = list(getPalette),
                        cmin = 4.60,
                        cmax = 4.85),
            dimensions = list(
              #list(range = c(~min(RMSE),~max(RMSE)),
              list(range = c(~min(RMSE),5.5),
                   label = 'RMSE', values = ~RMSE),
              list(range = c(~min(nrounds),~max(nrounds)),
                   label = 'nrounds', values = ~nrounds),
              list(range = c(~min(eta),~max(eta)),
                   label = 'eta', values = ~eta),
              list(range = c(~min(lambda),~max(lambda)),
                   label = 'lambda', values = ~lambda),
              list(range = c(~min(alpha),~max(alpha)),
                   label = 'alpha', values = ~alpha)
            )
  )
```

Of course, we see that the the hyperspace of sampling is larger than the grid search, but the algorithm hones in on the best areas in hyperspace. We are seeing, in general, higher values of lambda, alpha and eta are better and that results don't change much after about 200 rounds. This took 1102.33 sec to reach this solution by evaluating 672 models.

### RS - Test The Model with Optimized RS Settings

```{r}
RS_XGBoost_Linear_model$bestTune
```

```{r}
# Grid of optimal hyperparameter values

RS_XGBoost_Linear_grid <- expand.grid(
  nrounds = RS_XGBoost_Linear_model$bestTune$nrounds,  # learning rate, low value means model is more robust to overfitting
  eta = RS_XGBoost_Linear_model$bestTune$eta, # number of boosting iterations
  alpha = RS_XGBoost_Linear_model$bestTune$alpha, # L2 Regularization (Ridge Regression)
  lambda = RS_XGBoost_Linear_model$bestTune$lambda # L1 Regularization (Lasso Regression)
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
RS_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  tuneGrid = (RS_XGBoost_Linear_grid)
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

RS_XGBoost_Linear_model_fin

saveRDS(object = RS_XGBoost_Linear_model_fin, file = "Models/RS_XGBoost_Linear_model.rds")
```








## GA - Genetic Algorithm

Since this is an optimization problem, we can use optimization techniques to assess and refine the best combinations of hyperparemeters. This approach has two requirements:

- a function to evaluate the score for strategies, called the **fitness**
- the minimum and maximum values allowable for each hyperparameter
- the population size


Good [article here](https://blog.datascienceheroes.com/feature-selection-using-genetic-algorithms-in-r/) from the R-bloggers mailing list. There is a vignette [here](https://cran.r-project.org/web/packages/GA/vignettes/GA.html) by Luca Scrucca.

Have a look at [this article](https://towardsdatascience.com/hyperparameter-tuning-in-xgboost-using-genetic-algorithm-17bd2e581b17)

### GA - Create The Function For Assessing Solutions

```{r}
# Create custom function for assessing solutions
eval_function_XGBoost_Linear <- function(x1, x2, x3, x4, data, train_settings) {
  suppressWarnings(
    XGBoost_Linear_model <- caret::train(
      Strength ~., 
      data = data,
      method = "xgbLinear",
      trControl = train_settings,
      verbose = FALSE, 
      silent = 1,
      tuneGrid = expand.grid(
        nrounds = round(x1), # number of boosting iterations
        eta = 10^x2, # learning rate, low value means model is more robust to overfitting
        alpha = 10^x3, # L1 Regularization (equivalent to Lasso Regression) on weights
        lambda = 10^x4 # L2 Regularization (equivalent to ridge Regression) on weights
      ) 
    )
  )
return(-XGBoost_Linear_model$results$RMSE) # minimize RMSE
}

# Define minimum and maximum values for each input
nrounds_min_max <- c(10,10^3)
eta_min_max <- c(-5,3)
alpha_min_max <- c(-3,1)
lambda_min_max <- c(-3,1)

# Set parameter settings for search algorithm
max_iter <- 10 # maximum number of iterations
pop_size <- 10 # population size
```

### GA - Run The Optimization and Save the Model

```{r}
set.seed(1)
cluster <- makeCluster(detectCores() - 1) # number of cores, leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing
n_cores <- length(cluster)

GA_T0 <- Sys.time()
# Run genetic algorithm
GA_model_XGBoost_Linear <- GA::ga(
  type = "real-valued", 
  fitness = function(x) eval_function_XGBoost_Linear(x[1],x[2],x[3],x[4], 
                                                     data = train.data, 
                                                     train_settings = train_control), 
  lower = c(
    nrounds_min_max[1], eta_min_max[1], alpha_min_max[1], lambda_min_max[1]
  ), # minimum values
  upper = c(
    nrounds_min_max[2], eta_min_max[2], alpha_min_max[2], lambda_min_max[2]
  ), # maximum values
  popSize = pop_size, # population size
  maxiter = max_iter, # number of iterations
  pmutation = 0.5, # probability of mutation
  elitism = 0.3, # percentage of elitism (fraction of best current solutions used on next round)
  # suggestions = starting_point,
  parallel = n_cores, 
  optim = F, 
  keepBest = T,
  seed = 1
)
stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
GA_T1 <- Sys.time()
GA_T1 - GA_T0

GA_XGBoost_Linear_model
saveRDS(object = GA_XGBoost_Linear_model, file = paste0("Models/GA_XGBoost_Linear_model_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
saveRDS(object = GA_XGBoost_Linear_model$finalModel, file = paste0("Models/GA_XGBoost_Linear_model_", class(GA_XGBoost_Linear_model$finalModel)[1], "_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```

### GA - Summarize the Optimization

```{r}
summary(GA_model_XGBoost_Linear)
```

```{r}
GA::plot.ga(GA_model_XGBoost_Linear, main = "Genetic Algorithm: RMSE values at each iteration")
```

### GA - Test The Model with Optimized GE Settings

```{r}
# Grid of optimal hyperparameter values
GA_XGBoost_Linear_grid <- expand.grid(
  nrounds = round(GA_model_XGBoost_Linear@solution[1]),  # learning rate, low value means model is more robust to overfitting
  eta = 10^GA_model_XGBoost_Linear@solution[2], # number of boosting iterations
  alpha = 10^GA_model_XGBoost_Linear@solution[3], # L2 Regularization (Ridge Regression)
  lambda = 10^GA_model_XGBoost_Linear@solution[4] # L1 Regularization (Lasso Regression)
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
GA_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  # tuneLength = 1
  tuneGrid = GA_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

saveRDS(object = GA_XGBoost_Linear_model_fin, file = paste0("Models/GA_XGBoost_Linear_model_fin_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```










## DE - Differential Evolution

The `DEoptim` package will be used to optimize the hyperparameters of the XGBoost model using differential evolution. In the same manner as the genetic algorithm above, we first define a custom objective function to be minimized: this will be the cross-validated RMSE. 


### DE - Create The Function For Assessing Solutions

```{r}
# Create custom function for assessing solutions
eval_function_XGBoost_Linear <- function(x, data, train_settings) {
  x1 <- x[1]
  x2 <- x[2]
  x3 <- x[3]
  x4 <- x[4]
  suppressWarnings(
    XGBoost_Linear_model <- caret::train(
      Strength ~., 
      data = data,
      method = "xgbLinear",
      trControl = train_settings,
      verbose = FALSE, 
      silent = 1,
      tuneGrid = expand.grid(
        nrounds = round(x1), # number of boosting iterations
        eta = 10^x2, # learning rate, low value means model is more robust to overfitting
        alpha = 10^x3, # L1 Regularization (equivalent to Lasso Regression) on weights
        lambda = 10^x4 # L2 Regularization (equivalent to ridge Regression) on weights
      ) 
    )
  )
  return(XGBoost_Linear_model$results$RMSE) # minimize RMSE
}

# Define minimum and maximum values for each input
nrounds_min_max <- c(10,10^3)
eta_min_max <- c(-5,3)
alpha_min_max <- c(-3,1)
lambda_min_max <- c(-3,1)

# Set parameter settings for search algorithm
max_iter <- 10 # maximum number of iterations
pop_size <- 40 # population size
```

### DE - Run The Optimization

Run the model (1344.78 sec elapsed)

```{r}
DE_T0 <- Sys.time()
set.seed(1)
n_cores <- detectCores() - 1
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

# Run differential evolution algorithm
DE_model_XGBoost_Linear <- DEoptim::DEoptim(
  fn = eval_function_XGBoost_Linear, 
  lower = c(nrounds_min_max[1], eta_min_max[1], alpha_min_max[1], lambda_min_max[1]),
  upper = c(nrounds_min_max[2], eta_min_max[2], alpha_min_max[2], lambda_min_max[2]), 
  control = DEoptim.control(
    NP = pop_size, # population size
    itermax = max_iter, # maximum number of iterations
    CR = 0.5, # probability of crossover
    storepopfreq = 1, # store every population
    parallelType = 1 # run parallel processing
  ),
  data = train.data,
  train_settings = train_control
)
stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
DE_T1 <- Sys.time()
DE_T1 - DE_T0

saveRDS(object = DE_XGBoost_Linear_model, file = paste0("Models/DE_XGBoost_Linear_model_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```

### DE - Summarize The Optimization

```{r}
# Print search results
summary(DE_model_XGBoost_Linear)

DE_solutions <- DE_model_XGBoost_Linear$optim$bestmem

# Plot results
ggplot(mapping = aes(x = 1:length(DE_model_XGBoost_Linear$member$bestvalit), 
                     y = DE_model_XGBoost_Linear$member$bestvalit)) +
    geom_line(col = "grey50") + 
    geom_point(col = "grey50") +
    theme_bw() +
    theme(aspect.ratio = 0.9) +
    labs(x = "Iteration", y = "RMSE", 
         title = "Best RMSE value at each iteration", 
         subtitle = "Results using Differential Evolution") +
    scale_x_continuous(breaks = 1:DE_model_XGBoost_Linear$optim$iter, 
                       minor_breaks = NULL)
```

### DE - Test The Model with Optimized DE Settings

```{r}
# Grid of optimal hyperparameter values
DE_XGBoost_Linear_grid <- expand.grid(
  nrounds = round(DE_solutions[1]),  # learning rate, low value means model is more robust to overfitting
  eta = 10^DE_solutions[2], # number of boosting iterations
  alpha = 10^DE_solutions[3], # L2 Regularization (Ridge Regression)
  lambda = 10^DE_solutions[4] # L1 Regularization (Lasso Regression)
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
DE_XGBoost_Linear_model <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  # tuneLength = 1
  tuneGrid = DE_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

saveRDS(object = DE_XGBoost_Linear_model_fin, file = paste0("Models/DE_XGBoost_Linear_model_fin_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```

```{r}
DE_XGBoost_Linear_model
```










## PSO - Particle Swarm Optimization

We will use the `pso` package to optimize the four hyperparameters. Just like the Genetic Algorithm and Differential Evolution ww will define a function to be optimized.


### PSO - Create The Function For Assessing Solutions

```{r}
# Create custom function for assessing solutions
eval_function_XGBoost_Linear <- function(x, data, train_settings) {
  x1 <- x[1]
  x2 <- x[2]
  x3 <- x[3]
  x4 <- x[4]
  
  suppressWarnings(
    # Create dataframe with proportion of each solid component
    XGBoost_Linear_model <- caret::train(
      Strength ~., 
      data = data,
      method = "xgbLinear",
      trControl = train_settings,
      verbose = FALSE, 
      silent = 1,
      tuneGrid = expand.grid(
        nrounds = round(x1), # number of boosting iterations
        eta = 10^x2, # learning rate, low value means model is more robust to overfitting
        alpha = 10^x3, # L1 Regularization (equivalent to Lasso Regression) on weights
        lambda = 10^x4 # L2 Regularization (equivalent to ridge Regression) on weights
      ) 
    )
)
    return(XGBoost_Linear_model$results$RMSE) # minimize RMSE

}

# Define minimum and maximum values for each input
nrounds_min_max <- c(10,10^3)
eta_min_max <- c(-5,3)
alpha_min_max <- c(-3,1)
lambda_min_max <- c(-3,1)
```

### PSO - Run The Optimization

We can now use the `pso::psoptim()` function to run the search

```{r}
# Set parameter settings for search algorithm
max_iter <- 10 # maximum number of iterations
pop_size <- 10 # population size

set.seed(1)
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing
n_cores <- detectCores()-1

PSO_T0 <- Sys.time()
# Run search algorithm
PSO_model_XGBoost_Linear <- pso::psoptim(
  par = rep(NA, 4),
  fn = eval_function_XGBoost_Linear, 
  lower = c(nrounds_min_max[1], eta_min_max[1], alpha_min_max[1], lambda_min_max[1]),
  upper = c(nrounds_min_max[2], eta_min_max[2], alpha_min_max[2], lambda_min_max[2]), 
  control = list(
    trace = 1, #  produce tracing information on the progress of the optimization
    maxit = max_iter, # maximum number of iterations
    REPORT = 1, #  frequency for reports
    trace.stats = T,
    s = pop_size, # Swarm Size,
    maxit.stagnate = round(0.75*max_iter), # maximum number of iterations without improvement
    vectorize = T,
    type = "SPSO2011" # method used
  ),
  data = train.data,
  train_settings = train_control
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
PSO_T1 <- Sys.time()
PSO_T1 - PSO_T0

saveRDS(object = PSO_XGBoost_Linear_model, file = paste0("Models/PSO_XGBoost_Linear_model_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
saveRDS(object = PSO_XGBoost_Linear_model$finalModel, file = paste0("Models/PSO_XGBoost_Linear_model_", class(PSO_XGBoost_Linear_model$finalModel)[1], "_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```

### PSO - Summarize The Optimization

```{r}
PSO_summary <- data.frame(
  Iteration = PSO_model_XGBoost_Linear$stats$it,
  Mean = PSO_model_XGBoost_Linear$stats$f %>% sapply(FUN = mean),
  Median = PSO_model_XGBoost_Linear$stats$f %>% sapply(FUN = median),
  Best = PSO_model_XGBoost_Linear$stats$error %>% sapply(FUN = min)
)
PSO_summary %>% 
  gather(key = "Parameter", value = "Value", - Iteration) %>% 
  ggplot(mapping = aes(x = Iteration, y = Value, col = Parameter)) +
  geom_line() +
  geom_point() +
  theme_bw() +
  theme(aspect.ratio = 0.9) +
  scale_x_continuous(breaks = PSO_model_XGBoost_Linear$stats$it, minor_breaks = NULL) +
  labs(x = "Iteration", y = "RMSE", title = "RMSE values at each iteration", subtitle = "Results using Particle Swarm Optimization") +
  scale_color_brewer(type = "qual", palette = "Set1")
```

### PSO - Test The Model with Optimized PSO Settings

```{r}
# Grid of optimal hyperparameter values
PSO_XGBoost_Linear_grid <- expand.grid(
  nrounds = round(PSO_model_XGBoost_Linear$par[1]),  # number of boosting iterations
  lambda = PSO_model_XGBoost_Linear$par[4], # L1 Regularization (Lasso Regression)
  alpha = PSO_model_XGBoost_Linear$par[3], # L2 Regularization (Ridge Regression)
  eta = PSO_model_XGBoost_Linear$par[2] # learning rate, low value means model is more robust to overfitting
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
#modelLookup(model = "xgbLinear")
PSO_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  tuneGrid = PSO_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

PSO_XGBoost_Linear_model_fin

saveRDS(object = PSO_XGBoost_Linear_model_fin, file = paste0("Models/PSO_XGBoost_Linear_model_fin", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```









## BO - Bayesian Optimization

Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample.

Optimization by Bayesian methods [spans volumes](http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf), this will hit only the high points to get the job done. The methods here are very flexible and can be applied to a variety of ML algorithms. The schemes fit into two toolset camps.

**The `mlr` Toolset**

The `mlr` camp has a large suite of related tools for defining, running, tuning and evaluating ML models. 

**The `caret` Toolset**

The `caret` camp follows the model definition and logic used thus far, *vide supra*.

- The `caret` tools.
   
As before, define a function to be optimized. I'm going to use the same definition as the one used for the particle swarm optimization and modify it to return the RMSE and the cross-validated preditions, as required by `rBayesianOptimization`.

### BO - Create The Function For Assessing Solutions

```{r, eval = TRUE}
# Create custom function for assessing solutions
# test the function with
#GA results: 
#Iterations             = 10 
#Fitness function value = -4.462447 
#Solution = 
#           x1       x2         x3       x4
#[1,] 373.7826 1.127141 -0.3498167 0.230558
#x <- list(x1 = 373.7826, x2 = 1.127141, x3 = -0.3498167, x4 = 0.230558)
# elapsed = 19.19	Round = 57	x1 = 635.0000	x2 = 3.0000	x3 = -2.2709	x4 = 1.0000	Value = -4.3298 
#eval_function_XGBoost_Linear(x1 = 635, x2 = 3, x3 = -2.2709, x4 = 1.00, train_settings = train_control)
# returns [1] [1] -4.625419

eval_function_XGBoost_Linear <- function(x1, x2, x3, x4, train_settings) {

  #  suppressWarnings(
  # Create dataframe with proportion of each solid component
  XGBoost_Linear_model <- caret::train(
    Strength ~., 
    data = train.data,
    method = "xgbLinear",
    trControl = train_control,
    verbose = FALSE, 
    silent = 1,
    tuneGrid = expand.grid(
      nrounds = round(x1), # number of boosting iterations
      eta = 10^x2, # learning rate, low value means model is more robust to overfitting
      alpha = 10^x3, # L1 Regularization (equivalent to Lasso Regression) on weights
      lambda = 10^x4 # L2 Regularization (equivalent to ridge Regression) on weights
    )
  )
  return(list(Score = -XGBoost_Linear_model$results$RMSE, Pred = 0))
#    return(list(Score = XGBoost_Linear_model$results$RMSE, Pred = XGBoost_Linear_model$pred))
}
```

### BO - Create An Initialization Grid Based On The Grid Search

We want to include the best 30 points from the grid search and augment those with 20 more random points within the sample space.

```{r}
init_grid_dt <- GS_XGBoost_Linear_model$results %>% 
  arrange(RMSE) %>%
  unique() %>%
  .[1:20,] %>% 
  select(nrounds, eta, lambda, alpha, RMSE) %>% 
  mutate(x1 = nrounds) %>%
  mutate(x2 = log10(eta)) %>%
  mutate(x3 = log10(alpha)) %>%
  mutate(x4 = log10(lambda)) %>%
  mutate(Value = -RMSE) %>%
  .[,6:10]
```

### BO - Run the optimization

```{r, eval = TRUE}
BO_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1) # number of cores, leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Define minimum and maximum values for each input

#x
#train_control
BO_XGBoost_Linear_model <- BayesianOptimization(eval_function_XGBoost_Linear,
                                bounds = list(x1 = c(10L,1000L),
                                              x2 = c(-5,3),
                                              x3 = c(-3,1),
                                              x4 = c(-3,1)),
                                init_grid_dt = init_grid_dt, init_points = 21, n_iter = 30,
                                acq = "ucb", kappa = 2.576, eps = 0.0,
                                verbose = TRUE)
stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
BO_T1 <- Sys.time()
BO_T1 - BO_T0

saveRDS(object = BO_XGBoost_Linear_model, file = paste0("Models/BO_XGBoost_Linear_model_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
saveRDS(object = BO_XGBoost_Linear_model$finalModel, file = paste0("Models/BO_XGBoost_Linear_model_", class(BO_XGBoost_Linear_model$finalModel)[1], "_", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```

### BO - Plot the Results

```{r, fig.height=3, fig.width=8}
data_points <- BO_XGBoost_Linear_model$History
data_points$Round[1:20] <- "Grid Search Priming" 
data_points$Round[21:41] <- "Acquisition Sampling"
data_points$Round[42:71] <- "Bayesian Optimizing"
data_points$Value <- data_points$Value * -1.0
colnames(data_points) <- c("Optimization Phase", "nrounds", "eta", "alpha", "lambda", "RMSE")

data_points %>%
  pivot_longer(cols = c("nrounds", "eta", "alpha", "lambda", "RMSE")) %>%
  ggplot(aes(y = value, x = seq_along(`Optimization Phase`), col = `Optimization Phase`)) +
  geom_line() +
  facet_wrap(~name, scales = "free_y", ncol = 3) +
    scale_x_continuous(name = "Optimization Phase") +
  labs(title = "Bayesian Optimization With Priming") +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(name = "")
```

### BO - Test The Model with Optimized BO Settings

```{r}
# Grid of optimal hyperparameter values
#BO_XGBoost_Linear_grid
#Round = 54	x1 = 390.0000	x2 = -0.5757	x3 = -1.6649	x4 = 0.9707	Value = -4.4336 
#BO_XGBoost_Linear_grid <- expand.grid(
#  nrounds = 390,  # number of boosting iterations
#  lambda = 10^0.9707, # L1 Regularization (Lasso Regression)
#  alpha = 10^-1.6649, # L2 Regularization (Ridge Regression)
#  eta = 10^-0.5757^10 # learning rate, low value means model is more robust to overfitting
#)


BO_XGBoost_Linear_grid <- expand.grid(
  nrounds = round(BO_XGBoost_Linear_model$Best_Par[1]),  # number of boosting iterations
  eta = 10^BO_XGBoost_Linear_model$Best_Par[2], # learning rate, low value means model is more robust to overfitting
  alpha = 10^BO_XGBoost_Linear_model$Best_Par[3], # L2 Regularization (Ridge Regression)
  lambda = 10^BO_XGBoost_Linear_model$Best_Par[4] # L1 Regularization (Lasso Regression)
)

tic()
cluster <- makeCluster(detectCores() - 1) # number of cores, convention to leave 1 core for OS
registerDoParallel(cluster) # register the parallel processing

set.seed(1)
# Train model with optimal values
#modelLookup(model = "xgbLinear")
BO_XGBoost_Linear_model_fin <- caret::train(
  Strength ~., 
  data = train.data, 
  method = "xgbLinear",
  trControl = train_control,
  verbose = F, metric = "RMSE", maximize = FALSE,
  silent = 1,
  tuneGrid = BO_XGBoost_Linear_grid
)

stopCluster(cluster) # shut down the cluster 
registerDoSEQ() #  force R to return to single threaded processing
toc()

BO_XGBoost_Linear_model_fin

saveRDS(object = BO_XGBoost_Linear_model_fin, file = paste0("Models/BO_XGBoost_Linear_model_fin", stringr::str_remove_all(Sys.time(), pattern = ":"),".rds"))
```






# Summary of All Training Runs

```{r}
# Create summary table
Summary_Table_Training <- bind_rows(
  GS_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  RS_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  GA_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  DE_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  PSO_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5),
  BO_XGBoost_Linear_model_fin$results %>% arrange(RMSE) %>% .[1,] %>% select(RMSE, MAE, Rsquared, nrounds, eta, lambda, alpha) %>% round(5))

Summary_Table_Training <- Summary_Table_Training %>% 
  add_column(Method = c("Grid Search", "Adaptive Random Search", "Genetic Algorithm", "Differential Evolution", "Particle Swarm Optimization", "Bayesian Optimization"), .before = 1) %>% 
  add_column(`Processing Time` = round(c(GS_T1-GS_T0, RS_T1-RS_T0, GA_T1-GA_T0, DE_T1-DE_T0, PSO_T1-PSO_T0, BO_T1- BO_T0),0))

#Summary_Table_Training <- Summary_Table_Training %>% 
#  add_column(Method = c("Grid Search", "Random Search", "Genetic Algorithm", "Differential Evolution", "Particle #Swarm Optimization"), .before = 1)

# Print table
Summary_Table_Training %>% 
  kable(align = "c", caption = "Training Set Statistics and Hyperparameter Values of XGBoost Models.") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = paste0("Note:\nSummary statistics obtained using ", CV_folds, "-fold cross validation repeated ", CV_repeats, " times.","\nGrid and Random Search  performed with adaptive resampling."), general_title = "\n ")
```

# Performance On Test Set

```{r}
#Custom functions to plot observed, predicted and residual values for each method
library(ggplot2, quietly = T, verbose = F)

# Function to plot observed vs predicted values
predicted_observed_plot <- function(predicted_val, observed_val, residual_val, model_name = "", R_squared, ...) {
  
  plot <- ggplot(mapping = aes(x = predicted_val, y = observed_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 1) +
    # facet_wrap(~) +
    labs(title = paste0(model_name, "\nPredicted vs Observed: Test Set"),
         subtitle = paste0("R-squared: ", R_squared),
         x = "Predicted",
         y = "Observed",
         col = "Absolute Deviation") +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) + 
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}

# Function to plot residuals
residuals_plot <- function(predicted_val, residual_val, model_name = "", MAE, RMSE, ...) {

  plot <- ggplot(mapping = aes(x = predicted_val, y = residual_val, col = abs(residual_val))) +
  geom_point(alpha = 0.9, size = 2) +
  geom_abline(intercept = 0, slope = 0) +
    # facet_wrap(~) +
    labs(
       title = paste0(model_name, "\nResiduals: Test Set"),
       subtitle = paste0("RMSE: ", RMSE, ", MAE: ", round(MAE, 3)),
       x = "Predicted",
       y = "Residual",
       col = "Absolute Deviation"
       ) +
  theme_bw() +
  theme(aspect.ratio = 0.9, panel.grid.minor.x = element_blank(), legend.title = element_text(size = 10, face="bold"), legend.text = element_text(size = 9), plot.title = element_text(size=12, face="bold"), axis.title=element_text(size=10, face="bold"), axis.text.x = element_text(angle = 0), legend.position = "none") +
  # scale_x_continuous(expand = c(0,0)) +
  # scale_y_continuous(expand = c(0,0)) +
  coord_equal() + scale_color_viridis_c(direction = -1)

  return (plot)
}
```

### GS - Residual Plots

```{r}
### Grid Search (GS)

# Make predictions on test set
test.data$XGBoost_Linear_GS <- predict(GS_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_GS_residual <- test.data$Strength - test.data$XGBoost_Linear_GS

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_GS, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_GS, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_GS, obs = test.data$Strength), 6)

GS_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_GS_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_GS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_GS_residual, R_squared = R_squared, model_name = "Grid Search")
XGBoost_Linear_GS_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_GS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_GS_residual, MAE = MAE, RMSE = RMSE, model_name = "Grid Search")
gridExtra::grid.arrange(XGBoost_Linear_GS_pred_obs, XGBoost_Linear_GS_residuals, 
                             ncol = 2)
```

### RS - Residual Plots

```{r}
# Make predictions on test set
test.data$XGBoost_Linear_RS <- predict(RS_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_RS_residual <- test.data$Strength - test.data$XGBoost_Linear_RS

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_RS, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_RS, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_RS, obs = test.data$Strength), 6)

RS_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_RS_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_RS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_RS_residual, R_squared = R_squared, model_name = "Random Seach")
XGBoost_Linear_RS_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_RS, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_RS_residual, MAE = MAE, RMSE = RMSE, model_name = "Random Seach")
gridExtra::grid.arrange(XGBoost_Linear_RS_pred_obs, XGBoost_Linear_RS_residuals, 
                             ncol = 2)
```

### GA - Residual Plots

```{r}
### Genetic Algorithm (GA)

# Make predictions on test set
test.data$XGBoost_Linear_GA <- predict(GA_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_GA_residual <- test.data$Strength - test.data$XGBoost_Linear_GA

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_GA, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_GA, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_GA, obs = test.data$Strength), 6)

GA_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_GA_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_GA, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_GA_residual, R_squared = R_squared, model_name = "Genetic Algorithm")
XGBoost_Linear_GA_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_GA, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_GA_residual, MAE = MAE, RMSE = RMSE, model_name = "Genetic Algorithm")
gridExtra::grid.arrange(XGBoost_Linear_GA_pred_obs, XGBoost_Linear_GA_residuals, 
                             ncol = 2)

```

### DE - Residual Plots

```{r}
### Differential evolution (DE)

# Make predictions on test set
test.data$XGBoost_Linear_DE <- predict(DE_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_DE_residual <- test.data$Strength - test.data$XGBoost_Linear_DE

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_DE, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_DE, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_DE, obs = test.data$Strength), 6)

DE_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_DE_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_DE, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_DE_residual, R_squared = R_squared, model_name = "Differential Evolution")
XGBoost_Linear_DE_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_DE, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_DE_residual, MAE = MAE, RMSE = RMSE, model_name = "Differential Evolution")
gridExtra::grid.arrange(XGBoost_Linear_DE_pred_obs, XGBoost_Linear_DE_residuals, 
                             ncol = 2)

```

### PSO - Residual Plots

```{r}
### Particle Swarm Optimization (PSO)

# Make predictions on test set
test.data$XGBoost_Linear_PSO <- predict(PSO_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_PSO_residual <- test.data$Strength - test.data$XGBoost_Linear_PSO

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_PSO, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_PSO, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_PSO, obs = test.data$Strength), 6)

PSO_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_PSO_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_PSO, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_PSO_residual, R_squared = R_squared, model_name = "Particle Swarm Optimisation")
XGBoost_Linear_PSO_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_PSO, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_PSO_residual, MAE = MAE, RMSE = RMSE, model_name = "Particle Swarm Optimisation")
gridExtra::grid.arrange(XGBoost_Linear_PSO_pred_obs, XGBoost_Linear_PSO_residuals, 
                             ncol = 2)

```

### BO - Residual Plots

```{r}
### Bayesian Optimization)

# Make predictions on test set
test.data$XGBoost_Linear_BO <- predict(BO_XGBoost_Linear_model_fin, test.data)
# Calculate Residuals on test set
test.data$XGBoost_Linear_BO_residual <- test.data$Strength - test.data$XGBoost_Linear_BO

# Calculate test set R-squared, RMSE, MAE
R_squared <- round(cor(test.data$XGBoost_Linear_BO, test.data$Strength), 4)
RMSE <- signif(RMSE(pred = test.data$XGBoost_Linear_BO, obs = test.data$Strength, na.rm = T), 6)
MAE <- signif(MAE(pred = test.data$XGBoost_Linear_BO, obs = test.data$Strength), 6)

BO_test.data_Statistics <- c(RMSE, MAE, R_squared)

# Plot predicted vs observed values and residuals
XGBoost_Linear_BO_pred_obs <- predicted_observed_plot(predicted_val = test.data$XGBoost_Linear_BO, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_BO_residual, R_squared = R_squared, model_name = "Bayesian Optimization")
XGBoost_Linear_BO_residuals <- residuals_plot(predicted_val = test.data$XGBoost_Linear_BO, observed_val = test.data$Strength, residual_val = test.data$XGBoost_Linear_BO_residual, MAE = MAE, RMSE = RMSE, model_name = "Bayesian Optimization")
gridExtra::grid.arrange(XGBoost_Linear_BO_pred_obs, XGBoost_Linear_BO_residuals, 
                             ncol = 2)

```

### Create Summary Table

```{r}
# Create summary table
Summary_Table_Test <- rbind(
  GS_test.data_Statistics,
  RS_test.data_Statistics,
  GA_test.data_Statistics,
  DE_test.data_Statistics,
  PSO_test.data_Statistics, 
  BO_test.data_Statistics,
  deparse.level = 0 
  ) %>% data.frame()

colnames(Summary_Table_Test) <- c("RMSE", "MAE", "R-squared")
Summary_Table_Test <- Summary_Table_Test %>% add_column(Method = c("Grid Search", "Random Search", "Genetic Algorithm", "Differential Evolution", "Particle Swarm Optimization", "Bayesian Optimization"), .before = 1)

# Print table
Summary_Table_Test %>% 
  kable(align = "c", caption = "Test Set Statistics of XGBoost Models.") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = paste0(""), general_title = "\n ")
```

```{r}
#g <- gridExtra::grid.arrange(XGBoost_Linear_GS_pred_obs, XGBoost_Linear_GS_residuals, 
#                             XGBoost_Linear_RS_pred_obs, XGBoost_Linear_RS_residuals, 
#                             XGBoost_Linear_GA_pred_obs, XGBoost_Linear_GA_residuals, 
#                             XGBoost_Linear_DE_pred_obs, XGBoost_Linear_DE_residuals, 
#                             XGBoost_Linear_PSO_pred_obs, XGBoost_Linear_PSO_residuals, 
#                             XGBoost_Linear_BO_pred_obs, XGBoost_Linear_BO_residuals,
#                             ncol = 2)
```

# Summary

The Bayesian approuch with primed values is superior to all other approaches in terms of accuracy and computing resources.

