load the following libraries:

library(Matrix)
library(lme4)
library(data.table)
library(magrittr)
library(xgboost)
library(foreach)
library(doMC)
library(knitr)
library(caret)

load some custom functions:

source("~/CrossValidation/CV_functions.R")

data from northeast US:

db0us <- data.table(fst::read_fst("~/PM_USA/data/mod0_allyears.fst")) 
pp <- preproc(db0 = db0us, years = 2015:2015, obs_day = 50, obs_stn = 50) # some data pre-processing 
db0 <- pp$db_processed

Single tree

library(rpart)
library(rpart.plot)
tree.1 <- rpart(pm25 ~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+ 
                           aothat.aqua+reg+ time_or + 
                           month, data=db0)

rpart.plot::rpart.plot(tree.1)

Trees are very prone to overfitting. To avoid this, we reduce a tree’s complexity by pruning it. This is done with the prune function

One alternative is to use the smallest tree that is within 1 standard error of the best tree. The reason for this is that, given the CV estimates of the error, the smallest tree within 1 standard error is doing just as good a job at prediction as the best (lowest CV error) tree, yet it is doing it with fewer “terms” - less likely to overfit.

we can find a good pruning rate with a plot of cost-complexity vs tree size for the un-pruned tree.

plotcp(tree.1)

tree.2 <- prune(tree.1,cp = 0.015)
rpart.plot(tree.2)

Multiple trees models

single trees usually produce noisy (bushy) or weak (stunted) classifiers.

for this it is better to use:

  • Bagging (Bootstrap aggregated): fit many large trees to bootstrap-resampled version of the data (“shake the data”), and aggregate them to choose a classifier (e.g. in classification - by majority vote).
  • Random Forest: Fancier version of bagging: reduce variance by decorrelating the trees (choose different set of features in each tree)
  • Boosting: fit many large or small trees to reweighted versions of the training data. aggregate to choose a classifier.

Boosting

How does it works:

  1. Initialize the outcome
  2. Iterate from 1 to total number of trees 2.1 Update the weights for targets based on previous run (higher for the ones mis-classified) 2.2 Fit the model on selected subsample of data 2.3 Make predictions on the full set of observations 2.4 Update the output with current results taking into account the learning rate
  3. Return the final output.

Updating weights:

additive training:

for \(i \in 1,...,n\):

\[ \begin{align} \hat{y}^{(0)}_i &= 0 \\ \hat{y}^{(1)}_i &= f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i) \\ \hat{y}^{(2)}_i &= f_2(x_i) = \hat{y}^{(1)}_i + f_2(x_i) \\ ... \\ \hat{y}^{(t)}_i &= \sum_{k=1}^t f_k(x_i) = \hat{y}^{(t-1)}_i + f_t(x_i) \\ \end{align}\]

at each step we add the tree that optimizes our objective function. here is the objective function definition:

\[ \begin{align} obj^t &= \sum_{i=1}^n l(y_i,\hat{y_i}^t)+\sum_{i=1}^t\Omega(f_i) \\ &= \sum_{i=1}^n l(y_i,\hat{y_i}^{(t-1)} + f_t(x_i)) + \sum_{i=1}^t\Omega(f_i) \end{align}\]

where the regularization term is \(\Omega(f)\)

to understand the regularization term, first refine the definition of a tree \(f_t(x)\):

\[ f_t(x) = w_{q(x)}\]

where: \(x \in R^T\), \(q:R^T \to {1,...,T}\).

Here \(w\) is the vector of scores on leaves, \(q\) is a function assigning each data point to the corresponding leaf, and \(T\) is the number of leaves. In XGBoost, we define the complexity as

\[ \Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2\]

boosting parameters:

there are 3 categories of parameters:

  1. Tree-Specific Parameters: affect each individual tree in the model.
  2. Boosting Parameters: affect the boosting operation in the model.
  3. Miscellaneous Parameters: Other parameters for overall functioning.

some of the Tree-Specific Parameters are:

more on Gradient Boosting:

Trevor Hastie - Gradient Boosting Machine Learning

xgboost

simple use of xgboost

random_split <- CV_split(db = db0, validation = "tenfold", trn_ratio = 0.7)

db_trn <- random_split$db_trn
db_tst <- random_split$db_tst
    
X <- sparse.model.matrix(~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+ # LU
                           aothat.aqua+reg+ time_or+month,
                         data = db_trn[,-c("pm25")])

y  <- db_trn$pm25
dtrain <- xgb.DMatrix(data = X, label = y)

X_tst <- sparse.model.matrix(~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+ # LU
                               aothat.aqua+reg+ time_or+month,
                             data = db_tst[,-c("pm25")])
y_tst <- db_tst$pm25

# run xgb with default hyperparameters:
m_xgb <- xgb.train(data = dtrain, nrounds = 100, nthread=1)

# predict:
p_xgb <- predict(m_xgb,X_tst)

# evaluate:
sqrt(mean((p_xgb-y_tst)^2))
## [1] 2.384479

plot it:

# library(DiagrammeR)
# xgb.plot.tree(model = m_xgb)

how does it works

TODO

read this

Tuning

Building a model using xgboost is easy. But, improving the model is difficult. This algorithm uses multiple parameters. To improve the model, parameter tuning is must.

Tuning is art.

Which set of hyper-parameters to tune? What is the ideal value of these parameters?

It comes with experience…

xgboost parameters

read:

full list is here

XGBoost Parameter Tuning (in Python) tutorial

the most important:

  • Booster Parameters: (some are tree-specific and some affect the boosting operation)
    • eta [default=0.3]
      • Analogous to learning rate in GBM
      • Makes the model more robust by shrinking the weights on each step
      • Typical final values to be used: 0.01-0.2
    • min_child_weight [default=1]
      • Defines the minimum sum of weights of all observations required in a child.
      • Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
      • Too high values can lead to under-fitting hence, it should be tuned using CV.
    • max_depth [default=6]
      • The maximum depth of a tree, same as GBM.
      • Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
      • Should be tuned using CV.
      • Typical values: 3-10
    • gamma [default=0]
      • A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
      • Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
    • subsample[default=1]
      • Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
      • Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
      • Typical values: 0.5-1
    • lambda [default=1]
      • L2 regularization term on weights (analogous to Ridge regression)
      • This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
    • alpha [default=0]
      • L1 regularization term on weight (analogous to Lasso regression)
      • Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
  • Learning Task Parameters: These parameters are used to define the optimization objective the metric to be calculated at each step.
    • objective [default=reg:linear]
      • This defines the loss function to be minimized. Mostly used values are binary:logistic, multi:softmax, multi:softprob
    • eval_metric
      • The metric to be used for validation data
      • typical values: rmse, mae (mean absolute error), logloss, …

some notes:

  • watchlist: see the fitting progress every round/ several rounds
  • warmstart: start tuning from some already working xgboost models (save computation time)
  • todo

examples:

Fit Manually (that’s good for one or two parameters) with grid search:

here we use custom CV function (CV_split) and search for best eta (step size shrinkage used in update (learning rate))

netas <- 100 
etavec <- seq(0.01,0.99,len = netas)

registerDoMC(min(50,netas))

  eta_tune <- 
    foreach(i = 1:netas, .combine = "cbind") %dopar% { # loop over eta values
        foreach(j = 1:10, .combine = "rbind") %do% { # 10 CV repeats per eta value 
          
           random_split <- CV_split(db = db0, validation = "tenfold", trn_ratio = 0.7)
           db_trn <- random_split$db_trn
           db_tst <- random_split$db_tst
           xgb_data <- xgb_pp(db_trn = db_trn, db_tst = db_tst)  # this function do preprocessing for xgbost
      
           m_xgb <- xgb.train(data = xgb_data$dtrain,
                              params = list(eta = etavec[i]), nrounds = 100, nthread = 1) # notice the position of nthread. in some parallelization mechanisms you might want to specify the number of cores for each model to optimze the computation. (but usually it is better  to took it out from the params argument).
           
           p_xgb <- predict(m_xgb, xgb_data$X_tst)
           c(sqrt(mean((xgb_data$y_tst-p_xgb)^2)))
           } %>% data.table()
      } %>% as.matrix()

eta_tune_mean <- colMeans(eta_tune)
matplot(y=t(eta_tune),x=etavec,col="grey",type = "l", lwd = 1, lty = 1)
lines(y=eta_tune_mean,x=etavec, col = "blue", lwd = 2)
eta_min <- etavec[which.min(eta_tune_mean)]
min_rmse <- min(eta_tune_mean)
points(min_rmse~eta_min,pch = 16, col = "red", cex = 2)

Tuning with caret

some examples with mtcars dataset

tunegrid <- expand.grid(nrounds = 100,
                        max_depth = c(4,6,10,20),
                        eta = seq(0.05,0.4,len=8),
                        gamma = 0,
                        colsample_bytree = 1,
                        min_child_weight = 1,
                        subsample = seq(0.05,0.15,len=5))

trcontrol <- trainControl(method = "repeatedcv", # options: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob"
                          #p, for leave-group out: the training percentage 
                          number = 10, # ten folds
                          repeats = 5, # repeated 5 times
                            # index = list(1:10,c(4,25,30)), # thats good for custom CV folding
                            # indexOut = list(11:20,c(2,3,21)),
                          # selectionFunction = # a function to choose the optimal tuning parameters. see ?best)
                          allowParallel = T)

xgb_train = caret::train(hp ~ mpg+wt,
                         data=mtcars,
                         trControl = trcontrol,
                         tuneGrid = tunegrid,
                         method = "xgbTree")

plot(xgb_train)

#see more in ?plot.train

head(xgb_train$results)
##     eta max_depth gamma colsample_bytree min_child_weight subsample
## 1  0.05         4     0                1                1     0.050
## 2  0.05         4     0                1                1     0.075
## 3  0.05         4     0                1                1     0.100
## 4  0.05         4     0                1                1     0.125
## 5  0.05         4     0                1                1     0.150
## 21 0.10         4     0                1                1     0.050
##    nrounds     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      100 47.86046 0.8374547 39.25983 23.70222  0.2190509 16.58520
## 2      100 40.90330 0.8720291 33.54120 21.48166  0.1897728 14.67399
## 3      100 37.67087 0.8639549 29.91711 21.13377  0.2079917 14.42766
## 4      100 37.51402 0.8620519 29.58749 21.59527  0.2034839 15.00643
## 5      100 36.88676 0.8746544 29.47920 21.16609  0.1936070 14.63770
## 21     100 38.94122 0.8698972 32.94139 19.96523  0.1968575 14.35316
hmap <- reshape2::acast(xgb_train$results, eta~subsample,fun.aggregate = mean, value.var = "RMSE") %>% as.matrix()
image(hmap,xlab = "subsample", ylab = "eta")

# another altenative:
plot(xgb_train, metric = "Rsquared", plotType = "level",
     scales = list(x = list(rot = 90)))

custom optimality criterion. useful when you want to specify your own loss (think of weighting the errors by population density for example)

mtcars$weights <- rnorm(32)

mycustomfunction <- function(data, lev = NULL, model = NULL) { #can have more arguments such as "lev" (for classification) ... 
  y <- data$obs # these are observations
  y_hat <- data$pred # the prediction vector
  w <- data$weights
  res <-  c("mymetric" = sum(w*abs(y-y_hat))/sum(w))
  res
}


tunegrid <- expand.grid(nrounds = 100,
                        max_depth = 6,
                        eta = seq(0.05,0.4,len=8),
                        gamma = 0,
                        colsample_bytree = 1,
                        min_child_weight = 1,
                        subsample = 0.1,
                        alpha = seq(0,0.3,len=4))

trcontrol <- trainControl(method = "cv",
                          number = 10,
                          summaryFunction = mycustomfunction)

xgb_train = caret::train(hp ~ mpg+wt,weights = mtcars$weights,
                         data=mtcars,
                         trControl = trcontrol,
                         method = "xgbTree",
                         metric = "mymetric")

xgb_train$results[which.min(xgb_train$results$mymetric),]
##    eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 16 0.3         1     0              0.8                1         1      50
##    mymetric mymetricSD
## 16 50.68178   129.4149

we can also choose the final model with custom function that select the “optimal” model from a series of models.

For example, the tolerance function could be used to find a less complex model based on \((x-x_{best})/x_{best} \times 100\), which is the percent difference (tolerance selects the least complex model within some percent tolerance of the best value)

best <- tolerance(xgb_train$results, metric = "mymetric", tol = 5, maximize = F)
xgb_train$results[best,]
##    eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 16 0.3         1     0              0.8                1         1      50
##    mymetric mymetricSD
## 16 50.68178   129.4149

NOTE: hyper-parameter search (tuning) can be done in several approaches. here i presented grid search only (going over every parameter combination, calculate the loss and choose one set) which can take a lot of time as the numbers of points to search increases exponentially with every additional hyper-parameter. other search options are random seach that is supported by caret, evolutionary algorithms like cmaes (for numeric only), surrogate model based ( as this guy did ), Bayesian optimization and more.

More advanced tuning approaches for xgboost and other models are available in the mlr and mlrMBO packages.

a general approach for tuning:

  1. Choose a relatively high learning rate (eta).
    • Generally a learning rate of 0.1 works.
    • Determine the optimum number of trees for this learning rate.
    • XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
  2. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.
    • Note that several parameters can be chosen.
  3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
  4. Lower the learning rate and decide the optimal parameters.

Tuning with mlr

read this tutorial

The main goal of mlr is to provide a unified interface for machine learning tasks as classification, regression, cluster analysis and survival analysis in R.

mlr offers the following features:

  • Possibility to fit, predict, evaluate and resample models
  • Easy extension mechanism through S3 inheritance
  • Abstract description of learners and tasks by properties
  • Parameter system for learners to encode data types and constraints
  • Many convenience methods and generic building blocks for your machine learning experiments
  • Resampling like bootstrapping, cross-validation and subsampling
  • Different visualizations for e.g. ROC curves and predictions
  • Benchmarking of learners for multiple data sets
  • Easy hyper-parameter tuning using different optimization strategies
  • Variable selection with filters and wrappers
  • Nested resampling of models with tuning and feature selection
  • Cost-sensitive learning, threshold tuning and imbalance correction
  • Wrapper mechanism to extend learner functionality and complex and custom ways
  • Combine different processing steps to a complex data mining chain that can be jointly optimized
  • Extension points to integrate your own stuff
  • Parallelization is built-in
# to do

Tuning with mlrMBO

The main goal of mlrMBO is to optimize expensive black-box functions by model-based optimization (aka Bayesian optimization) and to provide a unified interface for different optimization tasks and algorithmic MBO variants. Supported are, among other things:

  • Efficient global optimization (EGO) of problems with numerical domain and Kriging as surrogate
  • Using arbitrary regression models from mlr as surrogates
  • Built-in parallelization using multi-point proposals
  • Mixed-space optimization with categorical and subordinate parameters, for parameter configuration and tuning
  • Multi-criteria optimization

for tutorials see this and this

# to do