load the following libraries:
library(Matrix)
library(lme4)
library(data.table)
library(magrittr)
library(xgboost)
library(foreach)
library(doMC)
library(knitr)
library(caret)
load some custom functions:
source("~/CrossValidation/CV_functions.R")
data from northeast US:
db0us <- data.table(fst::read_fst("~/PM_USA/data/mod0_allyears.fst"))
pp <- preproc(db0 = db0us, years = 2015:2015, obs_day = 50, obs_stn = 50) # some data pre-processing
db0 <- pp$db_processed
library(rpart)
library(rpart.plot)
tree.1 <- rpart(pm25 ~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+
aothat.aqua+reg+ time_or +
month, data=db0)
rpart.plot::rpart.plot(tree.1)
Trees are very prone to overfitting. To avoid this, we reduce a tree’s complexity by pruning it. This is done with the prune function
One alternative is to use the smallest tree that is within 1 standard error of the best tree. The reason for this is that, given the CV estimates of the error, the smallest tree within 1 standard error is doing just as good a job at prediction as the best (lowest CV error) tree, yet it is doing it with fewer “terms” - less likely to overfit.
we can find a good pruning rate with a plot of cost-complexity vs tree size for the un-pruned tree.
plotcp(tree.1)
tree.2 <- prune(tree.1,cp = 0.015)
rpart.plot(tree.2)
single trees usually produce noisy (bushy) or weak (stunted) classifiers.
for this it is better to use:
How does it works:
Updating weights:
additive training:
for \(i \in 1,...,n\):
\[ \begin{align} \hat{y}^{(0)}_i &= 0 \\ \hat{y}^{(1)}_i &= f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i) \\ \hat{y}^{(2)}_i &= f_2(x_i) = \hat{y}^{(1)}_i + f_2(x_i) \\ ... \\ \hat{y}^{(t)}_i &= \sum_{k=1}^t f_k(x_i) = \hat{y}^{(t-1)}_i + f_t(x_i) \\ \end{align}\]
at each step we add the tree that optimizes our objective function. here is the objective function definition:
\[ \begin{align} obj^t &= \sum_{i=1}^n l(y_i,\hat{y_i}^t)+\sum_{i=1}^t\Omega(f_i) \\ &= \sum_{i=1}^n l(y_i,\hat{y_i}^{(t-1)} + f_t(x_i)) + \sum_{i=1}^t\Omega(f_i) \end{align}\]
where the regularization term is \(\Omega(f)\)
to understand the regularization term, first refine the definition of a tree \(f_t(x)\):
\[ f_t(x) = w_{q(x)}\]
where: \(x \in R^T\), \(q:R^T \to {1,...,T}\).
Here \(w\) is the vector of scores on leaves, \(q\) is a function assigning each data point to the corresponding leaf, and \(T\) is the number of leaves. In XGBoost, we define the complexity as
\[ \Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2\]
boosting parameters:
there are 3 categories of parameters:
some of the Tree-Specific Parameters are:
more on Gradient Boosting:
simple use of xgboost
random_split <- CV_split(db = db0, validation = "tenfold", trn_ratio = 0.7)
db_trn <- random_split$db_trn
db_tst <- random_split$db_tst
X <- sparse.model.matrix(~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+ # LU
aothat.aqua+reg+ time_or+month,
data = db_trn[,-c("pm25")])
y <- db_trn$pm25
dtrain <- xgb.DMatrix(data = X, label = y)
X_tst <- sparse.model.matrix(~ air.2m+evap+hpbl+pres.sfc+pr_wtr+uwnd.10m+DevH_P1km+DevM_P1km+ # LU
aothat.aqua+reg+ time_or+month,
data = db_tst[,-c("pm25")])
y_tst <- db_tst$pm25
# run xgb with default hyperparameters:
m_xgb <- xgb.train(data = dtrain, nrounds = 100, nthread=1)
# predict:
p_xgb <- predict(m_xgb,X_tst)
# evaluate:
sqrt(mean((p_xgb-y_tst)^2))
## [1] 2.384479
plot it:
# library(DiagrammeR)
# xgb.plot.tree(model = m_xgb)
Building a model using xgboost is easy. But, improving the model is difficult. This algorithm uses multiple parameters. To improve the model, parameter tuning is must.
Tuning is art.
Which set of hyper-parameters to tune? What is the ideal value of these parameters?
It comes with experience…
read:
XGBoost Parameter Tuning (in Python) tutorial
the most important:
some notes:
examples:
Fit Manually (that’s good for one or two parameters) with grid search:
here we use custom CV function (CV_split
) and search for best eta
(step size shrinkage used in update (learning rate))
netas <- 100
etavec <- seq(0.01,0.99,len = netas)
registerDoMC(min(50,netas))
eta_tune <-
foreach(i = 1:netas, .combine = "cbind") %dopar% { # loop over eta values
foreach(j = 1:10, .combine = "rbind") %do% { # 10 CV repeats per eta value
random_split <- CV_split(db = db0, validation = "tenfold", trn_ratio = 0.7)
db_trn <- random_split$db_trn
db_tst <- random_split$db_tst
xgb_data <- xgb_pp(db_trn = db_trn, db_tst = db_tst) # this function do preprocessing for xgbost
m_xgb <- xgb.train(data = xgb_data$dtrain,
params = list(eta = etavec[i]), nrounds = 100, nthread = 1) # notice the position of nthread. in some parallelization mechanisms you might want to specify the number of cores for each model to optimze the computation. (but usually it is better to took it out from the params argument).
p_xgb <- predict(m_xgb, xgb_data$X_tst)
c(sqrt(mean((xgb_data$y_tst-p_xgb)^2)))
} %>% data.table()
} %>% as.matrix()
eta_tune_mean <- colMeans(eta_tune)
matplot(y=t(eta_tune),x=etavec,col="grey",type = "l", lwd = 1, lty = 1)
lines(y=eta_tune_mean,x=etavec, col = "blue", lwd = 2)
eta_min <- etavec[which.min(eta_tune_mean)]
min_rmse <- min(eta_tune_mean)
points(min_rmse~eta_min,pch = 16, col = "red", cex = 2)
caret
some examples with mtcars dataset
tunegrid <- expand.grid(nrounds = 100,
max_depth = c(4,6,10,20),
eta = seq(0.05,0.4,len=8),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = seq(0.05,0.15,len=5))
trcontrol <- trainControl(method = "repeatedcv", # options: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob"
#p, for leave-group out: the training percentage
number = 10, # ten folds
repeats = 5, # repeated 5 times
# index = list(1:10,c(4,25,30)), # thats good for custom CV folding
# indexOut = list(11:20,c(2,3,21)),
# selectionFunction = # a function to choose the optimal tuning parameters. see ?best)
allowParallel = T)
xgb_train = caret::train(hp ~ mpg+wt,
data=mtcars,
trControl = trcontrol,
tuneGrid = tunegrid,
method = "xgbTree")
plot(xgb_train)
#see more in ?plot.train
head(xgb_train$results)
## eta max_depth gamma colsample_bytree min_child_weight subsample
## 1 0.05 4 0 1 1 0.050
## 2 0.05 4 0 1 1 0.075
## 3 0.05 4 0 1 1 0.100
## 4 0.05 4 0 1 1 0.125
## 5 0.05 4 0 1 1 0.150
## 21 0.10 4 0 1 1 0.050
## nrounds RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 100 47.86046 0.8374547 39.25983 23.70222 0.2190509 16.58520
## 2 100 40.90330 0.8720291 33.54120 21.48166 0.1897728 14.67399
## 3 100 37.67087 0.8639549 29.91711 21.13377 0.2079917 14.42766
## 4 100 37.51402 0.8620519 29.58749 21.59527 0.2034839 15.00643
## 5 100 36.88676 0.8746544 29.47920 21.16609 0.1936070 14.63770
## 21 100 38.94122 0.8698972 32.94139 19.96523 0.1968575 14.35316
hmap <- reshape2::acast(xgb_train$results, eta~subsample,fun.aggregate = mean, value.var = "RMSE") %>% as.matrix()
image(hmap,xlab = "subsample", ylab = "eta")
# another altenative:
plot(xgb_train, metric = "Rsquared", plotType = "level",
scales = list(x = list(rot = 90)))
custom optimality criterion. useful when you want to specify your own loss (think of weighting the errors by population density for example)
mtcars$weights <- rnorm(32)
mycustomfunction <- function(data, lev = NULL, model = NULL) { #can have more arguments such as "lev" (for classification) ...
y <- data$obs # these are observations
y_hat <- data$pred # the prediction vector
w <- data$weights
res <- c("mymetric" = sum(w*abs(y-y_hat))/sum(w))
res
}
tunegrid <- expand.grid(nrounds = 100,
max_depth = 6,
eta = seq(0.05,0.4,len=8),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 0.1,
alpha = seq(0,0.3,len=4))
trcontrol <- trainControl(method = "cv",
number = 10,
summaryFunction = mycustomfunction)
xgb_train = caret::train(hp ~ mpg+wt,weights = mtcars$weights,
data=mtcars,
trControl = trcontrol,
method = "xgbTree",
metric = "mymetric")
xgb_train$results[which.min(xgb_train$results$mymetric),]
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 16 0.3 1 0 0.8 1 1 50
## mymetric mymetricSD
## 16 50.68178 129.4149
we can also choose the final model with custom function that select the “optimal” model from a series of models.
For example, the tolerance
function could be used to find a less complex model based on \((x-x_{best})/x_{best} \times 100\), which is the percent difference (tolerance selects the least complex model within some percent tolerance of the best value)
best <- tolerance(xgb_train$results, metric = "mymetric", tol = 5, maximize = F)
xgb_train$results[best,]
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 16 0.3 1 0 0.8 1 1 50
## mymetric mymetricSD
## 16 50.68178 129.4149
NOTE: hyper-parameter search (tuning) can be done in several approaches. here i presented grid search only (going over every parameter combination, calculate the loss and choose one set) which can take a lot of time as the numbers of points to search increases exponentially with every additional hyper-parameter. other search options are random seach that is supported by caret, evolutionary algorithms like cmaes (for numeric only), surrogate model based ( as this guy did ), Bayesian optimization and more.
More advanced tuning approaches for xgboost and other models are available in the mlr and mlrMBO packages.
a general approach for tuning:
mlr
read this tutorial
The main goal of mlr is to provide a unified interface for machine learning tasks as classification, regression, cluster analysis and survival analysis in R.
mlr offers the following features:
# to do
mlrMBO
The main goal of mlrMBO is to optimize expensive black-box functions by model-based optimization (aka Bayesian optimization) and to provide a unified interface for different optimization tasks and algorithmic MBO variants. Supported are, among other things:
for tutorials see this and this
# to do