- What is it?
- XGBoosting theory
- How to use?
- Example
- Advanced Features
Vladyslav Kolbasin
Data Scientist at Griddynamics, Lecturer at NTU 'KhPI' dep. KMMM
== eXtreme Gradient Boosting
XGBoost is short for "Extreme Gradient Boosting"
It implements classical boosting algorithm from Breiman
The best description is "Introduction to Boosted Trees" by Tianqi Chen
In several words: How can we find a tree that improves the prediction along the gradient
This slide is from Tianqi Chen presentation
One more tree = loss mean decreases = more data explained Original data points in tree 1 are replaced by the loss points for trees 2 & 3
This slide is from Tianqi Chen presentation
This slide is from Tianqi Chen presentation
R code:
install.packages('xgboost')
To follow the latest version, we can install from github:
devtools::install_github('dmlc/xgboost',subdir='R-package')
sparseData <- xgb.DMatrix(data, label, weight, missing=-999)
xgb.DMatrix.save(sparseData, "dtrain.buffer.dat")
sparseData2 <- xgb.DMatrix("dtrain.buffer.dat")
XGBoost always do convertion dense to sparse
Xgboost manages only numeric vectors
XGBoost process missing values in a very natural and simple way
All missing values will come to one of subnode
How they do this:
Finally every node has a "default direction" for missing values
data(agaricus.train, package='xgboost')
train <- agaricus.train
mdl <- xgboost(data=train$data, label=train$label, nround=2,
objective="binary:logistic", verbose=2)
pred <- predict(mdl, test$data)
tree prunning end, 1 roots, 20 extra nodes, 0 pruned nodes ,max_depth=5
[0] train-error:0.000614
tree prunning end, 1 roots, 18 extra nodes, 0 pruned nodes ,max_depth=5
[1] train-error:0.001228
data - matrix, dgCMatrix or xgb.DMatrix
For more advanced interface - use xgb.train function
xgb.dump(mdl, with.stats = T)
[1] "booster[0]"
[2] "0:[f28<-1e-005] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
[3] "1:[f55<-1e-005] yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
[4] "3:leaf=1.71218,cover=812"
[5] "4:leaf=-1.70044,cover=112.5"
[6] "2:[f108<-1e-005] yes=5,no=6,missing=5,gain=198.174,cover=703.75"
[7] "5:leaf=-1.94071,cover=690.5"
[8] "6:leaf=1.85965,cover=13.25"
[9] "booster[1]"
[10] "0:[f59<-1e-005] yes=1,no=2,missing=1,gain=832.545,cover=788.852"
[11] "1:[f28<-1e-005] yes=3,no=4,missing=3,gain=569.725,cover=768.39"
[12] "3:leaf=0.784718,cover=458.937"
[13] "4:leaf=-0.96853,cover=309.453"
[14] "2:leaf=-6.23625,cover=20.4624"
We may save and load trained models:
xgb.save(mdl, "trained.model.dat")
mdl2 <- xgb.load("trained.model.dat")
Study more about parameters:
dtrain <- xgb.DMatrix(dataTrain, label=labelTrain) # 5000 examples
dtrain <- xgb.DMatrix(dataTest, label=labelTest) # 1000 examples
param = list(objective="multi:softmax", num_class=10, eval_metric="mlogloss",
eta=0.2, max_depth=5, subsample=1, colsample_bytree=0.5, nthread=4)
mdl <- xgb.train(params=param, data=dtrain, nrounds=150,
watchlist=list(eval=dtest, train=dtrain))
[0] eval-mlogloss:1.762896 train-mlogloss:1.699440
[1] eval-mlogloss:1.472660 train-mlogloss:1.382979
[2] eval-mlogloss:1.265222 train-mlogloss:1.153952
...
[147] eval-mlogloss:0.161354 train-mlogloss:0.002686
[148] eval-mlogloss:0.161404 train-mlogloss:0.002668
[149] eval-mlogloss:0.161563 train-mlogloss:0.002651
pred<-predict(mdl, newdata=dtest)
sum(diag(table(labelTest,pred)))/length(labelTest)
[1] 0.947
Use caret package
We may tune parameters:
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid( nrounds = c(10,100,200),
eta = c(0.01, 0.001, 0.0001), max_depth = c(2, 4, 6, 8),
gamma = 1, colsample_bytree=1, min_child_weight=1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl( method = "cv", number = 5,
verboseIter = TRUE, returnData = FALSE, classProbs = TRUE,
summaryFunction = multiClassSummary, allowParallel = TRUE
)
# using CV train the model for each parameter combination in the grid
xgb_train_1 = train(
x = as.matrix(dat), y = y,
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
loglossobj <- function(preds, dtrain) {
# Extract the labels from the training data
labels <- getinfo(dtrain, "label")
# We compute the 1st and 2nd gradient, as grad and hess
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
# Return the result as a list
return(list(grad = grad, hess = hess))
}
model <- xgboost(data = train$data, label = train$label,
nrounds = 2, objective = loglossobj, eval_metric = "error")
We can iteratively train model:
mdl <- xgboost(params=param, data=dtrain, nround=1)
ptrain <- predict(mdl, dtrain, outputmargin=TRUE)
setinfo(dtrain, "base_margin", ptrain)
mdl <- xgboost(params=param, data=dtrain, nround=1)
Importance: number of appearance of each variable in all trees
xgb.importance(train$dataDimnames[[2]], model=mdl)
Feature Gain Cover Frequence
1: 406 2.407569e-02 1.088104e-02 4.050633e-03
2: 347 2.293334e-02 1.539037e-02 4.746835e-03
3: 436 2.240535e-02 7.608860e-03 4.873418e-03
4: 438 1.606233e-02 9.233623e-03 2.848101e-03
5: 351 1.456926e-02 9.906179e-03 5.443038e-03
---
478: 500 6.264526e-06 1.113056e-04 1.265823e-04
479: 445 6.190388e-06 1.587092e-05 6.329114e-05
480: 202 4.774393e-06 2.505570e-05 6.329114e-05
481: 285 4.367899e-06 9.859387e-06 6.329114e-05
482: 313 3.036956e-06 8.477709e-06 6.329114e-05