Introduction to XGBoost

Vladyslav Kolbasin
Data Scientist at Griddynamics, Lecturer at NTU 'KhPI' dep. KMMM

Overview

What is it?
XGBoosting theory
How to use?
Example
Advanced Features

What is it?

plot of chunk unnamed-chunk-1 == eXtreme Gradient Boosting

Open source tool. Hosted on github
Easy to use
- Easy to install
- Interfaces to R/Python/Julia/Java/Scala
- Good documentation
- Highly tunable
Very fast
Gives very accurate models
- Won many Kaggle competitions

Winning Solutions

Marios Michailidis, Mathias Muller and HJ van Veen, 1st place of the Dato Truely Native? competition.
Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition.
Josef Slavicek, 3rd place of the CERN LHCb experiment Flavour of Physics competition.
Mario Filho, Josef Feigl, Lucas, Gilberto, 1st place of the Caterpillar Tube Pricing competition.
Qingchen Wang, 1st place of the Liberty Mutual Property Inspection.
Chenglong Chen, 1st place of the Crowdflower Search Results Relevance.
Alexandre Barachant and Rafal Cycon, 1st place of the Grasp-and-Lift EEG Detection.
Halla Yang, 2nd place of the Recruit Coupon Purchase Prediction Challenge.
Owen Zhang, 1st place of the Avito Context Ad Clicks competition.
Keiichi Kuroyanagi, 2nd place of the Airbnb New User Bookings.
Marios Michailidis, Mathias Muller and Ning Situ, 1st place Homesite Quote Conversion.

Efficiency

Core is developed on C++
Automatic parallel computation on a single machine (use OpenMP)
Generally 10 times faster than the classical gbm
Distributed (There are versions for YARN, MPI, etc.)

How it works?

XGBoost is short for "Extreme Gradient Boosting"
It implements classical boosting algorithm from Breiman

Tree model
- There is also a linear model, but it is not so popular
Ensembling
- Gradient Boosting

The best description is "Introduction to Boosted Trees" by Tianqi Chen

In several words: How can we find a tree that improves the prediction along the gradient

Tree-based models

plot of chunk unnamed-chunk-2

This slide is from Tianqi Chen presentation

Ensembling

Ensemble - usually simple sum of trees outputs

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

The issue is that we can't use SGD to find f
f - is a tree, instead of just numeric values

Boosting

Train our model iteratively:

plot of chunk unnamed-chunk-5

Optimize objective at each iteration

plot of chunk unnamed-chunk-6

One more tree = loss mean decreases = more data explained Original data points in tree 1 are replaced by the loss points for trees 2 & 3

Simple math

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

Greedy Learning of the Tree

plot of chunk unnamed-chunk-12

This slide is from Tianqi Chen presentation

Overview: Boosted Tree Algorithm

plot of chunk unnamed-chunk-13

This slide is from Tianqi Chen presentation

How to install XGBoost?

R code:

install.packages('xgboost')

To follow the latest version, we can install from github:

devtools::install_github('dmlc/xgboost',subdir='R-package')

There are good descriptions for other languages too

Input data formats

Uses only LibSVM format
In packages use dense and sparse matrixes
Can create sparse matrix from dense:

sparseData <- xgb.DMatrix(data, label, weight, missing=-999)

Can save and load sparse matrices:

xgb.DMatrix.save(sparseData, "dtrain.buffer.dat")
sparseData2 <- xgb.DMatrix("dtrain.buffer.dat")

Notes on input data

XGBoost always do convertion dense to sparse
- but for repetitive training it is recommended to do this as preprocessing step
Xgboost manages only numeric vectors
- So for categorical data should do one-hot encoding

Process missing values?

XGBoost process missing values in a very natural and simple way

All missing values will come to one of subnode

How they do this:

Guide points with NA to each subnode,
Calculate maximum gain
Choose direction with a larger gain

Finally every node has a "default direction" for missing values

How to run

A simple interface:

data(agaricus.train, package='xgboost')
train <- agaricus.train
mdl <- xgboost(data=train$data, label=train$label, nround=2, 
               objective="binary:logistic", verbose=2)
pred <- predict(mdl, test$data)

tree prunning end, 1 roots, 20 extra nodes, 0 pruned nodes ,max_depth=5
[0] train-error:0.000614
tree prunning end, 1 roots, 18 extra nodes, 0 pruned nodes ,max_depth=5
[1] train-error:0.001228

data - matrix, dgCMatrix or xgb.DMatrix
For more advanced interface - use xgb.train function

You can dump the tree you learned

xgb.dump(mdl, with.stats = T)

 [1] "booster[0]"                                                     
 [2] "0:[f28<-1e-005] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
 [3] "1:[f55<-1e-005] yes=3,no=4,missing=3,gain=1158.21,cover=924.5"  
 [4] "3:leaf=1.71218,cover=812"                                       
 [5] "4:leaf=-1.70044,cover=112.5"                                    
 [6] "2:[f108<-1e-005] yes=5,no=6,missing=5,gain=198.174,cover=703.75"
 [7] "5:leaf=-1.94071,cover=690.5"                                    
 [8] "6:leaf=1.85965,cover=13.25"                                     
 [9] "booster[1]"                                                     
[10] "0:[f59<-1e-005] yes=1,no=2,missing=1,gain=832.545,cover=788.852"
[11] "1:[f28<-1e-005] yes=3,no=4,missing=3,gain=569.725,cover=768.39" 
[12] "3:leaf=0.784718,cover=458.937"                                  
[13] "4:leaf=-0.96853,cover=309.453"                                  
[14] "2:leaf=-6.23625,cover=20.4624"

Save and load models

We may save and load trained models:

xgb.save(mdl, "trained.model.dat")
mdl2 <- xgb.load("trained.model.dat")

Parameters

Stepsize (eta)
Regularization (gamma, lambda, alpha)
Number of threads (nthread)
Model type (gbtree or gblinear)
Objective and Evaluation funcs

Study more about parameters:

The documentation in the repository
The documentation for xgb.train function

Objectives

"reg:linear" - linear regression
"binary:logistic" - logistic regression
"multi:softmax" - set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
"rank:pairwise" - set XGBoost to do ranking task by minimizing the pairwise loss
User-specified objective...

Evaluation Metrics

"rmse": root mean square error
"mae": mean absolute error
"logloss": negative log-likelihood
"error": Binary classification error rate - #(wrong cases)/#(all cases)
"merror": Multiclass classification error rate - #(wrong cases)/#(all cases)
"mlogloss": Multiclass logloss
"auc": Area under the curve for ranking evaluation
"ndcg": Normalized Discounted Cumulative Gain
"map": Mean average precision
"ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
User-specified evaluation metric

MNIST example. Training

https://www.kaggle.com/c/digit-recognizer

dtrain <- xgb.DMatrix(dataTrain, label=labelTrain)   # 5000 examples
dtrain <- xgb.DMatrix(dataTest,  label=labelTest)    # 1000 examples
param = list(objective="multi:softmax", num_class=10, eval_metric="mlogloss", 
             eta=0.2, max_depth=5, subsample=1, colsample_bytree=0.5, nthread=4)
mdl <- xgb.train(params=param, data=dtrain, nrounds=150, 
                 watchlist=list(eval=dtest, train=dtrain))

[0] eval-mlogloss:1.762896  train-mlogloss:1.699440
[1] eval-mlogloss:1.472660  train-mlogloss:1.382979
[2] eval-mlogloss:1.265222  train-mlogloss:1.153952
...
[147]   eval-mlogloss:0.161354  train-mlogloss:0.002686
[148]   eval-mlogloss:0.161404  train-mlogloss:0.002668
[149]   eval-mlogloss:0.161563  train-mlogloss:0.002651

MNIST example. Evaluation

pred<-predict(mdl, newdata=dtest)
sum(diag(table(labelTest,pred)))/length(labelTest)

[1] 0.947

Tuning parameters

Complete Guide to Parameter Tuning in XGBoost
Use caret package
We may tune parameters:
- nrounds
- eta
- max_depth
- gamma
- colsample_bytree - subsample ratio of columns when constructing each tree
- min_child_weight - minimum sum of instance weight(hessian) needed in a child

Tuning parameters. Example

# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(  nrounds = c(10,100,200),
  eta = c(0.01, 0.001, 0.0001),  max_depth = c(2, 4, 6, 8),
  gamma = 1, colsample_bytree=1, min_child_weight=1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(  method = "cv",  number = 5,
  verboseIter = TRUE,  returnData = FALSE,  classProbs = TRUE,   
  summaryFunction = multiClassSummary,  allowParallel = TRUE
)
# using CV train the model for each parameter combination in the grid
xgb_train_1 = train(
  x = as.matrix(dat),  y = y,
  trControl = xgb_trcontrol_1,
  tuneGrid = xgb_grid_1,
  method = "xgbTree"
)

Caveats

Don't use it on linear separable problems like this

plot of chunk unnamed-chunk-14

But manual attributes mungling can help very much

Advanced features

Customization of objective & eval metrics
"Continuous" training
Embedded feature importance estimator
Command line run
Early-stopping

Customization of objective metric

loglossobj <- function(preds, dtrain) {
  # Extract the labels from the training data
  labels <- getinfo(dtrain, "label")
  # We compute the 1st and 2nd gradient, as grad and hess
  preds <- 1/(1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  # Return the result as a list
  return(list(grad = grad, hess = hess))
}

model <- xgboost(data = train$data, label = train$label,
                 nrounds = 2, objective = loglossobj, eval_metric = "error")

"Continuous" training

We can iteratively train model:

mdl <- xgboost(params=param, data=dtrain, nround=1)
ptrain <- predict(mdl, dtrain, outputmargin=TRUE)
setinfo(dtrain, "base_margin", ptrain)
mdl <- xgboost(params=param, data=dtrain, nround=1)

Feature importance estimator

Importance: number of appearance of each variable in all trees

xgb.importance(train$dataDimnames[[2]], model=mdl)

     Feature         Gain        Cover    Frequence
  1:     406 2.407569e-02 1.088104e-02 4.050633e-03
  2:     347 2.293334e-02 1.539037e-02 4.746835e-03
  3:     436 2.240535e-02 7.608860e-03 4.873418e-03
  4:     438 1.606233e-02 9.233623e-03 2.848101e-03
  5:     351 1.456926e-02 9.906179e-03 5.443038e-03
 ---                                               
478:     500 6.264526e-06 1.113056e-04 1.265823e-04
479:     445 6.190388e-06 1.587092e-05 6.329114e-05
480:     202 4.774393e-06 2.505570e-05 6.329114e-05
481:     285 4.367899e-06 9.859387e-06 6.329114e-05
482:     313 3.036956e-06 8.477709e-06 6.329114e-05

Feature importance plot

Surprisingly there are only 6 very important pixels for sampled MNIST dataset

plot of chunk unnamed-chunk-15

XGBoost Roadmap for 2016

Distributed Versions for Python & JVM
DataFrame to DMatrix Conversion
External Memory

Introduction to XGBoost

Overview

What is it?

Winning Solutions

Efficiency

How it works?

Tree-based models

Ensembling

Boosting

Simple math

Greedy Learning of the Tree

Overview: Boosted Tree Algorithm

How to install XGBoost?

Input data formats

Notes on input data

Process missing values?

How to run

You can dump the tree you learned

Save and load models

Parameters

Objectives

Evaluation Metrics

MNIST example. Training

MNIST example. Evaluation

Tuning parameters

Tuning parameters. Example

Caveats

Advanced features

Customization of objective metric

"Continuous" training

Feature importance estimator

Feature importance plot

XGBoost Roadmap for 2016

Links & Info

Documentation