Introduction to XGBoost

Vladyslav Kolbasin
Data Scientist at Griddynamics, Lecturer at NTU 'KhPI' dep. KMMM

Overview

  • What is it?
  • XGBoosting theory
  • How to use?
  • Example
  • Advanced Features

What is it?

plot of chunk unnamed-chunk-1 == eXtreme Gradient Boosting

  • Open source tool. Hosted on github
  • Easy to use
    • Easy to install
    • Interfaces to R/Python/Julia/Java/Scala
    • Good documentation
    • Highly tunable
  • Very fast
  • Gives very accurate models
    • Won many Kaggle competitions

Winning Solutions

  • Marios Michailidis, Mathias Muller and HJ van Veen, 1st place of the Dato Truely Native? competition.
  • Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition.
  • Josef Slavicek, 3rd place of the CERN LHCb experiment Flavour of Physics competition.
  • Mario Filho, Josef Feigl, Lucas, Gilberto, 1st place of the Caterpillar Tube Pricing competition.
  • Qingchen Wang, 1st place of the Liberty Mutual Property Inspection.
  • Chenglong Chen, 1st place of the Crowdflower Search Results Relevance.
  • Alexandre Barachant and Rafal Cycon, 1st place of the Grasp-and-Lift EEG Detection.
  • Halla Yang, 2nd place of the Recruit Coupon Purchase Prediction Challenge.
  • Owen Zhang, 1st place of the Avito Context Ad Clicks competition.
  • Keiichi Kuroyanagi, 2nd place of the Airbnb New User Bookings.
  • Marios Michailidis, Mathias Muller and Ning Situ, 1st place Homesite Quote Conversion.

Efficiency

  • Core is developed on C++
  • Automatic parallel computation on a single machine (use OpenMP)
  • Generally 10 times faster than the classical gbm
  • Distributed (There are versions for YARN, MPI, etc.)

How it works?

XGBoost is short for "Extreme Gradient Boosting"
It implements classical boosting algorithm from Breiman

  • Tree model
    • There is also a linear model, but it is not so popular
  • Ensembling
    • Gradient Boosting

The best description is "Introduction to Boosted Trees" by Tianqi Chen

In several words: How can we find a tree that improves the prediction along the gradient

Tree-based models

plot of chunk unnamed-chunk-2

This slide is from Tianqi Chen presentation

Ensembling

  • Ensemble - usually simple sum of trees outputs

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

  • The issue is that we can't use SGD to find f
  • f - is a tree, instead of just numeric values

Boosting

  • Train our model iteratively:

plot of chunk unnamed-chunk-5

  • Optimize objective at each iteration

plot of chunk unnamed-chunk-6

One more tree = loss mean decreases = more data explained Original data points in tree 1 are replaced by the loss points for trees 2 & 3

Simple math

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

Greedy Learning of the Tree

plot of chunk unnamed-chunk-12

This slide is from Tianqi Chen presentation

Overview: Boosted Tree Algorithm

plot of chunk unnamed-chunk-13

This slide is from Tianqi Chen presentation

How to install XGBoost?

R code:

install.packages('xgboost')

To follow the latest version, we can install from github:

devtools::install_github('dmlc/xgboost',subdir='R-package')
  • There are good descriptions for other languages too

Input data formats

  • Uses only LibSVM format
  • In packages use dense and sparse matrixes
  • Can create sparse matrix from dense:
sparseData <- xgb.DMatrix(data, label, weight, missing=-999) 
  • Can save and load sparse matrices:
xgb.DMatrix.save(sparseData, "dtrain.buffer.dat")
sparseData2 <- xgb.DMatrix("dtrain.buffer.dat")

Notes on input data

  • XGBoost always do convertion dense to sparse

    • but for repetitive training it is recommended to do this as preprocessing step
  • Xgboost manages only numeric vectors

    • So for categorical data should do one-hot encoding

Process missing values?

XGBoost process missing values in a very natural and simple way

All missing values will come to one of subnode

How they do this:

  • Guide points with NA to each subnode,
  • Calculate maximum gain
  • Choose direction with a larger gain

Finally every node has a "default direction" for missing values

How to run

  • A simple interface:
data(agaricus.train, package='xgboost')
train <- agaricus.train
mdl <- xgboost(data=train$data, label=train$label, nround=2, 
               objective="binary:logistic", verbose=2)
pred <- predict(mdl, test$data)
tree prunning end, 1 roots, 20 extra nodes, 0 pruned nodes ,max_depth=5
[0] train-error:0.000614
tree prunning end, 1 roots, 18 extra nodes, 0 pruned nodes ,max_depth=5
[1] train-error:0.001228
  • data - matrix, dgCMatrix or xgb.DMatrix

  • For more advanced interface - use xgb.train function

You can dump the tree you learned

xgb.dump(mdl, with.stats = T)
 [1] "booster[0]"                                                     
 [2] "0:[f28<-1e-005] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
 [3] "1:[f55<-1e-005] yes=3,no=4,missing=3,gain=1158.21,cover=924.5"  
 [4] "3:leaf=1.71218,cover=812"                                       
 [5] "4:leaf=-1.70044,cover=112.5"                                    
 [6] "2:[f108<-1e-005] yes=5,no=6,missing=5,gain=198.174,cover=703.75"
 [7] "5:leaf=-1.94071,cover=690.5"                                    
 [8] "6:leaf=1.85965,cover=13.25"                                     
 [9] "booster[1]"                                                     
[10] "0:[f59<-1e-005] yes=1,no=2,missing=1,gain=832.545,cover=788.852"
[11] "1:[f28<-1e-005] yes=3,no=4,missing=3,gain=569.725,cover=768.39" 
[12] "3:leaf=0.784718,cover=458.937"                                  
[13] "4:leaf=-0.96853,cover=309.453"                                  
[14] "2:leaf=-6.23625,cover=20.4624"

Save and load models

We may save and load trained models:

xgb.save(mdl, "trained.model.dat")
mdl2 <- xgb.load("trained.model.dat")

Parameters

  • Stepsize (eta)
  • Regularization (gamma, lambda, alpha)
  • Number of threads (nthread)
  • Model type (gbtree or gblinear)
  • Objective and Evaluation funcs

Study more about parameters:

  • The documentation in the repository
  • The documentation for xgb.train function

Objectives

  • "reg:linear" - linear regression
  • "binary:logistic" - logistic regression
  • "multi:softmax" - set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
  • "rank:pairwise" - set XGBoost to do ranking task by minimizing the pairwise loss
  • User-specified objective...

Evaluation Metrics

  • "rmse": root mean square error
  • "mae": mean absolute error
  • "logloss": negative log-likelihood
  • "error": Binary classification error rate - #(wrong cases)/#(all cases)
  • "merror": Multiclass classification error rate - #(wrong cases)/#(all cases)
  • "mlogloss": Multiclass logloss
  • "auc": Area under the curve for ranking evaluation
  • "ndcg": Normalized Discounted Cumulative Gain
  • "map": Mean average precision
  • "ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
  • User-specified evaluation metric

MNIST example. Training

dtrain <- xgb.DMatrix(dataTrain, label=labelTrain)   # 5000 examples
dtrain <- xgb.DMatrix(dataTest,  label=labelTest)    # 1000 examples
param = list(objective="multi:softmax", num_class=10, eval_metric="mlogloss", 
             eta=0.2, max_depth=5, subsample=1, colsample_bytree=0.5, nthread=4)
mdl <- xgb.train(params=param, data=dtrain, nrounds=150, 
                 watchlist=list(eval=dtest, train=dtrain))
[0] eval-mlogloss:1.762896  train-mlogloss:1.699440
[1] eval-mlogloss:1.472660  train-mlogloss:1.382979
[2] eval-mlogloss:1.265222  train-mlogloss:1.153952
...
[147]   eval-mlogloss:0.161354  train-mlogloss:0.002686
[148]   eval-mlogloss:0.161404  train-mlogloss:0.002668
[149]   eval-mlogloss:0.161563  train-mlogloss:0.002651

MNIST example. Evaluation

pred<-predict(mdl, newdata=dtest)
sum(diag(table(labelTest,pred)))/length(labelTest)
[1] 0.947

Tuning parameters

  • Complete Guide to Parameter Tuning in XGBoost

  • Use caret package

  • We may tune parameters:

    • nrounds
    • eta
    • max_depth
    • gamma
    • colsample_bytree - subsample ratio of columns when constructing each tree
    • min_child_weight - minimum sum of instance weight(hessian) needed in a child

Tuning parameters. Example

# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(  nrounds = c(10,100,200),
  eta = c(0.01, 0.001, 0.0001),  max_depth = c(2, 4, 6, 8),
  gamma = 1, colsample_bytree=1, min_child_weight=1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(  method = "cv",  number = 5,
  verboseIter = TRUE,  returnData = FALSE,  classProbs = TRUE,   
  summaryFunction = multiClassSummary,  allowParallel = TRUE
)
# using CV train the model for each parameter combination in the grid
xgb_train_1 = train(
  x = as.matrix(dat),  y = y,
  trControl = xgb_trcontrol_1,
  tuneGrid = xgb_grid_1,
  method = "xgbTree"
)

Caveats

  • Don't use it on linear separable problems like this

plot of chunk unnamed-chunk-14

  • But manual attributes mungling can help very much

Advanced features

  • Customization of objective & eval metrics
  • "Continuous" training
  • Embedded feature importance estimator
  • Command line run
  • Early-stopping

Customization of objective metric

loglossobj <- function(preds, dtrain) {
  # Extract the labels from the training data
  labels <- getinfo(dtrain, "label")
  # We compute the 1st and 2nd gradient, as grad and hess
  preds <- 1/(1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  # Return the result as a list
  return(list(grad = grad, hess = hess))
}

model <- xgboost(data = train$data, label = train$label,
                 nrounds = 2, objective = loglossobj, eval_metric = "error")

"Continuous" training

We can iteratively train model:

mdl <- xgboost(params=param, data=dtrain, nround=1)
ptrain <- predict(mdl, dtrain, outputmargin=TRUE)
setinfo(dtrain, "base_margin", ptrain)
mdl <- xgboost(params=param, data=dtrain, nround=1)

Feature importance estimator

Importance: number of appearance of each variable in all trees

xgb.importance(train$dataDimnames[[2]], model=mdl)
     Feature         Gain        Cover    Frequence
  1:     406 2.407569e-02 1.088104e-02 4.050633e-03
  2:     347 2.293334e-02 1.539037e-02 4.746835e-03
  3:     436 2.240535e-02 7.608860e-03 4.873418e-03
  4:     438 1.606233e-02 9.233623e-03 2.848101e-03
  5:     351 1.456926e-02 9.906179e-03 5.443038e-03
 ---                                               
478:     500 6.264526e-06 1.113056e-04 1.265823e-04
479:     445 6.190388e-06 1.587092e-05 6.329114e-05
480:     202 4.774393e-06 2.505570e-05 6.329114e-05
481:     285 4.367899e-06 9.859387e-06 6.329114e-05
482:     313 3.036956e-06 8.477709e-06 6.329114e-05

Feature importance plot

  • Surprisingly there are only 6 very important pixels for sampled MNIST dataset

plot of chunk unnamed-chunk-15

XGBoost Roadmap for 2016

  • Distributed Versions for Python & JVM
  • DataFrame to DMatrix Conversion
  • External Memory

Links & Info