Introduction

This work is an exercise on a Machine Learning technique in R: XGBoost.

??xgboost
## starting httpd help server ... done

Package help usually focus on information about how to use the package. Specially, advanced sections such as this one.

The majority of text is a pure transcription from help pages. Since this is just an exercise, there is no intention to reinventing the wheel.

In documentation, we cannot see the plot nor trees dendrogram.

Here we show all plots and code for a full apreciation of the help content about XGBoost.

Advanced features

Prerequisites

`{pre} install.packages(‘DiagrammeR’) `

Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.

require(xgboost)
## Loading required package: xgboost
# 2º. Aquire/Generate some data

# 3º. Split the data into train and test subsets

data(agaricus.train, package='xgboost') # used to build the model
data(agaricus.test, package='xgboost') # used to assess the quality of our model

# Why split the dataset in two parts?
# Without dividing the dataset we would test the model on the data which the algorithm have already seen.
# And the purpose of using Machine Learning is to be able to predict outcomes for unseen data.

train <- agaricus.train
test <- agaricus.test

Dataset preparation

For the following advanced features, we need to put data in xgb.DMatrix as explained above.

dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)

#print(head(train))

Measure learning progress with xgb.train

Both xgboost (simple) and xgb.train (advanced) functions train models.

One of the special feature of xgb.train is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.

One way to measure progress in learning of a model is to provide to XGBoost a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.

in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.

For the purpose of this example, we use watchlist parameter. It is a list of xgb.DMatrix, each of them tagged with a name.

watchlist <- list(train=dtrain, test=dtest)

bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic")
## [1]  train-error:0.046522    test-error:0.042831 
## [2]  train-error:0.022263    test-error:0.021726

XGBoost has computed at each round the same average error metric than seen above (we set nrounds to 2, that is why we have two lines). Obviously, the train-error number is related to the training dataset (the one the algorithm learns from) and the test-error number to the test dataset.

Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.

If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, caret package may help.

For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.

bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
## [1]  train-error:0.046522    train-logloss:0.233366  test-error:0.042831 test-logloss:0.226687 
## [2]  train-error:0.022263    train-logloss:0.136656  test-error:0.021726 test-logloss:0.137875

eval_metric allows us to monitor two new metrics for each round, logloss and error.

Linear boosting

Until now, all the learnings we have performed were based on boosting trees. XGBoost implements a second algorithm, based on linear boosting. The only difference with previous command is booster = “gblinear” parameter (and removing eta parameter).

bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
## [1]  train-error:0.016582    train-logloss:0.192610  test-error:0.018001 test-logloss:0.196706 
## [2]  train-error:0.002764    train-logloss:0.083713  test-error:0.003104 test-logloss:0.086552

In this specific case, linear boosting gets sligtly better performance metrics than decision trees based algorithm.

In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.

Manipulating xgb.DMatrix

Save / Load

Like saving models, xgb.DMatrix object (which groups both dataset and outcome) can also be saved using xgb.DMatrix.save function.

xgb.DMatrix.save(dtrain, "dtrain.buffer")
## [1] TRUE
# to load it in, simply call xgb.DMatrix
dtrain2 <- xgb.DMatrix("dtrain.buffer")
## [23:51:11] 6513x126 matrix with 143286 entries loaded from dtrain.buffer
bst <- xgb.train(data=dtrain2, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic")
## [1]  train-error:0.046522    test-error:0.042831 
## [2]  train-error:0.022263    test-error:0.021726

Information extraction

Information can be extracted from xgb.DMatrix using getinfo function. Hereafter we will extract label data.

label = getinfo(dtest, "label")
pred <- predict(bst, dtest)
err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
print(paste("test-error=", err))
## [1] "test-error= 0.0217256362507759"

View feature importance/influence from the learnt model

Feature importance is similar to R gbm package’s relative influence (rel.inf).

importance_matrix <- xgb.importance(model = bst)
print(importance_matrix)
##    Feature       Gain     Cover Frequency
## 1:      28 0.67615471 0.4978746       0.4
## 2:      55 0.17135375 0.1920543       0.2
## 3:      59 0.12317236 0.1638750       0.2
## 4:     108 0.02931918 0.1461960       0.2
xgb.plot.importance(importance_matrix = importance_matrix)

View the trees from a model

You can dump the tree you learned using xgb.dump into a text file.

xgb.dump(bst, with_stats = T)
##  [1] "booster[0]"                                                                    
##  [2] "0:[f28<-9.53674316e-007] yes=1,no=2,missing=1,gain=4000.53101,cover=1628.25"   
##  [3] "1:[f55<-9.53674316e-007] yes=3,no=4,missing=3,gain=1158.21204,cover=924.5"     
##  [4] "3:leaf=1.71217716,cover=812"                                                   
##  [5] "4:leaf=-1.70044053,cover=112.5"                                                
##  [6] "2:[f108<-9.53674316e-007] yes=5,no=6,missing=5,gain=198.173828,cover=703.75"   
##  [7] "5:leaf=-1.94070864,cover=690.5"                                                
##  [8] "6:leaf=1.85964918,cover=13.25"                                                 
##  [9] "booster[1]"                                                                    
## [10] "0:[f59<-9.53674316e-007] yes=1,no=2,missing=1,gain=832.545044,cover=788.852051"
## [11] "1:[f28<-9.53674316e-007] yes=3,no=4,missing=3,gain=569.725098,cover=768.389709"
## [12] "3:leaf=0.78471756,cover=458.936859"                                            
## [13] "4:leaf=-0.968530357,cover=309.45282"                                           
## [14] "2:leaf=-6.23624468,cover=20.462389"

You can plot the trees from your model using xgb.plot.tree

require(DiagrammeR)
## Loading required package: DiagrammeR
xgb.plot.tree(model = bst)

if you provide a path to fname parameter you can save the trees to your hard drive.

Save and load models

Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.

Hopefully for you, XGBoost implements such functions.

# save model to binary local file
xgb.save(bst, "xgboost.model")
## [1] TRUE

xgb.save function should return TRUE if everything goes well and crashes otherwise.

An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.

# load binary model to R
bst2 <- xgb.load("xgboost.model")
pred2 <- predict(bst2, test$data)

# And now the test
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred))))
## [1] "sum(abs(pred2-pred))= 0"

result is 0? We are good!

In some very specific cases, like when you want to pilot XGBoost from caret package, you will want to save the model as a R binary vector. See below how to do it.

# save model to R's raw vector
rawVec <- xgb.save.raw(bst)

# print class
print(class(rawVec))
## [1] "raw"
# load binary model to R
bst3 <- xgb.load(rawVec)
pred3 <- predict(bst3, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
## [1] "sum(abs(pred3-pred))= 0"

Again 0? It seems that XGBoost works pretty well!

References

R Help

Read the docs