XGBoost R tutorial: poisonous mushrooms

Load dataset

We will load the agaricus dataset stored in dgCMatrix format, which is a list containing two entries, data and label.

# load train and test data
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
data.frame(train = dim(train$data), test = dim(test$data), row.names = c("row", "col"))

##     train test
## row  6513 1611
## col   126  126

# dataset preparation for xgb.train
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)

Learning progress

One of the special feature of xgb.train is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.

One way to measure progress in learning of a model is to provide to XGBoost a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.

watchlist <- list(train=dtrain, test=dtest)
bst_tree <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, 
                      nrounds=10, watchlist=watchlist, eval_metric = "error", 
                      eval_metric = "logloss", objective = "binary:logistic")

## [1]  train-error:0.046522    train-logloss:0.233376  test-error:0.042831 test-logloss:0.226686 
## [2]  train-error:0.022263    train-logloss:0.136658  test-error:0.021726 test-logloss:0.137874 
## [3]  train-error:0.007063    train-logloss:0.082531  test-error:0.006207 test-logloss:0.080461 
## [4]  train-error:0.015200    train-logloss:0.056474  test-error:0.018001 test-logloss:0.058329 
## [5]  train-error:0.007063    train-logloss:0.041513  test-error:0.006207 test-logloss:0.038287 
## [6]  train-error:0.001228    train-logloss:0.029606  test-error:0.000000 test-logloss:0.026631 
## [7]  train-error:0.001228    train-logloss:0.019191  test-error:0.000000 test-logloss:0.013875 
## [8]  train-error:0.001228    train-logloss:0.013320  test-error:0.000000 test-logloss:0.010198 
## [9]  train-error:0.001228    train-logloss:0.011130  test-error:0.000000 test-logloss:0.008483 
## [10] train-error:0.000000    train-logloss:0.006634  test-error:0.000000 test-logloss:0.006920

Linear boosting

XGBoost implements a second algorithm, based on linear boosting. The only difference with previous command is booster = "gblinear" parameter (and removing eta parameter).

bst_linear <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=10, 
                        watchlist=watchlist, eval_metric = "error", 
                        eval_metric = "logloss", objective = "binary:logistic")

## [1]  train-error:0.021495    train-logloss:0.182509  test-error:0.014898 test-logloss:0.183625 
## [2]  train-error:0.003992    train-logloss:0.068550  test-error:0.006207 test-logloss:0.069772 
## [3]  train-error:0.000768    train-logloss:0.027487  test-error:0.000000 test-logloss:0.027891 
## [4]  train-error:0.000614    train-logloss:0.011004  test-error:0.000000 test-logloss:0.010784 
## [5]  train-error:0.000461    train-logloss:0.004726  test-error:0.000000 test-logloss:0.004550 
## [6]  train-error:0.000000    train-logloss:0.002013  test-error:0.000000 test-logloss:0.001956 
## [7]  train-error:0.000000    train-logloss:0.000879  test-error:0.000000 test-logloss:0.000835 
## [8]  train-error:0.000000    train-logloss:0.000374  test-error:0.000000 test-logloss:0.000337 
## [9]  train-error:0.000000    train-logloss:0.000171  test-error:0.000000 test-logloss:0.000141 
## [10] train-error:0.000000    train-logloss:0.000079  test-error:0.000000 test-logloss:0.000061

View feature importance from the model

Feature importance is similar to R gbm package’s relative influence (rel.inf).

importance_matrix <- xgb.importance(model = bst_tree)
print(importance_matrix)

##     Feature        Gain        Cover  Frequency
##  1:      28 0.550461506 0.3350138692 0.15384615
##  2:      55 0.137207123 0.1199509779 0.03846154
##  3:      59 0.098627282 0.1023510750 0.03846154
##  4:     101 0.045169588 0.0650510436 0.07692308
##  5:     108 0.038261491 0.1248887414 0.11538462
##  6:     110 0.030585717 0.0306225959 0.03846154
##  7:      66 0.026812969 0.0286806745 0.03846154
##  8:      26 0.019262425 0.0530136354 0.07692308
##  9:      38 0.019015284 0.0458661400 0.07692308
## 10:      23 0.011682573 0.0290161610 0.07692308
## 11:      35 0.009340369 0.0256254732 0.07692308
## 12:      22 0.007614195 0.0290552797 0.07692308
## 13:      60 0.002291844 0.0003729885 0.03846154
## 14:     111 0.001903777 0.0045319153 0.03846154
## 15:     114 0.001763858 0.0059594293 0.03846154

xgb.plot.importance(importance_matrix = importance_matrix)

View trees from the model

You can plot the trees from your model using xgb.plot.tree.

xgb.plot.tree(model = bst_tree)