We will load the agaricus dataset stored in dgCMatrix format, which is a list containing two entries, data and label.
# load train and test data
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
data.frame(train = dim(train$data), test = dim(test$data), row.names = c("row", "col"))
## train test
## row 6513 1611
## col 126 126
# dataset preparation for xgb.train
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
One of the special feature of xgb.train is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
One way to measure progress in learning of a model is to provide to XGBoost a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
watchlist <- list(train=dtrain, test=dtest)
bst_tree <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2,
nrounds=10, watchlist=watchlist, eval_metric = "error",
eval_metric = "logloss", objective = "binary:logistic")
## [1] train-error:0.046522 train-logloss:0.233376 test-error:0.042831 test-logloss:0.226686
## [2] train-error:0.022263 train-logloss:0.136658 test-error:0.021726 test-logloss:0.137874
## [3] train-error:0.007063 train-logloss:0.082531 test-error:0.006207 test-logloss:0.080461
## [4] train-error:0.015200 train-logloss:0.056474 test-error:0.018001 test-logloss:0.058329
## [5] train-error:0.007063 train-logloss:0.041513 test-error:0.006207 test-logloss:0.038287
## [6] train-error:0.001228 train-logloss:0.029606 test-error:0.000000 test-logloss:0.026631
## [7] train-error:0.001228 train-logloss:0.019191 test-error:0.000000 test-logloss:0.013875
## [8] train-error:0.001228 train-logloss:0.013320 test-error:0.000000 test-logloss:0.010198
## [9] train-error:0.001228 train-logloss:0.011130 test-error:0.000000 test-logloss:0.008483
## [10] train-error:0.000000 train-logloss:0.006634 test-error:0.000000 test-logloss:0.006920
XGBoost implements a second algorithm, based on linear boosting. The only difference with previous command is booster = "gblinear" parameter (and removing eta parameter).
bst_linear <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=10,
watchlist=watchlist, eval_metric = "error",
eval_metric = "logloss", objective = "binary:logistic")
## [1] train-error:0.021495 train-logloss:0.182509 test-error:0.014898 test-logloss:0.183625
## [2] train-error:0.003992 train-logloss:0.068550 test-error:0.006207 test-logloss:0.069772
## [3] train-error:0.000768 train-logloss:0.027487 test-error:0.000000 test-logloss:0.027891
## [4] train-error:0.000614 train-logloss:0.011004 test-error:0.000000 test-logloss:0.010784
## [5] train-error:0.000461 train-logloss:0.004726 test-error:0.000000 test-logloss:0.004550
## [6] train-error:0.000000 train-logloss:0.002013 test-error:0.000000 test-logloss:0.001956
## [7] train-error:0.000000 train-logloss:0.000879 test-error:0.000000 test-logloss:0.000835
## [8] train-error:0.000000 train-logloss:0.000374 test-error:0.000000 test-logloss:0.000337
## [9] train-error:0.000000 train-logloss:0.000171 test-error:0.000000 test-logloss:0.000141
## [10] train-error:0.000000 train-logloss:0.000079 test-error:0.000000 test-logloss:0.000061
Feature importance is similar to R gbm package’s relative influence (rel.inf).
importance_matrix <- xgb.importance(model = bst_tree)
print(importance_matrix)
## Feature Gain Cover Frequency
## 1: 28 0.550461506 0.3350138692 0.15384615
## 2: 55 0.137207123 0.1199509779 0.03846154
## 3: 59 0.098627282 0.1023510750 0.03846154
## 4: 101 0.045169588 0.0650510436 0.07692308
## 5: 108 0.038261491 0.1248887414 0.11538462
## 6: 110 0.030585717 0.0306225959 0.03846154
## 7: 66 0.026812969 0.0286806745 0.03846154
## 8: 26 0.019262425 0.0530136354 0.07692308
## 9: 38 0.019015284 0.0458661400 0.07692308
## 10: 23 0.011682573 0.0290161610 0.07692308
## 11: 35 0.009340369 0.0256254732 0.07692308
## 12: 22 0.007614195 0.0290552797 0.07692308
## 13: 60 0.002291844 0.0003729885 0.03846154
## 14: 111 0.001903777 0.0045319153 0.03846154
## 15: 114 0.001763858 0.0059594293 0.03846154
xgb.plot.importance(importance_matrix = importance_matrix)
You can plot the trees from your model using xgb.plot.tree.
xgb.plot.tree(model = bst_tree)