The purpose of this tutorial is to show the inner working of the xgboost package linking its functions and parameters to the paper XGBoost: A Scalable Tree Boosting System (https://arxiv.org/abs/1603.02754)
Xgboost can be used for classification or regression. First I will show how to use it for regression.
Xgboost was built and otimized to be used in extreme large datasets. But to keep the examples simple I will use small datasets as inputs.
library(xgboost)
library(Matrix) # to create the input data for xgboost
library(ggplot2) #for plotting
library(tidypredict) #to extract elements of the model
library(tidymodels)
library(DiagrammeR)
data <- data.frame(y = c(-10, 6, 7, -7), x = c(10, 20, 25, 35))
Visualizing our simple dataset
ggplot(data = data, aes(x = x, y= y)) + geom_point()
data_matrix <- sparse.model.matrix(data = data, y ~ . - 1)
model <- xgboost(data = data_matrix, label = data$y, nrounds = 1,
params = list(lambda = 0, max_depth = 1, gamma = 0, eta = 0.3))
## [1] train-rmse:6.695708
As we can see after one iteration the model prediction produced a training root mean squared error (RMSE) of 6.69. That is an improvement if we predicted using only the mean.
sqrt(mean((data$y - mean(data$y))^2))
## [1] 7.582875
We can generate the predicted values and the RMSE of the model tiwh the code below.
data$predictions <- predict(model, newdata = data_matrix)
sqrt(mean((data$predictions - data$y)^2))
## [1] 6.695708
Let’s see the prediction of each observation:
data
## y x predictions
## 1 -10 10 -2.65
## 2 6 20 0.95
## 3 7 25 0.95
## 4 -7 35 0.95
Now I will show how the model achieved the predictions. To start the model predicts 0.5 to ALL the observations and get the residuals.
data$residuals_model0 <- data$y - 0.5
data
## y x predictions residuals_model0
## 1 -10 10 -2.65 -10.5
## 2 6 20 0.95 5.5
## 3 7 25 0.95 6.5
## 4 -7 35 0.95 -7.5
The residuals in the beginning (model0) are: -10, 5.5, 6.5, and -7.5. With this residuals the model create a decision tree. And we get the tree below:
xgb.plot.tree(model=model)
The tree only has 1 level depth because I specified max_depth = 1
, tipically trees are deeper than that but I made this way to keep the explanation simple. Now let’s see the values presented in the visualization.
First the algorithm calculates the similarity score for each node.
\[SimilarityScore = \frac{(\sum_{i} observed_{i} - predicted_{i})^2}{N - \lambda} = \frac{(\sum_{i} residuals_{i})^2}{N - \lambda}\] Where \(N\) is equal to the number of observations. And \(\lambda\) is a regularization parameter that we will give more details latter, in this first example we have set it to lambda = 0
.
In this tree we have 3 nodes: the root and its 2 childs. The similarity score of the root node is:
(sim_score_root = sum(data$residuals_model0)^2/length(data$residuals_model0))
## [1] 9
The tree was split using x < 15
. THe first observation went to the node above and the others went do the node below.
data[,c('x','y','residuals_model0')]
## x y residuals_model0
## 1 10 -10 -10.5
## 2 20 6 5.5
## 3 25 7 6.5
## 4 35 -7 -7.5
So the similarity score of each child node is equal to:
print("Similarity score of the node above")
## [1] "Similarity score of the node above"
(sim_score_node_above = sum(data$residuals_model0[data$x < 15])^2/length(data$residuals_model0[data$x < 15]))
## [1] 110.25
print("Similarity score of the node below")
## [1] "Similarity score of the node below"
(sim_score_node_below = sum(data$residuals_model0[data$x >= 15])^2/length(data$residuals_model0[data$x >= 15]))
## [1] 6.75
With this we can calculate the gain of each split with the following formula.
\[Gain = SimilarityScore_{left} + SimilarityScore_{right} - SimilarityScore_{root}\]
In this case the gain is equal to 108.
(split_gain = sim_score_node_above + sim_score_node_below - sim_score_root)
## [1] 108
The value of a node is similar to the similarity score but we do not square the residuals. The result is multiplied by \(\eta\) that is the learning parameter.:
\[NodeValue = \frac{(\sum_{i} observed_{i} - predicted_{i})}{N + \lambda} * \eta = \frac{(\sum_{i} residuals_{i})}{N + \lambda} * \eta\]
This is the value that will go to the prediction of the model.
eta <- 0.3
print("Value node above")
## [1] "Value node above"
(value_node_above = sum(data$residuals_model0[data$x < 15])/length(data$residuals_model0[data$x < 15])) * eta
## [1] -3.15
print("Value node below")
## [1] "Value node below"
(value_node_below = sum(data$residuals_model0[data$x >= 15])/length(data$residuals_model0[data$x >= 15])) * eta
## [1] 0.45
With this information we get to the final predictions. All the observation in the node above have the prediciont value of
print("Prediction node above")
## [1] "Prediction node above"
0.5 + eta * value_node_above
## [1] -2.65
print("Prediction node below")
## [1] "Prediction node below"
0.5 + eta * value_node_below
## [1] 0.95
Which is equal to the value predicted using predict(model)
data
## y x predictions residuals_model0
## 1 -10 10 -2.65 -10.5
## 2 6 20 0.95 5.5
## 3 7 25 0.95 6.5
## 4 -7 35 0.95 -7.5
pred_contrib <- predict(model, predcontrib = TRUE, newdata = data_matrix)
pred_contrib
## x BIAS
## [1,] -2.7 0.04999998
## [2,] 0.9 0.04999998
## [3,] 0.9 0.04999998
## [4,] 0.9 0.04999998
tidypredict_sql(model, con = dbplyr::simulate_dbi())
## <SQL> 0.0 + CASE
## WHEN ((`x` < 15.0 OR ((`x`) IS NULL))) THEN (-3.1500001)
## WHEN (`x` >= 15.0) THEN (0.450000018)
## END + 0.5
gamma
If we relax the parameter max_depth
we would have deeper tree.
data <- data.frame(y = c(-10, 6, 7, -7), x = c(10, 20, 25, 35))
data_matrix <- sparse.model.matrix(data = data, y ~ . - 1)
model <- xgboost(data = data_matrix, label = data$y, nrounds = 1,
params = list(lambda = 0, max_depth = 3, gamma = 0, eta = 0.3))
## [1] train-rmse:5.410869
xgb.plot.tree(model=model)
Note that wi know have a tree with 3 levels and the gains of the other splits are: 121.5 and 0.5. One way to automatically prune a tree is using the parameter gamma
. If the difference betwwen gain and gamma
is negative we prune the branch
data <- data.frame(y = c(-10, 6, 7, -7), x = c(10, 20, 25, 35))
data_matrix <- sparse.model.matrix(data = data, y ~ . - 1)
model <- xgboost(data = data_matrix, label = data$y, nrounds = 1,
params = list(lambda = 0, max_depth = 3, gamma = 1, eta = 0.3))
## [1] train-rmse:5.416756
xgb.plot.tree(model=model)
And if we use a gamma
of 125 we end with no tree. There isn’t a split with gain greater than the value chosen for gamma
.
data <- data.frame(y = c(-10, 6, 7, -7), x = c(10, 20, 25, 35))
data_matrix <- sparse.model.matrix(data = data, y ~ . - 1)
model <- xgboost(data = data_matrix, label = data$y, nrounds = 1,
params = list(lambda = 0, max_depth = 3, gamma = 125, eta = 0.3))
## [1] train-rmse:7.655227
try(xgb.plot.tree(model=model))
## Error in xgb.model.dt.tree(feature_names = feature_names, model = model, :
## Non-tree model detected! This function can only be used with tree models.
lambda
Another parameter that helps with the regularization is labda
. Remember that the parameter appears in the similarity score and the value of a node. If lambda is set to 0 the value is simply the average mean of the residuals present in the node. And the value in the prediction is just the average mean times the learning rate eta
. \[NodeValue = \frac{(\sum_{i} observed_{i} - predicted_{i})}{N - \lambda} * \eta = \frac{(\sum_{i} residuals_{i})}{N - \lambda} * \eta\]
When we increase the value of lambra there is a decrease in the Similarity score and the node_value. This decrease is greater proportional to the number of residuals observations in each node.
The implication of this is that the gain is lower, so the tree will be more succetible to prunning from the gamma
parameter. Note in the tree below generated with a lambda of 1.
data <- data.frame(y = c(-10, 6, 7, -7), x = c(10, 20, 25, 35))
data_matrix <- sparse.model.matrix(data = data, y ~ . - 1)
model <- xgboost(data = data_matrix, label = data$y, nrounds = 1,
params = list(lambda = 1, max_depth = 1, gamma = 0, eta = 0.3))
## [1] train-rmse:7.171294
data$residuals_model0 <- data$y - 0.5
xgb.plot.tree(model=model)
The gain from the split dropped from 108 to 52.98. As we can see from the formulas below and the figure above.
print("using lambda equal 1")
## [1] "using lambda equal 1"
lambda <- 1
print("Similarity score from root node")
## [1] "Similarity score from root node"
(sim_score_root = sum(data$residuals_model0)^2/(length(data$residuals_model0) + lambda))
## [1] 7.2
print("Similarity score of the node above")
## [1] "Similarity score of the node above"
(sim_score_node_above = sum(data$residuals_model0[data$x < 15])^2/(length(data$residuals_model0[data$x < 15]) + lambda))
## [1] 55.125
print("Similarity score of the node below")
## [1] "Similarity score of the node below"
(sim_score_node_below = sum(data$residuals_model0[data$x >= 15])^2/(length(data$residuals_model0[data$x >= 15]) + lambda))
## [1] 5.0625
print("Split gain")
## [1] "Split gain"
(split_gain = sim_score_node_above + sim_score_node_below - sim_score_root)
## [1] 52.9875
Besides that when lambda is larger the nove_value is also lower.
\[NodeValue = \frac{(\sum_{i} observed_{i} - predicted_{i})}{N + \lambda} * \eta = \frac{(\sum_{i} residuals_{i})}{N + \lambda} * \eta\]
This is the value that will go to the prediction of the model.
print("Value node above")
## [1] "Value node above"
(value_node_above = sum(data$residuals_model0[data$x < 15])/(length(data$residuals_model0[data$x < 15]) + lambda)) * eta
## [1] -1.575
print("Value node below")
## [1] "Value node below"
(value_node_below = sum(data$residuals_model0[data$x >= 15])/(length(data$residuals_model0[data$x >= 15])) + lambda) * eta
## [1] 0.75
This implies that the updates of the predicted value will be smaller in each iteraction.