Introduction

XGBoost stands for Extreme Gradient Boost and it can be seen as improved version of the Gradient Boost designed for speed and performance. Like gradient boosted decision trees, XGBoost fits a tree to the residuals and update the old prediction scaling the latest prediction with a learning rate which is called eta. However XGBoost does not grow trees in the same way as gradient boost.

In short, XGBoost starts with a prediction equal to 0.5 regardless of the problem. All the observations get as first prediction 0.5. First residuals are calculated from here. Every time a tree is grown and a split is evaluated, for each leaf, the algorithms calculate the similarity score or structure score.

Similarity score = (sum of residuals)^2/(number of residuals + lambda)

lambda is a regularization parameter which means that it is intended to reduce the sensitivity to single observation. The higher the lambda the lower the structure score. Further to that, it is possible to see that the effect of lambda is higher when there are few observation in a leaf/branch.

In order to evaluate a split we calculate the Gain which is the sum of the structure score for the leaf on the left plus the leaf on the right minus the structure score of the parent node above.

The algorithms calculates the gain for all the thresholds for all the variables. The variable and the split yielding the highest value of gain is the chosen one. This process is iterated until we hit one of the constraints (min number of samples in a leaf, minimum loss reduction required, maximum depth of the tree..).

Gradient boost, instead, would grow a tree splitting the data in correspondence of the variable and the value yielding the lowest sum of squares residuals.

The tree just grown is pruned comparing the gain with the parameter gamma. If Gain - gamma is lower than 0 we remove the branch. The higher the value of gamma the more the pruning and the shallower the tree. We can also note that high value of lambda makes it easier to prune a tree since the gain is inversely proportioned to the value of lambda.

At the end of this procedure we have our first tree fitted on the residuals. We can now calculate the Output value for each leaf which will help us updating the prediction.

Output value = (sum of the residuals)/(number of the residual + lambda).

Again, we can see that lambda reduce the influence of a single observation. If lambda would be equal to 0, the output value is simply the average of the obs falling into a specific leaf like it happens with “regular” trees.

The prediction is then updated scaling the new predicted residual (output of the tree) by a value 0<eta<1.

New Prediction = Previous Prediction + (eta * Output value)

Math details

The algorithms builds tree by minimizing the following objective function:

\[ Obj (\Theta)= L(\Theta) + \Omega(\Theta)\] Where \(L(\Theta)\) is the loss function. While \(\Omega(\Theta)\) is the regularization term

The 2 terms have different meaning and goals:

Optimizing the loss function encourages predictive models:

▪ Fitting well in training data at least get you close to training data

Optimizing regularization encourages simple models:

▪ Simpler models tends to have smaller variance in future predictions, making prediction stable

The Objective function can be rewritten as follow:

\[Obj = \sum{L(y_i,p_i)} + \gamma T + \frac{1}{2}\lambda O_{val}^2\]

When the regularization parameter is set to zero, the objective falls back to the traditional gradient tree boosting.

For simplicity’s sake we will omit the gamma term which is another “tool” to prune the tree, not to grow it. Pruning takes place after a full tree is grown.

Considering that at any time, the latest prediction has the form: \(p_i^{(t-1)} + f_t(x_i)\) where \(f_t(x_i) \propto O_{val}\)

The goal is to find an output value \(O_{val}\) that minimize the obj: \[Obj = \sum{L(y_i,p_i^{(t-1)}} + O_{val}) +\frac{1}{2}\lambda O_{val}^2\] As mentioned before the output value of the tree is the output we get running the input (\(x_i\)) through a tree - it is basically the prediction we get from a tree (tree built with XGBoost algorithm).

Regardless of the type of the problem, XGBoost substitute the loss function \(L\) with its second order Taylor approximation:

\[L(y_i,p_i^{t-1} + O_{val}) \approx L(y_i, p_i) + \frac{\partial{L(y_i, p_i)}}{\partial{p_i}}O_{val} + \frac{1}{2}\frac{\partial^2{L(y_i, p_i)}}{\partial{p_i}^2}O^2_{val}\] The second term and the third terms of the series are respectively the Gradient and the Hessian.

\(g_i = \frac{\partial{L(y_i, p_i)}}{\partial{p_i}}\)

\(h_i = \frac{1}{2}\frac{\partial^2{L(y_i, p_i)}}{\partial{p_i}^2}\)

The function can be rewritten:

\[Obj = \sum[{L(y_i,p_i}) + g_iO_{val} + \frac{1}{2}h_iO^2_{val}] +\frac{1}{2}\lambda O_{val}^2\]

In order to find the value minimizing the function here above, we take the first derivative with respect to \(O_{val}\) and set to 0.

\[\sum[g_i + h_iO_{val}] + \lambda O_{val} = 0\]

\[\sum g_i + \sum h_i O_{val} + \lambda O_{val} = 0\]

\[O_{val} = \frac{-\sum g_i}{\sum h_i + \lambda}\] If we plug in the value of \(g_i\) \(h_i\) for the loss function \(L(y_i, p_i) = \frac{1}{2}(y_i-p_i)^2\), we get:

\(g_i = -(y_i - p_i)\) and \(h_i=1\) and the optimal value for \(O_{val}\) becomes:

\[O_{val} = \frac{\sum (y_i - p_i)}{\sum 1 + \lambda}\] And this is the value of \(O_{val}\) that minimize the objective function (Loss function + Regularization term). The numerator is the sum of the residuals while the denominator is the number of the residual in a leaf plus lambda.

Now I need to consider the structure score and we need to go back to the original objective function.

\[Obj = \sum[{L(y_i,p_i}) + g_iO_{val} + \frac{1}{2}h_iO^2_{val}] +\frac{1}{2}\lambda O_{val}^2\] We remove the constant term \(L(y_i,p_i)\) for convenience.

\[New Goal: (\sum[g_iO_{val} + \frac{1}{2}h_iO^2_{val}] +\frac{1}{2}\lambda O_{val}^2)(-1)\] Now we could plug in \(O_{val}\) in the new goal expression and calculate the structure score which is a way to measure the quality of a tree structure q. However it is more convenient to rewrite the gradient and the hessian (j refers to a specific leaf) in a more compact way:

\(G_j = \sum_{i\in I_j} g_i\)

\(H_j = \sum_{i\in I_j} h_i\)

Now we can write the optimal value \(O_{val}\) in the following way:

\(O_{val} = -\frac{G_i}{H_j + \lambda}\)

And consequentially the expression becomes:

\[ \frac{1}{2}\frac{G_j^2}{H_j + \lambda}\] For the whole tree we write:

\[ \sum_j^T\frac{1}{2}\frac{G_j^2}{H_j + \lambda}\] However, as mentioned in the previous part, usually the algorithms use the structure score to evaluate a spit. Namely (1/2 is omitted as is a common scaling term):

\[Split : \frac{G_{left}^2}{H_{left} + \lambda} + \frac{G_{right}^2}{H_{right} + \lambda} - \frac{G_{p.node}^2}{H_{p.node} + \lambda} - \gamma\] Where p.node refers to the parent node of the split under evaluation.

If we substitute \(G\) and \(H\) with their values we get:

\[\frac{[\sum - (y_i - p_i)]^2}{n.res + \lambda}_{[left]} + \frac{[\sum - (y_i - p_i)]^2}{n.res + \lambda}_{[right]} - \frac{[\sum - (y_i - p_i)]^2}{n.res + \lambda}_{[p.node]} - \gamma\] Where \(n.res\) is the number of residual in the leaf.

Once we have grown a tree and pruned it, we run our obs through it. The output of the tree is then used to update the prediction in the following way:

\[p_i^{<t+1>}= p_i + \eta O_{val_i}\]

Extras

AN EXAMPLE ON HOW XGBOOST HANDLE BIG DATA SET

In the last 2 sections we went over how XGBoost grows and fits trees. XGBoost was designed to deal with large data set and it is equipped with optimization tools that cut the processing time.

For example, the algorithms is greedy in the sense that when evaluating a split it just choose the one that yields the highest gain for that split. It does not consider the effect of the split in relation to the next ones.

However, the algorithm does not check all the possible split for all the variables (e.g. 100 obs = 99 threshold for possible split for a single variable). XGBoost divides the data into quantiles and consider these as possible threshold for the split of a variable.

Carseats data set

First we want to transform our data set into a numeric matrix since Xgboost manages only numeric vectors. This means that factor needs to be coded into binaries values (one hot coding).

Read the data and check the structure

data(Carseats)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

Most of the variables are already numeric, there are only 3 variables with class factors. The number of levels is also manageable so there is no need for further cleaning or level aggregation.

We transform the data frame into a numeric matrix with the command sparse.model.matrix

s.mat<-sparse.model.matrix(Sales~.,data=Carseats)[,-1] #I want a matrix with only numeric values as Xgboost manages only numeric vectors. The command expands factors to a set of dummy variables.

colnames(s.mat) # new column names after expanding the factors to a set of dummy variable (0,1)
##  [1] "CompPrice"       "Income"          "Advertising"     "Population"     
##  [5] "Price"           "ShelveLocGood"   "ShelveLocMedium" "Age"            
##  [9] "Education"       "UrbanYes"        "USYes"
head(s.mat)
## 6 x 11 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 11 column names 'CompPrice', 'Income', 'Advertising' ... ]]
##                                   
## 1 138  73 11 276 120 . . 42 17 1 1
## 2 111  48 16 260  83 1 . 65 10 1 1
## 3 113  35 10 269  80 . 1 59 12 1 1
## 4 117 100  4 466  97 . 1 55 14 1 1
## 5 141  64  3 340 128 . . 38 13 1 .
## 6 124 113 13 501  72 . . 78 16 . 1

Now I split the date set in train and test set. After that I convert the sparse matrix in DMatrix, XGBoost built-in data type.

set.seed(1)


train<-sample(1:nrow(s.mat), nrow(s.mat)*0.7)

train.set<-s.mat[train,]
y.train<-Carseats$Sales[train]
test.set<-s.mat[-train,]
y.test<-Carseats$Sales[-train]


xgb.train<-xgb.DMatrix(data = train.set, label = y.train)
xgb.test<-xgb.DMatrix(data = test.set, label = y.test)

Parameter Tuning

This is a partial list of the parameters that can be tuned in order to optmize XGBoost with caret:

eta: step size of each boosting step. This is similar to the learning rate seen for the standard gradient boosting.

max.depth: maximum depth of the tree. Used to control over-fitting as higher depth will allow the model to grow tree very specific to a particular sample.

min_child_weight: Since we are solving a regression problem, it refers to the minimum number of samples in a leaf. This is the Hessian (second derivative) of the loss function (Loss Function: Sum of square errors x 0.5).

gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. The Gain of a split is compared against gamma, if the difference is negative, the algorithm discard the split under evaluation.

colsample_bytree: is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. Similar to the m value in randomForest. Here we specify the ratio not the actual size of the subsample.

subsample: Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This will prevent over fitting. The range is (0,1] and the default value is 1.

lambda: L2 regularization term on weights (here called Optimal value). Increasing this value will make model more conservative. The default is equal to 1.

We decide to tune 2 parameters at the time. This might not be the ideal way but it allows us to visualize with ggplot the effect of each parameter. In this way we can get a feeling of the optimal parameter values.

LEARNING RATE AND TREE DEPTH

We start checking possible values for the eta and max depth.

nrounds <- 1000 # to keep the computing time reasonable

tune_grid <- expand.grid(
  nrounds = seq(from = 200, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3, 4),
  gamma = 0, #c(0, 0.1, 0.5, 0.7, 1.0),
  colsample_bytree = 1, #c(0.3,0.5,1),
  min_child_weight = 1,
  subsample = 1
)



tune_control <- trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  verboseIter = FALSE, # no training log
  allowParallel = FALSE,
  returnData = FALSE  
)


xgb_tune <- train(
  x = train.set,
  y = y.train,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = F
)


ggplot(xgb_tune) + theme_bw()

xgb_tune$bestTune
##    nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
## 64     800         2 0.025     0                1                1         1
xgb_tune$results$RMSE %>% min()
## [1] 1.354485
xgb_tune$times # time elapsed to complete XGBoost algorithm with different combination of the parameters
## $everything
##    user  system elapsed 
##  139.42  111.36  131.87 
## 
## $final
##    user  system elapsed 
##    1.14    0.91    1.00 
## 
## $prediction
## [1] NA NA NA

GAMMA AND COLUMN SAMPLING

Next, we pick the best values from previous step, and now will see whether changing the gamma and colsample_bytree (subsample ratio of columns when constructing each tree) has any effect on the model fit.

tune_grid2 <- expand.grid(
  nrounds = seq(from = 200, to = nrounds, by = 100),
  eta = xgb_tune$bestTune$eta,
  max_depth = xgb_tune$bestTune$max_depth,
  gamma = c(0, 0.05, 0.1, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree =  c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = 1,
  subsample = 1
)

xgb_tune2 <- train(
  x = train.set,
  y = y.train,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = F
)

ggplot(xgb_tune2) + theme_bw()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 36 rows containing missing values (geom_point).

xgb_tune2$results$RMSE %>% min()
## [1] 1.348369
xgb_tune2$bestTune
##     nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
## 108    1000         2 0.025   0.1                1                1         1
xgb_tune2$times #time elapsed
## $everything
##    user  system elapsed 
##  248.80  198.32  233.86 
## 
## $final
##    user  system elapsed 
##    1.88    1.87    1.92 
## 
## $prediction
## [1] NA NA NA

RE-EVALUATE ETA

Now that we have 1 possible best configuration, we can tune eta and increase the number of trees grown. As nrounds (number of tree grown) increases eta(learning rate) can decrease in order to reduce the risk of over fitting.

tune_grid3 <- expand.grid(
  nrounds = seq(from = 200, to = 5000, by = 100),
  eta = c(0.005, 0.01, 0.025, 0.05, 0.08, 0.1),
  max_depth = xgb_tune$bestTune$max_depth,
  gamma = xgb_tune2$bestTune$gamma,
  colsample_bytree =  xgb_tune2$bestTune$colsample_bytree,
  min_child_weight = 1,
  subsample = 1
)

xgb_tune3 <- train(
  x = train.set,
  y = y.train,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = F
)

ggplot(xgb_tune3) + theme_bw()

xgb_tune3$results$RMSE %>% min()
## [1] 1.33683
xgb_tune3$bestTune
##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 151     500         2 0.05   0.1                1                1         1
xgb_tune3$times #time elapsed
## $everything
##    user  system elapsed 
##  247.93  196.35  238.80 
## 
## $final
##    user  system elapsed 
##    0.71    0.52    0.70 
## 
## $prediction
## [1] NA NA NA

The number of possible parameter configurations can be very high. For example if I have 5 parameters and I decide to test 5 value for each parameter, I end up with \(5^5\) possible configurations.

In this project we started presenting the algorithm in an intuitive way, then we went deeper and added some mathematical details along with examples of optimization tools.

After that we have applied XGBoost to Carseats data set and played around with the parameters. The parameter combinations explored is by no means complete. We simply decided to set 2-3 parameters at the time so we could plot how the RMSE would vary.

We ended up with one possible best parameter configuration. Now we calculate the resulting Test MSE.

Test MSE

yhat<-predict (xgb_tune3, newdata = test.set)
testMSE<-mean((yhat-y.test)^2)
testMSE
## [1] 1.423714