Boosting was proposed by Michael Kearns as a way for many weak learners to combine and make a strong predictive model. Unlike bagged trees, which average predictions over many fully expanded decision trees, boosted trees are composed of small tree stumps whose main purpose is to predict correctly the cases that previous trees in the chain predicted wrongly. In regression, each stump is trained on the current residuals, but in classification the stumps train on data points that have been weighted based on past performance in the training algorithm.
The first concrete boosting algorithm was AdaBoost (short for Adaptive Boosting) by Yoav Freund and Robert Schapire. AdaBoost selects the best single feature and creates a decision split on it. It then finds data points for which it was wrong and updates their weights accordingly. This means that hard to classify samples are given increasingly more priority as the algorithm goes on. Another pro of this approach is that AdaBoost escapes dimensionality problems by only selecting one predictor at a time.
Modern additions to AdaBoost are stochastic gradient boosting techniques, which trains each stump only on a random subset of the data, an important feature given the size of many modern datasets. The most famous algorithm currently applying this technique is XGBoost, which is highly utilized in data science competitions like those on Kaggle.
The Hitters dataset comes with the ISLR package. It contains performance metrics about baseball players from the 1986 season and their salaries at the start of the 1987 season. The metrics are stats like
I will use those covariates to predict the log transform of the players’ salaries in 1987.
First a gradient boosted model was trained with parameters:
The learning curve for this model can be seen below.
Figure 1. Learning Curve for shrinkage parameter
We can see from figure 1 that training error keeps going down even after testing error reaches a minimum. This is to be expected.
Now that we’ve established a good shrinkage parameter it would be useful to plot a learning curve for the number of trees trained. If we can reduce the number of trees in the model without causing a large increase in testing error a simpler model would generalize better to new data.
Figure 2. Learning Curve for n.trees parameter. One standard deviation rule shown in red
A common rule when training complex models is to choose the simplest model whose error is within one standard deviation of the best model. By following that heuristic we can greatly simplify our model to needing only 403 trees. But, when looking at the graph we can see that using 479 reduces our loss handily, so I will choose that number of trees instead.
Our final model parameters are therefore:
| x | |
|---|---|
| RMSE | 0.5328230 |
| Rsquared | 0.5976603 |
| MAE | 0.3475788 |
When training complex models like gradient boosted trees, there is no easy way to see how changes in different covariates affects our predictions. We can, on the other hand, find out which variables were most important to our model. The training algorithms keep track of the total reduction in loss caused by each variable for every iteration.
Figure 3. Variable Importance for Gradient Boosted Model
For the rest of this assignment I utilized the caret package by Max Kuhn. The data was first preprocessed by centering and scaling numeric variables using the means and standard deviations of the training set before one-hot encoding categorical variables. Then the models were trained and evaluated using a 30-fold bootstrap.
I will show basic summaries and learning curves for each model before comparing them all in the next section.
Basic linear regression is always a good place to start. Features for each bootstrapped resample were chosen via model AIC scores.
| term | estimate | p.value |
|---|---|---|
| (Intercept) | -0.08 | 0.22 |
| AtBat | -0.51 | 0.01 |
| Hits | 0.72 | 0.00 |
| Walks | 0.25 | 0.00 |
| Years | 0.35 | 0.00 |
| CRuns | 0.45 | 0.01 |
| CWalks | -0.30 | 0.07 |
| Division.E | 0.16 | 0.08 |
| PutOuts | 0.14 | 0.00 |
| x | |
|---|---|
| RMSE | 0.7484820 |
| Rsquared | 0.3542432 |
| MAE | 0.5588668 |
The linear model does not do as well as the boosted model, but that was expected. The main attractiveness of linear models don’t lie in predictive power but in ease of explanation and inference.
Elastic net regularization applies a mixture of LASSO and ridge regression via the formula:
I trained linear regressions models using elastic net regularization with parameters:
Figure 4. Learning curve for Elastic Net regression model.
| alpha | lambda | RMSE | Rsquared |
|---|---|---|---|
| 0.0000000 | 0.3333333 | 0.6864489 | 0.5425118 |
| 0.0000000 | 0.6666667 | 0.6869059 | 0.5438826 |
| 0.0000000 | 1.0000000 | 0.6895744 | 0.5440683 |
| 0.1428571 | 0.3333333 | 0.6913749 | 0.5407689 |
| 0.0000000 | 0.0000000 | 0.6921881 | 0.5364441 |
| 0.2857143 | 0.3333333 | 0.7052646 | 0.5323518 |
| 0.1428571 | 0.0000000 | 0.7095344 | 0.5225255 |
| 0.8571429 | 0.0000000 | 0.7096992 | 0.5223151 |
| 0.2857143 | 0.0000000 | 0.7099560 | 0.5220471 |
| 0.7142857 | 0.0000000 | 0.7099564 | 0.5220970 |
| 1.0000000 | 0.0000000 | 0.7100112 | 0.5218291 |
| 0.5714286 | 0.0000000 | 0.7100152 | 0.5219730 |
| 0.4285714 | 0.0000000 | 0.7102029 | 0.5218297 |
| 0.1428571 | 0.6666667 | 0.7113726 | 0.5332227 |
| 0.4285714 | 0.3333333 | 0.7241796 | 0.5205558 |
| 0.0000000 | 4.0000000 | 0.7323358 | 0.5422285 |
| 0.1428571 | 1.0000000 | 0.7375901 | 0.5244927 |
| 0.5714286 | 0.3333333 | 0.7452068 | 0.5080999 |
| 0.2857143 | 0.6666667 | 0.7534942 | 0.5116360 |
| 0.7142857 | 0.3333333 | 0.7671921 | 0.4964745 |
| 0.0000000 | 7.0000000 | 0.7733363 | 0.5410534 |
| 0.8571429 | 0.3333333 | 0.7910274 | 0.4820150 |
| 0.4285714 | 0.6666667 | 0.8019354 | 0.4892019 |
| 0.0000000 | 10.0000000 | 0.8056467 | 0.5403563 |
| 0.2857143 | 1.0000000 | 0.8098917 | 0.4919123 |
| 1.0000000 | 0.3333333 | 0.8148975 | 0.4646329 |
| 0.5714286 | 0.6666667 | 0.8561400 | 0.4583072 |
| 0.4285714 | 1.0000000 | 0.8920165 | 0.4431545 |
| 0.7142857 | 0.6666667 | 0.9118331 | 0.4156143 |
| 0.8571429 | 0.6666667 | 0.9628013 | 0.3750014 |
| 0.5714286 | 1.0000000 | 0.9684443 | 0.3804526 |
| 0.1428571 | 4.0000000 | 0.9844739 | 0.3874520 |
| 1.0000000 | 0.6666667 | 0.9992911 | 0.3441595 |
| 0.7142857 | 1.0000000 | 1.0076199 | 0.3217853 |
| 0.1428571 | 7.0000000 | 1.0089234 | NaN |
| 0.1428571 | 10.0000000 | 1.0089234 | NaN |
| 0.2857143 | 4.0000000 | 1.0089234 | NaN |
| 0.2857143 | 7.0000000 | 1.0089234 | NaN |
| 0.2857143 | 10.0000000 | 1.0089234 | NaN |
| 0.4285714 | 4.0000000 | 1.0089234 | NaN |
| 0.4285714 | 7.0000000 | 1.0089234 | NaN |
| 0.4285714 | 10.0000000 | 1.0089234 | NaN |
| 0.5714286 | 4.0000000 | 1.0089234 | NaN |
| 0.5714286 | 7.0000000 | 1.0089234 | NaN |
| 0.5714286 | 10.0000000 | 1.0089234 | NaN |
| 0.7142857 | 4.0000000 | 1.0089234 | NaN |
| 0.7142857 | 7.0000000 | 1.0089234 | NaN |
| 0.7142857 | 10.0000000 | 1.0089234 | NaN |
| 0.8571429 | 1.0000000 | 1.0089234 | NaN |
| 0.8571429 | 4.0000000 | 1.0089234 | NaN |
| 0.8571429 | 7.0000000 | 1.0089234 | NaN |
| 0.8571429 | 10.0000000 | 1.0089234 | NaN |
| 1.0000000 | 1.0000000 | 1.0089234 | NaN |
| 1.0000000 | 4.0000000 | 1.0089234 | NaN |
| 1.0000000 | 7.0000000 | 1.0089234 | NaN |
| 1.0000000 | 10.0000000 | 1.0089234 | NaN |
| x | |
|---|---|
| RMSE | 0.7342858 |
| Rsquared | 0.3268801 |
| MAE | 0.5639280 |
Partial Least Squares performs feature selection through supervised dimension reduction and trains a linear regression model on those components. I expect this method should come in handy on this dataset since there are probably multiple colinear variables like the obvious example AtBat and CAtBat.
Figure 5. Learning curve for PLS model
| ncomp | RMSE | Rsquared |
|---|---|---|
| 1 | 0.6992334 | 0.5262258 |
| 2 | 0.7141601 | 0.5112001 |
| 4 | 0.7170441 | 0.5091373 |
| 3 | 0.7176150 | 0.5067679 |
| 12 | 0.7228114 | 0.5076451 |
| 11 | 0.7237804 | 0.5060740 |
| 13 | 0.7240241 | 0.5068642 |
| 10 | 0.7251288 | 0.5037276 |
| 8 | 0.7260542 | 0.5016029 |
| 9 | 0.7263278 | 0.5026390 |
| 14 | 0.7271375 | 0.5040421 |
| 5 | 0.7275464 | 0.4968396 |
| 6 | 0.7296176 | 0.4953634 |
| 15 | 0.7302571 | 0.5010987 |
| 16 | 0.7329537 | 0.4986845 |
| 7 | 0.7335040 | 0.4911485 |
| 18 | 0.7355082 | 0.4952366 |
| 17 | 0.7357283 | 0.4951865 |
It’s a little bit of a surprise that the optimal number of components was only 1, but it makes sense given that many of the variables explain similar attributes about baseball players.
| x | |
|---|---|
| RMSE | 0.7478039 |
| Rsquared | 0.3054730 |
| MAE | 0.5872181 |
I also train basic bagged decision tree models with varying complexity control parameters.
| RMSE | Rsquared |
|---|---|
| 0.5453458 | 0.7015712 |
Figure 6. One of the simple bagged decision trees.
As soon as we make the jump from linear regression models to ensembles of decision trees the performance makes a large jump for the better.
| x | |
|---|---|
| RMSE | 0.6007889 |
| Rsquared | 0.5583835 |
| MAE | 0.3801646 |
Random Forest has been hailed as a close second to gradient boosted trees for most tasks. Here I train models on various parameters. ranger, a relatively new package (callable from caret), allows for fast parallel training of random forest models.
Figure 7. Learning curve for Random Forest model
| mtry | splitrule | min.node.size | RMSE | Rsquared |
|---|---|---|---|---|
| 18 | extratrees | 10 | 0.5142004 | 0.7366637 |
| 19 | extratrees | 10 | 0.5144329 | 0.7364852 |
| 20 | extratrees | 10 | 0.5145276 | 0.7360559 |
| 14 | extratrees | 10 | 0.5145921 | 0.7360603 |
| 12 | extratrees | 10 | 0.5147796 | 0.7360459 |
| 21 | extratrees | 10 | 0.5148119 | 0.7360844 |
| 22 | extratrees | 10 | 0.5149363 | 0.7359524 |
| 15 | extratrees | 10 | 0.5150696 | 0.7359179 |
| 16 | extratrees | 10 | 0.5151400 | 0.7357276 |
| 17 | extratrees | 10 | 0.5153332 | 0.7355778 |
| 11 | extratrees | 10 | 0.5155171 | 0.7354303 |
| 13 | extratrees | 10 | 0.5158131 | 0.7351703 |
| 10 | extratrees | 10 | 0.5163732 | 0.7348576 |
| 3 | variance | 10 | 0.5175551 | 0.7335869 |
| 9 | extratrees | 10 | 0.5177270 | 0.7338229 |
| 4 | variance | 10 | 0.5178438 | 0.7332150 |
| 8 | extratrees | 10 | 0.5183110 | 0.7334427 |
| 5 | variance | 10 | 0.5188329 | 0.7325326 |
| 7 | extratrees | 10 | 0.5191696 | 0.7328366 |
| 6 | variance | 10 | 0.5191703 | 0.7320587 |
| 2 | variance | 10 | 0.5211176 | 0.7319268 |
| 7 | variance | 10 | 0.5213790 | 0.7305028 |
| 6 | extratrees | 10 | 0.5223729 | 0.7305640 |
| 8 | variance | 10 | 0.5236173 | 0.7285955 |
| 9 | variance | 10 | 0.5249420 | 0.7274079 |
| 10 | variance | 10 | 0.5253124 | 0.7273529 |
| 5 | extratrees | 10 | 0.5259155 | 0.7285022 |
| 11 | variance | 10 | 0.5267719 | 0.7258594 |
| 12 | variance | 10 | 0.5276699 | 0.7253682 |
| 13 | variance | 10 | 0.5286305 | 0.7244939 |
| 14 | variance | 10 | 0.5300395 | 0.7233127 |
| 15 | variance | 10 | 0.5306937 | 0.7227515 |
| 4 | extratrees | 10 | 0.5325198 | 0.7239077 |
| 17 | variance | 10 | 0.5328872 | 0.7208362 |
| 16 | variance | 10 | 0.5329004 | 0.7206066 |
| 18 | variance | 10 | 0.5338285 | 0.7200228 |
| 19 | variance | 10 | 0.5339943 | 0.7199321 |
| 20 | variance | 10 | 0.5355279 | 0.7185848 |
| 21 | variance | 10 | 0.5365520 | 0.7176272 |
| 22 | variance | 10 | 0.5376347 | 0.7166062 |
| 3 | extratrees | 10 | 0.5422478 | 0.7184310 |
| 2 | extratrees | 10 | 0.5682866 | 0.7028899 |
We get the best performance when we split by using Extremely Random Trees. Extremely Random Trees work similar to Random Forests but they differ on a couple of levels:
| x | |
|---|---|
| RMSE | 0.4887077 |
| Rsquared | 0.6970456 |
| MAE | 0.2916262 |
Finally, I trained an XGBoost model to compare with the other models. In XGBoost there are a lot of hyperparameters that require tuning but I narrowed them down to a search over:
Each tree had a max depth of 6 and there was no regularization constant inside the trees. For each iteration I also utilized only 80% of the training data.
Figure 8. Learning curve for XGBoost model
| eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | RMSE | Rsquared |
|---|---|---|---|---|---|---|---|---|
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 700 | 0.5069693 | 0.7454435 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 600 | 0.5073256 | 0.7454181 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 300 | 0.5075768 | 0.7452551 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 800 | 0.5076553 | 0.7452580 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 400 | 0.5080666 | 0.7450891 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 600 | 0.5099644 | 0.7426192 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 500 | 0.5100948 | 0.7443281 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 700 | 0.5101321 | 0.7424102 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 600 | 0.5103285 | 0.7421742 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 300 | 0.5103689 | 0.7424857 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 700 | 0.5105006 | 0.7419582 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 500 | 0.5108443 | 0.7450646 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 300 | 0.5109533 | 0.7416116 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 800 | 0.5112027 | 0.7420391 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 300 | 0.5112833 | 0.7416963 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 600 | 0.5113299 | 0.7414146 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 700 | 0.5113535 | 0.7413192 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 800 | 0.5114066 | 0.7417510 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 400 | 0.5115077 | 0.7420321 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 600 | 0.5117100 | 0.7435051 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 400 | 0.5118336 | 0.7413760 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 800 | 0.5121735 | 0.7411278 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 400 | 0.5122487 | 0.7413626 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 700 | 0.5126581 | 0.7429981 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 500 | 0.5126721 | 0.7426076 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 500 | 0.5130795 | 0.7420998 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 800 | 0.5132916 | 0.7425992 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 500 | 0.5139359 | 0.7410707 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 500 | 0.5140600 | 0.7405412 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 500 | 0.5141703 | 0.7412210 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 500 | 0.5142871 | 0.7406593 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 600 | 0.5156788 | 0.7402006 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 600 | 0.5157834 | 0.7396754 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 600 | 0.5159319 | 0.7398495 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 700 | 0.5167858 | 0.7395796 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 700 | 0.5169176 | 0.7390218 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 700 | 0.5169871 | 0.7392428 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 800 | 0.5175225 | 0.7391191 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 800 | 0.5176205 | 0.7388446 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 800 | 0.5176778 | 0.7385113 |
| 0.010 | 6 | 0 | 0.5 | 1 | 0.8 | 200 | 0.5222656 | 0.7441449 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 400 | 0.5224844 | 0.7442203 |
| 0.010 | 6 | 0 | 0.6 | 1 | 0.8 | 200 | 0.5230201 | 0.7425382 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 400 | 0.5231197 | 0.7423730 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 400 | 0.5234363 | 0.7420367 |
| 0.010 | 6 | 0 | 0.7 | 1 | 0.8 | 200 | 0.5239433 | 0.7413699 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 400 | 0.5244525 | 0.7411256 |
| 0.010 | 6 | 0 | 0.8 | 1 | 0.8 | 200 | 0.5245956 | 0.7408921 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 300 | 0.5532924 | 0.7419453 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 300 | 0.5535713 | 0.7416965 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 300 | 0.5541784 | 0.7427082 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 300 | 0.5544383 | 0.7408167 |
| 0.005 | 6 | 0 | 0.6 | 1 | 0.8 | 200 | 0.6293978 | 0.7403623 |
| 0.005 | 6 | 0 | 0.7 | 1 | 0.8 | 200 | 0.6296223 | 0.7400998 |
| 0.005 | 6 | 0 | 0.8 | 1 | 0.8 | 200 | 0.6301154 | 0.7395130 |
| 0.005 | 6 | 0 | 0.5 | 1 | 0.8 | 200 | 0.6313091 | 0.7398240 |
The XGBoost model does very well, but not quite as well as the random forest model.
| x | |
|---|---|
| RMSE | 0.5140068 |
| Rsquared | 0.6825288 |
| MAE | 0.3255674 |
caret saves all trained models in an object containing all resampled metrics. We can extract those metrics and plot them to compare all the models.
Figure 9. Boxplots of model metrics
It’s clear that the tree-based models do much better than the regression models. This is to be expected as tree-based models work better out of the box most of the time. For linear models to have predictive power equal to tree-based ones, careful feature engineering and domain knowledge must be applied.
Now we can use scatterplots to compare the tree-based models more explicitly. The points are the resampled RMSE values. Points above the dotted line correspond to the model on the y axis having a higher RMSE and points below the line correspond to the model on the x axis having a higher RMSE.
Figure 10. SPLOM of model RMSE
This scatterplot shows us that even though the random forest did best on most cases, the difference between models is not overly large. I believe that with more time spent in parameter tuning for the boosted models they could beat the random forest models.
| model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Stepwise AIC | 0.7484820 | 0.3542432 | 0.5588668 |
| PLS | 0.7478039 | 0.3054730 | 0.5872181 |
| GLMnet | 0.7342858 | 0.3268801 | 0.5639280 |
| Bagged trees | 0.6007889 | 0.5583835 | 0.3801646 |
| GBM | 0.5328230 | 0.5976603 | 0.3475788 |
| XGBoost | 0.5140068 | 0.6825288 | 0.3255674 |
| Random forest | 0.4887077 | 0.6970456 | 0.2916262 |
Table 6 above shows the test set performances. Random forest with extremely randomized trees did much better than the competition. Keep in mind though, that I handpicked the parameters for the GBM model based on graphs, so I might have missed some optimal parts of the parameter tuning space. The same can be said for the XGBoost model which has many more parameters than random forest models.
It is interesting that the stepwise AIC model has better \(R^2\) and MAE (mean absolute error) than the other linear models, even though its RMSE is higher. RMSE punished high residuals more than MAE since it squares them instead of taking the absolute value. This might mean that the stepwise AIC model explains abnormal data points worse than the other models, but still explains most other data points better. This is a good lesson in choice of cost function, since different cost functions can give us different orderings of models.
In this report I summarized the history of gradient boosting and handpicked parameters for such a model. I then trained various other linear and non-linear models, and compared their resampled performance measures. We saw that decision tree ensemble models do a lot better than linear models out of the box, which makes them a good first angle of attack when one is dealing with a complex dataset and has little time for feature engineering. My main lessons learned from this assignment:
carets built in objects to evaluate resampled performance measures.Hope you enjoyed reading this as much as I enjoyed researching and writing it!