Executive Summary

Gradient Boosting

Boosting was proposed by Michael Kearns as a way for many weak learners to combine and make a strong predictive model. Unlike bagged trees, which average predictions over many fully expanded decision trees, boosted trees are composed of small tree stumps whose main purpose is to predict correctly the cases that previous trees in the chain predicted wrongly. In regression, each stump is trained on the current residuals, but in classification the stumps train on data points that have been weighted based on past performance in the training algorithm.

The first concrete boosting algorithm was AdaBoost (short for Adaptive Boosting) by Yoav Freund and Robert Schapire. AdaBoost selects the best single feature and creates a decision split on it. It then finds data points for which it was wrong and updates their weights accordingly. This means that hard to classify samples are given increasingly more priority as the algorithm goes on. Another pro of this approach is that AdaBoost escapes dimensionality problems by only selecting one predictor at a time.

Modern additions to AdaBoost are stochastic gradient boosting techniques, which trains each stump only on a random subset of the data, an important feature given the size of many modern datasets. The most famous algorithm currently applying this technique is XGBoost, which is highly utilized in data science competitions like those on Kaggle.

The dataset

The Hitters dataset comes with the ISLR package. It contains performance metrics about baseball players from the 1986 season and their salaries at the start of the 1987 season. The metrics are stats like

  • AtBat: Number of times at bat in 1986
  • CAtBat: Number of times at bat during career
  • Years: Years in major leagues

I will use those covariates to predict the log transform of the players’ salaries in 1987.

Training a boosted model

Parameter tuning

First a gradient boosted model was trained with parameters:

  • Number of trees: 1000
  • Number of splits: 1
  • Minimum observations in each node after splitting: 10
  • Shrinkage: 20 values on the set [0.001, 0.1]

The learning curve for this model can be seen below.

Figure 1. Learning Curve for shrinkage parameter

Figure 1. Learning Curve for shrinkage parameter

We can see from figure 1 that training error keeps going down even after testing error reaches a minimum. This is to be expected.

  • The testing error reaches a minimum of RMSE = 0.5000069 with a shrinkage of \(\lambda\) = 0.0843684

Now that we’ve established a good shrinkage parameter it would be useful to plot a learning curve for the number of trees trained. If we can reduce the number of trees in the model without causing a large increase in testing error a simpler model would generalize better to new data.

Figure 2. Learning Curve for n.trees parameter. One standard deviation rule shown in red

Figure 2. Learning Curve for n.trees parameter. One standard deviation rule shown in red

A common rule when training complex models is to choose the simplest model whose error is within one standard deviation of the best model. By following that heuristic we can greatly simplify our model to needing only 403 trees. But, when looking at the graph we can see that using 479 reduces our loss handily, so I will choose that number of trees instead.

Our final model parameters are therefore:

  • Number of trees: 479
  • Number of splits: 1
  • Minimum observations in each node after splitting: 10
  • Shrinkage: 0.0843684
Test set metrics for GBM model.
x
RMSE 0.5328230
Rsquared 0.5976603
MAE 0.3475788

Variable importance

When training complex models like gradient boosted trees, there is no easy way to see how changes in different covariates affects our predictions. We can, on the other hand, find out which variables were most important to our model. The training algorithms keep track of the total reduction in loss caused by each variable for every iteration.

Figure 3. Variable Importance for Gradient Boosted Model

Figure 3. Variable Importance for Gradient Boosted Model

Other models

For the rest of this assignment I utilized the caret package by Max Kuhn. The data was first preprocessed by centering and scaling numeric variables using the means and standard deviations of the training set before one-hot encoding categorical variables. Then the models were trained and evaluated using a 30-fold bootstrap.

I will show basic summaries and learning curves for each model before comparing them all in the next section.

Linear Regression Models

Stepwise Feature Selection

Basic linear regression is always a good place to start. Features for each bootstrapped resample were chosen via model AIC scores.

Coefficients for stepwise AIC model.
term estimate p.value
(Intercept) -0.08 0.22
AtBat -0.51 0.01
Hits 0.72 0.00
Walks 0.25 0.00
Years 0.35 0.00
CRuns 0.45 0.01
CWalks -0.30 0.07
Division.E 0.16 0.08
PutOuts 0.14 0.00
Test set metrics for stepwise AIC model.
x
RMSE 0.7484820
Rsquared 0.3542432
MAE 0.5588668

The linear model does not do as well as the boosted model, but that was expected. The main attractiveness of linear models don’t lie in predictive power but in ease of explanation and inference.

Generalized Linear Model with Elastic-Net regularization

Elastic net regularization applies a mixture of LASSO and ridge regression via the formula:

  • \(RSS + \alpha \lambda||L||_2 + (1-\alpha)\lambda||L||_1\)
  • \(\alpha = 0\): LASSO
  • \(\alpha = 1\): ridge

I trained linear regressions models using elastic net regularization with parameters:

  • \(\alpha\): 8 values on the set [0, 1]
  • \(\lambda\): 4 values on the set [0, 1] and 4 values on the set [1, 10]
Figure 4. Learning curve for Elastic Net regression model.

Figure 4. Learning curve for Elastic Net regression model.

Table 1. Parameters and performance of GLM models.
alpha lambda RMSE Rsquared
0.0000000 0.3333333 0.6864489 0.5425118
0.0000000 0.6666667 0.6869059 0.5438826
0.0000000 1.0000000 0.6895744 0.5440683
0.1428571 0.3333333 0.6913749 0.5407689
0.0000000 0.0000000 0.6921881 0.5364441
0.2857143 0.3333333 0.7052646 0.5323518
0.1428571 0.0000000 0.7095344 0.5225255
0.8571429 0.0000000 0.7096992 0.5223151
0.2857143 0.0000000 0.7099560 0.5220471
0.7142857 0.0000000 0.7099564 0.5220970
1.0000000 0.0000000 0.7100112 0.5218291
0.5714286 0.0000000 0.7100152 0.5219730
0.4285714 0.0000000 0.7102029 0.5218297
0.1428571 0.6666667 0.7113726 0.5332227
0.4285714 0.3333333 0.7241796 0.5205558
0.0000000 4.0000000 0.7323358 0.5422285
0.1428571 1.0000000 0.7375901 0.5244927
0.5714286 0.3333333 0.7452068 0.5080999
0.2857143 0.6666667 0.7534942 0.5116360
0.7142857 0.3333333 0.7671921 0.4964745
0.0000000 7.0000000 0.7733363 0.5410534
0.8571429 0.3333333 0.7910274 0.4820150
0.4285714 0.6666667 0.8019354 0.4892019
0.0000000 10.0000000 0.8056467 0.5403563
0.2857143 1.0000000 0.8098917 0.4919123
1.0000000 0.3333333 0.8148975 0.4646329
0.5714286 0.6666667 0.8561400 0.4583072
0.4285714 1.0000000 0.8920165 0.4431545
0.7142857 0.6666667 0.9118331 0.4156143
0.8571429 0.6666667 0.9628013 0.3750014
0.5714286 1.0000000 0.9684443 0.3804526
0.1428571 4.0000000 0.9844739 0.3874520
1.0000000 0.6666667 0.9992911 0.3441595
0.7142857 1.0000000 1.0076199 0.3217853
0.1428571 7.0000000 1.0089234 NaN
0.1428571 10.0000000 1.0089234 NaN
0.2857143 4.0000000 1.0089234 NaN
0.2857143 7.0000000 1.0089234 NaN
0.2857143 10.0000000 1.0089234 NaN
0.4285714 4.0000000 1.0089234 NaN
0.4285714 7.0000000 1.0089234 NaN
0.4285714 10.0000000 1.0089234 NaN
0.5714286 4.0000000 1.0089234 NaN
0.5714286 7.0000000 1.0089234 NaN
0.5714286 10.0000000 1.0089234 NaN
0.7142857 4.0000000 1.0089234 NaN
0.7142857 7.0000000 1.0089234 NaN
0.7142857 10.0000000 1.0089234 NaN
0.8571429 1.0000000 1.0089234 NaN
0.8571429 4.0000000 1.0089234 NaN
0.8571429 7.0000000 1.0089234 NaN
0.8571429 10.0000000 1.0089234 NaN
1.0000000 1.0000000 1.0089234 NaN
1.0000000 4.0000000 1.0089234 NaN
1.0000000 7.0000000 1.0089234 NaN
1.0000000 10.0000000 1.0089234 NaN
Test set metrics for GLM model.
x
RMSE 0.7342858
Rsquared 0.3268801
MAE 0.5639280

Partial Least Squares Regression

Partial Least Squares performs feature selection through supervised dimension reduction and trains a linear regression model on those components. I expect this method should come in handy on this dataset since there are probably multiple colinear variables like the obvious example AtBat and CAtBat.

Figure 5. Learning curve for PLS model

Figure 5. Learning curve for PLS model

Table 2. Parameters and metrics of PLS model.
ncomp RMSE Rsquared
1 0.6992334 0.5262258
2 0.7141601 0.5112001
4 0.7170441 0.5091373
3 0.7176150 0.5067679
12 0.7228114 0.5076451
11 0.7237804 0.5060740
13 0.7240241 0.5068642
10 0.7251288 0.5037276
8 0.7260542 0.5016029
9 0.7263278 0.5026390
14 0.7271375 0.5040421
5 0.7275464 0.4968396
6 0.7296176 0.4953634
15 0.7302571 0.5010987
16 0.7329537 0.4986845
7 0.7335040 0.4911485
18 0.7355082 0.4952366
17 0.7357283 0.4951865

It’s a little bit of a surprise that the optimal number of components was only 1, but it makes sense given that many of the variables explain similar attributes about baseball players.

Test set metrics for PLS model.
x
RMSE 0.7478039
Rsquared 0.3054730
MAE 0.5872181

Tree-based models

Bagged trees

I also train basic bagged decision tree models with varying complexity control parameters.

Table 3. Bootstrap metrics of simple bagged tree model.
RMSE Rsquared
0.5453458 0.7015712
Figure 6. One of the simple bagged decision trees.

Figure 6. One of the simple bagged decision trees.

As soon as we make the jump from linear regression models to ensembles of decision trees the performance makes a large jump for the better.

Test set metrics for bagged tree model.
x
RMSE 0.6007889
Rsquared 0.5583835
MAE 0.3801646

Random Forest

Random Forest has been hailed as a close second to gradient boosted trees for most tasks. Here I train models on various parameters. ranger, a relatively new package (callable from caret), allows for fast parallel training of random forest models.

Figure 7. Learning curve for Random Forest model

Figure 7. Learning curve for Random Forest model

Table 4. Metrics of random forest model.
mtry splitrule min.node.size RMSE Rsquared
18 extratrees 10 0.5142004 0.7366637
19 extratrees 10 0.5144329 0.7364852
20 extratrees 10 0.5145276 0.7360559
14 extratrees 10 0.5145921 0.7360603
12 extratrees 10 0.5147796 0.7360459
21 extratrees 10 0.5148119 0.7360844
22 extratrees 10 0.5149363 0.7359524
15 extratrees 10 0.5150696 0.7359179
16 extratrees 10 0.5151400 0.7357276
17 extratrees 10 0.5153332 0.7355778
11 extratrees 10 0.5155171 0.7354303
13 extratrees 10 0.5158131 0.7351703
10 extratrees 10 0.5163732 0.7348576
3 variance 10 0.5175551 0.7335869
9 extratrees 10 0.5177270 0.7338229
4 variance 10 0.5178438 0.7332150
8 extratrees 10 0.5183110 0.7334427
5 variance 10 0.5188329 0.7325326
7 extratrees 10 0.5191696 0.7328366
6 variance 10 0.5191703 0.7320587
2 variance 10 0.5211176 0.7319268
7 variance 10 0.5213790 0.7305028
6 extratrees 10 0.5223729 0.7305640
8 variance 10 0.5236173 0.7285955
9 variance 10 0.5249420 0.7274079
10 variance 10 0.5253124 0.7273529
5 extratrees 10 0.5259155 0.7285022
11 variance 10 0.5267719 0.7258594
12 variance 10 0.5276699 0.7253682
13 variance 10 0.5286305 0.7244939
14 variance 10 0.5300395 0.7233127
15 variance 10 0.5306937 0.7227515
4 extratrees 10 0.5325198 0.7239077
17 variance 10 0.5328872 0.7208362
16 variance 10 0.5329004 0.7206066
18 variance 10 0.5338285 0.7200228
19 variance 10 0.5339943 0.7199321
20 variance 10 0.5355279 0.7185848
21 variance 10 0.5365520 0.7176272
22 variance 10 0.5376347 0.7166062
3 extratrees 10 0.5422478 0.7184310
2 extratrees 10 0.5682866 0.7028899

We get the best performance when we split by using Extremely Random Trees. Extremely Random Trees work similar to Random Forests but they differ on a couple of levels:

  • They do not perform bootstrapping
  • They also choose a random subset of predictors but then perform random splits on those predictors and choose the best one from those.
Test set metrics for random forest model.
x
RMSE 0.4887077
Rsquared 0.6970456
MAE 0.2916262

Extreme Gradient Boosting with XGBoost

Finally, I trained an XGBoost model to compare with the other models. In XGBoost there are a lot of hyperparameters that require tuning but I narrowed them down to a search over:

  • eta: learning rate = 0.001 or 0.0005
  • nrounds: number of iterations = [200, 300, 400, 500, 600, 700, 800]
  • colsample_bytree: fraction of predictors to choose from = [0.5, 0.6, 0.7, 0.8]

Each tree had a max depth of 6 and there was no regularization constant inside the trees. For each iteration I also utilized only 80% of the training data.

Figure 8. Learning curve for XGBoost model

Figure 8. Learning curve for XGBoost model

Table 5. Parameters and performance of XGBoost models.
eta max_depth gamma colsample_bytree min_child_weight subsample nrounds RMSE Rsquared
0.005 6 0 0.5 1 0.8 700 0.5069693 0.7454435
0.005 6 0 0.5 1 0.8 600 0.5073256 0.7454181
0.010 6 0 0.5 1 0.8 300 0.5075768 0.7452551
0.005 6 0 0.5 1 0.8 800 0.5076553 0.7452580
0.010 6 0 0.5 1 0.8 400 0.5080666 0.7450891
0.005 6 0 0.6 1 0.8 600 0.5099644 0.7426192
0.010 6 0 0.5 1 0.8 500 0.5100948 0.7443281
0.005 6 0 0.6 1 0.8 700 0.5101321 0.7424102
0.005 6 0 0.7 1 0.8 600 0.5103285 0.7421742
0.010 6 0 0.6 1 0.8 300 0.5103689 0.7424857
0.005 6 0 0.7 1 0.8 700 0.5105006 0.7419582
0.005 6 0 0.5 1 0.8 500 0.5108443 0.7450646
0.010 6 0 0.7 1 0.8 300 0.5109533 0.7416116
0.005 6 0 0.6 1 0.8 800 0.5112027 0.7420391
0.010 6 0 0.8 1 0.8 300 0.5112833 0.7416963
0.005 6 0 0.8 1 0.8 600 0.5113299 0.7414146
0.005 6 0 0.8 1 0.8 700 0.5113535 0.7413192
0.005 6 0 0.7 1 0.8 800 0.5114066 0.7417510
0.010 6 0 0.6 1 0.8 400 0.5115077 0.7420321
0.010 6 0 0.5 1 0.8 600 0.5117100 0.7435051
0.010 6 0 0.7 1 0.8 400 0.5118336 0.7413760
0.005 6 0 0.8 1 0.8 800 0.5121735 0.7411278
0.010 6 0 0.8 1 0.8 400 0.5122487 0.7413626
0.010 6 0 0.5 1 0.8 700 0.5126581 0.7429981
0.005 6 0 0.6 1 0.8 500 0.5126721 0.7426076
0.005 6 0 0.7 1 0.8 500 0.5130795 0.7420998
0.010 6 0 0.5 1 0.8 800 0.5132916 0.7425992
0.010 6 0 0.6 1 0.8 500 0.5139359 0.7410707
0.010 6 0 0.7 1 0.8 500 0.5140600 0.7405412
0.005 6 0 0.8 1 0.8 500 0.5141703 0.7412210
0.010 6 0 0.8 1 0.8 500 0.5142871 0.7406593
0.010 6 0 0.6 1 0.8 600 0.5156788 0.7402006
0.010 6 0 0.7 1 0.8 600 0.5157834 0.7396754
0.010 6 0 0.8 1 0.8 600 0.5159319 0.7398495
0.010 6 0 0.6 1 0.8 700 0.5167858 0.7395796
0.010 6 0 0.7 1 0.8 700 0.5169176 0.7390218
0.010 6 0 0.8 1 0.8 700 0.5169871 0.7392428
0.010 6 0 0.6 1 0.8 800 0.5175225 0.7391191
0.010 6 0 0.8 1 0.8 800 0.5176205 0.7388446
0.010 6 0 0.7 1 0.8 800 0.5176778 0.7385113
0.010 6 0 0.5 1 0.8 200 0.5222656 0.7441449
0.005 6 0 0.5 1 0.8 400 0.5224844 0.7442203
0.010 6 0 0.6 1 0.8 200 0.5230201 0.7425382
0.005 6 0 0.6 1 0.8 400 0.5231197 0.7423730
0.005 6 0 0.7 1 0.8 400 0.5234363 0.7420367
0.010 6 0 0.7 1 0.8 200 0.5239433 0.7413699
0.005 6 0 0.8 1 0.8 400 0.5244525 0.7411256
0.010 6 0 0.8 1 0.8 200 0.5245956 0.7408921
0.005 6 0 0.6 1 0.8 300 0.5532924 0.7419453
0.005 6 0 0.7 1 0.8 300 0.5535713 0.7416965
0.005 6 0 0.5 1 0.8 300 0.5541784 0.7427082
0.005 6 0 0.8 1 0.8 300 0.5544383 0.7408167
0.005 6 0 0.6 1 0.8 200 0.6293978 0.7403623
0.005 6 0 0.7 1 0.8 200 0.6296223 0.7400998
0.005 6 0 0.8 1 0.8 200 0.6301154 0.7395130
0.005 6 0 0.5 1 0.8 200 0.6313091 0.7398240

The XGBoost model does very well, but not quite as well as the random forest model.

Test set metrics for XGBoost model.
x
RMSE 0.5140068
Rsquared 0.6825288
MAE 0.3255674

Comparing models

caret saves all trained models in an object containing all resampled metrics. We can extract those metrics and plot them to compare all the models.

Comparison of resamples

Boxplots

Figure 9. Boxplots of model metrics

Figure 9. Boxplots of model metrics

It’s clear that the tree-based models do much better than the regression models. This is to be expected as tree-based models work better out of the box most of the time. For linear models to have predictive power equal to tree-based ones, careful feature engineering and domain knowledge must be applied.

Scatterplot

Now we can use scatterplots to compare the tree-based models more explicitly. The points are the resampled RMSE values. Points above the dotted line correspond to the model on the y axis having a higher RMSE and points below the line correspond to the model on the x axis having a higher RMSE.

Figure 10. SPLOM of model RMSE

Figure 10. SPLOM of model RMSE

This scatterplot shows us that even though the random forest did best on most cases, the difference between models is not overly large. I believe that with more time spent in parameter tuning for the boosted models they could beat the random forest models.

Test set performances

Table 6. Test set performance measures.
model RMSE Rsquared MAE
Stepwise AIC 0.7484820 0.3542432 0.5588668
PLS 0.7478039 0.3054730 0.5872181
GLMnet 0.7342858 0.3268801 0.5639280
Bagged trees 0.6007889 0.5583835 0.3801646
GBM 0.5328230 0.5976603 0.3475788
XGBoost 0.5140068 0.6825288 0.3255674
Random forest 0.4887077 0.6970456 0.2916262

Table 6 above shows the test set performances. Random forest with extremely randomized trees did much better than the competition. Keep in mind though, that I handpicked the parameters for the GBM model based on graphs, so I might have missed some optimal parts of the parameter tuning space. The same can be said for the XGBoost model which has many more parameters than random forest models.

It is interesting that the stepwise AIC model has better \(R^2\) and MAE (mean absolute error) than the other linear models, even though its RMSE is higher. RMSE punished high residuals more than MAE since it squares them instead of taking the absolute value. This might mean that the stepwise AIC model explains abnormal data points worse than the other models, but still explains most other data points better. This is a good lesson in choice of cost function, since different cost functions can give us different orderings of models.

Summary

In this report I summarized the history of gradient boosting and handpicked parameters for such a model. I then trained various other linear and non-linear models, and compared their resampled performance measures. We saw that decision tree ensemble models do a lot better than linear models out of the box, which makes them a good first angle of attack when one is dealing with a complex dataset and has little time for feature engineering. My main lessons learned from this assignment:

  • Applying the standard deviation rule to train a simpler model without decreasing performance too much
  • Intuition onto parameter tuning for gradient boosting, in particular XGBoost.
  • Learned about the Extremely Random Trees algorithm.
  • Using carets built in objects to evaluate resampled performance measures.
  • Saw example of differences in MAE vs. RMSE loss metrics.

Hope you enjoyed reading this as much as I enjoyed researching and writing it!