Executive Summary

Gradient Boosting

Boosting was proposed by Michael Kearns as a way for many weak learners to combine and make a strong predictive model. Unlike bagged trees, which average predictions over many fully expanded decision trees, boosted trees are composed of small tree stumps whose main purpose is to predict correctly the cases that previous trees in the chain predicted wrongly. In regression, each stump is trained on the current residuals, but in classification the stumps train on data points that have been weighted based on past performance in the training algorithm.

The first concrete boosting algorithm was AdaBoost (short for Adaptive Boosting) by Yoav Freund and Robert Schapire. AdaBoost selects the best single feature and creates a decision split on it. It then finds data points for which it was wrong and updates their weights accordingly. This means that hard to classify samples are given increasingly more priority as the algorithm goes on. Another pro of this approach is that AdaBoost escapes dimensionality problems by only selecting one predictor at a time.

Modern additions to AdaBoost are stochastic gradient boosting techniques, which trains each stump only on a random subset of the data, an important feature given the size of many modern datasets. The most famous algorithm currently applying this technique is XGBoost, which is highly utilized in data science competitions like those on Kaggle.

The dataset

The Hitters dataset comes with the ISLR package. It contains performance metrics about baseball players from the 1986 season and their salaries at the start of the 1987 season. The metrics are stats like

AtBat: Number of times at bat in 1986
CAtBat: Number of times at bat during career
Years: Years in major leagues

I will use those covariates to predict the log transform of the players’ salaries in 1987.

Training a boosted model

Parameter tuning

First a gradient boosted model was trained with parameters:

Number of trees: 1000
Number of splits: 1
Minimum observations in each node after splitting: 10
Shrinkage: 20 values on the set [0.001, 0.1]

The learning curve for this model can be seen below.

Figure 1. Learning Curve for shrinkage parameter

We can see from figure 1 that training error keeps going down even after testing error reaches a minimum. This is to be expected.

The testing error reaches a minimum of RMSE = 0.5000069 with a shrinkage of $\lambda$ = 0.0843684

Now that we’ve established a good shrinkage parameter it would be useful to plot a learning curve for the number of trees trained. If we can reduce the number of trees in the model without causing a large increase in testing error a simpler model would generalize better to new data.

Figure 2. Learning Curve for n.trees parameter. One standard deviation rule shown in red

A common rule when training complex models is to choose the simplest model whose error is within one standard deviation of the best model. By following that heuristic we can greatly simplify our model to needing only 403 trees. But, when looking at the graph we can see that using 479 reduces our loss handily, so I will choose that number of trees instead.

Our final model parameters are therefore:

Number of trees: 479
Number of splits: 1
Minimum observations in each node after splitting: 10
Shrinkage: 0.0843684

Test set metrics for GBM model.
	x
RMSE	0.5328230
Rsquared	0.5976603
MAE	0.3475788

Variable importance

When training complex models like gradient boosted trees, there is no easy way to see how changes in different covariates affects our predictions. We can, on the other hand, find out which variables were most important to our model. The training algorithms keep track of the total reduction in loss caused by each variable for every iteration.

Figure 3. Variable Importance for Gradient Boosted Model

Other models

For the rest of this assignment I utilized the caret package by Max Kuhn. The data was first preprocessed by centering and scaling numeric variables using the means and standard deviations of the training set before one-hot encoding categorical variables. Then the models were trained and evaluated using a 30-fold bootstrap.

I will show basic summaries and learning curves for each model before comparing them all in the next section.

Linear Regression Models

Stepwise Feature Selection

Basic linear regression is always a good place to start. Features for each bootstrapped resample were chosen via model AIC scores.

Coefficients for stepwise AIC model.
term	estimate	p.value
(Intercept)	-0.08	0.22
AtBat	-0.51	0.01
Hits	0.72	0.00
Walks	0.25	0.00
Years	0.35	0.00
CRuns	0.45	0.01
CWalks	-0.30	0.07
Division.E	0.16	0.08
PutOuts	0.14	0.00

Test set metrics for stepwise AIC model.
	x
RMSE	0.7484820
Rsquared	0.3542432
MAE	0.5588668

The linear model does not do as well as the boosted model, but that was expected. The main attractiveness of linear models don’t lie in predictive power but in ease of explanation and inference.

Generalized Linear Model with Elastic-Net regularization

Elastic net regularization applies a mixture of LASSO and ridge regression via the formula:

$RSS + \alpha \lambda||L||_2 + (1-\alpha)\lambda||L||_1$
$\alpha = 0$: LASSO
$\alpha = 1$: ridge

I trained linear regressions models using elastic net regularization with parameters:

$\alpha$: 8 values on the set [0, 1]
$\lambda$: 4 values on the set [0, 1] and 4 values on the set [1, 10]

Figure 4. Learning curve for Elastic Net regression model.

Table 1. Parameters and performance of GLM models.
alpha	lambda	RMSE	Rsquared
0.0000000	0.3333333	0.6864489	0.5425118
0.0000000	0.6666667	0.6869059	0.5438826
0.0000000	1.0000000	0.6895744	0.5440683
0.1428571	0.3333333	0.6913749	0.5407689
0.0000000	0.0000000	0.6921881	0.5364441
0.2857143	0.3333333	0.7052646	0.5323518
0.1428571	0.0000000	0.7095344	0.5225255
0.8571429	0.0000000	0.7096992	0.5223151
0.2857143	0.0000000	0.7099560	0.5220471
0.7142857	0.0000000	0.7099564	0.5220970
1.0000000	0.0000000	0.7100112	0.5218291
0.5714286	0.0000000	0.7100152	0.5219730
0.4285714	0.0000000	0.7102029	0.5218297
0.1428571	0.6666667	0.7113726	0.5332227
0.4285714	0.3333333	0.7241796	0.5205558
0.0000000	4.0000000	0.7323358	0.5422285
0.1428571	1.0000000	0.7375901	0.5244927
0.5714286	0.3333333	0.7452068	0.5080999
0.2857143	0.6666667	0.7534942	0.5116360
0.7142857	0.3333333	0.7671921	0.4964745
0.0000000	7.0000000	0.7733363	0.5410534
0.8571429	0.3333333	0.7910274	0.4820150
0.4285714	0.6666667	0.8019354	0.4892019
0.0000000	10.0000000	0.8056467	0.5403563
0.2857143	1.0000000	0.8098917	0.4919123
1.0000000	0.3333333	0.8148975	0.4646329
0.5714286	0.6666667	0.8561400	0.4583072
0.4285714	1.0000000	0.8920165	0.4431545
0.7142857	0.6666667	0.9118331	0.4156143
0.8571429	0.6666667	0.9628013	0.3750014
0.5714286	1.0000000	0.9684443	0.3804526
0.1428571	4.0000000	0.9844739	0.3874520
1.0000000	0.6666667	0.9992911	0.3441595
0.7142857	1.0000000	1.0076199	0.3217853
0.1428571	7.0000000	1.0089234	NaN
0.1428571	10.0000000	1.0089234	NaN
0.2857143	4.0000000	1.0089234	NaN
0.2857143	7.0000000	1.0089234	NaN
0.2857143	10.0000000	1.0089234	NaN
0.4285714	4.0000000	1.0089234	NaN
0.4285714	7.0000000	1.0089234	NaN
0.4285714	10.0000000	1.0089234	NaN
0.5714286	4.0000000	1.0089234	NaN
0.5714286	7.0000000	1.0089234	NaN
0.5714286	10.0000000	1.0089234	NaN
0.7142857	4.0000000	1.0089234	NaN
0.7142857	7.0000000	1.0089234	NaN
0.7142857	10.0000000	1.0089234	NaN
0.8571429	1.0000000	1.0089234	NaN
0.8571429	4.0000000	1.0089234	NaN
0.8571429	7.0000000	1.0089234	NaN
0.8571429	10.0000000	1.0089234	NaN
1.0000000	1.0000000	1.0089234	NaN
1.0000000	4.0000000	1.0089234	NaN
1.0000000	7.0000000	1.0089234	NaN
1.0000000	10.0000000	1.0089234	NaN

Test set metrics for GLM model.
	x
RMSE	0.7342858
Rsquared	0.3268801
MAE	0.5639280

Partial Least Squares Regression

Partial Least Squares performs feature selection through supervised dimension reduction and trains a linear regression model on those components. I expect this method should come in handy on this dataset since there are probably multiple colinear variables like the obvious example AtBat and CAtBat.

Figure 5. Learning curve for PLS model

Table 2. Parameters and metrics of PLS model.
ncomp	RMSE	Rsquared
1	0.6992334	0.5262258
2	0.7141601	0.5112001
4	0.7170441	0.5091373
3	0.7176150	0.5067679
12	0.7228114	0.5076451
11	0.7237804	0.5060740
13	0.7240241	0.5068642
10	0.7251288	0.5037276
8	0.7260542	0.5016029
9	0.7263278	0.5026390
14	0.7271375	0.5040421
5	0.7275464	0.4968396
6	0.7296176	0.4953634
15	0.7302571	0.5010987
16	0.7329537	0.4986845
7	0.7335040	0.4911485
18	0.7355082	0.4952366
17	0.7357283	0.4951865

It’s a little bit of a surprise that the optimal number of components was only 1, but it makes sense given that many of the variables explain similar attributes about baseball players.

Test set metrics for PLS model.
	x
RMSE	0.7478039
Rsquared	0.3054730
MAE	0.5872181

Tree-based models

Bagged trees

I also train basic bagged decision tree models with varying complexity control parameters.

Table 3. Bootstrap metrics of simple bagged tree model.
RMSE	Rsquared
0.5453458	0.7015712

Figure 6. One of the simple bagged decision trees.

As soon as we make the jump from linear regression models to ensembles of decision trees the performance makes a large jump for the better.

Test set metrics for bagged tree model.
	x
RMSE	0.6007889
Rsquared	0.5583835
MAE	0.3801646

Random Forest

Random Forest has been hailed as a close second to gradient boosted trees for most tasks. Here I train models on various parameters. ranger, a relatively new package (callable from caret), allows for fast parallel training of random forest models.

Figure 7. Learning curve for Random Forest model

Table 4. Metrics of random forest model.
mtry	splitrule	min.node.size	RMSE	Rsquared
18	extratrees	10	0.5142004	0.7366637
19	extratrees	10	0.5144329	0.7364852
20	extratrees	10	0.5145276	0.7360559
14	extratrees	10	0.5145921	0.7360603
12	extratrees	10	0.5147796	0.7360459
21	extratrees	10	0.5148119	0.7360844
22	extratrees	10	0.5149363	0.7359524
15	extratrees	10	0.5150696	0.7359179
16	extratrees	10	0.5151400	0.7357276
17	extratrees	10	0.5153332	0.7355778
11	extratrees	10	0.5155171	0.7354303
13	extratrees	10	0.5158131	0.7351703
10	extratrees	10	0.5163732	0.7348576
3	variance	10	0.5175551	0.7335869
9	extratrees	10	0.5177270	0.7338229
4	variance	10	0.5178438	0.7332150
8	extratrees	10	0.5183110	0.7334427
5	variance	10	0.5188329	0.7325326
7	extratrees	10	0.5191696	0.7328366
6	variance	10	0.5191703	0.7320587
2	variance	10	0.5211176	0.7319268
7	variance	10	0.5213790	0.7305028
6	extratrees	10	0.5223729	0.7305640
8	variance	10	0.5236173	0.7285955
9	variance	10	0.5249420	0.7274079
10	variance	10	0.5253124	0.7273529
5	extratrees	10	0.5259155	0.7285022
11	variance	10	0.5267719	0.7258594
12	variance	10	0.5276699	0.7253682
13	variance	10	0.5286305	0.7244939
14	variance	10	0.5300395	0.7233127
15	variance	10	0.5306937	0.7227515
4	extratrees	10	0.5325198	0.7239077
17	variance	10	0.5328872	0.7208362
16	variance	10	0.5329004	0.7206066
18	variance	10	0.5338285	0.7200228
19	variance	10	0.5339943	0.7199321
20	variance	10	0.5355279	0.7185848
21	variance	10	0.5365520	0.7176272
22	variance	10	0.5376347	0.7166062
3	extratrees	10	0.5422478	0.7184310
2	extratrees	10	0.5682866	0.7028899

We get the best performance when we split by using Extremely Random Trees. Extremely Random Trees work similar to Random Forests but they differ on a couple of levels:

They do not perform bootstrapping
They also choose a random subset of predictors but then perform random splits on those predictors and choose the best one from those.

Test set metrics for random forest model.
	x
RMSE	0.4887077
Rsquared	0.6970456
MAE	0.2916262

Extreme Gradient Boosting with XGBoost

Finally, I trained an XGBoost model to compare with the other models. In XGBoost there are a lot of hyperparameters that require tuning but I narrowed them down to a search over:

eta: learning rate = 0.001 or 0.0005
nrounds: number of iterations = [200, 300, 400, 500, 600, 700, 800]
colsample_bytree: fraction of predictors to choose from = [0.5, 0.6, 0.7, 0.8]

Each tree had a max depth of 6 and there was no regularization constant inside the trees. For each iteration I also utilized only 80% of the training data.

$Figure 8. Learning curve for XGBoost model$

Figure 8. Learning curve for XGBoost model

Table 5. Parameters and performance of XGBoost models.
eta	max_depth	colsample_bytree	min_child_weight	subsample	nrounds	RMSE	Rsquared
0.005	6	0.5	1	0.8	700	0.5069693	0.7454435
0.005	6	0.5	1	0.8	600	0.5073256	0.7454181
0.010	6	0.5	1	0.8	300	0.5075768	0.7452551
0.005	6	0.5	1	0.8	800	0.5076553	0.7452580
0.010	6	0.5	1	0.8	400	0.5080666	0.7450891
0.005	6	0.6	1	0.8	600	0.5099644	0.7426192
0.010	6	0.5	1	0.8	500	0.5100948	0.7443281
0.005	6	0.6	1	0.8	700	0.5101321	0.7424102
0.005	6	0.7	1	0.8	600	0.5103285	0.7421742
0.010	6	0.6	1	0.8	300	0.5103689	0.7424857
0.005	6	0.7	1	0.8	700	0.5105006	0.7419582
0.005	6	0.5	1	0.8	500	0.5108443	0.7450646
0.010	6	0.7	1	0.8	300	0.5109533	0.7416116
0.005	6	0.6	1	0.8	800	0.5112027	0.7420391
0.010	6	0.8	1	0.8	300	0.5112833	0.7416963
0.005	6	0.8	1	0.8	600	0.5113299	0.7414146
0.005	6	0.8	1	0.8	700	0.5113535	0.7413192
0.005	6	0.7	1	0.8	800	0.5114066	0.7417510
0.010	6	0.6	1	0.8	400	0.5115077	0.7420321
0.010	6	0.5	1	0.8	600	0.5117100	0.7435051
0.010	6	0.7	1	0.8	400	0.5118336	0.7413760
0.005	6	0.8	1	0.8	800	0.5121735	0.7411278
0.010	6	0.8	1	0.8	400	0.5122487	0.7413626
0.010	6	0.5	1	0.8	700	0.5126581	0.7429981
0.005	6	0.6	1	0.8	500	0.5126721	0.7426076
0.005	6	0.7	1	0.8	500	0.5130795	0.7420998
0.010	6	0.5	1	0.8	800	0.5132916	0.7425992
0.010	6	0.6	1	0.8	500	0.5139359	0.7410707
0.010	6	0.7	1	0.8	500	0.5140600	0.7405412
0.005	6	0.8	1	0.8	500	0.5141703	0.7412210
0.010	6	0.8	1	0.8	500	0.5142871	0.7406593
0.010	6	0.6	1	0.8	600	0.5156788	0.7402006
0.010	6	0.7	1	0.8	600	0.5157834	0.7396754
0.010	6	0.8	1	0.8	600	0.5159319	0.7398495
0.010	6	0.6	1	0.8	700	0.5167858	0.7395796
0.010	6	0.7	1	0.8	700	0.5169176	0.7390218
0.010	6	0.8	1	0.8	700	0.5169871	0.7392428
0.010	6	0.6	1	0.8	800	0.5175225	0.7391191
0.010	6	0.8	1	0.8	800	0.5176205	0.7388446
0.010	6	0.7	1	0.8	800	0.5176778	0.7385113
0.010	6	0.5	1	0.8	200	0.5222656	0.7441449
0.005	6	0.5	1	0.8	400	0.5224844	0.7442203
0.010	6	0.6	1	0.8	200	0.5230201	0.7425382
0.005	6	0.6	1	0.8	400	0.5231197	0.7423730
0.005	6	0.7	1	0.8	400	0.5234363	0.7420367
0.010	6	0.7	1	0.8	200	0.5239433	0.7413699
0.005	6	0.8	1	0.8	400	0.5244525	0.7411256
0.010	6	0.8	1	0.8	200	0.5245956	0.7408921
0.005	6	0.6	1	0.8	300	0.5532924	0.7419453
0.005	6	0.7	1	0.8	300	0.5535713	0.7416965
0.005	6	0.5	1	0.8	300	0.5541784	0.7427082
0.005	6	0.8	1	0.8	300	0.5544383	0.7408167
0.005	6	0.6	1	0.8	200	0.6293978	0.7403623
0.005	6	0.7	1	0.8	200	0.6296223	0.7400998
0.005	6	0.8	1	0.8	200	0.6301154	0.7395130
0.005	6	0.5	1	0.8	200	0.6313091	0.7398240

The XGBoost model does very well, but not quite as well as the random forest model.

Test set metrics for XGBoost model.
	x
RMSE	0.5140068
Rsquared	0.6825288
MAE	0.3255674

Comparing models

caret saves all trained models in an object containing all resampled metrics. We can extract those metrics and plot them to compare all the models.

Comparison of resamples

Boxplots

Figure 9. Boxplots of model metrics

It’s clear that the tree-based models do much better than the regression models. This is to be expected as tree-based models work better out of the box most of the time. For linear models to have predictive power equal to tree-based ones, careful feature engineering and domain knowledge must be applied.

Scatterplot

Now we can use scatterplots to compare the tree-based models more explicitly. The points are the resampled RMSE values. Points above the dotted line correspond to the model on the y axis having a higher RMSE and points below the line correspond to the model on the x axis having a higher RMSE.

Figure 10. SPLOM of model RMSE

This scatterplot shows us that even though the random forest did best on most cases, the difference between models is not overly large. I believe that with more time spent in parameter tuning for the boosted models they could beat the random forest models.

Test set performances

Table 6. Test set performance measures.
model	RMSE	Rsquared	MAE
Stepwise AIC	0.7484820	0.3542432	0.5588668
PLS	0.7478039	0.3054730	0.5872181
GLMnet	0.7342858	0.3268801	0.5639280
Bagged trees	0.6007889	0.5583835	0.3801646
GBM	0.5328230	0.5976603	0.3475788
XGBoost	0.5140068	0.6825288	0.3255674
Random forest	0.4887077	0.6970456	0.2916262

Table 6 above shows the test set performances. Random forest with extremely randomized trees did much better than the competition. Keep in mind though, that I handpicked the parameters for the GBM model based on graphs, so I might have missed some optimal parts of the parameter tuning space. The same can be said for the XGBoost model which has many more parameters than random forest models.

It is interesting that the stepwise AIC model has better $R^2$ and MAE (mean absolute error) than the other linear models, even though its RMSE is higher. RMSE punished high residuals more than MAE since it squares them instead of taking the absolute value. This might mean that the stepwise AIC model explains abnormal data points worse than the other models, but still explains most other data points better. This is a good lesson in choice of cost function, since different cost functions can give us different orderings of models.

Summary

In this report I summarized the history of gradient boosting and handpicked parameters for such a model. I then trained various other linear and non-linear models, and compared their resampled performance measures. We saw that decision tree ensemble models do a lot better than linear models out of the box, which makes them a good first angle of attack when one is dealing with a complex dataset and has little time for feature engineering. My main lessons learned from this assignment:

Applying the standard deviation rule to train a simpler model without decreasing performance too much
Intuition onto parameter tuning for gradient boosting, in particular XGBoost.
Learned about the Extremely Random Trees algorithm.
Using carets built in objects to evaluate resampled performance measures.
Saw example of differences in MAE vs. RMSE loss metrics.

Hope you enjoyed reading this as much as I enjoyed researching and writing it!

Assignment 2

Comparing gradient boosting to other regression models

Brynjólfur Gauti Jónsson

Teachers: Óli Páll Geirsson and Oxana Storozhenko

2018-03-24