Shrinkage Methods
To find the best model we can use subset selection1 Model contains a subset of predictors or shrinkage methods.2 Model contains all \(p\) predcitors
Ridge Regression Basics
Least squares fits a model by minimizing the sum of squared residuals.
\[RSS=\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2\]
Ridge Regression is similar, but it includes another term.3 \(\lambda\geq0\) is a tuning parameter and the second term, \(\lambda\sum_{j=1}^{p}\beta_j^2\), is called a shrinkage penalty
\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}\beta_j^2\] \[= RSS+\lambda\sum_{j=1}^{p}\beta_j^2\] In order to minimze this equation \(\beta_1,\dotsc,\beta_p\) should be close to zero and so it shrinks the coefficients. The tuning parameter, \(\lambda\), controls the impact.4 \(\lambda=0\) will produce the least squares estimate. \(\lambda\to\infty\) will produce coefficients near zero
Ridge regression will produce a different set of coefficient estimates, \(\hat\beta_{\lambda}^R\), for each value of \(\lambda\).
Importance of Scaled Data
Least squares coefficients are scale equivalent. Multiplying \(X_j\)5 Column of \(x\) values for the \(j^{th}\) predictor by a constant \(c\) will change \(\hat\beta_j\)6 \(\hat\beta\) value for the \(j^{th}\) predictor by \(1/c\). Therefore, \(X_j\hat\beta_j\) will remain the same.
With ridge regression the coefficient estimates can change significantly when multiplying a coefficient by a constant.7 For example, dollars in thousand vs. dollars in millions \(X_j\hat\beta_{j,\lambda}^R\) will depend on \(\lambda\) and the scaling of the \(j^{th}\) predictor. To account for this the predictors must be standardized:
\[\tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]
The demoninator is the estimated standard deviation of the \(j^{th}\) predictor.
Ridge Regression vs. Least Squares
Model selection often involves a bias-variance trade-off.8 For more information on bias-variance see my report on Resampling Methods
The Lasso
Ridge regression does have some disadvantages.
The lasso helps overcome these problems. It is similar to ridge regression, but the penalty term is slightly different.12 \(|\beta_j|\) instead of \(\beta_j^2\)
\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}|\beta_j|\] \[= RSS+\lambda\sum_{j=1}^{p}|\beta_j|\]
Like ridge regression it shrinks the coefficients towards zero. However, the lasso allows some of the coefficients to be exactly zero.
The Lasso vs. Ridge Regression
Selecting the Tuning Parameter
We can select a tuning parameter by using cross validation.
Ridge Regression Example
We can use the Hitters
dataset to predict a baseball player’s Salary
using ridge regression and the lasso. It is important to first remove missing values.
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
[1] 322 20
[1] 59
We can see that there are 59 missing values for Salary
. They can be removed with the na.omit
function.
[1] 263 20
[1] 0
Next, we must load the glmnet
package. The glmnet()
function is used with an x matrix as well as a y vector. The model.matrix()
function can create x by producing a matrix of the 19 predictors. It also transforms any qualitative variables into dummy variables.15 Glmnet() can only take numerical inputs
> #Create matrix of 19 predictors
> x=model.matrix(Salary~.,Hitters)[,-1]
> #Create vector of y values
> y=Hitters$Salary
Once we have the values for x and y we can create a grid of lambda values. Using the seq
function I have created a grid of 100 values from \(\lambda=10^{10}\) to \(\lambda=10^{-2}\). The glmnet()
function can then be used to perform ridge regression on the selected \(\lambda\) values.16 If no \(\lambda\) values are specified it will automatically select a range By default the variables are automatically standardized so that they are on the same scale. This can be changed with stadardize=FALSE
.
> library(glmnet) #load package
> #create grid of 100 values
> grid=10^seq(10,-2,length=100)
> #alpha = 0 for ridge regression
> #alpha = 1 for lasso
> ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
For each value of \(\lambda\) there is a vector of coefficients stored in a matrix. In this case it has 20 rows17 One for each predictor plus the intercept and 100 columns18 One for each value of \(\lambda\). They can be accessed by coef()
.
[1] 20 100
The coefficients should be smaller when a large value of \(\lambda\) is used. The \(50^{th}\space\lambda\) value is \(11498\) and the sum of squared coefficients19 \(\sum_{j=1}^{p}\beta_j^2\) is \(6.36\)
[1] 11497.57
(Intercept) AtBat Hits HmRun Runs
407.356050200 0.036957182 0.138180344 0.524629976 0.230701523
RBI Walks Years CAtBat CHits
0.239841459 0.289618741 1.107702929 0.003131815 0.011653637
CHmRun CRuns CRBI CWalks LeagueN
0.087545670 0.023379882 0.024138320 0.025015421 0.085028114
DivisionW PutOuts Assists Errors NewLeagueN
-6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
[1] 6.360612
The \(60^{th}\space\lambda\) value is \(705\) and the sum of squared coefficients is \(57.1\)
[1] 705.4802
(Intercept) AtBat Hits HmRun Runs RBI
54.32519950 0.11211115 0.65622409 1.17980910 0.93769713 0.84718546
Walks Years CAtBat CHits CHmRun CRuns
1.31987948 2.59640425 0.01083413 0.04674557 0.33777318 0.09355528
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.09780402 0.07189612 13.68370191 -54.65877750 0.11852289 0.01606037
Errors NewLeagueN
-0.70358655 8.61181213
[1] 57.11001
To generate the coefficients for a specific value of \(\lambda\), in this case \(50\), the predict()
function can be used.
(Intercept) AtBat Hits HmRun Runs
4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
RBI Walks Years CAtBat CHits
8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
CHmRun CRuns CRBI CWalks LeagueN
6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
DivisionW PutOuts Assists Errors NewLeagueN
-1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00
In order to estimate the test error we need to split the data into test and training sets.
> # Make replicable
> set.seed(1)
> #random sample of half
> train=sample(1:nrow(x),nrow(x)/2)
> #make the values negative
> test=(-train)
> #remove training values
> y.test=y[test]
We can use fit a model on the training data and use the predict()
function with a selected \(\lambda\) value to calculate the MSE for the test set. With \(\lambda=4\) the \(MSE=101037\).
> #ridge regression on training data
> ridge.mod=glmnet(x[train,],y[train],alpha=0,
+ lambda=grid,thresh = 1e-12)
> #test set predictions for lambda = 4
> ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 101036.8
We can use the mean y value20 For the training set as our prediction to calculate the MSE for an intercept only model. The same result can be achieved by using a very large \(\lambda\) value.21 Shrinks coefficients to near 0
[1] 193253.1
> #test set predictions for lambda = 1e10
> ridge.pred=predict(ridge.mod,s=1e10,
+ newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 193253.1
A \(\lambda\) value of \(0\) will yield the least squares results.
> #Set lambda=0 for least squares
> ridge.pred=predict(ridge.mod,s=0, newx=x[test,],
+ exact=TRUE,x=x[train,],
+ y=y[train])
> #MSE for least squares
> mean((ridge.pred -y.test)^2)
[1] 114783.1
Call:
lm(formula = y ~ x, subset = train)
Coefficients:
(Intercept) xAtBat xHits xHmRun xRuns xRBI
299.42849 -2.54027 8.36682 11.64512 -9.09923 2.44105
xWalks xYears xCAtBat xCHits xCHmRun xCRuns
9.23440 -22.93673 -0.18154 -0.11598 -1.33888 3.32838
xCRBI xCWalks xLeagueN xDivisionW xPutOuts xAssists
0.07536 -1.07841 59.76065 -98.86233 0.34087 0.34165
xErrors xNewLeagueN
-0.64207 -0.67442
> #Compare to lambda=0
> predict(ridge.mod,s=0,exact=TRUE,
+ type="coefficients",
+ x=x[train,],y=y[train])[1:20,]
(Intercept) AtBat Hits HmRun Runs RBI
299.42883596 -2.54014665 8.36611719 11.64400720 -9.09877719 2.44152119
Walks Years CAtBat CHits CHmRun CRuns
9.23403909 -22.93584442 -0.18160843 -0.11561496 -1.33836534 3.32817777
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.07511771 -1.07828647 59.76529059 -98.85996590 0.34086400 0.34165605
Errors NewLeagueN
-0.64205839 -0.67606314
We can see that ridge regression with \(\lambda=4\) outperformed the intercept only model or the least squares model.
Rather than selecting our own \(\lambda\) value, we can use cross-validation to find the best result. In this case it was a \(\lambda\) of \(212\).
> #Make replicable
> set.seed(1)
> #10 fold cross-validation
> cv.out=cv.glmnet(x[train,], y[train], alpha=0,
+ nfolds = 10)
> #plot results
> plot(cv.out)
[1] 211.7416
The best MSE was \(96016\), which outperformed least squares and the intercept only model.
> #test set predictions for lambda = 212
> ridge.pred=predict(ridge.mod,s=bestlam,
+ newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 96015.51
We can then fit the model on the full dataset and use the optimal \(\lambda\) value to generate the coefficients.
> #generate full model
> out=glmnet(x,y,alpha=0)
> #coefficients for best model
> predict(out,type = "coefficients",
+ s=bestlam)[1:20,]
(Intercept) AtBat Hits HmRun Runs RBI
9.88487157 0.03143991 1.00882875 0.13927624 1.11320781 0.87318990
Walks Years CAtBat CHits CHmRun CRuns
1.80410229 0.13074383 0.01113978 0.06489843 0.45158546 0.12900049
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.13737712 0.02908572 27.18227527 -91.63411282 0.19149252 0.04254536
Errors NewLeagueN
-1.81244470 7.21208394
Lasso Example
We can perform the same functions for the lasso by changing the function input to alpha=1
. We can see from the plot that some coefficients can be zero depending on the \(\lambda\) value.
> #generate lasso model
> lasso.mod=glmnet(x[train,],y[train],alpha=1,
+ lambda=grid)
> #plot coefficients
> plot(lasso.mod)
Cross-validation can be used to find the best value of \(\lambda\) and the minimum \(MSE\). In this case the best \(MSE\) was \(100743\). It is slightly higher that the best \(MSE\) for ridge regression, which was \(96016\), but it outperformed least squares.
> #make replicable
> set.seed(1)
> # 10 fold CV
> cv.out=cv.glmnet(x[train,],y[train],alpha=1,
+ nfolds=10)
> #lambda for min MSE
> bestlam=cv.out$lambda.min
> #test predictions with best lambda
> lasso.pred=predict(lasso.mod,s=bestlam,
+ newx=x[test,])
> #MSE
> mean((lasso.pred-y.test)^2)
[1] 100743.4
However, the lasso model is much more sparse than it was with ridge regression. It contains only \(7\) variables, while ridge regesssion contains all \(19\).
> #full model
> out=glmnet(x,y,alpha=1,lambda=grid)
> #predictions with best lambda
> lasso.coef=predict(out,type="coefficients",
+ s=bestlam)[1:20,]
> #non-zero coefficients
> lasso.coef[lasso.coef!=0]
(Intercept) Hits Walks CRuns CRBI LeagueN
18.5394844 1.8735390 2.2178444 0.2071252 0.4130132 3.2666677
DivisionW PutOuts
-103.4845458 0.2204284