Lasso and Ridge Regression

An implementation in R markdown

Paul Jozefek

2020-06-23

Shrinkage Methods

To find the best model we can use subset selection11 Model contains a subset of predictors or shrinkage methods.22 Model contains all \(p\) predcitors

Subset Selection - an iterative process that evaluates each predictor for inclusion.
Shrinkage Methods - a technique that reduces the coefficient estimates towards zero.

Ridge Regression Basics

Least squares fits a model by minimizing the sum of squared residuals.

\[RSS=\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2\]

Ridge Regression is similar, but it includes another term.33 \(\lambda\geq0\) is a tuning parameter and the second term, \(\lambda\sum_{j=1}^{p}\beta_j^2\), is called a shrinkage penalty

\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}\beta_j^2\] \[= RSS+\lambda\sum_{j=1}^{p}\beta_j^2\] In order to minimze this equation \(\beta_1,\dotsc,\beta_p\) should be close to zero and so it shrinks the coefficients. The tuning parameter, \(\lambda\), controls the impact.44 \(\lambda=0\) will produce the least squares estimate. \(\lambda\to\infty\) will produce coefficients near zero

Ridge regression will produce a different set of coefficient estimates, \(\hat\beta_{\lambda}^R\), for each value of \(\lambda\).

Importance of Scaled Data

Least squares coefficients are scale equivalent. Multiplying \(X_j\)55 Column of \(x\) values for the \(j^{th}\) predictor by a constant \(c\) will change \(\hat\beta_j\)66 \(\hat\beta\) value for the \(j^{th}\) predictor by \(1/c\). Therefore, \(X_j\hat\beta_j\) will remain the same.

With ridge regression the coefficient estimates can change significantly when multiplying a coefficient by a constant.77 For example, dollars in thousand vs. dollars in millions \(X_j\hat\beta_{j,\lambda}^R\) will depend on \(\lambda\) and the scaling of the \(j^{th}\) predictor. To account for this the predictors must be standardized:

\[\tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]

The demoninator is the estimated standard deviation of the \(j^{th}\) predictor.

All of the standardized predictors will have a standard deviation of one.
The final model will not depend on the scale of the predictors.

Ridge Regression vs. Least Squares

Model selection often involves a bias-variance trade-off.88 For more information on bias-variance see my report on Resampling Methods

As \(\lambda\) increases the model flexibility decreases, which will reduce the variance. However, at some point the shrikage of the coefficients will cause them to be significantly underestimated, which will increase the bias.99 The test mean squared error (MSE) is a function of the variance plus the squared bias
When the relationship between the response and the predictors is close to linear least squares estimates will have low bias but high variance.1010 A small change in the test data will have a large change in the coefficient estimates
When the number of variables \(p\) is almost as large as the number of observations \(n\) the least squares estimates will be extremely variable. When \(p>n\) least squares can not produce a unique solution.
Ridge regression works best when least squares estimates have high variance.1111 Trading off a small increase in bias for a large decrease in variance
It also has computational advatages. Subset selection evaluates \(2^p\) models, whereas for a fixed \(\lambda\) ridge regression only fits a single model.

The Lasso

Ridge regression does have some disadvantages.

Unlike subset selection, ridge regression includes all \(p\) predictors.
The penalty term will shrink all of the coefficients towards zero, but none of them will be exactly zero.
Such a large model often makes interpretation difficult.

The lasso helps overcome these problems. It is similar to ridge regression, but the penalty term is slightly different.1212 \(|\beta_j|\) instead of \(\beta_j^2\)

\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}|\beta_j|\] \[= RSS+\lambda\sum_{j=1}^{p}|\beta_j|\]

Like ridge regression it shrinks the coefficients towards zero. However, the lasso allows some of the coefficients to be exactly zero.

The Lasso vs. Ridge Regression

Models produced by the lasso are easier to interpret.1313 Fewer non-zero coefficients
The lasso will perform better when a relatively small number of predictors have substantial coefficients.
Ridge regression will outperform when the dependent variable is a function of many predictors.
Ridge regression tends to shrink the coefficients by a similar proportion while the lasso tends to shrink the coefficients by a similar amount.1414 The lasso shrinks small coefficients all the way to zero
Both the lasso and ridge regression can generate more accurate predictions when the least squares estimates have very high variance.

Selecting the Tuning Parameter

We can select a tuning parameter by using cross validation.

Choose a grid of \(\lambda\) values and compute the cross-validation error for each.
Select a tuning parameter value for which the cross-validation error is smallest.
Re-fit the model using all of the observations and the selected value of \(\lambda\).

Ridge Regression Example

We can use the Hitters dataset to predict a baseball player’s Salary using ridge regression and the lasso. It is important to first remove missing values.

> library(ISLR) #load package
> names(Hitters) #column names

 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

> dim(Hitters) #Rows and Columns

[1] 322  20

> #Number of NA values for salary
> sum(is.na(Hitters$Salary))

[1] 59

We can see that there are 59 missing values for Salary. They can be removed with the na.omit function.

> Hitters=na.omit(Hitters) #Remove NAs
> dim(Hitters) #Rows and Columns

[1] 263  20

> sum(is.na(Hitters)) #Number of NA values

[1] 0

Next, we must load the glmnet package. The glmnet() function is used with an x matrix as well as a y vector. The model.matrix() function can create x by producing a matrix of the 19 predictors. It also transforms any qualitative variables into dummy variables.1515 Glmnet() can only take numerical inputs

> #Create matrix of 19 predictors
> x=model.matrix(Salary~.,Hitters)[,-1]
> #Create vector of y values
> y=Hitters$Salary

Once we have the values for x and y we can create a grid of lambda values. Using the seq function I have created a grid of 100 values from \(\lambda=10^{10}\) to \(\lambda=10^{-2}\). The glmnet() function can then be used to perform ridge regression on the selected \(\lambda\) values.1616 If no \(\lambda\) values are specified it will automatically select a range By default the variables are automatically standardized so that they are on the same scale. This can be changed with stadardize=FALSE.

> library(glmnet) #load package
> #create grid of 100 values
> grid=10^seq(10,-2,length=100)
> #alpha = 0 for ridge regression
> #alpha = 1 for lasso
> ridge.mod=glmnet(x,y,alpha=0,lambda=grid)

For each value of \(\lambda\) there is a vector of coefficients stored in a matrix. In this case it has 20 rows1717 One for each predictor plus the intercept and 100 columns1818 One for each value of \(\lambda\). They can be accessed by coef().

> #coefficient matrix size
> dim(coef(ridge.mod))

[1]  20 100

The coefficients should be smaller when a large value of \(\lambda\) is used. The \(50^{th}\space\lambda\) value is \(11498\) and the sum of squared coefficients1919 \(\sum_{j=1}^{p}\beta_j^2\) is \(6.36\)

> #50th lambda value
> ridge.mod$lambda[50]

[1] 11497.57

> #coefficients for 50th lambda
> coef(ridge.mod)[,50]

  (Intercept)         AtBat          Hits         HmRun          Runs 
407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
          RBI         Walks         Years        CAtBat         CHits 
  0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
  0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
 -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531

> #sum of squared coefficients
> sqrt(sum(coef(ridge.mod)[-1,50]^2))

[1] 6.360612

The \(60^{th}\space\lambda\) value is \(705\) and the sum of squared coefficients is \(57.1\)

> #60th lambda value
> ridge.mod$lambda[60]

[1] 705.4802

> #coefficients for 60th lambda
> coef(ridge.mod)[,60]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.31987948   2.59640425   0.01083413   0.04674557   0.33777318   0.09355528 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.09780402   0.07189612  13.68370191 -54.65877750   0.11852289   0.01606037 
      Errors   NewLeagueN 
 -0.70358655   8.61181213

> #sum of squared coefficients
> sqrt(sum(coef(ridge.mod)[-1,60]^2))

[1] 57.11001

To generate the coefficients for a specific value of \(\lambda\), in this case \(50\), the predict() function can be used.

> # Coefficients for lambda 50
> predict(ridge.mod,s=50,
+         type="coefficients")[1:20,]

  (Intercept)         AtBat          Hits         HmRun          Runs 
 4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
          RBI         Walks         Years        CAtBat         CHits 
 8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
 6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
-1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00

In order to estimate the test error we need to split the data into test and training sets.

> # Make replicable
> set.seed(1)
> #random sample of half
> train=sample(1:nrow(x),nrow(x)/2)
> #make the values negative
> test=(-train)
> #remove training values
> y.test=y[test]

We can use fit a model on the training data and use the predict() function with a selected \(\lambda\) value to calculate the MSE for the test set. With \(\lambda=4\) the \(MSE=101037\).

> #ridge regression on training data
> ridge.mod=glmnet(x[train,],y[train],alpha=0,
+                  lambda=grid,thresh = 1e-12)
> #test set predictions for lambda = 4
> ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)

[1] 101036.8

We can use the mean y value2020 For the training set as our prediction to calculate the MSE for an intercept only model. The same result can be achieved by using a very large \(\lambda\) value.2121 Shrinks coefficients to near 0

> #Use mean for prediction
> mean((mean(y[train])-y.test)^2)

[1] 193253.1

> #test set predictions for lambda = 1e10
> ridge.pred=predict(ridge.mod,s=1e10,
+                    newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)

[1] 193253.1

A \(\lambda\) value of \(0\) will yield the least squares results.

> #Set lambda=0 for least squares
> ridge.pred=predict(ridge.mod,s=0, newx=x[test,],
+                    exact=TRUE,x=x[train,],
+                    y=y[train])
> #MSE for least squares
> mean((ridge.pred -y.test)^2)

[1] 114783.1

> #Linear regression
> lm(y~x, subset=train)


Call:
lm(formula = y ~ x, subset = train)

Coefficients:
(Intercept)       xAtBat        xHits       xHmRun        xRuns         xRBI  
  299.42849     -2.54027      8.36682     11.64512     -9.09923      2.44105  
     xWalks       xYears      xCAtBat       xCHits      xCHmRun       xCRuns  
    9.23440    -22.93673     -0.18154     -0.11598     -1.33888      3.32838  
      xCRBI      xCWalks     xLeagueN   xDivisionW     xPutOuts     xAssists  
    0.07536     -1.07841     59.76065    -98.86233      0.34087      0.34165  
    xErrors  xNewLeagueN  
   -0.64207     -0.67442

> #Compare to lambda=0
> predict(ridge.mod,s=0,exact=TRUE,
+         type="coefficients",
+         x=x[train,],y=y[train])[1:20,]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
299.42883596  -2.54014665   8.36611719  11.64400720  -9.09877719   2.44152119 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  9.23403909 -22.93584442  -0.18160843  -0.11561496  -1.33836534   3.32817777 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.07511771  -1.07828647  59.76529059 -98.85996590   0.34086400   0.34165605 
      Errors   NewLeagueN 
 -0.64205839  -0.67606314

We can see that ridge regression with \(\lambda=4\) outperformed the intercept only model or the least squares model.

Rather than selecting our own \(\lambda\) value, we can use cross-validation to find the best result. In this case it was a \(\lambda\) of \(212\).

> #Make replicable
> set.seed(1)
> #10 fold cross-validation
> cv.out=cv.glmnet(x[train,], y[train], alpha=0,
+                  nfolds = 10)
> #plot results
> plot(cv.out)

> #lambda for min MSE
> bestlam=cv.out$lambda.min
> bestlam

[1] 211.7416

The best MSE was \(96016\), which outperformed least squares and the intercept only model.

> #test set predictions for lambda = 212
> ridge.pred=predict(ridge.mod,s=bestlam,
+                    newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)

[1] 96015.51

We can then fit the model on the full dataset and use the optimal \(\lambda\) value to generate the coefficients.

> #generate full model
> out=glmnet(x,y,alpha=0)
> #coefficients for best model
> predict(out,type = "coefficients",
+         s=bestlam)[1:20,]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
  9.88487157   0.03143991   1.00882875   0.13927624   1.11320781   0.87318990 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.80410229   0.13074383   0.01113978   0.06489843   0.45158546   0.12900049 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.13737712   0.02908572  27.18227527 -91.63411282   0.19149252   0.04254536 
      Errors   NewLeagueN 
 -1.81244470   7.21208394

Lasso Example

We can perform the same functions for the lasso by changing the function input to alpha=1. We can see from the plot that some coefficients can be zero depending on the \(\lambda\) value.

> #generate lasso model
> lasso.mod=glmnet(x[train,],y[train],alpha=1,
+                  lambda=grid)
> #plot coefficients
> plot(lasso.mod)

Cross-validation can be used to find the best value of \(\lambda\) and the minimum \(MSE\). In this case the best \(MSE\) was \(100743\). It is slightly higher that the best \(MSE\) for ridge regression, which was \(96016\), but it outperformed least squares.

> #make replicable
> set.seed(1)
> # 10 fold CV
> cv.out=cv.glmnet(x[train,],y[train],alpha=1,
+                  nfolds=10)
> #lambda for min MSE
> bestlam=cv.out$lambda.min
> #test predictions with best lambda
> lasso.pred=predict(lasso.mod,s=bestlam,
+                    newx=x[test,])
> #MSE
> mean((lasso.pred-y.test)^2)

[1] 100743.4

However, the lasso model is much more sparse than it was with ridge regression. It contains only \(7\) variables, while ridge regesssion contains all \(19\).

> #full model
> out=glmnet(x,y,alpha=1,lambda=grid)
> #predictions with best lambda
> lasso.coef=predict(out,type="coefficients",
+                    s=bestlam)[1:20,]
> #non-zero coefficients
> lasso.coef[lasso.coef!=0]

 (Intercept)         Hits        Walks        CRuns         CRBI      LeagueN 
  18.5394844    1.8735390    2.2178444    0.2071252    0.4130132    3.2666677 
   DivisionW      PutOuts 
-103.4845458    0.2204284