Lasso and Ridge Regression

An implementation in R markdown

Paul Jozefek

2020-06-23


Shrinkage Methods


To find the best model we can use subset selection1 Model contains a subset of predictors or shrinkage methods.2 Model contains all \(p\) predcitors

Ridge Regression Basics


Least squares fits a model by minimizing the sum of squared residuals.

\[RSS=\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2\]

Ridge Regression is similar, but it includes another term.3 \(\lambda\geq0\) is a tuning parameter and the second term, \(\lambda\sum_{j=1}^{p}\beta_j^2\), is called a shrinkage penalty

\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}\beta_j^2\] \[= RSS+\lambda\sum_{j=1}^{p}\beta_j^2\] In order to minimze this equation \(\beta_1,\dotsc,\beta_p\) should be close to zero and so it shrinks the coefficients. The tuning parameter, \(\lambda\), controls the impact.4 \(\lambda=0\) will produce the least squares estimate. \(\lambda\to\infty\) will produce coefficients near zero

Ridge regression will produce a different set of coefficient estimates, \(\hat\beta_{\lambda}^R\), for each value of \(\lambda\).

Importance of Scaled Data


Least squares coefficients are scale equivalent. Multiplying \(X_j\)5 Column of \(x\) values for the \(j^{th}\) predictor by a constant \(c\) will change \(\hat\beta_j\)6 \(\hat\beta\) value for the \(j^{th}\) predictor by \(1/c\). Therefore, \(X_j\hat\beta_j\) will remain the same.

With ridge regression the coefficient estimates can change significantly when multiplying a coefficient by a constant.7 For example, dollars in thousand vs. dollars in millions \(X_j\hat\beta_{j,\lambda}^R\) will depend on \(\lambda\) and the scaling of the \(j^{th}\) predictor. To account for this the predictors must be standardized:

\[\tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]

The demoninator is the estimated standard deviation of the \(j^{th}\) predictor.

Ridge Regression vs. Least Squares


Model selection often involves a bias-variance trade-off.8 For more information on bias-variance see my report on Resampling Methods

The Lasso


Ridge regression does have some disadvantages.

The lasso helps overcome these problems. It is similar to ridge regression, but the penalty term is slightly different.12 \(|\beta_j|\) instead of \(\beta_j^2\)

\[\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^{p}|\beta_j|\] \[= RSS+\lambda\sum_{j=1}^{p}|\beta_j|\]

Like ridge regression it shrinks the coefficients towards zero. However, the lasso allows some of the coefficients to be exactly zero.

The Lasso vs. Ridge Regression


Selecting the Tuning Parameter


We can select a tuning parameter by using cross validation.

Ridge Regression Example


We can use the Hitters dataset to predict a baseball player’s Salary using ridge regression and the lasso. It is important to first remove missing values.

> library(ISLR) #load package
> names(Hitters) #column names
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"
> dim(Hitters) #Rows and Columns
[1] 322  20
> #Number of NA values for salary
> sum(is.na(Hitters$Salary)) 
[1] 59

We can see that there are 59 missing values for Salary. They can be removed with the na.omit function.

> Hitters=na.omit(Hitters) #Remove NAs
> dim(Hitters) #Rows and Columns
[1] 263  20
> sum(is.na(Hitters)) #Number of NA values
[1] 0

Next, we must load the glmnet package. The glmnet() function is used with an x matrix as well as a y vector. The model.matrix() function can create x by producing a matrix of the 19 predictors. It also transforms any qualitative variables into dummy variables.15 Glmnet() can only take numerical inputs

> #Create matrix of 19 predictors
> x=model.matrix(Salary~.,Hitters)[,-1]
> #Create vector of y values
> y=Hitters$Salary

Once we have the values for x and y we can create a grid of lambda values. Using the seq function I have created a grid of 100 values from \(\lambda=10^{10}\) to \(\lambda=10^{-2}\). The glmnet() function can then be used to perform ridge regression on the selected \(\lambda\) values.16 If no \(\lambda\) values are specified it will automatically select a range By default the variables are automatically standardized so that they are on the same scale. This can be changed with stadardize=FALSE.

> library(glmnet) #load package
> #create grid of 100 values
> grid=10^seq(10,-2,length=100)
> #alpha = 0 for ridge regression
> #alpha = 1 for lasso
> ridge.mod=glmnet(x,y,alpha=0,lambda=grid)

For each value of \(\lambda\) there is a vector of coefficients stored in a matrix. In this case it has 20 rows17 One for each predictor plus the intercept and 100 columns18 One for each value of \(\lambda\). They can be accessed by coef().

> #coefficient matrix size
> dim(coef(ridge.mod))
[1]  20 100

The coefficients should be smaller when a large value of \(\lambda\) is used. The \(50^{th}\space\lambda\) value is \(11498\) and the sum of squared coefficients19 \(\sum_{j=1}^{p}\beta_j^2\) is \(6.36\)

> #50th lambda value
> ridge.mod$lambda[50]
[1] 11497.57
> #coefficients for 50th lambda
> coef(ridge.mod)[,50]
  (Intercept)         AtBat          Hits         HmRun          Runs 
407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
          RBI         Walks         Years        CAtBat         CHits 
  0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
  0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
 -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531 
> #sum of squared coefficients
> sqrt(sum(coef(ridge.mod)[-1,50]^2))
[1] 6.360612

The \(60^{th}\space\lambda\) value is \(705\) and the sum of squared coefficients is \(57.1\)

> #60th lambda value
> ridge.mod$lambda[60]
[1] 705.4802
> #coefficients for 60th lambda
> coef(ridge.mod)[,60]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.31987948   2.59640425   0.01083413   0.04674557   0.33777318   0.09355528 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.09780402   0.07189612  13.68370191 -54.65877750   0.11852289   0.01606037 
      Errors   NewLeagueN 
 -0.70358655   8.61181213 
> #sum of squared coefficients
> sqrt(sum(coef(ridge.mod)[-1,60]^2))
[1] 57.11001

To generate the coefficients for a specific value of \(\lambda\), in this case \(50\), the predict() function can be used.

> # Coefficients for lambda 50
> predict(ridge.mod,s=50,
+         type="coefficients")[1:20,]
  (Intercept)         AtBat          Hits         HmRun          Runs 
 4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
          RBI         Walks         Years        CAtBat         CHits 
 8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
 6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
-1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00 

In order to estimate the test error we need to split the data into test and training sets.

> # Make replicable
> set.seed(1)
> #random sample of half
> train=sample(1:nrow(x),nrow(x)/2)
> #make the values negative
> test=(-train)
> #remove training values
> y.test=y[test]

We can use fit a model on the training data and use the predict() function with a selected \(\lambda\) value to calculate the MSE for the test set. With \(\lambda=4\) the \(MSE=101037\).

> #ridge regression on training data
> ridge.mod=glmnet(x[train,],y[train],alpha=0,
+                  lambda=grid,thresh = 1e-12)
> #test set predictions for lambda = 4
> ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 101036.8

We can use the mean y value20 For the training set as our prediction to calculate the MSE for an intercept only model. The same result can be achieved by using a very large \(\lambda\) value.21 Shrinks coefficients to near 0

> #Use mean for prediction
> mean((mean(y[train])-y.test)^2)
[1] 193253.1
> #test set predictions for lambda = 1e10
> ridge.pred=predict(ridge.mod,s=1e10,
+                    newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 193253.1

A \(\lambda\) value of \(0\) will yield the least squares results.

> #Set lambda=0 for least squares
> ridge.pred=predict(ridge.mod,s=0, newx=x[test,],
+                    exact=TRUE,x=x[train,],
+                    y=y[train])
> #MSE for least squares
> mean((ridge.pred -y.test)^2)
[1] 114783.1
> #Linear regression
> lm(y~x, subset=train)

Call:
lm(formula = y ~ x, subset = train)

Coefficients:
(Intercept)       xAtBat        xHits       xHmRun        xRuns         xRBI  
  299.42849     -2.54027      8.36682     11.64512     -9.09923      2.44105  
     xWalks       xYears      xCAtBat       xCHits      xCHmRun       xCRuns  
    9.23440    -22.93673     -0.18154     -0.11598     -1.33888      3.32838  
      xCRBI      xCWalks     xLeagueN   xDivisionW     xPutOuts     xAssists  
    0.07536     -1.07841     59.76065    -98.86233      0.34087      0.34165  
    xErrors  xNewLeagueN  
   -0.64207     -0.67442  
> #Compare to lambda=0
> predict(ridge.mod,s=0,exact=TRUE,
+         type="coefficients",
+         x=x[train,],y=y[train])[1:20,]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
299.42883596  -2.54014665   8.36611719  11.64400720  -9.09877719   2.44152119 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  9.23403909 -22.93584442  -0.18160843  -0.11561496  -1.33836534   3.32817777 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.07511771  -1.07828647  59.76529059 -98.85996590   0.34086400   0.34165605 
      Errors   NewLeagueN 
 -0.64205839  -0.67606314 

We can see that ridge regression with \(\lambda=4\) outperformed the intercept only model or the least squares model.

Rather than selecting our own \(\lambda\) value, we can use cross-validation to find the best result. In this case it was a \(\lambda\) of \(212\).

> #Make replicable
> set.seed(1)
> #10 fold cross-validation
> cv.out=cv.glmnet(x[train,], y[train], alpha=0,
+                  nfolds = 10)
> #plot results
> plot(cv.out)

> #lambda for min MSE
> bestlam=cv.out$lambda.min
> bestlam
[1] 211.7416

The best MSE was \(96016\), which outperformed least squares and the intercept only model.

> #test set predictions for lambda = 212
> ridge.pred=predict(ridge.mod,s=bestlam,
+                    newx=x[test,])
> #calculate MSE
> mean((ridge.pred-y.test)^2)
[1] 96015.51

We can then fit the model on the full dataset and use the optimal \(\lambda\) value to generate the coefficients.

> #generate full model
> out=glmnet(x,y,alpha=0)
> #coefficients for best model
> predict(out,type = "coefficients",
+         s=bestlam)[1:20,]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
  9.88487157   0.03143991   1.00882875   0.13927624   1.11320781   0.87318990 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.80410229   0.13074383   0.01113978   0.06489843   0.45158546   0.12900049 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.13737712   0.02908572  27.18227527 -91.63411282   0.19149252   0.04254536 
      Errors   NewLeagueN 
 -1.81244470   7.21208394 

Lasso Example


We can perform the same functions for the lasso by changing the function input to alpha=1. We can see from the plot that some coefficients can be zero depending on the \(\lambda\) value.

> #generate lasso model
> lasso.mod=glmnet(x[train,],y[train],alpha=1,
+                  lambda=grid)
> #plot coefficients
> plot(lasso.mod)

Cross-validation can be used to find the best value of \(\lambda\) and the minimum \(MSE\). In this case the best \(MSE\) was \(100743\). It is slightly higher that the best \(MSE\) for ridge regression, which was \(96016\), but it outperformed least squares.

> #make replicable
> set.seed(1)
> # 10 fold CV
> cv.out=cv.glmnet(x[train,],y[train],alpha=1,
+                  nfolds=10)
> #lambda for min MSE
> bestlam=cv.out$lambda.min
> #test predictions with best lambda
> lasso.pred=predict(lasso.mod,s=bestlam,
+                    newx=x[test,])
> #MSE
> mean((lasso.pred-y.test)^2)
[1] 100743.4

However, the lasso model is much more sparse than it was with ridge regression. It contains only \(7\) variables, while ridge regesssion contains all \(19\).

> #full model
> out=glmnet(x,y,alpha=1,lambda=grid)
> #predictions with best lambda
> lasso.coef=predict(out,type="coefficients",
+                    s=bestlam)[1:20,]
> #non-zero coefficients
> lasso.coef[lasso.coef!=0]
 (Intercept)         Hits        Walks        CRuns         CRBI      LeagueN 
  18.5394844    1.8735390    2.2178444    0.2071252    0.4130132    3.2666677 
   DivisionW      PutOuts 
-103.4845458    0.2204284