Ridge regression and LASSO using the Hitters dataset

Data set

We will use the data set Hitters, which is bundled in the ISLR2 package.

library(ISLR2)
str(Hitters)
'data.frame':   322 obs. of  20 variables:
 $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
 $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
 $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
 $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
 $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
 $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
 $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
 $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
 $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
 $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
 $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
 $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
 $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
 $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
 $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
 $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
 $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
 $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
 $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
 $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
  • As you can see, Hitters is a data frame with 322 observations of baseball players on the 20 variables: Salary, AtBat, Hits, Runs, …
  • Run ?Hitters to learn more about the data set.

Goal: predict a baseball player’s salary.

mean(Hitters$Salary)
[1] NA
  • The output is NA (Not Available) because some salary data is missing.

Let’s remove those rows that have missing data in any variable:

Hitters <- na.omit(Hitters)
mean(Hitters$Salary)
[1] 535.9259
  • The average annual salary is 535.9 thousands of dollars.
library(dplyr)

Training dataset and Test dataset

We will compare the performances of (Un-regularized) linear regression, Ridge regression and LASSO models.

Throughout the comparison, we will use the same training dataset and test dataset:

set.seed(1)
train <- sample(1:nrow(Hitters), nrow(Hitters) / 2)
test <- (-train)

We will use the model.matrix to create the argument x for glmnet later. One reason to use model.matrix (instead of, say, as.matrix) is that it automatically expands factors to a set of dummy variables. Run ?model.matrix to learn more about this function.

x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary
y.test <- y[test]

Linear Regression

lin.mod = lm(Salary ~ ., Hitters, subset=train)
new_data = Hitters[test,] |>
    mutate(Salary = NULL)
lm.pred = predict(lin.mod, newdata=new_data)
mse_lm = mean((lm.pred - y.test)^2)
mse_lm
[1] 168593.3

Ridge regression

We will use the glmnet package to perform ridge regression and the lasso. Recall that these two methods use \(\ell^2\) loss and \(\ell^1\) loss, respectively, for the positive betas.

library(glmnet)

We will use the glmnet function, which can be used to fit both ridge regression and lasso models.

The function glmnet has slightly different syntax from the lm/glm functions that we have encountered:

  • We must pass in an x matrix as well as a y vector
  • We do not use the y ~ x syntax

The syntax of the glmnet function is as follows:

glmnet(x, y, 
    alpha, # alpha=0 the ridge penalty.
    lambda = lambda)
  • The argument alpha=1 is the lasso penalty, and alpha=0 the ridge penalty.

For different \(\lambda\), the estimated coefficients tend to differ. Specifically, as \(\lambda\) gets higher, the coefficient estimates should get smaller (in terms of \(\ell^2\) norm). Below we check whether that’s true for \(\lambda =1\) and \(\lambda = 10000\):

coef1 = coef(glmnet(x, y, alpha = 0, lambda = 1))
coef2 = coef(glmnet(x, y, alpha = 0, lambda = 10000))
c(sum(coef1^2), sum(coef2^2)) 
[1]  45021.75 154289.97

We use cross validation to determine the optimal \(\lambda\). We can do this using the cross-validation function, cv.glmnet. By default, the function performs ten-fold cross-validation.

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)

The “best” lambda by cross-validation is stored in cv.out$lambda.min:

cv.out$lambda.min
[1] 326.0828

So, we will use cv.out$lambda.min as the tuning parameter. The test error is:

ridge.mod.cv <- glmnet(
    x, y, 
    alpha = 0, # alpha=0 the ridge penalty.
    lambda = 326)
ridge.pred <- predict(
    ridge.mod.cv, 
    s = cv.out$lambda.min,
    newx = x[test, ])
mse_ridge = mean((ridge.pred - y.test)^2)
mse_ridge
[1] 119150.1
  • By using Ridge regression, the test error decreases from 168593 to 119150.

LASSO

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

The “best” lambda by cross-validation is:

cv.out$lambda.min
[1] 9.286955

The test error is:

lasso.mod <- glmnet(
    x, y, 
    alpha = 1,
    lambda = cv.out$lambda.min)
lasso.pred <- predict(
    lasso.mod, 
    s = cv.out$lambda.min,
    newx = x[test, ])
mse_lasso = mean((lasso.pred - y.test)^2)
c(mse_lm, mse_ridge, mse_lasso)
[1] 168593.3 119150.1 112193.9

The prediction accuracy of LASSO is similar to that of Ridge. However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse.

lasso.mod$beta
19 x 1 sparse Matrix of class "dgCMatrix"
                      s0
AtBat         .         
Hits          2.02965136
HmRun         .         
Runs          .         
RBI           .         
Walks         2.24850782
Years         .         
CAtBat        .         
CHits         .         
CHmRun        0.04994886
CRuns         0.22212444
CRBI          0.40183027
CWalks        .         
LeagueN      20.83775664
DivisionW  -116.39019204
PutOuts       0.23768309
Assists       .         
Errors       -0.93567863
NewLeagueN    .         

Here we see that 10 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only nine variables:

lasso_coef = lasso.mod$beta
lasso_coef[lasso_coef != 0] |> length()
[1] 9