Ridge, Lasso, and Elastic Net Regression using glmnet

Notice

This tutorial assumes that you understand the basics of linear regression, as well as some basics of the theory behind machine learning (e.g. test-train splitm, cross validation, and the bias-variance tradeoff). If these topics aren’t familiar to you, you may wish to review those first.

Introduction

In this tutorial, we will be discussing ridge regression, lasso regression, and elastic net regression, each of which is a form of “regularization.” In essence, regularization is just any technique where we add information to the model to reduce overfitting, or to otherwise introduce bias for the sake of variance. With these regularization techniques, we simply are aiming to add a correction to our algorithm that may lead to worse fit on our training data, but under the theory that we are describing the real world better than would our overfit model.

Recall that in typical linear regression, we find the equation that minimizes the sum of all of our squared residuals. This equation, therefore, is matched as well as possible to the specific data we use to train the model. In the three regularization methods to be discussed we minimize not only the sum of the squared residuals, but the sum of the squared residuals, plus some penalty based on the size of the coefficients. Ridge, lasso, and elastic net regression vary only in how that penalty is calculated. We’ll discuss how these penalties are calculated and the ramifications of these choices below, but for now, let’s set up our R environment and the data we will be analyzing.

Setting up our R environment

Load Package

We will be using only one library, glmnet. If you do not have this package installed, first run install.packages("glmnet").

library(glmnet)

Simulating Data

We’ll now simulate data for our analysis. Why simulate data rather than work with real data? If we know the actual structure of our data, we know what a good model should reveal. Thus, we can see whether the models we will train actually describe our data well.

We’ll start by creating a matrix of predictor variables. We’ll construct a matrix with n = 1000 observations (you can think about these as 1000 participants) and p = 2000 predictors.

set.seed(123)

n <- 1000
p <- 2000

pred <- matrix(rnorm(n*p), nrow = n, ncol = p)

An attentive reader may now be wondering how we can have a model with more predictors than observations. Certainly, this is not something we can not do in standard regression. But with regularization methods, we are bringing additional information into the model in the form of the penalties we introduce, which allows us to have more predictors than observations. In theory, these methods should bring the coefficients of predictors that have little effect on the model to zero, or close to it. This, along with mitigating the risk of overfitting, is one of the great benefits of these methods.

We now need to simulate our dependent variable. For the sake of this example, our DV will be tied to our first 25 predictor variables. However, we will be assigning varying weights so the outcome variable is predicted by the predictors to different extent. The other 1,975 predictors will have no true effect on our DV.

dv <- (rowSums(pred[,1:5]) + .8*rowSums(pred[,6:10]) + 
         .6 * rowSums(pred[,11:15]) + .4*rowSums(pred[,16:20]) +
         .2 * rowSums(pred[,21:25]) + rnorm(n))

Let me explain in a bit more detail what the code above does. We created an outcome variable “dv” which is calculated, first, by the sum of predictors 1 through 5. Then by adding the sum of predictors 6 through 10 (but only influenced by these rows 80% as much as by the first 5). Then 11 through 15, time (60% as much) and so forth. Additionally, some random noise is added to each outcome variable.

NOTE: it is extremely important that all predictor variables are on the same scale. These regularization methods are sensitive to the size of coefficients by design, so coefficients on different scales are enormously problematic. Based on how these data were simulated, we know that our predictors are already on the same scale, but as good practice, we will center and scale them anyways.

pred <- scale(pred)

At this stage, it would be important to also do all the other typical pre-processing steps, such as checking for missing data. However, since our data were simulated in such a way as to avoid these issues, we can skip this step. But, as with all analyses, such steps are extremely important in a real workflow.

Train-test Split

Let’s now split our data (both the predictors and dv) into train and test sets. We’ll go with a 70-30 split.

train_rows <- sample(1:n, .7*n, replace = F)

pred.train <- pred[train_rows,]
dv.train <- dv[train_rows]

pred.test <- pred[-train_rows,]
dv.test <- dv[-train_rows]

Ridge Regression

We’ll start by modeling a ridge regression. But first, it’s very important to understand what this method actually does. I previously mentioned that the methods we’ll be discussing seek to create an equation that minimizes the sum of the squared residuals, plus some penalty. All that varies between the methods is how the penalty is computed. In ridge regression the penalty is equal to the sum of each coefficient squared, all multiplied by a value “lambda.” That may still be confusing, so let’s look at an example.

Let’s assume we have a relatively simple situation where we are aiming to predict a variable Y from three predictors, X1, X2, and X3. We are aiming to find an equation Y = Int + a*X1 + b*X2 + c*X3, where “Int” is our intercept and “a”, “b”, and “c” are our coefficients. In a typical linear regression, we are simply aiming to find the values of Int, a, b, and c where the sum of squared residuals (SSR) is as small as possible. In Ridge Regression, we introduce an additional penalty λ(a^2 + b^2 + c^2). So rather than just minimizing SSR, we are aiming to find the Int, a, b, and c that minimize SSR + λ(a^2 + b^2 + c^2).

Hopefully that is now relatively straight-forward, but one big question remains: How do we estimate λ? Well… we don’t. At least not the way we do Int, a, b, and c. Instead, we need to provide the value for λ (or let the software provide decent values for λ). While Int, a, b, and c are model-estimated parameters, λ is instead what we call a hyperparameter. It is a parameter that is used to control the training of the model, rather than one that is output in the model-training process. So how do we choose λ? Typically, we use cross-validation to find the lambda that generates the best-fitting model. Fortunately, the glmnet package can do this automatically.

Now that we understand what ridge regression is doing, let’s generate the model. In the code below, we input our predictors as a matrix into x. We put our outcome variable as a vector into y. We specify type.measure="mse" to tell the function that we’re seeking the mean square error (equivalent to sum of squared residuals), plus the penalty of course. alpha=0 is simply notation to specify this is a ridge regression (more on that argument later). Finally, family="gaussian" suggests we want to specify this to a gaussian distribution, rather than, say, a binomial distribution (which we’d use for logistic regression). nlambda = 200 tells the function we want to try 200 different lambda values (the default for the function is 100).

ridge <- cv.glmnet(x=pred.train, y=dv.train, type.measure="mse", 
                        alpha=0, family="gaussian", nlambda=200)

We can plot this object to see how each tested lambda value performed:

plot(ridge)

Here the x-axis is the log of λ, while the y is mean-squared error (as a reminder, a low mean squared error indicates better fit). Two particular values are marked with dotted vertical lines. The first is the lambda value that resulted in the lowest mean-squared error (called lambda.min by the package). The second is the largest value of λ that provides cross-validated error within one standard error of the minimum (called lambda.1se). We often are more interested in this latter value than the former, as one goal of these methods is to prevent over-fitting. Thus, we are often willing to sacrifice some goodness of fit for the sake of greater regularization.

With this in mind, let’s choose the model generated with a lambda equal to lambda.1se as the model we want to go with. We can now test this model on our test dataset by computing model-predicted values for the test dataset and computing the mean square error.

ridge.predicted <- predict(ridge, ridge$lambda.1se, new=pred.test)
mean((dv.test - ridge.predicted)^2)

## [1] 11.5338

Lasso regression

Lasso regression is very similar to ridge regression, with only a slight difference in how the penalty is computed. Rather than λ times the sum of squared coefficients, lasso regression’s penalty is λ times the sum of the absolute values of the coefficients. That’s the entire difference. So, using our earlier example, while ridge regression minimizes SSR + λ(a^2 + b^2 + c^2), lasso minimizes SSR + λ(|a| + |b| + |c|).

From this subtle difference in algorithm come some interestingly different properties, however. Ridge regression tends to punish large coefficients, but it can never actually bring coefficients to zero–they merely asymptotically approach 0. Lasso regression, on the other hand, can actually bring coefficients to 0. This can prove very useful, especially if we expect some of our predictors to have no true relationship with our outcome variable. And recall in our simulated dataset, many of our predictors have no true relationship with our outcome variable. With this in mind, we may expect lasso to be a better technique for our data. Let’s see if that’s true.

We can set up our model very similarly to our ridge function. We simply need to change the alpha argument to 1.

lasso <- cv.glmnet(x=pred.train, y=dv.train, type.measure="mse", 
                        alpha=1, family="gaussian", nlambda=200)

Let’s plot this:

plot(lasso)

And again, we’ll work off the model where λ = lambda.1se

lasso.predicted <- predict(lasso, lasso$lambda.1se, new=pred.test)
mean((dv.test - lasso.predicted)^2)

## [1] 1.346554

We see that the lasso regression fits our data much better than the ridge regression does. In this example, this makes perfect sense. As a great many of our predictors have no true relationship with our outcome variable, it is encouraging to see that the model that can move their coefficients to 0 fits the data well.

On that note, we can actually look at the coefficients to see if any did move to zero. We’ll look at the first 50 coefficients. The first coefficient refers to the intercept, so we should expect to see positive coefficients for rows 2-26 of declining magnitudes, and null coefficients for the rest:

predict(lasso, type="coef")[1:50,]

## (Intercept)          V1          V2          V3          V4          V5 
##  0.01854225  0.91384543  0.96932962  0.90212955  0.88949946  0.91534438 
##          V6          V7          V8          V9         V10         V11 
##  0.64525902  0.64690199  0.78218162  0.70563614  0.67867589  0.50756434 
##         V12         V13         V14         V15         V16         V17 
##  0.48380497  0.51905024  0.44261591  0.48508763  0.38262521  0.27904441 
##         V18         V19         V20         V21         V22         V23 
##  0.23697499  0.29552190  0.28291419  0.17610851  0.22649908  0.06632408 
##         V24         V25         V26         V27         V28         V29 
##  0.05844338  0.03063795  0.00000000  0.00000000  0.00000000  0.00000000 
##         V30         V31         V32         V33         V34         V35 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V36         V37         V38         V39         V40         V41 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V42         V43         V44         V45         V46         V47 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V48         V49 
##  0.00000000  0.00000000

And indeed that’s what we find! It appears lasso regression performed quite well here. (Of course, in a real-world example you will not know the exact structure of the data a priori, but it is useful to sanity-check this model in a simulated example where we do know the structure up front).

Elastic Net Regression

So hopefully by now you’ve seen that Ridge and Lasso regression, as fancy as they sound, are not so complicated. And the good news is that elastic net regression is not complicated either… It’s simply a combination of lasso and net regression! Both the lasso penalty and ridge penalty are included in the model, at some pre-specified weight (an additional hyperparameter). In glmnet, this weight hyperparameter is called “alpha”.

So let’s review briefly with our example where we have one outcome variable and three predictors. Recall we are trying to find an equation y = Int + a*X1 + b*x2 + c*X3:

Simple linear regression simply aims to find Int, a, b, and c such the SSRs is minimized.

Ridge regression aims to find Int, a, b, and c such that SSR + λ(a^2 + b^2 + c^2) is minimzed.

Lasso regression aims to find Int, a, b, and c such that SSR + λ(|a| + |b| + |c|) is minimized.

Now our new method elastic net regression aims to find Int, a, b, and c such that SSR + (1-α)*λ(a^2 + b^2 + c^2)+ α*λ(|a| + |b| + |c|) is minimized.

The biggest challenge of elastic net regression over ridge and lasso regression is simply that we have an additional hyperparameter to tune, α. We can address this complication by simply running our cv.glmnet function with a wide range of alphas. We can do this by running a for loop.

models <- list()
for (i in 0:20) {
  name <- paste0("alpha", i/20)
  

  models[[name]] <-
    cv.glmnet(pred.train, dv.train, type.measure="mse", alpha=i/20, 
              family="gaussian")
}

We now have run cv.glmnet 21 times, each with a different alpha value, at increments of 0.05. We want to explore these models more closely to see which produces the best model fit. We can compute the mean square error for each model where lambda is equal to lamda.1se. We’ll store these MSEs in a data frame called “results”.

results <- data.frame()
for (i in 0:20) {
  name <- paste0("alpha", i/20)
  
  ## Use each model to predict 'y' given the Testing dataset
  predicted <- predict(models[[name]], 
                       s=models[[name]]$lambda.1se, newx=pred.test)
  
  ## Calculate the Mean Squared Error...
  mse <- mean((dv.test - predicted)^2)
  
  ## Store the results
  temp <- data.frame(alpha=i/20, mse=mse, name=name)
  results <- rbind(results, temp)
}

Now we can look at results.

print(results)

##    alpha       mse      name
## 1   0.00 11.664661    alpha0
## 2   0.05  3.306753 alpha0.05
## 3   0.10  2.088946  alpha0.1
## 4   0.15  1.764543 alpha0.15
## 5   0.20  1.628326  alpha0.2
## 6   0.25  1.570717 alpha0.25
## 7   0.30  1.541584  alpha0.3
## 8   0.35  1.511170 alpha0.35
## 9   0.40  1.472381  alpha0.4
## 10  0.45  1.443804 alpha0.45
## 11  0.50  1.422139  alpha0.5
## 12  0.55  1.427541 alpha0.55
## 13  0.60  1.437616  alpha0.6
## 14  0.65  1.400156 alpha0.65
## 15  0.70  1.389806  alpha0.7
## 16  0.75  1.333016 alpha0.75
## 17  0.80  1.420232  alpha0.8
## 18  0.85  1.347585 alpha0.85
## 19  0.90  1.381859  alpha0.9
## 20  0.95  1.355341 alpha0.95
## 21  1.00  1.350625    alpha1

plot(results$alpha, results$mse)

From this, we can see the alpha that produces the best fit on our training data is 0.85. That said, the differences in fit are relatively small for alphas of 0.1 and above, so any of these should yield reasonably good predictions.

Lastly, let’s take a look at our coefficients. As a reminder, based on how our data were generated, 25 of our predictors should have positive coefficients. The rest should be 0 or very small. Let’s take a look at our first 50 coefficients to see how our model performed.

predict(models[["alpha0.85"]], type = "coef")[1:50,]

## (Intercept)          V1          V2          V3          V4          V5 
##  0.01857022  0.91127213  0.96851275  0.90122846  0.88750970  0.91339748 
##          V6          V7          V8          V9         V10         V11 
##  0.64376065  0.64537929  0.78093551  0.70672393  0.67775357  0.50736870 
##         V12         V13         V14         V15         V16         V17 
##  0.48351162  0.51973063  0.44297111  0.48493961  0.38232497  0.27898063 
##         V18         V19         V20         V21         V22         V23 
##  0.23821332  0.29568119  0.28390519  0.17803300  0.22827262  0.06938443 
##         V24         V25         V26         V27         V28         V29 
##  0.06216011  0.03185826  0.00000000  0.00000000  0.00000000  0.00000000 
##         V30         V31         V32         V33         V34         V35 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V36         V37         V38         V39         V40         V41 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V42         V43         V44         V45         V46         V47 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         V48         V49 
##  0.00000000  0.00000000

Indeed, that’s what we found. It looks like our elastic net yielded a good model!

Conclusion

Ridge regression, lasso regression, and elastic net regression are all simply modifications to basic regression that introduce a penalty, based on the size of coefficients, to reduce the risk of overfitting. The differences between these methods are quite minor, at least superficially. Ridge regression’s penalty is based on the square of the coefficients. Lasso regression’s is based on the absolute value of the coefficients. And elastic net’s penalty is a combination of ridge and lasso regression’s penalties. Each type of model can be run quite simply using the glmnet package in R.

Acknowledgement

This tutorial was inspired by the amazing YouTube series StatQuest by Josh Starmer.