Lasso regression is a type of linear regression that adds a regularization penalty to the loss function during training. This penalty is proportional to the absolute value of the coefficients, which encourages the model to not only fit the data but also to keep the model weights as small as possible. This property of Lasso regression makes it particularly useful for feature selection, as it can effectively reduce the number of features by setting the coefficients of less important features to zero.

Here’s a step-by-step example of how to implement Lasso regression in R, including how to decide on the number of input variables and the optimal lambda value:

Step 1: Load the Necessary Libraries

We need the glmnet package, which provides functions for fitting generalized linear models via penalized maximum likelihood. Install and load it with:

# install.packages("glmnet")
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.2.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

Step 2: Prepare Data

For this example, let’s create some synthetic data. Assume we’re predicting a response y based on predictors x1, x2, …, x100, but only some of these are actually informative.

set.seed(42)  # for reproducibility
n <- 100  # number of samples
p <- 100  # number of variables
X <- matrix(rnorm(n * p), n, p)
beta <- c(rep(1, 10), rep(0, p-10))  # only first 10 are informative
y <- X %*% beta + rnorm(n)

Step 3: Fit Lasso Model

Use cross-validation to select the lambda that minimizes the mean cross-validated error.

cv.lasso <- cv.glmnet(X, y, alpha=1)  # alpha=1 for Lasso
plot(cv.lasso)

# This code fits the Lasso model across a range of lambda values and performs cross-validation to assess model performance. The plot function shows the mean squared error for each lambda.

Step 4: Determine the Best Lambda

We can extract the lambda that gives the minimum mean cross-validated error as follows:

best.lambda <- cv.lasso$lambda.min
print(best.lambda)

## [1] 0.0324792

Step 5: Re-fit the Lasso Model with the Selected Lambda

Now, use the selected lambda to fit the final model.

lasso.model <- glmnet(X, y, alpha=1, lambda=best.lambda)

Step 6: Investigate the Coefficients

To see which variables were selected (non-zero coefficients), we can do:

coef(lasso.model)

## 101 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  0.261556932
## V1           0.878224411
## V2           1.163501203
## V3           0.863586153
## V4           0.761606635
## V5           1.141607402
## V6           0.976442884
## V7           0.846018321
## V8           1.199882133
## V9           0.995539757
## V10          1.170741260
## V11          0.030781462
## V12          .          
## V13          .          
## V14          .          
## V15         -0.030427855
## V16         -0.118873704
## V17          .          
## V18          .          
## V19          .          
## V20          .          
## V21         -0.149071528
## V22         -0.042924447
## V23          .          
## V24          .          
## V25          .          
## V26         -0.005810407
## V27         -0.080204412
## V28          .          
## V29          .          
## V30          .          
## V31          0.085085985
## V32          .          
## V33         -0.171795608
## V34          .          
## V35          0.043480946
## V36         -0.151323362
## V37          .          
## V38          0.038903816
## V39          0.018480756
## V40          .          
## V41          .          
## V42          .          
## V43          .          
## V44          0.038648370
## V45          0.042544526
## V46          0.122027134
## V47          .          
## V48          .          
## V49          0.044273502
## V50          .          
## V51         -0.099218688
## V52          0.105502422
## V53          0.054445439
## V54          0.057229452
## V55          .          
## V56         -0.081061285
## V57          .          
## V58          0.018283616
## V59         -0.054336189
## V60          0.016794284
## V61         -0.219440660
## V62          .          
## V63          .          
## V64          .          
## V65          .          
## V66          0.077581043
## V67         -0.134037701
## V68          0.154062557
## V69          .          
## V70          .          
## V71          .          
## V72          .          
## V73         -0.057622789
## V74          0.019749255
## V75          .          
## V76         -0.037391945
## V77         -0.089953400
## V78          0.186404534
## V79          0.129575177
## V80         -0.164696368
## V81          .          
## V82          .          
## V83          .          
## V84          .          
## V85         -0.113563536
## V86         -0.022514404
## V87          0.084850661
## V88          0.086598338
## V89          0.147645539
## V90          0.228979638
## V91          .          
## V92          .          
## V93          .          
## V94          .          
## V95          0.076066563
## V96          .          
## V97          .          
## V98         -0.075137531
## V99          .          
## V100         .

Lasso regression

Vu Thien

2024-05-12