LASSO vs. Ridge

LASSO and Ridge regression are regression methods that perform \(variable\) \(selection\) and \(regularization\) to enhance prediction accuracy and interpretability of the statistical model. In short, they do two things

Variable selection: identify important variables in the data that explain major variation in the outcome variable (\(Y\))
Regularization: these models add a penalty term based on whether the addition of a variable contributes toward minimizing sum of squared errors (SSE), thus shrinking some of the coefficients (or their alternative specifications, for example, \(X^2\), \(X*Z\), \(log(X)\)) to zero and thereby reducing the number of features in the model, equivalent to \(model\) \(specification\)

LASSO and Ridge are similar to ordinary least square (OLS) model, our workhorse linear regression model (\(\texttt{lm()}\)). They both add a penalty term that minimizes a loss function of the following forms:

LASSO

\[\begin{equation*} \mathop{\arg\min}_{\lambda} = \underbrace{\sum_{i=1}^{N}(Y_{ij} - X_{ij}\beta)^2}_{\mbox{SSE}} + \underbrace{\lambda\sum^{p}_{j=1}|\beta_j|}_{\mbox{penalty}}, \end{equation*}\]

Ridge

\[\begin{equation*} \mathop{\arg\min}_{\lambda} = \underbrace{\sum_{i=1}^{N}(Y_{ij} - X_{ij}\beta)^2}_{\mbox{SSE}} + \underbrace{\lambda\sum^{p}_{j=1}{\beta_j}^2}_{\mbox{penalty}}, \end{equation*}\]

for \(i\) in \(N\) observations and \(j\) in \(p\) features.

The only difference between LASSO and Ridge is that the former uses the “absolute value” of estimated coefficients, \(|\beta|\), as its penalty term, which is also called the L1 regularization, whereas the latter uses ``squared magnitude’’ of estimated coefficients, \(\beta^2\), as its penalty term, called the L2 regularization.

Variables (and their alternative functional form specifications) that contribute little toward minimizing the SSE, their \(\beta\)s will be relegated to the “penalty” part of both models and assigned a penalty (\(\lambda\)) in the calculation of model SSE; if the SSE remains large, these variables are not important “features” of the model.

The \(\lambda\) terms in both LASSO and Ridge regressions \(\textbf{regulate}\) the intensity of penalty (now you know what the term “regularization” is supposed to mean). The higher (lower) the \(\lambda\) the faster (slower) the penalty, so less (more) variables are retained during the process. Usually,

\[\begin{equation*} \begin{cases} \mbox{if $\lambda$ $<$ 1 $\rightarrow$ slower penalty}\\ \mbox{if $\lambda$ $>$ 1 $\rightarrow$ accelerate penalty} \end{cases} \end{equation*}\]

Say, for example, \(\lambda\) \(=\) 0.8 should accelerate the penalty process, forcing more variables to have 0 estimated coefficients than when \(\lambda\) \(=\) 0.2. Yet when no penalty takes place, both regressions are reduced to the plain vanilla OLS model.

It has been shown that L2 penalty shrinks the coefficients of correlated predictors towards each other while the L1 tends to pick one of them and discard the rest.

Elastic Net

The elastic-net regularization proposed by Zou and Hastie (2005) takes the middle ground by trying to leverage the strength of both LASSO and Ridge regressions through the \(\alpha\) parameter:

\[\begin{equation*} \mathop{\arg\min}_{\lambda} = |Y - X_{ij}\beta|^2 + \alpha\underbrace{\lambda_1|\beta|}_{\mbox{L1}} + (1 - \alpha)\underbrace{\lambda|\beta|^2}_{\mbox{L2}}, \end{equation*}\]

thus bridging the gap between LASSO (\(\alpha\) \(=\) 1, the default) and Ridge (\(\alpha\) \(=\) 0). As \(\alpha\) approaches 1, the model becomes LASSO, yet as \(\alpha\) gets closer to 0, then the model tilts toward Ridge regression. The choice of \(\alpha\) depends on the trade-off between the strength and weakness of LASSO and Ridge that you are willing to accept for the analytical tasks at hand.

In \(\texttt{R}\), all the above regressions can be implemented via the \(\texttt{glmnet()}\) function of the \(\texttt{glmnet}\) package or the \(\texttt{train()}\) function inside the \(\texttt{caret}\) package.

Example

Let’s work through some examples. First, we load the required packages.

## Loading required package: Matrix
## Loaded glmnet 4.1-8
## Warning: package 'plotmo' was built under R version 4.4.3
## Loading required package: Formula
## Loading required package: plotrix
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

We then load the data and implement LASSO and Ridge regressions.

# Load the data
data(mtcars)
names(mtcars) # as usual, check out what's inside the loaded dataframe
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# Subset on obs w/o missing values
df <- mtcars[complete.cases(mtcars),]

# Data preparation: convert data into matrix format to be fed into glmnet()
X = model.matrix(mpg ~ .-1, data=df) 
Y = df$mpg   # our dependent variable

# Estimate a LASSO model by setting alpha to 1 (which is the default in glmnet())
mtcars.lasso = glmnet(X, Y, alpha = 1)

coef(mtcars.lasso, s = cv.glmnet(X, Y)$lambda.1se)   # The second argument (s (which means "lambda" in glmnet()) says choosing the lambda values from cross-validated glmnet() whose error is within 1 standard error of the cross-validated errors for lambda.min (the lambda value that returns minimum mean cross-validated error)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 33.471391885
## cyl         -0.833427049
## disp         .          
## hp          -0.005886408
## drat         .          
## wt          -2.287815524
## qsec         .          
## vs           .          
## am           .          
## gear         .          
## carb         .
# Only cyl, hp, and wt are important variables at the optimized values of lambda
# Plot it out 
plot(mtcars.lasso, label = TRUE)

# Generate prettier plots using plot_glmnet() from the "plotmo" package
par(mfrow=c(1,2))  # Set up 1 by 2 plotting environemnt
plot_glmnet(mtcars.lasso, xvar = "lambda", label=5, xlab = expression(paste("log(", lambda, ")")), ylab = expression(beta))   # Select top-5 variables to display on the plot

plot_glmnet(mtcars.lasso, label = 5, xlab = expression(paste("log(", lambda, ")")), ylab = expression(beta))

# Essentially, the graph says as the magnitude of penalty (lambda) increases, the estimated coefficients (beta) of most variables "shrink" in magnitude, so the number of variables (look at the top x-axis) that still have non-zero beta coef. decreases.

# Estimate a Ridge model by setting alpha to 0
mtcars.ridge = glmnet(X, Y, alpha = 0)

# It turns out that most variables remain after we ran the ridge regression
# We will do the same for ridge regression
coef(mtcars.ridge, s = cv.glmnet(X, Y)$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 21.230633081
## cyl         -0.349003729
## disp        -0.004725405
## hp          -0.012060197
## drat         1.039200233
## wt          -1.397529002
## qsec         0.182273348
## vs           0.684926701
## am           1.786060888
## gear         0.560564984
## carb        -0.607373895
plot(mtcars.ridge, label = TRUE)

Let’s demonstrate with a more sophisticated example that integrates the following skills - train-test split, receiver operating characteristics (ROC), parallel processing, cross-validation (cv), for-loop, and elastic net.

# Load the credit count data from our shared Dropbox folder
my_data <- read.csv("https://www.dropbox.com/s/yv6rac6kv8pfebl/credit_count.csv?dl=1")

# How many observations do we have?
dim(my_data)
## [1] 13444    14
# Subset data by cardholder (CARDHLDR==1)
names(my_data)[1] <- "CARDHLDR"  # Rename the first column
df <- my_data[my_data$CARDHLDR == 1, ]  # Subset


# You can subset df to get a new dataframe without NA values (not implemented here)
# df <- df[complete.cases(df),]

## Data preparation
# Set random seed for reproducibility 
set.seed(123)
n <- nrow(df)
sample <- sample(seq(n), size = n * 0.5, replace = FALSE) # 50-50 split

# Train-test split
training <- df[sample, -1]  # Assign indexed observations to training data & remove the first column
testing <- df[-sample, -1] # Assign those NOT in "sample" to testing data & remove the first column

# Generate X, Y variables for the training and testing data
mdlY <- as.factor(as.matrix(training["DEFAULT"]))
mdlX <- as.matrix(training[setdiff(colnames(df), c("CARDHLDR", "DEFAULT"))])  # Assign those not in c("CARDHLDR", "DEFAULT") to X variables in mdlX. We only used "CARDHLDR" to select desired rows but we don't need it for estimation.
newY <- as.factor(as.matrix(testing["DEFAULT"]))
newX <- as.matrix(testing[setdiff(colnames(df), c("CARDHLDR", "DEFAULT"))])

LASSO regression

# Estimate LASSO regression by setting alpha = 1 (why?)
# Estimate the LASSO with cross-validated glmnet function
# Note that because the dependent variable "DEFAULT" is binary, so we need to set the distribution family to "binomial"
lasso.cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", alpha = 1) # Use cv.glmnet() to get the optimized lambda value and then feed it to the lambda argument inside regular glmnet() function 
lasso.fit <- glmnet(mdlX, mdlY, family = "binomial", lambda = lasso.cv$lambda.1se, alpha = 1)

# Check out estimated coefficients (only "INCOME" and "LOGSPEND" are retained)
coef(lasso.fit, s = lasso.cv$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -1.8117877271
## AGE          .           
## ACADMOS      .           
## ADEPCNT      .           
## MAJORDRG     .           
## MINORDRG     .           
## OWNRENT      .           
## INCOME      -0.0001201947
## SELFEMPL     .           
## INCPER       .           
## EXP_INC      .           
## SPENDING     .           
## LOGSPEND    -0.0170202047
# The data (Y) inside newY contains the ground truth (actual values), so we use the trained model to make prediction on testing data (ŷ), and then compare predicted values with their actual values.
# Receiver operating characteristics (ROC): predict with 64% accuracy
roc(newY, as.numeric(predict(lasso.fit, newX, type = "response"))) # Use the trained model to predict on the testing data and assess prediction accuracy against actual Y values in the testing data
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = newY, predictor = as.numeric(predict(lasso.fit,     newX, type = "response")))
## 
## Data: as.numeric(predict(lasso.fit, newX, type = "response")) in 4781 controls (newY 0) < 469 cases (newY 1).
## Area under the curve: 0.6421

Ridge regression

# Estimate Ridge regression by setting alpha = 0 (why?)
# Use cv.glmnet() to get the optimized lambda value and then feed it to the lambda argument inside regular glmnet() function 

ridge.cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", alpha = 0)
ridge.fit <- glmnet(mdlX, mdlY, family = "binomial", lambda = ridge.cv$lambda.1se, alpha = 0)

# Check out estimated coefficients (all variables are retained)
coef(ridge.fit, s = ridge.cv$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -1.507211e+00
## AGE         -4.439026e-03
## ACADMOS     -7.118618e-05
## ADEPCNT      6.301495e-03
## MAJORDRG     6.754877e-02
## MINORDRG     5.979355e-02
## OWNRENT     -8.237297e-02
## INCOME      -6.891628e-05
## SELFEMPL    -3.066778e-02
## INCPER      -4.859370e-06
## EXP_INC      1.823505e-01
## SPENDING    -7.527302e-05
## LOGSPEND    -5.290339e-02
# Receiver operating characteristics (ROC): predict with > 66% accuracy
roc(newY, as.numeric(predict(ridge.fit, newX, type = "response")))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = newY, predictor = as.numeric(predict(ridge.fit,     newX, type = "response")))
## 
## Data: as.numeric(predict(ridge.fit, newX, type = "response")) in 4781 controls (newY 0) < 469 cases (newY 1).
## Area under the curve: 0.6598

Finally, the elastic net, let’s see if it indeed gets the best of both worlds(?) But we need to do something a little different here.

# Estimate an elastic-net regression by varying alpha values (why?)
# Estimate the plain version (enet.fit) and 10-fold cross-validated version (enet.cv)
# Apply parallel processing to increase computing speed as we need to loop over alpha
# We first need to register the number of CPU cores to be used in the computation
n.core <- makeCluster(detectCores(logical = TRUE) - 1)  # Detect how many CPU cores are in your current machine but leave 1 core for spare (you DO NOT want to use up all your CPUs). The makeCluster() function comes from the "parallel" package that we imported at the very beginning of this document.
registerDoParallel(cores = n.core)     # Register cores: tell R that you are going to use this many core(s) in your machine 


# Vary alpha value from 0.2 to 0.8 by an increment of 0.05
alpha <- seq(0.2, 0.8, 0.05)  

# Write a "for loop" to execute the code inside the {} across the values of alpha, the dopar is the "doParallel" command that distributes our data flows to the registered CPU cores

search <- foreach(i = alpha, .combine = rbind) %dopar% {
  cv <- glmnet::cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i)  # cv.glmnet() need to be called directly
  data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) # Tell R the resulting dataframe should contain these columns
}

# See what's inside the estimated object
# It contains cross-validated mean squared errors (cvm), optimized lambda, and the corresponding alpha value (higher alpha -> closer to LASSO, lower alpha -> closer to Ridge)
names(search)
## [1] "cvm"        "lambda.1se" "alpha"
enet.cv <- search[search$cvm == min(search$cvm), ] # Find the minimum values of cvm to locate the row that contains optimized alpha, then feed them to the relevant argument inside glmnet()
enet.fit <- glmnet(mdlX, mdlY, family = "binomial", lambda = enet.cv$lambda.1se, alpha = enet.cv$alpha)

# Check out estimated coefficients: AGE, MAJORDRUG, MINORDRUG, OWNRENT, INCOME, INCPER, LOGSPEND remain important features
coef(enet.fit)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -1.190112e+00
## AGE         -3.868212e-03
## ACADMOS      .           
## ADEPCNT      .           
## MAJORDRG     4.189241e-02
## MINORDRG     6.066569e-02
## OWNRENT     -4.313824e-02
## INCOME      -1.473616e-04
## SELFEMPL     .           
## INCPER      -6.734920e-06
## EXP_INC      .           
## SPENDING     .           
## LOGSPEND    -7.953523e-02
# Receiver operating characteristics (ROC): prediction accuracy is only slightly over 65%, not bad but not significantly better that what we got using Ridge regression
roc(newY, as.numeric(predict(enet.fit, newX, type = "response")))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = newY, predictor = as.numeric(predict(enet.fit,     newX, type = "response")))
## 
## Data: as.numeric(predict(enet.fit, newX, type = "response")) in 4781 controls (newY 0) < 469 cases (newY 1).
## Area under the curve: 0.6569