library(tidyr)
library(dplyr)
library(broom)
library(MASS)
data(meatspec, package="faraway")

Ridge Regression

In contexts where information is redundant, even established and simple methods such as ordinary linear regression (OLS) are not effective. Let’s try to understand why. For ordinary linear regression, there is a closed formula that precisely returns the model parameters:

\[ \hat{\boldsymbol{\beta}}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}, \]

Where \(\mathbf{X}\) is an \(n \times p\) matrix, with \(n\) representing the sample size and \(p\) the number of parameters, while \(\mathbf{y}\) is an \(n \times 1\) column vector containing the realizations of the dependent variable.

The formula for the Ridge estimator involves the inversion of the matrix \(\mathbf{X}^T\mathbf{X}\), which does not always exist when working with redundant data. Some programming languages, such as R, are able to handle such problems. This is possible because, instead of inverting the matrix, they use a decomposition method called QR decomposition. However, the problem persists, as the estimates obtained in this way are unreliable and associated with high variability.

To prove that what we have just stated is true, let’s try to estimate a linear model that includes all covariates present in the dataset. From our previous exploratory analysis (output not shown), we noticed a strong correlation between the columns of the data matrix.

#Fit of OLS model
modlm <- lm(fat ~ ., meatspec)
my_summary <- summary(modlm)
my_df <- broom::tidy(my_summary)
my_df

The standard errors turn out to be too high to consider the estimates obtained as reliable. Therefore, it becomes necessary to stabilize the matrix inversion and, consequently, improve the precision of the estimates obtained.

In this context, it is appropriate to introduce Ridge regression, which assumes that the covariates \(X_1,\dots, X_p\) have been centered using their mean and scaled by their standard deviation, and that the same procedure has been applied to the response variable \(Y\). The estimated coefficient vector \(\hat{\boldsymbol{\beta}}_{Ridge}\) has the following form:

\[ \hat{\boldsymbol{\beta}}_{Ridge}=(\mathbf{X}^T\mathbf{X}+\lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \]

where \(\lambda\) is a real coefficient that unfortunately cannot be estimated, while \(\mathbf{I}\) is an identity matrix with \(1\) on the diagonal and \(0\) elsewhere.

There are various possible interpretations of Ridge regression, the most intuitive is to add a positive constant \(\lambda\) to the diagonal elements of the matrix \(\mathbf{X}^T\mathbf{X}\) before inverting it. This operation allows us to solve the non-invertibility problem that arises when working with redundant information. An extensive discussion about the intuition to add a regularization constant is in (Hoerl and Kennard, 1970).

# Fit of ridge regression 
rgmod <- lm.ridge(fat ~ ., meatspec, lambda = seq(0, 5e-8, len=21))
matplot(rgmod$lambda, 
        coef(rgmod), type="l", 
        xlab=expression(lambda) ,
        ylab=expression(hat(beta)),
        col=4)

Unfortunately, there is no direct estimation method for \(\lambda\), so the analyst must exercise patience and try different possible values. A commonly used graphical representation is the ridge trace plot. This plot represents the estimated coefficients as a function of different values of \(\lambda\) (x-axis). To select the value of \(\lambda\), one method is generalized cross-validation (GCV).

The MASS package implements the lm.ridge function and already internally calculates the value of \(\lambda\) that minimizes the generalized cross-validation (GCV).

which.min(rgmod$GCV)
## 3.25e-08 
##       14

Which returns the value of \(\lambda\) for which the GCV is minimum.

{matplot(rgmod$lambda, 
         coef(rgmod), 
         type="l", 
         xlab=expression(lambda) ,
         ylab=expression(hat(beta)),
         col=4)
  abline(v=1.75e-08)}

But what are the estimates associated with the value of \(\lambda=1.75e-08\)?

my_coef<- coef(rgmod)[which.min(rgmod$GCV),]
my_coef_dt<- broom::tidy(my_coef)
print(my_coef_dt)
## # A tibble: 101 × 2
##    names         x
##    <chr>     <dbl>
##  1 ""         7.14
##  2 "V1"    9693.  
##  3 "V2"  -12590.  
##  4 "V3"    2907.  
##  5 "V4"    5368.  
##  6 "V5"   -9038.  
##  7 "V6"     263.  
##  8 "V7"   -5984.  
##  9 "V8"    6133.  
## 10 "V9"      54.1 
## # … with 91 more rows

The estimates obtained in this way will be biased, but with less variability.

Lasso Regression

Lasso regression is apparently similar to Ridge, but there are some conceptual differences that are worth discussing. Here we want to choose \(\hat{\boldsymbol{\beta}}\) such that it minimizes \[ (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T,\,\,\,\sum_{j=1}^p|\beta_j|\le t \]

where \(\mathbf{y}\) is the response variable vector, \(\mathbf{X}\) is the covariate matrix, \(\boldsymbol{\beta}\) is the regression coefficient vector, and \(t\) is a pre-specified value that represents the sum of the absolute deviations of the coefficients.

The Lasso regression differs from Ridge in the penalty term introduced by Tibshirani in 1996. This term, which imposes the restriction \(\sum_{j=1}^p|\beta_j|\le t\), does not have a closed-form solution for the parameter vector. Furthermore, as \(t\) increases, more variables are added to the model and their coefficients become larger. For sufficiently large \(t\), the restriction \(\sum_{j=1}^p|\beta_j|\le t\) becomes redundant and the least squares solution is returned.

This method is particularly useful when we expect the effects to be sparse, i.e., when only a few predictors are able to explain the response, while the others have no significant effect. The key to Lasso regression lies in its ability to eliminate predictors that are not relevant to the regression. When \(\hat{\beta}_j\) is zero, the corresponding predictor \(x_j\) is completely eliminated from the model. On the other hand, if you are using Ridge regression, there is no elimination of variables; this method only reduces the effect of the estimated coefficients \(\hat{\beta}_j\).

It is interesting to note that the term “Lasso” is not an acronym, but refers to the rope used by cowboys to catch cattle, in analogy with the process of selecting variables from the model by the statistician.

The dataset in this example is the state dataset built in the datasets package in R. It contains information on various aspects of the United States, such as population, per capita income, public spending, education, and many others.

In particular, the dataset includes 50 observations, one for each US state, and 25 variables. Among these variables, we find the state name, its abbreviation, its capital, its geographic region, the number of representatives in Congress, the total population, the population density, the percentage of white population, the percentage of black population, the per capita income, the public spending per student, the number of primary and secondary schools, and many others.

require(lars)
data(state)
statedata <- data.frame(state.x77,row.names=state.abb)
# Fit of Lasso regression, -4 means that we are excluding Life from the design
lmod <- lars(as.matrix(statedata[,-4]),statedata$Life)
plot(lmod)

The graph is very similar to what we have already seen in the case of Ridge regression. However, there are some notable differences. In particular, it can be observed that some of the coefficients, which depend on \(t\), are forced to take the value zero. This means that the corresponding variables are excluded from the model, making Lasso regression a very powerful variable selection method.

Note that the horizontal axis has been scaled to the maximum value of the coefficient vector obtained using the least squares method. For small values of \(t\), only predictor 4 (homicide rate) is different from zero. As \(t\) increases, a second predictor comes into play (graduation rate). Therefore, as \(t\) increases, we also see the size of the estimated coefficients increase.

But how do we choose \(t\)? The selection of the parameter \(t\) (remember that it cannot be estimated using conventional methods) is usually done using Cross-Validation (CV). In the lars package, this is implemented through a 10-fold CV. In short, the data is randomly divided into 10 groups, and then the remaining nine groups are used to predict one of the remaining groups. This is done iteratively for each of the 10 groups, and then the overall prediction performance is calculated.

set.seed(123)

cvlmod <- cv.lars(as.matrix(statedata[,-4]),statedata$Life)

The value suggested by the CV is \(t_{CV}=0.65\). Referring to the previous graph, this corresponds to a model containing only four predictors. We can now extract the coefficients.

cvlmod$index[which.min(cvlmod$cv)]
## [1] 0.6464646
my_coef_lasso<- predict(lmod,s=0.65657,type="coef",mode="fraction")$coef
my_coef_lasso_dt<- broom::tidy(my_coef_lasso)
print(my_coef_lasso_dt)
## # A tibble: 7 × 2
##   names               x
##   <chr>           <dbl>
## 1 Population  0.0000235
## 2 Income      0        
## 3 Illiteracy  0        
## 4 Murder     -0.240    
## 5 HS.Grad     0.0353   
## 6 Frost      -0.00169  
## 7 Area        0

In conclusion, Lasso and Ridge are two regularization techniques that are similar but have different objectives. Lasso is more suitable when it is presumed that only a small subset of variables is relevant in explaining the \(Y_i\), while Ridge is more appropriate when it is assumed that all variables are relevant, but some are more relevant than others. Furthermore, Lasso is particularly useful when working with high-dimensional datasets, as it helps to select only the most important variables and reduce the risk of overfitting. On the other hand, Ridge may be more suitable when variables are strongly correlated with each other. In summary, the choice between Lasso and Ridge depends on the specific characteristics of the dataset and the research questions of the analyst.

References

  • Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in Statistical Science.