OLS, and regularisation techniques (Lasso, Ridge)

Why `predictors (p) \> observations (n)` is an issue?

Theory

In linear regression, when the number of independent variables (predictors) exceeds the number of observations (data points), the model becomes overparameterized, leading to an inability to estimate the coefficients (betas) and their standard errors correctly.

Multiple ways to understand why this is the case -

Mathematical Impossibility — No Unique Solution (not enough observations as you need more observations than predictors to estimate coefficients uniquely).

\[ \hat{\beta} = (X'X)^{-1} X' y \]
- If the number of predictors p is greater than the number of observations n, the matrix X′X is singular¹ (not invertible). This is because X doesn’t have enough information to uniquely determine the coefficients. Imagine trying to solve 11 variables with only 10 equations — there is no unique solution, as the system is underdetermined.
Degrees of Freedom — Lack of Flexibility:
- Degrees of freedom in regression are calculated as \(n - (p + 1)\), where n is the number of observations, and p is the number of predictors.
- When \(p \geq n\), the degrees of freedom become zero or negative. Degrees of freedom represent the amount of independent information available to estimate the coefficients. Without enough degrees of freedom, you can’t estimate the variance or standard errors accurately, leading to undefined standard errors.
- In simpler terms, you need enough data points to account for each predictor’s effect. If there aren’t enough data points, the model lacks sufficient information to separate out the effects of each predictor accurately.
Perfect Multicollinearity — Infinite Solutions:
- Perfect multicollinearity occurs when one or more predictors are exact linear combinations of others. If you have more predictors than observations, some predictors can be expressed as combinations of others, leading to perfect multicollinearity.
- This means there are infinite solutions for the coefficients \(\beta\), and the regression algorithm cannot determine which is correct. For example, \(\beta_1\) and \(\beta_2\) might both be adjusted in different ways to still perfectly fit the data, creating instability in the model.

Empirical

We need the glmnet package.

# Clear the workspace
  rm(list = ls()) # Clear environment - remove all files from your workspace
  invisible(gc()) # Clear unused memory
  cat("\f")       # Clear the console

  graphics.off()  # clear all graphs


# Prepare needed libraries
packages <- c("glmnet",   # used for regression
              "caret",    # used for modeling
              "xgboost",  # used for building XGBoost model
              "ISLR",
              "dplyr",     # used for data manipulation and joining
              "tidyselect",
              "stargazer", # presentation of data
              "data.table",# used for reading and manipulation of data
              "ggplot2",   # used for ploting
              "cowplot",   # used for combining multiple plots
              "e1071",     # used for skewness
              "psych",
              "car"
              )

suppressPackageStartupMessages({
  
  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i]
                       , repos = "http://cran.rstudio.com/"
                       , dependencies = TRUE
                       )
    }
    library(packages[i], character.only = TRUE)
  }
  
})  

rm(packages)

set.seed(7)

Generate fake/simulated data

You can get a dataset where \(p>n\) as well.

# Clear the workspace by removing all objects from the environment
remove(list=ls())

# Load the help page for distribution functions
?distribution

Help on topic 'distribution' was found in the following packages:

  Package               Library
  lava                  /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  stats                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...

# Generate response variable 'y' from a Student's t-distribution with 10 observations and 7 degrees of freedom
y <- rt(n = 10, df = 7)


# Generate predictor variables from different distributions:

# Uniform distributions for x1, x2, and x3:
x1 <- runif(n = 10, min = 1, max = 7)  # 10 random values between 1 and 7
x2 <- runif(n = 10, min = 2, max = 8)  
x3 <- runif(n = 10, min = 3, max = 9)  

# Binomial distributions for x4, x5, and x6:
x4 <- rbinom(n = 10, size = 14, prob = 0.7)  # 10 random values, each as the count of 14 trials with 0.7 probability
x5 <- rbinom(n = 10, size = 15, prob = 0.8) 
x6 <- rbinom(n = 10, size = 16, prob = 0.9) 

# Poisson distributions for x7 to x11:
x7 <- rpois(n = 10, lambda = 5)  # 10 random values with an average rate of 5
x8 <- rpois(n = 10, lambda = 6)  
x9 <- rpois(n = 10, lambda = 7) 
x10 <- rpois(n = 10, lambda = 8)
x11 <- rpois(n = 10, lambda = 9)

Once you have the data, lets run some regressions -

Case I : \(p<n\) (reg1 has \(p=8+1\) and \(n=10\)). Will get all \(\beta\) and \(se(\beta\)).
Case II : \(p=n\) (reg2 has \(p=9+1\) and \(n=10\)). Will get all \(\beta\) BUT \(se(\beta\)) an issue.
Case III : \(p>n\) (reg3 has \(p=10+1\) and \(n=10\)). Will NOT get all \(\beta\) AND \(se(\beta\)) an issue.

reg1 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8)
reg2 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9)
reg3 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10)
#reg4 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11)


stargazer(reg1,reg2,reg3,#reg4,
          type = "text")


===================================================
                          Dependent variable:      
                    -------------------------------
                                   y               
                           (1)         (2)    (3)  
---------------------------------------------------
x1                        0.165       0.231  0.231 
                         (0.422)                   
                                                   
x2                        1.072       0.980  0.980 
                         (0.616)                   
                                                   
x3                       -0.602       -0.453 -0.453
                         (0.876)                   
                                                   
x4                       -0.260       -0.316 -0.316
                         (0.274)                   
                                                   
x5                       -0.089       -0.117 -0.117
                         (0.227)                   
                                                   
x6                       -1.177       -1.296 -1.296
                         (0.468)                   
                                                   
x7                        0.184       0.186  0.186 
                         (0.296)                   
                                                   
x8                        0.805       0.888  0.888 
                         (0.263)                   
                                                   
x9                                    0.127  0.127 
                                                   
                                                   
x10                                                
                                                   
                                                   
Constant                 13.709       14.039 14.039
                        (12.761)                   
                                                   
---------------------------------------------------
Observations               10           10     10  
R2                        0.978       1.000  1.000 
Adjusted R2               0.799                    
Residual Std. Error  0.565 (df = 1)                
F Statistic         5.476 (df = 8; 1)              
===================================================
Note:                   *p<0.1; **p<0.05; ***p<0.01

As you can see, reg3 is not able to estimate all betas.
In fact, standard errors are an issue from reg2 onwards.

Removing intercept term

This will change the p in each of the three regressions, but the story stays the same - you cannot have \(p>n\) for coefficient estimation (beta) , and \(p \geq n\) for standard errors (se).

reg1 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8 -1)
reg2 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9 -1)

reg3 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10 -1)
reg4 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11 -1)

stargazer(reg1,reg2,reg3, reg4,
          type = "text")


=====================================================================
                                   Dependent variable:               
                    -------------------------------------------------
                                            y                        
                           (1)               (2)         (3)    (4)  
---------------------------------------------------------------------
x1                        0.545             0.618       1.072  1.072 
                         (0.238)           (0.266)                   
                                                                     
x2                        0.548             0.447       -0.130 -0.130
                         (0.391)           (0.430)                   
                                                                     
x3                        0.051             0.212       0.160  0.160 
                         (0.655)           (0.717)                   
                                                                     
x4                       -0.086            -0.137       0.230  0.230 
                         (0.230)           (0.250)                   
                                                                     
x5                        0.061             0.037       0.347  0.347 
                         (0.186)           (0.198)                   
                                                                     
x6                       -0.746*           -0.852       -0.968 -0.968
                         (0.249)           (0.290)                   
                                                                     
x7                        0.382             0.389       1.390  1.390 
                         (0.240)           (0.255)                   
                                                                     
x8                        0.707             0.785       1.514  1.514 
                         (0.255)           (0.285)                   
                                                                     
x9                                          0.123       -0.314 -0.314
                                           (0.140)                   
                                                                     
x10                                                     -1.222 -1.222
                                                                     
                                                                     
x11                                                                  
                                                                     
                                                                     
---------------------------------------------------------------------
Observations               10                10           10     10  
R2                        0.971             0.984       1.000  1.000 
Adjusted R2               0.855             0.837                    
Residual Std. Error  0.587 (df = 2)    0.622 (df = 1)                
F Statistic         8.346 (df = 8; 2) 6.693 (df = 9; 1)              
=====================================================================
Note:                                     *p<0.1; **p<0.05; ***p<0.01

Case I (p<n): No issues — both \(\beta\) and \(se(\beta\)) are estimable.
Case II (p=n): \(\beta\) is estimable, but \(se(\beta\)) is undefined due to zero degrees of freedom.
Case III (p>n): \(\beta\) and \(se(\beta\)) are not estimable due to a singular \(X'X\) matrix, which results in multiple solutions.

Overview

Different SSR

Lasso, Ridge, and Ordinary Least Squares (OLS) are all regression techniques used for modeling relationships between variables. Here’s a brief overview and comparison of these three methods:

Ordinary Least Squares (OLS):
- OLS is the classic linear regression method that aims to minimize the sum of squared differences between observed and predicted values.
- It estimates coefficients for predictor variables to fit a linear model that best represents the relationship with the response variable.
- OLS doesn’t inherently address issues like multicollinearity or feature selection, potentially leading to overfitting when dealing with high-dimensional data.
Lasso Regression:
- Lasso (Least Absolute Shrinkage and Selection Operator) extends OLS by adding a penalty term based on the absolute values of the coefficients.
- Lasso regression seeks to minimize the following:
  - RSS + λ Σ|β_j|
- It encourages some coefficients to become exactly zero, leading to feature selection. This makes Lasso suitable when you suspect that only a subset of predictors is relevant.
- Lasso is effective for high-dimensional data, as it can automatically exclude less important variables and prevent overfitting.
- However, Lasso’s feature selection can be aggressive and might lead to biased coefficient estimates in certain cases.
- Use Case: Lasso is useful when you have many features, and you suspect that some may be irrelevant. It automatically selects important features by zeroing out less important coefficients.
Ridge Regression:
- Ridge regression, like Lasso, extends OLS by adding a penalty term, but this term is based on the squared values of coefficients.
- Ridge regression seeks to minimize the following:
  - RSS + λ Σβ_j²
- Ridge helps address multicollinearity by shrinking coefficients, reducing the impact of correlated predictors.
- Unlike Lasso, Ridge rarely forces coefficients to be exactly zero. It’s suitable when you want to retain most predictors but mitigate multicollinearity effects.
- Ridge can provide more stable and less biased coefficient estimates compared to Lasso, especially when many predictors are correlated.
- Use Case: Ridge is useful when all features are potentially relevant, and you want to reduce the magnitude of coefficients to make the model more stable and generalizable.

Comparison of OLS, Lasso and Ridge

OLS is a basic linear regression, while Lasso and Ridge add regularization to prevent overfitting and handle multicollinearity. Lasso is suitable for feature selection, while Ridge is useful for maintaining stability and addressing multicollinearity². The choice between them depends on your data characteristics and modeling goals.

Feature Selection: OLS doesn’t perform feature selection. Lasso is well-suited for feature selection by setting some coefficients to zero, while Ridge retains most predictors.
Coefficient Values: Lasso can lead to sparse solutions with many coefficients being exactly zero, whereas Ridge generally shrinks coefficients towards zero without making them zero.
Multicollinearity: Ridge is particularly effective in handling multicollinearity by reducing the impact of correlated predictors, whereas Lasso might handle multicollinearity by excluding correlated variables entirely.
Model Complexity: OLS can lead to overfitting in high-dimensional settings, while Lasso and Ridge provide regularization to mitigate this issue.
Model Selection: Lasso can provide a simpler and more interpretable model by selecting a subset of predictors. Ridge retains more predictors and maintains predictive stability.

Data

Independent Variables

We will generate 20 columns and 500 rows of data from a normal distribution, and then standardize the data (different from normalization).

Standardization: Scaling for Zeros and Ones

Standardization, also known as z-score normalization, scales our data to have a mean of 0 and a standard deviation of 1. By centering the data around zero and squeezing it to a consistent range, we achieve balanced and comparable features. Standardization is particularly useful for algorithms that rely on distance measures, such as k-nearest neighbors and support vector machines.

Normalization: Mapping to a Common Range

Normalization, on the other hand, scales the data to a range between 0 and 1. By mapping the features to a common interval, normalization helps to emphasize the relative importance of different features. This technique is beneficial for algorithms that rely on weight-sensitive models, like neural networks and clustering algorithms.

The decision to standardize or normalize depends on the characteristics of the dataset and the requirements of the chosen machine learning algorithm. Here are some guidelines to consider:

Standardization: When the features have different units or different scales, standardization ensures they are all on a comparable level, making it easier for the algorithm to learn patterns effectively.
Normalization: When the scale of the features is not critical, and you want to ensure that all features contribute equally to the model, normalization is a suitable choice.

N <- 500 # number of observations
p <- 20  # number of variables
    
#——————————————–
# X variable
#——————————————–
?rpois
X = matrix(data = rpois(n = N*p, lambda = 7), #rnorm(n = N*p), 
           ncol = p
           )
 
 
# before standardization
colMeans(x = X)    # mean

 [1] 7.028 6.992 6.974 6.840 7.008 6.898 7.014 7.118 6.974 6.958 6.862 7.194
[13] 7.168 7.070 6.902 7.046 7.108 6.930 7.084 6.918

apply(X = X,
      MARGIN = 2,
      FUN = sd       # var
      )           # standard deviation

 [1] 2.613978 2.658208 2.695153 2.701295 2.747186 2.597902 2.768598 2.643871
 [9] 2.675002 2.498945 2.634547 2.662055 2.541002 2.760496 2.540335 2.667229
[17] 2.594957 2.665203 2.791510 2.671620

# scale : mean = 0, std=1
?scale    
scaled_X = scale(x = X)
 
# after standardization
    colMeans(x = scaled_X)    # mean ~ 0

 [1]  5.329071e-17  1.199041e-17 -1.820766e-16  2.873257e-16 -6.536004e-17
 [6]  8.004708e-17 -1.156852e-16 -1.261213e-16 -7.629314e-17 -7.815970e-17
[11]  1.154632e-17 -2.886580e-18 -2.148282e-17 -1.318945e-16 -1.374456e-16
[16] -5.773160e-18  2.096101e-16  8.748557e-17  1.606493e-16 -3.419487e-17

    apply(X = scaled_X,
          MARGIN = 2,
          FUN = sd
          )  # standard deviation = 1

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

df_scaledX <- as.data.frame(scaled_X)
describe(df_scaledX)

    vars   n mean sd median trimmed  mad   min  max range skew kurtosis   se
V1     1 500    0  1  -0.01   -0.03 1.13 -2.31 3.43  5.74 0.35     0.00 0.04
V2     2 500    0  1   0.00   -0.05 1.12 -2.25 3.39  5.64 0.45    -0.04 0.04
V3     3 500    0  1   0.01   -0.04 1.10 -2.22 3.35  5.57 0.42     0.07 0.04
V4     4 500    0  1   0.06   -0.03 1.10 -2.16 3.39  5.55 0.33    -0.13 0.04
V5     5 500    0  1   0.00   -0.05 1.08 -2.19 3.64  5.82 0.45     0.10 0.04
V6     6 500    0  1   0.04   -0.05 1.14 -2.66 3.12  5.77 0.36    -0.20 0.04
V7     7 500    0  1  -0.01   -0.05 1.07 -2.17 4.69  6.86 0.63     1.10 0.04
V8     8 500    0  1  -0.04   -0.05 1.12 -2.31 4.12  6.43 0.54     0.50 0.04
V9     9 500    0  1   0.01   -0.03 1.11 -2.23 3.00  5.23 0.28    -0.16 0.04
V10   10 500    0  1   0.02   -0.03 1.19 -2.38 3.62  6.00 0.26     0.08 0.04
V11   11 500    0  1   0.05   -0.04 1.13 -2.60 3.47  6.07 0.38     0.15 0.04
V12   12 500    0  1  -0.07   -0.02 1.11 -2.33 3.31  5.63 0.26     0.00 0.04
V13   13 500    0  1  -0.07   -0.03 1.17 -2.03 3.48  5.51 0.29    -0.17 0.04
V14   14 500    0  1  -0.03   -0.03 1.07 -2.20 3.23  5.43 0.34     0.05 0.04
V15   15 500    0  1   0.04   -0.05 1.17 -2.72 2.79  5.51 0.34    -0.04 0.04
V16   16 500    0  1  -0.02   -0.03 1.11 -2.27 2.98  5.25 0.31    -0.24 0.04
V17   17 500    0  1  -0.04   -0.04 1.14 -2.35 4.58  6.94 0.53     1.05 0.04
V18   18 500    0  1   0.03   -0.05 1.11 -2.22 3.40  5.63 0.50     0.13 0.04
V19   19 500    0  1  -0.03   -0.04 1.06 -2.18 3.19  5.37 0.45     0.30 0.04
V20   20 500    0  1   0.03   -0.05 1.11 -2.22 4.15  6.36 0.52     0.57 0.04

Dependent Variable

\[ y = .15 X_1 -.33X_2 +.25X_3 -.25 X_4+.05X_5 +0X_6+...+.342X_{20}+\epsilon \]

#——————————————–
# Y variable
#——————————————–

beta <- c(
  0.15, -0.33, 0.25, -0.25, 0.05,     # Initial 5 specific coefficients
  rep(0, times = (p / 2) - 5),        # Fills with zeros for next 5 coefficcients
  rep(0, times = (p / 2) - 5),        # Fills the next 5 coefficients as 0
  -0.25, 0.12, -0.125,.33, .342       # Next 3 specific coefficients
)

beta

 [1]  0.150 -0.330  0.250 -0.250  0.050  0.000  0.000  0.000  0.000  0.000
[11]  0.000  0.000  0.000  0.000  0.000 -0.250  0.120 -0.125  0.330  0.342

# Y variable, standardized Y
y = X %*% beta + rnorm(n = N, mean = 3, sd = 0.5)
summary(y) # mean = 3

       V1         
 Min.   :-0.1872  
 1st Qu.: 3.6334  
 Median : 4.9588  
 Mean   : 5.0626  
 3rd Qu.: 6.3080  
 Max.   :11.5917

scaled_y = scale(y)
summary(scaled_y) # mean = 0

       V1          
 Min.   :-2.60654  
 1st Qu.:-0.70959  
 Median :-0.05153  
 Mean   : 0.00000  
 3rd Qu.: 0.61834  
 Max.   : 3.24174

What happens when you run lasso/ridge without standardization ?

Running Lasso (Least Absolute Shrinkage and Selection Operator) or Ridge regression without standardization can lead to various issues and challenges, similar to those discussed earlier. Both Lasso and Ridge are regularization techniques used to prevent overfitting in linear regression models. When applied without standardization, the following problems may arise:

Variable Scale Impact: Like in the case of Lasso, Ridge regression is sensitive to the scale of predictor variables. Without standardization, variables with different scales can have unequal contributions to the regularization term, potentially biasing the coefficient estimates towards variables with larger scales.
Unfair Feature Selection: In Lasso, without standardization, variables with larger scales can be more likely to have their coefficients reduced to zero, leading to unfair feature selection. Similarly, in Ridge regression, variables with larger scales might dominate the penalty term, influencing the shrinkage of coefficients.
Bias in Coefficient Estimates: Ridge regression introduces a penalty term that shrinks coefficients towards zero. Without standardization, variables with larger scales can have larger initial coefficient values, and Ridge regression might disproportionately shrink coefficients for these variables, leading to biased estimates.
Interpretability: Without standardization, interpreting coefficients in Ridge and Lasso models becomes more challenging. The coefficients’ magnitudes do not directly indicate their impact on the response variable, making it harder to understand the relationships between predictors and the response.
Model Performance: The predictive performance of Ridge and Lasso models can be negatively affected by the issues mentioned above. Unstandardized variables might lead to suboptimal model fit and reduced generalization to new data.

To mitigate these issues and ensure the proper application of Ridge and Lasso, it’s advisable to standardize your predictor variables before running these regularization techniques. Standardization helps achieve fair treatment of variables with different scales and improves the interpretability and performance of your model.

Choosing the Tuning Parameter \(\lambda\) -

In the context of regularization techniques like Lasso and Ridge regression, the term “optimal lambda” refers to the ideal value of the regularization parameter (often denoted as \(\lambda\) ) that results in the best model performance. The goal of selecting an optimal lambda is to strike a balance between fitting the model well to the training data and preventing overfitting.

Here’s why selecting an optimal lambda is important:

Preventing Overfitting: Regularization techniques like Lasso and Ridge add a penalty term to the regression coefficients. This penalty discourages large coefficient values, which in turn reduces the model’s tendency to overfit the training data. An optimal lambda helps find the right level of regularization to prevent overfitting, resulting in better generalization to new, unseen data.
Bias-Variance Trade-off: Regularization introduces a bias in the coefficient estimates, but it reduces the variance of the estimates. An optimal lambda finds the balance between bias and variance, improving the model’s overall predictive accuracy.
Feature Selection: Lasso, in particular, has the ability to set some coefficients exactly to zero, effectively performing feature selection. An optimal lambda helps identify which predictors are most important for the model, leading to a more interpretable and potentially simpler model.
Improved Model Performance: Selecting the right lambda can lead to improved model performance on both the training and validation data. It helps ensure that the model captures the underlying patterns in the data without fitting noise.
Regularization Strength: The value of lambda determines the strength of the regularization. Too small a lambda might result in insufficient regularization, while too large a lambda might lead to excessive shrinkage of coefficients. An optimal lambda helps find the appropriate level of regularization for the given data.

To find the optimal lambda, you typically use techniques like cross-validation. Cross-validation involves splitting your data into training and validation sets multiple times, training the model on different subsets of the data, and evaluating model performance for different lambda values. The lambda that results in the best performance (e.g., lowest validation error) is then selected as the optimal lambda.

In summary, selecting an optimal lambda is crucial for achieving a well-performing and generalizable model by effectively balancing the trade-off between bias and variance. It helps control model complexity, prevents overfitting, and improves the model’s ability to make accurate predictions on new data.

(Optimal) Lambda

We regularized our parameters above.

Now lets find the best \(\lambda\) value for lasso regression, and then run the optimal lasso model - it will not keep more or drop more variables than required.

The cv.glmnet function from the glmnet package in R is used for cross-validated Lasso or Ridge regression. This function finds the optimal value of the regularization parameter \(\lambda\) that minimizes the mean squared error (MSE) in a predictive model.

?cv.glmnet
cv_model <- cv.glmnet(x = scaled_X,
                      y = scaled_y, 
                      alpha = 1
                      )

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda

[1] 0.007241287

Lets apply lasso/ridge. In the context of glmnet, the alpha parameter controls the type of regularization applied in the model:
- alpha = 1 corresponds to Lasso regression (L1 regularization).
- alpha = 0 corresponds to Ridge regression (L2 regularization).
- Values between 0 and 1 apply Elastic Net regularization, which is a combination of both Lasso and Ridge.

# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X -1)
    
?glmnet
# Lasso
la.eq <- glmnet(x      = scaled_X, 
                y      = scaled_y, 
                lambda = best_lambda, 
                family = "gaussian", #specifiy the error if you know its distrubution
                intercept = FALSE,   #since we are comparing wth OLS without an intercept
                alpha  = 1
                )     
summary(la.eq) # this will not print out the coefficients

          Length Class     Mode   
a0         1     -none-    numeric
beta      20     dgCMatrix S4     
df         1     -none-    numeric
dim        2     -none-    numeric
lambda     1     -none-    numeric
dev.ratio  1     -none-    numeric
nulldev    1     -none-    numeric
npasses    1     -none-    numeric
jerr       1     -none-    numeric
offset     1     -none-    logical
call       7     -none-    call   
nobs       1     -none-    numeric

# Ridge
ri.eq <- glmnet(x      = scaled_X, 
                y      = scaled_y, 
                lambda = best_lambda, 
                family = "gaussian", # specifiy the error if you know its distrubution
                intercept = FALSE, 
                alpha = 0
                ) 

#——————————————–
# Results (lambda= OPTIMAL)
#——————————————–
df_comp_optimal <- data.frame(
    beta    = beta,
    Linear  = li.eq$coefficients,
    Lasso   = la.eq$beta[,1],
    Ridge   = ri.eq$beta[,1]
    )    
df_comp_optimal

      beta       Linear        Lasso         Ridge
X1   0.150  0.175079272  0.193350766  0.1999265411
X2  -0.330 -0.298691773 -0.406847882 -0.4076861191
X3   0.250  0.288201179  0.343546479  0.3478442230
X4  -0.250 -0.207522542 -0.298520739 -0.3034809026
X5   0.050  0.054488161  0.042975034  0.0480200389
X6   0.000  0.023988059  0.000000000 -0.0012696791
X7   0.000  0.003203762 -0.014850653 -0.0222814078
X8   0.000  0.014496733 -0.001073914 -0.0094480669
X9   0.000  0.020305782  0.000000000 -0.0007178033
X10  0.000  0.011881598 -0.010020837 -0.0166188482
X11  0.000  0.021015255  0.000000000 -0.0028083347
X12  0.000  0.020728251  0.000000000  0.0007817192
X13  0.000  0.020401368  0.000000000 -0.0072182008
X14  0.000  0.027856962  0.000000000  0.0027445549
X15  0.000  0.012894955 -0.001285681 -0.0094953822
X16 -0.250 -0.225524079 -0.313945199 -0.3188480182
X17  0.120  0.142177787  0.149143410  0.1566322529
X18 -0.125 -0.104308043 -0.163921671 -0.1697023316
X19  0.330  0.362382056  0.462117731  0.4668650490
X20  0.342  0.350011655  0.432699924  0.4384161442

In glmnet, the family parameter specifies the type of model to fit, and while it’s not strictly required, it’s recommended to specify it for clarity and accuracy. By default, glmnet assumes "gaussian" if family is not provided, which is suitable for linear regression.

Common Options for `family` in `glmnet`

family = "gaussian": For linear regression (default).
family = "binomial": For logistic regression in binary classification tasks.
family = "multinomial": For multinomial logistic regression in multi-class classification tasks.
family = "poisson": For count data models.
family = "cox": For survival analysis with Cox proportional hazards.

In Lasso/Ridge regression using the glmnet package in R, standard errors for the coefficient estimates are not provided directly because the method focuses on shrinkage and doesn’t compute standard errors as in ordinary least squares (OLS) regression. However, there are a few ways to approximate or obtain standard errors for Ridge regression coefficients, such as bootstrapping and/or analytical approximations. It might be simplest of you simply run an OLS with the vars suggested by lasso/ridge and look at those standard errors.

# Extract non-zero coefficients from Lasso regression results
# Converts coefficient matrix to a regular matrix format for easier manipulation
W <- as.matrix(coef(la.eq))

# Identify predictors with non-zero coefficients, excluding the intercept
keep_X <- rownames(W)[W != 0 & rownames(W) != "(Intercept)"]

# Subset the original data to include only predictors with non-zero coefficients
X_O <- as.matrix(df_scaledX[, keep_X])

# Combine target variable `y` with selected predictors into a data frame
df <- cbind.data.frame(y, X_O)

# Fit a linear model using only the selected predictors, without an intercept
la.eq_O <- summary(lm(y ~ X_O - 1, data = df))
la.eq_O


Call:
lm(formula = y ~ X_O - 1, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
 3.388  4.713  5.066  5.411  6.585 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
X_OV1   0.40422    0.23831   1.696 0.090492 .  
X_OV2  -0.82741    0.23840  -3.471 0.000565 ***
X_OV3   0.70684    0.23320   3.031 0.002567 ** 
X_OV4  -0.61603    0.23332  -2.640 0.008550 ** 
X_OV5   0.09664    0.23510   0.411 0.681202    
X_OV7  -0.04348    0.23453  -0.185 0.853003    
X_OV8  -0.01915    0.23326  -0.082 0.934601    
X_OV10 -0.03256    0.23506  -0.139 0.889876    
X_OV15 -0.01730    0.23540  -0.073 0.941442    
X_OV16 -0.64732    0.23325  -2.775 0.005729 ** 
X_OV17  0.31753    0.23325   1.361 0.174046    
X_OV18 -0.34308    0.23692  -1.448 0.148250    
X_OV19  0.94758    0.23440   4.043 6.15e-05 ***
X_OV20  0.88992    0.23393   3.804 0.000160 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.161 on 486 degrees of freedom
Multiple R-squared:  0.1277,    Adjusted R-squared:  0.1025 
F-statistic:  5.08 on 14 and 486 DF,  p-value: 6.225e-09

8 variables dropped under lasso regression !!!

(High) \(\lambda\) = .2

HIGH \(\lambda\) SHOULD PENALISE COEFFICIENTS MORE AND WE SHOULD FIND MANY COEFFICIENTS TO BE ZERO IN LASSO REGRESSION.

best_lambda

[1] 0.007241287

#——————————————–
# Model High
#——————————————–

high_lambda <- .2
    
# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X-1) 
    
# lasso
la.eq <- glmnet(x = scaled_X, 
                y = scaled_y, 
                lambda = high_lambda, 
                family = "gaussian", 
                intercept = F, 
                alpha=1
                ) 

    # Ridge
ri.eq <- glmnet(x = scaled_X, y = scaled_y, lambda = high_lambda, 
                family="gaussian", 
                intercept = F, alpha=0) 
    
#——————————————–
# Results (lambda=0.01)
#——————————————–
df_comp_high <- data.frame(
    beta    = beta,
    Linear  = li.eq$coefficients,
    Lasso   = la.eq$beta[,1],
    Ridge   = ri.eq$beta[,1]
)
df_comp_high

      beta       Linear       Lasso        Ridge
X1   0.150  0.175079272  0.03707951  0.173074235
X2  -0.330 -0.298691773 -0.23294161 -0.340491382
X3   0.250  0.288201179  0.12409582  0.283613090
X4  -0.250 -0.207522542 -0.10641852 -0.253731662
X5   0.050  0.054488161  0.00000000  0.052270926
X6   0.000  0.023988059  0.00000000 -0.005770317
X7   0.000  0.003203762  0.00000000 -0.026916201
X8   0.000  0.014496733  0.00000000 -0.007634320
X9   0.000  0.020305782  0.00000000 -0.008593708
X10  0.000  0.011881598  0.00000000 -0.017885868
X11  0.000  0.021015255  0.00000000 -0.006075790
X12  0.000  0.020728251  0.00000000  0.007167234
X13  0.000  0.020401368  0.00000000  0.005807202
X14  0.000  0.027856962  0.00000000  0.001823942
X15  0.000  0.012894955  0.00000000 -0.019495015
X16 -0.250 -0.225524079 -0.09381018 -0.259799782
X17  0.120  0.142177787  0.00000000  0.121590695
X18 -0.125 -0.104308043  0.00000000 -0.144561118
X19  0.330  0.362382056  0.24065956  0.384923137
X20  0.342  0.350011655  0.18758038  0.353820110

Yes, empirically we find support for theory that HIGH \(\lambda\) PENALISES COEFFICIENTS MORE AND WE FIND MANY MORE COEFFICIENTS (THAN 8) ARE ZERO IN LASSO REGRESSION.

Lasso (L1 Regularization)

Lasso applies an L1 penalty, which adds the absolute values of the coefficients to the loss function.
This L1 penalty has the effect of pushing some coefficients exactly to zero as λ increases. This is because the L1 norm creates a “sharp corner” at zero, making it easier for Lasso to eliminate variables completely by setting their coefficients to zero.
As a result, a high λ in Lasso leads to sparse solutions, where many coefficients are zero. This is why Lasso is effective for feature selection, as it “selects” important features by keeping their coefficients non-zero while setting others to zero.

Ridge (L2 Regularization)

Ridge regression uses an L2 penalty, which adds the square of the coefficients to the loss function.
Unlike Lasso, the L2 norm does not produce sparse solutions. Instead, it shrinks all coefficients closer to zero without setting any of them exactly to zero. The penalty term discourages large coefficients, but due to the smooth nature of the L2 norm, it affects all coefficients uniformly rather than pushing any to exactly zero.
Therefore, even with a high λ, Ridge will generally retain all predictors, albeit with smaller coefficient magnitudes. Ridge is less effective for feature selection but helps stabilize the model by reducing the impact of multicollinearity.

(Low) \(\lambda\) = .005

LOW \(\lambda\) SHOULD PENALISE COEFFICIENTS LESS AND WE SHOULD FIND MANY COEFFICIENTS TO BE NOT ZERO (atleast more than 8).

best_lambda

[1] 0.007241287

#——————————————–
# Model
#——————————————–

lambda_low <- 0.005
    
# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X-1) 
    
# lasso
la.eq <- glmnet(x = X, y = y, lambda = lambda_low, 
                family = "gaussian", 
                intercept = F, alpha=1) 
    
# Ridge
ri.eq <- glmnet(x = X, y = y, lambda = lambda_low, 
                family="gaussian", 
                intercept = F, alpha=0) 
    
#——————————————–
# Results (lambda=0.01)
#——————————————–
df_comp_low <- data.frame(
    beta    = beta,
    Linear  = li.eq$coefficients,
    Lasso   = la.eq$beta[,1],
    Ridge   = ri.eq$beta[,1]
)
df_comp_low

      beta       Linear        Lasso         Ridge
X1   0.150  0.175079272  0.176684678  0.1766054752
X2  -0.330 -0.298691773 -0.296882544 -0.2993814956
X3   0.250  0.288201179  0.285090003  0.2860953099
X4  -0.250 -0.207522542 -0.206973585 -0.2097321715
X5   0.050  0.054488161  0.050575436  0.0511075370
X6   0.000  0.023988059  0.020112515  0.0211679892
X7   0.000  0.003203762  0.000000000 -0.0006075895
X8   0.000  0.014496733  0.010511407  0.0116582392
X9   0.000  0.020305782  0.015879733  0.0172757572
X10  0.000  0.011881598  0.007352872  0.0088893380
X11  0.000  0.021015255  0.019449402  0.0201101541
X12  0.000  0.020728251  0.019446279  0.0200092760
X13  0.000  0.020401368  0.021086566  0.0212410647
X14  0.000  0.027856962  0.029232994  0.0296156005
X15  0.000  0.012894955  0.014794817  0.0159223780
X16 -0.250 -0.225524079 -0.218250597 -0.2214771644
X17  0.120  0.142177787  0.145888754  0.1465739267
X18 -0.125 -0.104308043 -0.096430978 -0.0993319014
X19  0.330  0.362382056  0.364885708  0.3654527901
X20  0.342  0.350011655  0.350082839  0.3514197057

Yes, empirically we find support for theory that LOW \(\lambda\) WILL IMPLY WE KEEP MORE COEFFICIENTS OPTIMALLY REQUIRED (8) IN LASSO REGRESSION.

Presentation

Stargazer package has to be adapted a little bit to work with lasso.

Parsimonious models are better.

Footnotes

Why Linearly Dependent Columns Occur when p>n
- Insufficient Information: When you have more predictors than observations, you don’t have enough data to uniquely determine each predictor’s effect on y. In other words, some columns of X can be expressed as linear combinations of others.
- Perfect Multicollinearity: With more predictors than observations, you’re guaranteed to have multicollinearity because there are infinite ways to combine the predictors to fit the observations perfectly. Mathematically, it’s impossible to distinguish between multiple solutions that fit the data equally well.
↩︎
In multicollinear models, predictors contain overlapping information.

As a result:
1. The model struggles to distinguish the individual effects of each predictor on the target variable.
2. Small changes in the data can lead to large fluctuations in the estimated coefficients.
3. This instability in coefficients results in predictions that may vary widely, especially if the model is used on new or slightly different data.
Ridge regression (L2 regularization) help enhance predictive stability by:
1. Shrinking coefficients: Ridge regression penalizes large coefficients, forcing them to be closer to zero. This reduces the variance in coefficient estimates for highly correlated variables, stabilizing predictions.
2. Controlling sensitivity: By reducing the impact of correlated predictors, Ridge helps the model focus on the combined predictive power of the multicollinear variables rather than their individual, unstable contributions.
↩︎

Why predictors (p) \> observations (n) is an issue?