Lasso, Ridge, and Ordinary Least Squares (OLS) are all regression techniques used for modeling relationships between variables. Here’s a brief overview and comparison of these three methods:
Ordinary Least Squares (OLS):
OLS is the classic linear regression method that aims to minimize the sum of squared differences between observed and predicted values.
It estimates coefficients for predictor variables to fit a linear model that best represents the relationship with the response variable.
OLS doesn’t inherently address issues like multicollinearity or feature selection, potentially leading to overfitting when dealing with high-dimensional data.
Lasso Regression:
Lasso (Least Absolute Shrinkage and Selection Operator) extends OLS by adding a penalty term based on the absolute values of the coefficients.
Lasso regression seeks to minimize the following:
It encourages some coefficients to become exactly zero, leading to feature selection. This makes Lasso suitable when you suspect that only a subset of predictors is relevant.
Lasso is effective for high-dimensional data, as it can automatically exclude less important variables and prevent overfitting.
However, Lasso’s feature selection can be aggressive and might lead to biased coefficient estimates in certain cases.
Ridge Regression:
Ridge regression, like Lasso, extends OLS by adding a penalty term, but this term is based on the squared values of coefficients.
Ridge regression seeks to minimize the following:
Ridge helps address multicollinearity by shrinking coefficients, reducing the impact of correlated predictors.
Unlike Lasso, Ridge rarely forces coefficients to be exactly zero. It’s suitable when you want to retain most predictors but mitigate multicollinearity effects.
Ridge can provide more stable and less biased coefficient estimates compared to Lasso, especially when many predictors are correlated.
Feature Selection: OLS doesn’t perform feature selection. Lasso is well-suited for feature selection by setting some coefficients to zero, while Ridge retains most predictors.
Coefficient Values: Lasso can lead to sparse solutions with many coefficients being exactly zero, whereas Ridge generally shrinks coefficients towards zero without making them zero.
Multicollinearity: Ridge is particularly effective in handling multicollinearity by reducing the impact of correlated predictors, whereas Lasso might handle multicollinearity by excluding correlated variables entirely.
Model Complexity: OLS can lead to overfitting in high-dimensional settings, while Lasso and Ridge provide regularization to mitigate this issue.
Model Selection: Lasso can provide a simpler and more interpretable model by selecting a subset of predictors. Ridge retains more predictors and maintains stability.
In summary, OLS is a basic linear regression, while Lasso and Ridge add regularization to prevent overfitting and handle multicollinearity. Lasso is suitable for feature selection, while Ridge is useful for maintaining stability and addressing multicollinearity. The choice between them depends on your data characteristics and modeling goals.
Different SSR
# Clear the workspace
rm(list = ls()) # Clear environment - remove all files from your workspace
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528427 28.3 1175155 62.8 NA 669271 35.8
## Vcells 975319 7.5 8388608 64.0 16384 1840171 14.1
cat("\f") # Clear the console
graphics.off() # clear all graphs
# Prepare needed libraries
packages <- c("glmnet", # used for regression
"caret", # used for modeling
"xgboost", # used for building XGBoost model
"ISLR",
"dplyr", # used for data manipulation and joining
"tidyselect",
"stargazer", # presentation of data
"data.table", # used for reading and manipulation of data
"ggplot2", # used for ploting
"cowplot", # used for combining multiple plots
"e1071", # used for skewness
"psych"
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i]
, repos = "http://cran.rstudio.com/"
, dependencies = TRUE
)
}
library(packages[i], character.only = TRUE)
}
## Loading required package: Matrix
## Loaded glmnet 4.1-8
## Loading required package: ggplot2
## Loading required package: lattice
##
## The downloaded binary packages are in
## /var/folders/0h/kk_vktsd17q_p7867shj39zw0000gn/T//RtmpeiJaHR/downloaded_packages
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:xgboost':
##
## slice
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## also installing the dependencies 'formatR', 'lambda.r', 'futile.options', 'futile.logger', 'Cairo', 'gridGraphics', 'maps', 'PASWR', 'vdiffr', 'VennDiagram'
##
## The downloaded binary packages are in
## /var/folders/0h/kk_vktsd17q_p7867shj39zw0000gn/T//RtmpeiJaHR/downloaded_packages
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
rm(packages)
set.seed(7)
We will generate 20 columns and 500 rows of data from a normal distribution, and then standardize the data (different from normalization).
Standardization, also known as z-score normalization, scales our data to have a mean of 0 and a standard deviation of 1. By centering the data around zero and squeezing it to a consistent range, we achieve balanced and comparable features. Standardization is particularly useful for algorithms that rely on distance measures, such as k-nearest neighbors and support vector machines.
Normalization, on the other hand, scales the data to a range between 0 and 1. By mapping the features to a common interval, normalization helps to emphasize the relative importance of different features. This technique is beneficial for algorithms that rely on weight-sensitive models, like neural networks and clustering algorithms.
Standardization: When the features have different units or different scales, standardization ensures they are all on a comparable level, making it easier for the algorithm to learn patterns effectively.
Normalization: When the scale of the features is not critical, and you want to ensure that all features contribute equally to the model, normalization is a suitable choice.
N = 500 # number of observations
p = 20 # number of variables
#——————————————–
# X variable
#——————————————–
X = matrix(data = rnorm(n = N*p),
ncol = p
)
# before standardization
colMeans(x = X) # mean
## [1] 0.04500705 -0.03891039 -0.01788887 0.05515532 -0.02264947 0.01850257
## [7] 0.03769958 -0.01459774 -0.04957677 0.02517987 -0.01930050 0.02059231
## [13] -0.01839379 -0.01900492 0.05483110 0.06450308 0.02783687 -0.02600104
## [19] -0.06928138 -0.01514300
apply(X = X,
MARGIN = 2,
FUN = sd
) # standard deviation
## [1] 0.9981451 0.9654139 1.0183661 1.0266920 0.9848989 0.9892028 0.9722865
## [8] 1.0173771 0.9758915 1.0295796 1.0611173 0.9753777 1.0089758 1.0370566
## [15] 1.0093298 1.0454662 1.0325656 0.9664161 1.0155624 0.9920651
# scale : mean = 0, std=1
?scale
X = scale(x = X)
# after standardization
colMeans(x = X) # mean ~ 0
## [1] 5.839773e-17 8.881784e-19 1.321165e-17 3.552714e-17 1.709743e-17
## [6] 9.592327e-17 -4.496403e-17 2.398082e-17 4.130030e-17 -3.452794e-17
## [11] -2.176037e-17 6.106227e-18 -6.439294e-17 -1.043610e-17 -2.231548e-17
## [16] 1.509903e-17 -2.386980e-18 9.414691e-17 6.394885e-17 -8.881784e-18
apply(X = X,
MARGIN = 2,
FUN = sd
) # standard deviation = 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
df_X <- as.data.frame(X)
describe(df_X)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## V1 1 500 0 1 -0.02 0.00 1.02 -3.02 2.71 5.73 0.02 -0.21 0.04
## V2 2 500 0 1 -0.02 0.00 0.98 -2.68 3.11 5.79 0.08 -0.04 0.04
## V3 3 500 0 1 -0.02 -0.02 0.99 -2.61 3.10 5.71 0.18 -0.16 0.04
## V4 4 500 0 1 0.06 -0.01 0.95 -3.17 3.35 6.52 0.11 0.13 0.04
## V5 5 500 0 1 0.00 0.00 0.97 -2.94 2.85 5.79 -0.04 0.20 0.04
## V6 6 500 0 1 0.00 -0.01 0.91 -3.57 2.94 6.51 -0.04 0.22 0.04
## V7 7 500 0 1 0.03 0.02 0.99 -2.89 2.89 5.78 -0.18 -0.17 0.04
## V8 8 500 0 1 -0.01 0.00 1.05 -3.23 2.74 5.97 -0.01 -0.14 0.04
## V9 9 500 0 1 0.02 -0.02 0.97 -2.64 3.34 5.98 0.18 0.11 0.04
## V10 10 500 0 1 0.02 -0.01 0.91 -2.76 3.33 6.09 0.08 0.12 0.04
## V11 11 500 0 1 -0.01 0.01 1.09 -3.13 2.85 5.98 -0.05 -0.32 0.04
## V12 12 500 0 1 0.07 -0.01 1.06 -3.58 2.75 6.33 -0.01 -0.07 0.04
## V13 13 500 0 1 -0.02 0.00 0.93 -3.43 3.78 7.21 0.07 0.43 0.04
## V14 14 500 0 1 -0.01 0.00 1.01 -3.49 3.00 6.49 -0.02 0.05 0.04
## V15 15 500 0 1 0.02 0.01 0.96 -2.85 2.81 5.66 -0.05 -0.07 0.04
## V16 16 500 0 1 -0.03 -0.02 1.00 -3.09 3.05 6.14 0.13 0.08 0.04
## V17 17 500 0 1 -0.02 0.02 1.04 -3.83 2.80 6.63 -0.17 0.09 0.04
## V18 18 500 0 1 -0.01 0.00 0.93 -2.56 2.88 5.44 0.00 -0.23 0.04
## V19 19 500 0 1 0.02 0.02 1.02 -3.31 2.72 6.03 -0.20 -0.08 0.04
## V20 20 500 0 1 0.00 -0.01 1.04 -2.93 2.89 5.82 0.06 -0.27 0.04
#——————————————–
# Y variable
#——————————————–
beta = c( 0.15, -0.33, 0.25, -0.25, 0.05,
rep(x = 0, times = p/2-5), # generates 5 rows with value 0
-0.25, 0.12, -0.125,
rep(x = 0, times = p/2-3) # generates 7 rows with value 0
)
# Y variable, standardized Y
y = X %*% beta + rnorm(n = N, sd = 0.5)
summary(y) #mean=0.008674
## V1
## Min. :-1.93271
## 1st Qu.:-0.46681
## Median : 0.01224
## Mean : 0.03133
## 3rd Qu.: 0.49753
## Max. : 2.03125
y = scale(y)
summary(y) # mean=0
## V1
## Min. :-2.67528
## 1st Qu.:-0.67853
## Median :-0.02599
## Mean : 0.00000
## 3rd Qu.: 0.63502
## Max. : 2.72416
Running Lasso (Least Absolute Shrinkage and Selection Operator) or Ridge regression without standardization can lead to various issues and challenges, similar to those discussed earlier. Both Lasso and Ridge are regularization techniques used to prevent overfitting in linear regression models. When applied without standardization, the following problems may arise:
Variable Scale Impact: Like in the case of Lasso, Ridge regression is sensitive to the scale of predictor variables. Without standardization, variables with different scales can have unequal contributions to the regularization term, potentially biasing the coefficient estimates towards variables with larger scales.
Unfair Feature Selection: In Lasso, without standardization, variables with larger scales can be more likely to have their coefficients reduced to zero, leading to unfair feature selection. Similarly, in Ridge regression, variables with larger scales might dominate the penalty term, influencing the shrinkage of coefficients.
Bias in Coefficient Estimates: Ridge regression introduces a penalty term that shrinks coefficients towards zero. Without standardization, variables with larger scales can have larger initial coefficient values, and Ridge regression might disproportionately shrink coefficients for these variables, leading to biased estimates.
Interpretability: Without standardization, interpreting coefficients in Ridge and Lasso models becomes more challenging. The coefficients’ magnitudes do not directly indicate their impact on the response variable, making it harder to understand the relationships between predictors and the response.
Model Performance: The predictive performance of Ridge and Lasso models can be negatively affected by the issues mentioned above. Unstandardized variables might lead to suboptimal model fit and reduced generalization to new data.
To mitigate these issues and ensure the proper application of Ridge and Lasso, it’s advisable to standardize your predictor variables before running these regularization techniques. Standardization helps achieve fair treatment of variables with different scales and improves the interpretability and performance of your model.
In the context of regularization techniques like Lasso and Ridge regression, the term “optimal lambda” refers to the ideal value of the regularization parameter (often denoted as \(\lambda\) ) that results in the best model performance. The goal of selecting an optimal lambda is to strike a balance between fitting the model well to the training data and preventing overfitting.
Preventing Overfitting: Regularization techniques like Lasso and Ridge add a penalty term to the regression coefficients. This penalty discourages large coefficient values, which in turn reduces the model’s tendency to overfit the training data. An optimal lambda helps find the right level of regularization to prevent overfitting, resulting in better generalization to new, unseen data.
Bias-Variance Trade-off: Regularization introduces a bias in the coefficient estimates, but it reduces the variance of the estimates. An optimal lambda finds the balance between bias and variance, improving the model’s overall predictive accuracy.
Feature Selection: Lasso, in particular, has the ability to set some coefficients exactly to zero, effectively performing feature selection. An optimal lambda helps identify which predictors are most important for the model, leading to a more interpretable and potentially simpler model.
Improved Model Performance: Selecting the right lambda can lead to improved model performance on both the training and validation data. It helps ensure that the model captures the underlying patterns in the data without fitting noise.
Regularization Strength: The value of lambda determines the strength of the regularization. Too small a lambda might result in insufficient regularization, while too large a lambda might lead to excessive shrinkage of coefficients. An optimal lambda helps find the appropriate level of regularization for the given data.
To find the optimal lambda, you typically use techniques like cross-validation. Cross-validation involves splitting your data into training and validation sets multiple times, training the model on different subsets of the data, and evaluating model performance for different lambda values. The lambda that results in the best performance (e.g., lowest validation error) is then selected as the optimal lambda.
In summary, selecting an optimal lambda is crucial for achieving a well-performing and generalizable model by effectively balancing the trade-off between bias and variance. It helps control model complexity, prevents overfitting, and improves the model’s ability to make accurate predictions on new data.
We regularized our parameters above. Now lets find the best \(\lambda\) value for lasso regression, and then run the optimal lasso model - it will not keep more or drop more variables than required.
?cv.glmnet
cv_model <- cv.glmnet(x = X,
y = y,
alpha = 1
)
#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
## [1] 0.01703865
#——————————————–
# Model
#——————————————–
lambda <- best_lambda
# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X-1)
# lasso
la.eq <- glmnet(x = X,
y = y,
lambda = lambda,
family = "gaussian",
intercept = F,
alpha = 1
)
summary(la.eq) # this will not print out the coefficients
## Length Class Mode
## a0 1 -none- numeric
## beta 20 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 7 -none- call
## nobs 1 -none- numeric
# STORING COEFFICIENTS CHOSEN BY LASSO FOR MULTIVARIATE LINEAR REGRESSION
?coef()
W <- as.matrix(coef(object = la.eq)) # # coef is a generic function which extracts model coefficients from objects returned by modeling functions. coefficients is an alias for it.
W
## s0
## (Intercept) 0.000000000
## V1 0.136592869
## V2 -0.418994519
## V3 0.301016245
## V4 -0.301625699
## V5 0.041825538
## V6 0.000000000
## V7 0.000000000
## V8 0.000000000
## V9 0.000000000
## V10 0.003259267
## V11 -0.264180292
## V12 0.171676077
## V13 -0.212543546
## V14 0.000000000
## V15 0.028580905
## V16 0.000000000
## V17 0.000000000
## V18 0.018956272
## V19 -0.019557887
## V20 0.000000000
keep_X <- rownames(W)[W!=0] # non-zero coefficients
keep_X <- keep_X[!keep_X == "(Intercept)"]
X_O <- df_X[,keep_X] # X <- X %>% select(all_of(keep_X))
X_O <- as.matrix(X_O)
df<-cbind.data.frame(y,X_O)
la.eq_O <- summary(lm(y~X_O,
data = df)
)
la.eq_O
##
## Call:
## lm(formula = y ~ X_O, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80500 -0.44588 -0.05958 0.46867 1.82519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.913e-17 2.980e-02 0.000 1.0000
## X_OV1 1.572e-01 3.044e-02 5.163 3.54e-07 ***
## X_OV2 -4.403e-01 3.022e-02 -14.569 < 2e-16 ***
## X_OV3 3.138e-01 3.054e-02 10.275 < 2e-16 ***
## X_OV4 -3.228e-01 3.026e-02 -10.668 < 2e-16 ***
## X_OV5 5.531e-02 3.028e-02 1.827 0.0683 .
## X_OV10 2.011e-02 2.999e-02 0.671 0.5028
## X_OV11 -2.829e-01 3.025e-02 -9.354 < 2e-16 ***
## X_OV12 1.880e-01 3.010e-02 6.245 9.21e-10 ***
## X_OV13 -2.342e-01 3.018e-02 -7.759 5.08e-14 ***
## X_OV15 5.027e-02 3.039e-02 1.654 0.0988 .
## X_OV18 3.487e-02 3.034e-02 1.149 0.2511
## X_OV19 -3.744e-02 3.004e-02 -1.246 0.2133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6663 on 487 degrees of freedom
## Multiple R-squared: 0.5668, Adjusted R-squared: 0.5561
## F-statistic: 53.09 on 12 and 487 DF, p-value: < 2.2e-16
# Ridge
ri.eq <- glmnet(x = X,
y = y,
lambda = lambda,
family = "gaussian",
intercept = F,
alpha. = 0
)
#——————————————–
# Results (lambda= OPTIMAL)
#——————————————–
df.comp <- data.frame(
beta = beta,
Linear = li.eq$coefficients,
Lasso = la.eq$beta[,1],
Ridge = ri.eq$beta[,1]
)
df.comp
## beta Linear Lasso Ridge
## X1 0.150 0.1566710600 0.136592869 0.136592869
## X2 -0.330 -0.4402130694 -0.418994519 -0.418994519
## X3 0.250 0.3162461821 0.301016245 0.301016245
## X4 -0.250 -0.3236849125 -0.301625699 -0.301625699
## X5 0.050 0.0550973981 0.041825538 0.041825538
## X6 0.000 -0.0030195665 0.000000000 0.000000000
## X7 0.000 0.0051977077 0.000000000 0.000000000
## X8 0.000 -0.0054583898 0.000000000 0.000000000
## X9 0.000 -0.0008904859 0.000000000 0.000000000
## X10 0.000 0.0212436192 0.003259267 0.003259267
## X11 -0.250 -0.2834743489 -0.264180292 -0.264180292
## X12 0.120 0.1890030461 0.171676077 0.171676077
## X13 -0.125 -0.2332500827 -0.212543546 -0.212543546
## X14 0.000 0.0151522767 0.000000000 0.000000000
## X15 0.000 0.0516039352 0.028580905 0.028580905
## X16 0.000 -0.0125642827 0.000000000 0.000000000
## X17 0.000 -0.0099729228 0.000000000 0.000000000
## X18 0.000 0.0338450690 0.018956272 0.018956272
## X19 0.000 -0.0367888278 -0.019557887 -0.019557887
## X20 0.000 0.0190892658 0.000000000 0.000000000
8 variables dropped under lasso regression !!!
#——————————————–
# Model High
#——————————————–
lambda <- 0.1
# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X-1)
# lasso
la.eq <- glmnet(x = X,
y = y,
lambda = lambda,
family = "gaussian",
intercept = F,
alpha=1
)
# STORING COEFFICIENTS CHOSEN BY LASSO FOR MULTIVARIATE LINEAR REGRESSION
W <- as.matrix(coef(la.eq))
W
## s0
## (Intercept) 0.00000000
## V1 0.04482387
## V2 -0.32337856
## V3 0.23536128
## V4 -0.20667337
## V5 0.00000000
## V6 0.00000000
## V7 0.00000000
## V8 0.00000000
## V9 0.00000000
## V10 0.00000000
## V11 -0.17788891
## V12 0.08788599
## V13 -0.11004909
## V14 0.00000000
## V15 0.00000000
## V16 0.00000000
## V17 0.00000000
## V18 0.00000000
## V19 0.00000000
## V20 0.00000000
keep_X <- rownames(W)[W!=0] # non-zero coefficients
keep_X <- keep_X[!keep_X == "(Intercept)"]
X_H <- df_X[,keep_X] # X <- X %>% select(all_of(keep_X))
X_H <- as.matrix(X_H)
la.eq_H <- summary(lm(y~X_H))
# Ridge
ri.eq <- glmnet(x = X, y = y, lambda = lambda,
family="gaussian",
intercept = F, alpha=0)
#——————————————–
# Results (lambda=0.01)
#——————————————–
df.comp <- data.frame(
beta = beta,
Linear = li.eq$coefficients,
Lasso = la.eq$beta[,1],
Ridge = ri.eq$beta[,1]
)
df.comp
## beta Linear Lasso Ridge
## X1 0.150 0.1566710600 0.04482387 0.1405302583
## X2 -0.330 -0.4402130694 -0.32337856 -0.3949679206
## X3 0.250 0.3162461821 0.23536128 0.2879825888
## X4 -0.250 -0.3236849125 -0.20667337 -0.2891646013
## X5 0.050 0.0550973981 0.00000000 0.0535982192
## X6 0.000 -0.0030195665 0.00000000 0.0007083008
## X7 0.000 0.0051977077 0.00000000 0.0037154896
## X8 0.000 -0.0054583898 0.00000000 -0.0068741562
## X9 0.000 -0.0008904859 0.00000000 0.0107655152
## X10 0.000 0.0212436192 0.00000000 0.0198926148
## X11 -0.250 -0.2834743489 -0.17788891 -0.2581351300
## X12 0.120 0.1890030461 0.08788599 0.1713124502
## X13 -0.125 -0.2332500827 -0.11004909 -0.2051300527
## X14 0.000 0.0151522767 0.00000000 0.0159466876
## X15 0.000 0.0516039352 0.00000000 0.0409112026
## X16 0.000 -0.0125642827 0.00000000 -0.0094756029
## X17 0.000 -0.0099729228 0.00000000 -0.0063321392
## X18 0.000 0.0338450690 0.00000000 0.0284737200
## X19 0.000 -0.0367888278 0.00000000 -0.0310696451
## X20 0.000 0.0190892658 0.00000000 0.0147697568
Yes, empirically we find support for theory that HIGH \(\lambda\) PENALISES COEFFICIENTS MORE AND WE FIND MANY MORE COEFFICIENTS (THAN 8) ARE ZERO IN LASSO REGRESSION.
#——————————————–
# Model
#——————————————–
lambda <- 0.01
# standard linear regression without intercept(-1)
li.eq <- lm(y ~ X-1)
# lasso
la.eq <- glmnet(x = X, y = y, lambda = lambda,
family = "gaussian",
intercept = F, alpha=1)
# STORING COEFFICIENTS CHOSEN BY LASSO FOR MULTIVARIATE LINEAR REGRESSION
W <- as.matrix(coef(la.eq))
W
## s0
## (Intercept) 0.000000000
## V1 0.144921927
## V2 -0.427707722
## V3 0.306846347
## V4 -0.310381539
## V5 0.047414397
## V6 0.000000000
## V7 0.000000000
## V8 0.000000000
## V9 0.000000000
## V10 0.010864350
## V11 -0.272127426
## V12 0.178832322
## V13 -0.221147773
## V14 0.004684801
## V15 0.037903623
## V16 -0.002668058
## V17 0.000000000
## V18 0.025204152
## V19 -0.026651162
## V20 0.007403530
keep_X <- rownames(W)[W!=0] # non-zero coefficients
keep_X <- keep_X[!keep_X == "(Intercept)"]
X_L <- df_X[,keep_X] # X <- X %>% select(all_of(keep_X))
X_L <- as.matrix(X_L)
la.eq_L <- summary(lm(y~X_L))
# Ridge
ri.eq <- glmnet(x = X, y = y, lambda = lambda,
family="gaussian",
intercept = F, alpha=0)
#——————————————–
# Results (lambda=0.01)
#——————————————–
df.comp <- data.frame(
beta = beta,
Linear = li.eq$coefficients,
Lasso = la.eq$beta[,1],
Ridge = ri.eq$beta[,1]
)
df.comp
## beta Linear Lasso Ridge
## X1 0.150 0.1566710600 0.144921927 0.1548801868
## X2 -0.330 -0.4402130694 -0.427707722 -0.4352057341
## X3 0.250 0.3162461821 0.306846347 0.3131679819
## X4 -0.250 -0.3236849125 -0.310381539 -0.3198457579
## X5 0.050 0.0550973981 0.047414397 0.0549577070
## X6 0.000 -0.0030195665 0.000000000 -0.0025620970
## X7 0.000 0.0051977077 0.000000000 0.0050199233
## X8 0.000 -0.0054583898 0.000000000 -0.0056435637
## X9 0.000 -0.0008904859 0.000000000 0.0005165105
## X10 0.000 0.0212436192 0.010864350 0.0211110885
## X11 -0.250 -0.2834743489 -0.272127426 -0.2807031636
## X12 0.120 0.1890030461 0.178832322 0.1870659626
## X13 -0.125 -0.2332500827 -0.221147773 -0.2300890977
## X14 0.000 0.0151522767 0.004684801 0.0152642104
## X15 0.000 0.0516039352 0.037903623 0.0503643459
## X16 0.000 -0.0125642827 -0.002668058 -0.0122166814
## X17 0.000 -0.0099729228 0.000000000 -0.0095452779
## X18 0.000 0.0338450690 0.025204152 0.0332329682
## X19 0.000 -0.0367888278 -0.026651162 -0.0361341629
## X20 0.000 0.0190892658 0.007403530 0.0185976370
Yes, empirically we find support for theory that LOW \(\lambda\) WILL IMPLY WE KEEP MORE COEFFICIENTS OPTIMALLY REQUIRED (8) IN LASSO REGRESSION.
Stargazer package has to be adapted a little bit to work with lasso.