Why predictors (p) \> observations (n) is an issue?
Theory
In linear regression, when the number of independent variables (predictors) exceeds the number of observations (data points), the model becomes overparameterized, leading to an inability to estimate the coefficients (betas) and their standard errors correctly.
Multiple ways to understand why this is the case -
Mathematical Impossibility — No Unique Solution (not enough observations as you need more observations than predictors to estimate coefficients uniquely).
\[
\hat{\beta} = (X'X)^{-1} X' y
\]
If the number of predictors p is greater than the number of observations n, the matrix X′X is singular1 (not invertible). This is because X doesn’t have enough information to uniquely determine the coefficients. Imagine trying to solve 11 variables with only 10 equations — there is no unique solution, as the system is underdetermined.
Degrees of Freedom — Lack of Flexibility:
Degrees of freedom in regression are calculated as \(n - (p + 1)\), where n is the number of observations, and p is the number of predictors.
When \(p \geq n\), the degrees of freedom become zero or negative. Degrees of freedom represent the amount of independent information available to estimate the coefficients. Without enough degrees of freedom, you can’t estimate the variance or standard errors accurately, leading to undefined standard errors.
In simpler terms, you need enough data points to account for each predictor’s effect. If there aren’t enough data points, the model lacks sufficient information to separate out the effects of each predictor accurately.
Perfect Multicollinearity — Infinite Solutions:
Perfect multicollinearity occurs when one or more predictors are exact linear combinations of others. If you have more predictors than observations, some predictors can be expressed as combinations of others, leading to perfect multicollinearity.
This means there are infinite solutions for the coefficients \(\beta\), and the regression algorithm cannot determine which is correct. For example, \(\beta_1\) and \(\beta_2\) might both be adjusted in different ways to still perfectly fit the data, creating instability in the model.
Empirical
We need the glmnet package.
# Clear the workspacerm(list =ls())# Clear environment - remove all files from your workspaceinvisible(gc())# Clear unused memorycat("\f")# Clear the console
graphics.off()# clear all graphs# Prepare needed librariespackages<-c("glmnet", # used for regression"caret", # used for modeling"xgboost", # used for building XGBoost model"ISLR","dplyr", # used for data manipulation and joining"tidyselect","stargazer", # presentation of data"data.table",# used for reading and manipulation of data"ggplot2", # used for ploting"cowplot", # used for combining multiple plots"e1071", # used for skewness"psych","car")suppressPackageStartupMessages({for(iin1:length(packages)){if(!packages[i]%in%rownames(installed.packages())){install.packages(packages[i] , repos ="http://cran.rstudio.com/" , dependencies =TRUE)}library(packages[i], character.only =TRUE)}})rm(packages)set.seed(7)
Generate fake/simulated data
You can get a dataset where \(p>n\) as well.
# Clear the workspace by removing all objects from the environmentremove(list=ls())# Load the help page for distribution functions?distribution
Help on topic 'distribution' was found in the following packages:
Package Library
lava /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
stats /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
# Generate response variable 'y' from a Student's t-distribution with 10 observations and 7 degrees of freedomy<-rt(n =10, df =7)# Generate predictor variables from different distributions:# Uniform distributions for x1, x2, and x3:x1<-runif(n =10, min =1, max =7)# 10 random values between 1 and 7x2<-runif(n =10, min =2, max =8)x3<-runif(n =10, min =3, max =9)# Binomial distributions for x4, x5, and x6:x4<-rbinom(n =10, size =14, prob =0.7)# 10 random values, each as the count of 14 trials with 0.7 probabilityx5<-rbinom(n =10, size =15, prob =0.8)x6<-rbinom(n =10, size =16, prob =0.9)# Poisson distributions for x7 to x11:x7<-rpois(n =10, lambda =5)# 10 random values with an average rate of 5x8<-rpois(n =10, lambda =6)x9<-rpois(n =10, lambda =7)x10<-rpois(n =10, lambda =8)x11<-rpois(n =10, lambda =9)
Once you have the data, lets run some regressions -
Case I : \(p<n\) (reg1 has \(p=8+1\) and \(n=10\)). Will get all \(\beta\) and \(se(\beta\)).
Case II : \(p=n\) (reg2 has \(p=9+1\) and \(n=10\)). Will get all \(\beta\) BUT \(se(\beta\)) an issue.
Case III : \(p>n\) (reg3 has \(p=10+1\) and \(n=10\)). Will NOT get all \(\beta\) AND \(se(\beta\)) an issue.
reg1<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8)reg2<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8+x9)reg3<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10)#reg4 <- lm(formula = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11)stargazer(reg1,reg2,reg3,#reg4, type ="text")
As you can see, reg3 is not able to estimate all betas.
In fact, standard errors are an issue from reg2 onwards.
Removing intercept term
This will change the p in each of the three regressions, but the story stays the same - you cannot have \(p>n\) for coefficient estimation (beta) , and \(p \geq n\) for standard errors (se).
reg1<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8-1)reg2<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8+x9-1)reg3<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10-1)reg4<-lm(formula =y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11-1)stargazer(reg1,reg2,reg3, reg4, type ="text")
Lasso, Ridge, and Ordinary Least Squares (OLS) are all regression techniques used for modeling relationships between variables. Here’s a brief overview and comparison of these three methods:
Ordinary Least Squares (OLS):
OLS is the classic linear regression method that aims to minimize the sum of squared differences between observed and predicted values.
It estimates coefficients for predictor variables to fit a linear model that best represents the relationship with the response variable.
OLS doesn’t inherently address issues like multicollinearity or feature selection, potentially leading to overfitting when dealing with high-dimensional data.
Lasso Regression:
Lasso (Least Absolute Shrinkage and Selection Operator) extends OLS by adding a penalty term based on the absolute values of the coefficients.
Lasso regression seeks to minimize the following:
RSS + λ Σ|βj|
It encourages some coefficients to become exactly zero, leading to feature selection. This makes Lasso suitable when you suspect that only a subset of predictors is relevant.
Lasso is effective for high-dimensional data, as it can automatically exclude less important variables and prevent overfitting.
However, Lasso’s feature selection can be aggressive and might lead to biased coefficient estimates in certain cases.
Use Case: Lasso is useful when you have many features, and you suspect that some may be irrelevant. It automatically selects important features by zeroing out less important coefficients.
Ridge Regression:
Ridge regression, like Lasso, extends OLS by adding a penalty term, but this term is based on the squared values of coefficients.
Ridge regression seeks to minimize the following:
RSS + λ Σβj2
Ridge helps address multicollinearity by shrinking coefficients, reducing the impact of correlated predictors.
Unlike Lasso, Ridge rarely forces coefficients to be exactly zero. It’s suitable when you want to retain most predictors but mitigate multicollinearity effects.
Ridge can provide more stable and less biased coefficient estimates compared to Lasso, especially when many predictors are correlated.
Use Case: Ridge is useful when all features are potentially relevant, and you want to reduce the magnitude of coefficients to make the model more stable and generalizable.
Comparison of OLS, Lasso and Ridge
OLS is a basic linear regression, while Lasso and Ridge add regularization to prevent overfitting and handle multicollinearity. Lasso is suitable for feature selection, while Ridge is useful for maintaining stability and addressing multicollinearity2. The choice between them depends on your data characteristics and modeling goals.
Feature Selection: OLS doesn’t perform feature selection. Lasso is well-suited for feature selection by setting some coefficients to zero, while Ridge retains most predictors.
Coefficient Values: Lasso can lead to sparse solutions with many coefficients being exactly zero, whereas Ridge generally shrinks coefficients towards zero without making them zero.
Multicollinearity: Ridge is particularly effective in handling multicollinearity by reducing the impact of correlated predictors, whereas Lasso might handle multicollinearity by excluding correlated variables entirely.
Model Complexity: OLS can lead to overfitting in high-dimensional settings, while Lasso and Ridge provide regularization to mitigate this issue.
Model Selection: Lasso can provide a simpler and more interpretable model by selecting a subset of predictors. Ridge retains more predictors and maintains predictive stability.
Data
Independent Variables
We will generate 20 columns and 500 rows of data from a normal distribution, and then standardize the data (different from normalization).
Standardization: Scaling for Zeros and Ones
Standardization, also known as z-score normalization, scales our data to have a mean of 0 and a standard deviation of 1. By centering the data around zero and squeezing it to a consistent range, we achieve balanced and comparable features. Standardization is particularly useful for algorithms that rely on distance measures, such as k-nearest neighbors and support vector machines.
Normalization: Mapping to a Common Range
Normalization, on the other hand, scales the data to a range between 0 and 1. By mapping the features to a common interval, normalization helps to emphasize the relative importance of different features. This technique is beneficial for algorithms that rely on weight-sensitive models, like neural networks and clustering algorithms.
The decision to standardize or normalize depends on the characteristics of the dataset and the requirements of the chosen machine learning algorithm. Here are some guidelines to consider:
Standardization: When the features have different units or different scales, standardization ensures they are all on a comparable level, making it easier for the algorithm to learn patterns effectively.
Normalization: When the scale of the features is not critical, and you want to ensure that all features contribute equally to the model, normalization is a suitable choice.
N<-500# number of observationsp<-20# number of variables#——————————————–# X variable#——————————————–?rpoisX=matrix(data =rpois(n =N*p, lambda =7), #rnorm(n = N*p), ncol =p)# before standardizationcolMeans(x =X)# mean
#——————————————–# Y variable#——————————————–beta<-c(0.15, -0.33, 0.25, -0.25, 0.05, # Initial 5 specific coefficientsrep(0, times =(p/2)-5), # Fills with zeros for next 5 coefficcientsrep(0, times =(p/2)-5), # Fills the next 5 coefficients as 0-0.25, 0.12, -0.125,.33, .342# Next 3 specific coefficients)beta
V1
Min. :-2.60654
1st Qu.:-0.70959
Median :-0.05153
Mean : 0.00000
3rd Qu.: 0.61834
Max. : 3.24174
What happens when you run lasso/ridge without standardization ?
Running Lasso (Least Absolute Shrinkage and Selection Operator) or Ridge regression without standardization can lead to various issues and challenges, similar to those discussed earlier. Both Lasso and Ridge are regularization techniques used to prevent overfitting in linear regression models. When applied without standardization, the following problems may arise:
Variable Scale Impact: Like in the case of Lasso, Ridge regression is sensitive to the scale of predictor variables. Without standardization, variables with different scales can have unequal contributions to the regularization term, potentially biasing the coefficient estimates towards variables with larger scales.
Unfair Feature Selection: In Lasso, without standardization, variables with larger scales can be more likely to have their coefficients reduced to zero, leading to unfair feature selection. Similarly, in Ridge regression, variables with larger scales might dominate the penalty term, influencing the shrinkage of coefficients.
Bias in Coefficient Estimates: Ridge regression introduces a penalty term that shrinks coefficients towards zero. Without standardization, variables with larger scales can have larger initial coefficient values, and Ridge regression might disproportionately shrink coefficients for these variables, leading to biased estimates.
Interpretability: Without standardization, interpreting coefficients in Ridge and Lasso models becomes more challenging. The coefficients’ magnitudes do not directly indicate their impact on the response variable, making it harder to understand the relationships between predictors and the response.
Model Performance: The predictive performance of Ridge and Lasso models can be negatively affected by the issues mentioned above. Unstandardized variables might lead to suboptimal model fit and reduced generalization to new data.
To mitigate these issues and ensure the proper application of Ridge and Lasso, it’s advisable to standardize your predictor variables before running these regularization techniques. Standardization helps achieve fair treatment of variables with different scales and improves the interpretability and performance of your model.
Choosing the Tuning Parameter \(\lambda\) -
In the context of regularization techniques like Lasso and Ridge regression, the term “optimal lambda” refers to the ideal value of the regularization parameter (often denoted as \(\lambda\) ) that results in the best model performance. The goal of selecting an optimal lambda is to strike a balance between fitting the model well to the training data and preventing overfitting.
Here’s why selecting an optimal lambda is important:
Preventing Overfitting: Regularization techniques like Lasso and Ridge add a penalty term to the regression coefficients. This penalty discourages large coefficient values, which in turn reduces the model’s tendency to overfit the training data. An optimal lambda helps find the right level of regularization to prevent overfitting, resulting in better generalization to new, unseen data.
Bias-Variance Trade-off: Regularization introduces a bias in the coefficient estimates, but it reduces the variance of the estimates. An optimal lambda finds the balance between bias and variance, improving the model’s overall predictive accuracy.
Feature Selection: Lasso, in particular, has the ability to set some coefficients exactly to zero, effectively performing feature selection. An optimal lambda helps identify which predictors are most important for the model, leading to a more interpretable and potentially simpler model.
Improved Model Performance: Selecting the right lambda can lead to improved model performance on both the training and validation data. It helps ensure that the model captures the underlying patterns in the data without fitting noise.
Regularization Strength: The value of lambda determines the strength of the regularization. Too small a lambda might result in insufficient regularization, while too large a lambda might lead to excessive shrinkage of coefficients. An optimal lambda helps find the appropriate level of regularization for the given data.
To find the optimal lambda, you typically use techniques like cross-validation. Cross-validation involves splitting your data into training and validation sets multiple times, training the model on different subsets of the data, and evaluating model performance for different lambda values. The lambda that results in the best performance (e.g., lowest validation error) is then selected as the optimal lambda.
In summary, selecting an optimal lambda is crucial for achieving a well-performing and generalizable model by effectively balancing the trade-off between bias and variance. It helps control model complexity, prevents overfitting, and improves the model’s ability to make accurate predictions on new data.
(Optimal) Lambda
We regularized our parameters above.
Now lets find the best \(\lambda\) value for lasso regression, and then run the optimal lasso model - it will not keep more or drop more variables than required.
The cv.glmnet function from the glmnet package in R is used for cross-validated Lasso or Ridge regression. This function finds the optimal value of the regularization parameter \(\lambda\) that minimizes the mean squared error (MSE) in a predictive model.
?cv.glmnetcv_model<-cv.glmnet(x =scaled_X, y =scaled_y, alpha =1)#find optimal lambda value that minimizes test MSEbest_lambda<-cv_model$lambda.minbest_lambda
[1] 0.007241287
Lets apply lasso/ridge. In the context of glmnet, the alpha parameter controls the type of regularization applied in the model:
alpha = 1 corresponds to Lasso regression (L1 regularization).
alpha = 0 corresponds to Ridge regression (L2 regularization).
Values between 0 and 1 apply Elastic Net regularization, which is a combination of both Lasso and Ridge.
# standard linear regression without intercept(-1)li.eq<-lm(y~X-1)?glmnet# Lassola.eq<-glmnet(x =scaled_X, y =scaled_y, lambda =best_lambda, family ="gaussian", #specifiy the error if you know its distrubution intercept =FALSE, #since we are comparing wth OLS without an intercept alpha =1)summary(la.eq)# this will not print out the coefficients
# Ridgeri.eq<-glmnet(x =scaled_X, y =scaled_y, lambda =best_lambda, family ="gaussian", # specifiy the error if you know its distrubution intercept =FALSE, alpha =0)#——————————————–# Results (lambda= OPTIMAL)#——————————————–df_comp_optimal<-data.frame( beta =beta, Linear =li.eq$coefficients, Lasso =la.eq$beta[,1], Ridge =ri.eq$beta[,1])df_comp_optimal
In glmnet, the family parameter specifies the type of model to fit, and while it’s not strictly required, it’s recommended to specify it for clarity and accuracy. By default, glmnet assumes "gaussian" if family is not provided, which is suitable for linear regression.
Common Options for family in glmnet
family = "gaussian": For linear regression (default).
family = "binomial": For logistic regression in binary classification tasks.
family = "multinomial": For multinomial logistic regression in multi-class classification tasks.
family = "poisson": For count data models.
family = "cox": For survival analysis with Cox proportional hazards.
In Lasso/Ridge regression using the glmnet package in R, standard errors for the coefficient estimates are not provided directly because the method focuses on shrinkage and doesn’t compute standard errors as in ordinary least squares (OLS) regression. However, there are a few ways to approximate or obtain standard errors for Ridge regression coefficients, such as bootstrapping and/or analytical approximations. It might be simplest of you simply run an OLS with the vars suggested by lasso/ridge and look at those standard errors.
# Extract non-zero coefficients from Lasso regression results# Converts coefficient matrix to a regular matrix format for easier manipulationW<-as.matrix(coef(la.eq))# Identify predictors with non-zero coefficients, excluding the interceptkeep_X<-rownames(W)[W!=0&rownames(W)!="(Intercept)"]# Subset the original data to include only predictors with non-zero coefficientsX_O<-as.matrix(df_scaledX[, keep_X])# Combine target variable `y` with selected predictors into a data framedf<-cbind.data.frame(y, X_O)# Fit a linear model using only the selected predictors, without an interceptla.eq_O<-summary(lm(y~X_O-1, data =df))la.eq_O
HIGH\(\lambda\)SHOULD PENALISE COEFFICIENTS MORE AND WE SHOULD FIND MANY COEFFICIENTS TO BE ZERO IN LASSO REGRESSION.
best_lambda
[1] 0.007241287
#——————————————–# Model High#——————————————–high_lambda<-.2# standard linear regression without intercept(-1)li.eq<-lm(y~X-1)# lassola.eq<-glmnet(x =scaled_X, y =scaled_y, lambda =high_lambda, family ="gaussian", intercept =F, alpha=1)# Ridgeri.eq<-glmnet(x =scaled_X, y =scaled_y, lambda =high_lambda, family="gaussian", intercept =F, alpha=0)#——————————————–# Results (lambda=0.01)#——————————————–df_comp_high<-data.frame( beta =beta, Linear =li.eq$coefficients, Lasso =la.eq$beta[,1], Ridge =ri.eq$beta[,1])df_comp_high
Yes, empirically we find support for theory that HIGH \(\lambda\) PENALISES COEFFICIENTS MORE AND WE FIND MANY MORE COEFFICIENTS (THAN 8) ARE ZERO IN LASSO REGRESSION.
Lasso (L1 Regularization)
Lasso applies an L1 penalty, which adds the absolute values of the coefficients to the loss function.
This L1 penalty has the effect of pushing some coefficients exactly to zero as λ increases. This is because the L1 norm creates a “sharp corner” at zero, making it easier for Lasso to eliminate variables completely by setting their coefficients to zero.
As a result, a high λ in Lasso leads to sparse solutions, where many coefficients are zero. This is why Lasso is effective for feature selection, as it “selects” important features by keeping their coefficients non-zero while setting others to zero.
Ridge (L2 Regularization)
Ridge regression uses an L2 penalty, which adds the square of the coefficients to the loss function.
Unlike Lasso, the L2 norm does not produce sparse solutions. Instead, it shrinks all coefficients closer to zero without setting any of them exactly to zero. The penalty term discourages large coefficients, but due to the smooth nature of the L2 norm, it affects all coefficients uniformly rather than pushing any to exactly zero.
Therefore, even with a high λ, Ridge will generally retain all predictors, albeit with smaller coefficient magnitudes. Ridge is less effective for feature selection but helps stabilize the model by reducing the impact of multicollinearity.
(Low) \(\lambda\) = .005
LOW\(\lambda\)SHOULD PENALISE COEFFICIENTS LESS AND WE SHOULD FIND MANY COEFFICIENTS TO BE NOT ZERO (atleast more than 8).
best_lambda
[1] 0.007241287
#——————————————–# Model#——————————————–lambda_low<-0.005# standard linear regression without intercept(-1)li.eq<-lm(y~X-1)# lassola.eq<-glmnet(x =X, y =y, lambda =lambda_low, family ="gaussian", intercept =F, alpha=1)# Ridgeri.eq<-glmnet(x =X, y =y, lambda =lambda_low, family="gaussian", intercept =F, alpha=0)#——————————————–# Results (lambda=0.01)#——————————————–df_comp_low<-data.frame( beta =beta, Linear =li.eq$coefficients, Lasso =la.eq$beta[,1], Ridge =ri.eq$beta[,1])df_comp_low
Insufficient Information: When you have more predictors than observations, you don’t have enough data to uniquely determine each predictor’s effect on y. In other words, some columns of X can be expressed as linear combinations of others.
Perfect Multicollinearity: With more predictors than observations, you’re guaranteed to have multicollinearity because there are infinite ways to combine the predictors to fit the observations perfectly. Mathematically, it’s impossible to distinguish between multiple solutions that fit the data equally well.
In multicollinear models, predictors contain overlapping information.
As a result:
The model struggles to distinguish the individual effects of each predictor on the target variable.
Small changes in the data can lead to large fluctuations in the estimated coefficients.
This instability in coefficients results in predictions that may vary widely, especially if the model is used on new or slightly different data.
Ridge regression (L2 regularization) help enhance predictive stability by:
Shrinking coefficients: Ridge regression penalizes large coefficients, forcing them to be closer to zero. This reduces the variance in coefficient estimates for highly correlated variables, stabilizing predictions.
Controlling sensitivity: By reducing the impact of correlated predictors, Ridge helps the model focus on the combined predictive power of the multicollinear variables rather than their individual, unstable contributions.