Run an OLS - # of predictors (p) is much larger than the number of observations (n).
I will be using the data set from California school test performance and droping several observations.
library(AER)
Warning: package 'AER' was built under R version 4.3.3
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("CASchools")
Adjusting the data set
set.seed(123)# setting the # of rows usedOLS_dt <- CASchools[1:8, ]# Count the columns / predictors ---> -5 to subtract out non numericnum_variables <-ncol(OLS_dt) -5# Count the columns / observationsnum_rows <-nrow(OLS_dt)# Printnum_variables
[1] 9
num_rows
[1] 8
Since we have 9 numeric predictors, we will only keep the first 8 observations
Running the Model
Model1 <-lm(math ~ read + english + income + expenditure + lunch + calworks + teachers + students, data = OLS_dt)summary(Model1)
Call:
lm(formula = math ~ read + english + income + expenditure + lunch +
calworks + teachers + students, data = OLS_dt)
Residuals:
ALL 8 residuals are 0: no residual degrees of freedom!
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.9530 NaN NaN NaN
read 0.8513 NaN NaN NaN
english 0.1658 NaN NaN NaN
income 1.7708 NaN NaN NaN
expenditure -0.0110 NaN NaN NaN
lunch -0.1404 NaN NaN NaN
calworks 0.8565 NaN NaN NaN
teachers -0.2596 NaN NaN NaN
students NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 7 and 0 DF, p-value: NA
Results from the Model
When you have more predictors than observations in a regression model, the estimation of slope parameters (coefficients) become a bit problematic. As shown above, this is causing for matrix of predictors to become singular / not have an inverse as there is perfect multicollinearity.
Standard Errors –> The standard errors for the coefficients aren’t defined and as result showing up as “NaN”. When the predictors have perfect multicollinearity, the model cannot determine how each predictor uniquely contributes to explaining the dependent variable.
No Residual Degrees of Freedom –> The “ALL 8 residuals are 0: no residual degrees of freedom” is a sign of overfitting due to the large number of predictors in relation to the lower number of observations.
No estimations –> Some coefficients such as students seem to not be able to be estimated due to the perfect multicollinearity.
Part 2)
Part A
On a dataset of your choice where where p >> n, apply the lasso or ridge method after scaling all the X variables. Which method do you prefer, and why? Do they give similar coefficients?
Creating the two models
Using the same data set, we are finding the optimal lamda and creating the Ridge and Lasso regressions.
Setup
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
# Predictors (X) and Response (y)X <-scale(as.matrix(OLS_dt[, c("read", "english", "income", "expenditure", "lunch", "calworks", "teachers", "students")]))y <- OLS_dt$math# Lasso --> Cross-validation to find optimal lambdacv_lasso <-cv.glmnet(x = X, y = y, alpha =1)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
# Ridge --> Cross-validation to find optimal lambdacv_ridge <-cv.glmnet(x = X, y = y, alpha =0)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
Optimal Lambda
# Optimal lambda for Lassobest_lambda_lasso <- cv_lasso$lambda.mincat("Optimal lambda for Lasso:", best_lambda_lasso, "\n")
Optimal lambda for Lasso: 1.978527
# Optimal lambda for Ridgebest_lambda_ridge <- cv_ridge$lambda.min cat("Optimal lambda for Ridge:", best_lambda_ridge, "\n")
Optimal lambda for Ridge: 193.3041
Ridge and Lasso regressions
# Ridge regressionridge_model <-glmnet(X, y, alpha =0)# Lasso regressionlasso_model <-glmnet(X, y, alpha =1)# Comparing coefficientsridge_coefs <-coef(ridge_model, s =0.01) # lambdalasso_coefs <-coef(lasso_model, s =0.01)
Ridge Coefficients:
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 639.1375122
read 15.7862053
english 0.5727419
income 1.7921303
expenditure -1.1982939
lunch -12.7607006
calworks 4.8329731
teachers 0.8623691
students 1.1288586
Lasso Coefficients:
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 639.137512
read 21.945995
english 3.320658
income 3.126077
expenditure -5.196942
lunch -11.279382
calworks 10.304263
teachers -3.297968
students .
Results from the Models
Overall the coefficients are a bit different as they naturally have different uses.
Ridge Coefficients:
Ridge regression applies a level 2 penalty, which reduces the magnitude of all coefficients but does not set any of them to zero.
This is consistent with the output above, where every predictor has a non zero coefficient. This becomes effective when we want to shrink coefficients but keep all predictors in the model.
Lasso Coefficients:
Lasso regression uses a level 1 penalty, which sets some coefficients to zero.
Above we can see the coefficient for students blank, indicating that it was set to zero by the model. It becomes very helpful in reducing model complexity by excluding less relevant predictors, making it more interpretable.
I prefer Lasso as I find it very useful to be able to idenfity variables which might be able to be exclude from one’s model and help with it’s complexity & interpretability.
Part B
Now, take any independent variable that has not been eliminated, and change its units (transform it). In lots of instances, this may make sense (EG if the variable is income measured in dollars, you may want to measure it in thousands of dollars instead). Rerun the lasso/ridge now (without scaling the variables) - does this rescaled variable that was not previously eliminated get eliminated as a resulting change in scale? Explain why this could happen?
Adjusting the scale of the models
Setup
# Scale from dollars to thousands of dollarsOLS_dt$income <- OLS_dt$income /1000# Define predictors (X) and response (y)X <-as.matrix(OLS_dt[, c("read", "english", "income", "expenditure", "lunch", "calworks", "teachers", "students")])y <- OLS_dt$math# Lasso --> Cross-validation to find optimal lambda (without additional scaling)cv_lasso <-cv.glmnet(x = X, y = y, alpha =1)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
# Ridge --> Cross-validation to find optimal lambda (without additional scaling)cv_ridge <-cv.glmnet(x = X, y = y, alpha =0)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
Optimal Lambda
# Optimal lambda for Lassobest_lambda_lasso <- cv_lasso$lambda.min # Optimal lambda for Ridgebest_lambda_ridge <- cv_ridge$lambda.min
Ridge and Lasso regressions
# Fit Lasso modellasso_model <-glmnet(X, y, alpha =1, lambda = best_lambda_lasso)# Fit Ridge modelridge_model <-glmnet(X, y, alpha =0, lambda = best_lambda_ridge)# Comparing coefficientsridge_coefs <-coef(ridge_model)lasso_coefs <-coef(lasso_model)
Lasso Coefficients (After Rescaling Income):
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 1.914422e+02
read 7.157088e-01
english .
income .
expenditure .
lunch -1.502634e-01
calworks .
teachers 6.529610e-02
students 2.905849e-04
Ridge Coefficients (After Rescaling Income):
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 5.756398e+02
read 9.147522e-02
english -6.787410e-02
income 3.967424e+02
expenditure 1.343486e-03
lunch -8.255706e-02
calworks 6.925893e-03
teachers 1.145313e-02
students 4.847282e-04
Results from the Models
Changing the units of a predictor in Lasso regression led to 4 predictors being set to zero because Lasso is sensitive to the new scaling of the variables. For this reason, rescaled variables may appear less significant and thus be penalized to zero.
Ridge regression, however was not affected by this scale change as it shrinks all coefficients but does not eliminate any.
This shows the importance of standardizing variables before applying Lasso as it helps ensure consistent feature selection.