ECNM Discussion 8

Author

Bryan Calderon

Part 1

Run an OLS - # of predictors (p) is much larger than the number of observations (n).

I will be using the data set from California school test performance and droping several observations.

library(AER)
Warning: package 'AER' was built under R version 4.3.3
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("CASchools")

Adjusting the data set

set.seed(123)
# setting the # of rows used
OLS_dt <- CASchools[1:8, ]

# Count the columns / predictors  ---> -5 to subtract out non numeric
num_variables <- ncol(OLS_dt) - 5

# Count the columns / observations
num_rows <- nrow(OLS_dt)

# Print
num_variables
[1] 9
num_rows
[1] 8

Since we have 9 numeric predictors, we will only keep the first 8 observations

Running the Model

Model1 <- lm(math ~ read + english + income + expenditure + lunch + calworks + teachers + students, 
             data = OLS_dt)

summary(Model1)

Call:
lm(formula = math ~ read + english + income + expenditure + lunch + 
    calworks + teachers + students, data = OLS_dt)

Residuals:
ALL 8 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.9530        NaN     NaN      NaN
read          0.8513        NaN     NaN      NaN
english       0.1658        NaN     NaN      NaN
income        1.7708        NaN     NaN      NaN
expenditure  -0.0110        NaN     NaN      NaN
lunch        -0.1404        NaN     NaN      NaN
calworks      0.8565        NaN     NaN      NaN
teachers     -0.2596        NaN     NaN      NaN
students          NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 7 and 0 DF,  p-value: NA

Results from the Model

When you have more predictors than observations in a regression model, the estimation of slope parameters (coefficients) become a bit problematic. As shown above, this is causing for matrix of predictors to become singular / not have an inverse as there is perfect multicollinearity.

  • Standard Errors –> The standard errors for the coefficients aren’t defined and as result showing up as “NaN”. When the predictors have perfect multicollinearity, the model cannot determine how each predictor uniquely contributes to explaining the dependent variable.

  • No Residual Degrees of Freedom –> The “ALL 8 residuals are 0: no residual degrees of freedom” is a sign of overfitting due to the large number of predictors in relation to the lower number of observations.

  • No estimations –> Some coefficients such as students seem to not be able to be estimated due to the perfect multicollinearity.

Part 2)

Part A

On a dataset of your choice where where p >> n, apply the lasso or ridge method after scaling all the X variables. Which method do you prefer, and why? Do they give similar coefficients? 

Creating the two models

Using the same data set, we are finding the optimal lamda and creating the Ridge and Lasso regressions.

Setup

library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
# Predictors (X) and Response (y)
X <- scale(as.matrix(OLS_dt[, c("read", "english", "income", "expenditure", 
                                "lunch", "calworks", "teachers", "students")]))
y <- OLS_dt$math


# Lasso --> Cross-validation to find optimal lambda
cv_lasso <- cv.glmnet(x = X, 
                      y = y, 
                      alpha = 1)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
# Ridge --> Cross-validation to find optimal lambda
cv_ridge <- cv.glmnet(x = X, 
                      y = y, 
                      alpha = 0)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold

Optimal Lambda

# Optimal lambda for Lasso
best_lambda_lasso <- cv_lasso$lambda.min

cat("Optimal lambda for Lasso:", 
    best_lambda_lasso, "\n")
Optimal lambda for Lasso: 1.978527 
# Optimal lambda for Ridge
best_lambda_ridge <- cv_ridge$lambda.min  

cat("Optimal lambda for Ridge:", best_lambda_ridge, "\n")
Optimal lambda for Ridge: 193.3041 

Ridge and Lasso regressions

# Ridge regression
ridge_model <- glmnet(X, 
                      y, 
                      alpha = 0)

# Lasso regression
lasso_model <- glmnet(X, 
                      y, 
                      alpha = 1)

# Comparing coefficients
ridge_coefs <- coef(ridge_model, 
                    s = 0.01)  # lambda
lasso_coefs <- coef(lasso_model, 
                    s = 0.01)
Ridge Coefficients:
9 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) 639.1375122
read         15.7862053
english       0.5727419
income        1.7921303
expenditure  -1.1982939
lunch       -12.7607006
calworks      4.8329731
teachers      0.8623691
students      1.1288586

Lasso Coefficients:
9 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept) 639.137512
read         21.945995
english       3.320658
income        3.126077
expenditure  -5.196942
lunch       -11.279382
calworks     10.304263
teachers     -3.297968
students      .       

Results from the Models

Overall the coefficients are a bit different as they naturally have different uses.

  1. Ridge Coefficients:

    • Ridge regression applies a level 2 penalty, which reduces the magnitude of all coefficients but does not set any of them to zero.

    • This is consistent with the output above, where every predictor has a non zero coefficient. This becomes effective when we want to shrink coefficients but keep all predictors in the model.

  2. Lasso Coefficients:

    • Lasso regression uses a level 1​ penalty, which sets some coefficients to zero.

    • Above we can see the coefficient for students blank, indicating that it was set to zero by the model. It becomes very helpful in reducing model complexity by excluding less relevant predictors, making it more interpretable.

I prefer Lasso as I find it very useful to be able to idenfity variables which might be able to be exclude from one’s model and help with it’s complexity & interpretability.

Part B

Now, take any independent variable that has not been eliminated, and change its units (transform it). In lots of instances, this may make sense (EG if the variable is income measured in dollars, you may want to measure it in thousands of dollars instead). Rerun the lasso/ridge now (without scaling the variables) - does this rescaled variable that was not previously eliminated get eliminated as a resulting change in scale? Explain why this could happen?

Adjusting the scale of the models

Setup

# Scale from dollars to thousands of dollars
OLS_dt$income <- OLS_dt$income / 1000

# Define predictors (X) and response (y)
X <- as.matrix(OLS_dt[, c("read", "english", "income", "expenditure", 
                          "lunch", "calworks", "teachers", "students")])
y <- OLS_dt$math

# Lasso --> Cross-validation to find optimal lambda (without additional scaling)
cv_lasso <- cv.glmnet(x = X, 
                      y = y, 
                      alpha = 1)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
# Ridge --> Cross-validation to find optimal lambda (without additional scaling)
cv_ridge <- cv.glmnet(x = X, 
                      y = y, 
                      alpha = 0)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold

Optimal Lambda

# Optimal lambda for Lasso
best_lambda_lasso <- cv_lasso$lambda.min  

# Optimal lambda for Ridge
best_lambda_ridge <- cv_ridge$lambda.min

Ridge and Lasso regressions

# Fit Lasso model
lasso_model <- glmnet(X, 
                      y, 
                      alpha = 1, 
                      lambda = best_lambda_lasso)

# Fit Ridge model
ridge_model <- glmnet(X, 
                      y, 
                      alpha = 0, 
                      lambda = best_lambda_ridge)

# Comparing coefficients
ridge_coefs <- coef(ridge_model)

lasso_coefs <- coef(lasso_model)

Lasso Coefficients (After Rescaling Income):
9 x 1 sparse Matrix of class "dgCMatrix"
                       s0
(Intercept)  1.914422e+02
read         7.157088e-01
english      .           
income       .           
expenditure  .           
lunch       -1.502634e-01
calworks     .           
teachers     6.529610e-02
students     2.905849e-04

Ridge Coefficients (After Rescaling Income):
9 x 1 sparse Matrix of class "dgCMatrix"
                       s0
(Intercept)  5.756398e+02
read         9.147522e-02
english     -6.787410e-02
income       3.967424e+02
expenditure  1.343486e-03
lunch       -8.255706e-02
calworks     6.925893e-03
teachers     1.145313e-02
students     4.847282e-04

Results from the Models

Changing the units of a predictor in Lasso regression led to 4 predictors being set to zero because Lasso is sensitive to the new scaling of the variables. For this reason, rescaled variables may appear less significant and thus be penalized to zero.

Ridge regression, however was not affected by this scale change as it shrinks all coefficients but does not eliminate any.

This shows the importance of standardizing variables before applying Lasso as it helps ensure consistent feature selection.