STA 6923 Fall 2024 Homework 4

HW 4

1. Exercise 6.6.2, Indicate validity of statements i-iv for each scenario.

Statement 1: More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Statement 2: More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Statement 3: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Statement 4: Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

a) The lasso, relative to least squares, is: 3

The lasso is a less flexible method compared to least squares, as it includes an L1 penalty. This reduces the likelihood of overfitting and results in lower variance. However, due to the bias-variance tradeoff, there is a relative increase in bias.

b) The ridge regression, relative to least squares, is: 3

Similar to the lasso, ridge regression is also less flexible than least squares, as it includes an L2 penalty. This reduces the likelihood of overfitting and results in lower variance. However, due to the bias-variance tradeoff, there is a relative increase in bias.

c) Non-linear methods, relative to least squares, are: 2

Non-linear methods are more flexible than least squares, so they can better fit complex patterns in data regardless of noise, contributing to their lower bias. However, due to the bias-variance tradeoff, there is a relative increase in variance as slight changes in training data can cause significant changes in the model.

2. Exercise 6.6.8, In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ϵ of length n = 100.

set.seed(1234)
X <- rnorm(100)
epsilon <- rnorm(100)

b) Generate a response vector Y of length n = 100 according to the model Y = β0 + β1X + β2X2 + β3X3 + ϵ, where β0, β1, β2, and β3 are constants of your choice.

beta0 <- 1
beta1 <- -2
beta2 <- 3
beta3 <- -4
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon

The simulated data is y = 1 - 2x + 3x^2 - 4x^3.

c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2,…,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y.

# Load the leaps package
library(leaps)

# Create a data frame with predictors X, X^2, ..., X^10 and response Y
data_df <- data.frame(y = Y, x = X, x2 = X^2, x3 = X^3, x4 = X^4, x5 = X^5, 
                     x6 = X^6, x7 = X^7, x8 = X^8, x9 = X^9, x10 = X^10)

# Perform best subset selection
subsets <- regsubsets(y ~ ., data = data_df, nvmax = 10) 

# Get the summary of the results
subsets_summary <- summary(subsets)

# Find the best model based on Cp, BIC, and adjusted R-squared
best_cp <- which.min(subsets_summary$cp)
best_bic <- which.min(subsets_summary$bic)
best_adjr2 <- which.max(subsets_summary$adjr2)

# Print the best models
cat("Best model based on Cp:", best_cp)

## Best model based on Cp: 3

cat("Best model based on BIC:", best_bic)

## Best model based on BIC: 3

cat("Best model based on adjusted R-squared:", best_adjr2)

## Best model based on adjusted R-squared: 3

# Plot Cp, BIC, and adjusted R-squared
par(mfrow = c(2, 2))
plot(subsets_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(best_cp, subsets_summary$cp[best_cp], col = "red", pch = 20)
plot(subsets_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(best_bic, subsets_summary$bic[best_bic], col = "red", pch = 20)
plot(subsets_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R-squared", type = "l")
points(best_adjr2, subsets_summary$adjr2[best_adjr2], col = "red", pch = 20)

# Get the coefficients of the best model 
coef(subsets, best_cp)

## (Intercept)           x          x2          x3 
##    1.132470   -2.087414    2.893627   -3.967695

coef(subsets, best_bic)

## (Intercept)           x          x2          x3 
##    1.132470   -2.087414    2.893627   -3.967695

coef(subsets, best_adjr2)

## (Intercept)           x          x2          x3 
##    1.132470   -2.087414    2.893627   -3.967695

The best model according to all 3 criterion is y = 1.13 - 2.09x + 2.89x^2 - 3.97x^3. This is very similar to the original.

d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

df.lm <- lm(y ~ ., data = data_df)
df.f <- ols_step_forward_p(df.lm, penter = 0.05)
df.b <- ols_step_backward_p(df.lm, prem = 0.1)
summary(df.f$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.95953 -0.66151  0.03895  0.53538  2.91598 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.13247    0.13610   8.321 5.92e-13 ***
## x3          -3.96769    0.05661 -70.088  < 2e-16 ***
## x2           2.89363    0.08222  35.193  < 2e-16 ***
## x           -2.08741    0.18789 -11.110  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.038 on 96 degrees of freedom
## Multiple R-squared:  0.9952, Adjusted R-squared:  0.995 
## F-statistic:  6588 on 3 and 96 DF,  p-value: < 2.2e-16

summary(df.b$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.84330 -0.51507 -0.03746  0.46811  2.80231 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.937870   0.203768   4.603 1.32e-05 ***
## x           -2.029522   0.202791 -10.008  < 2e-16 ***
## x2           3.920355   0.694328   5.646 1.77e-07 ***
## x3          -3.971703   0.065209 -60.908  < 2e-16 ***
## x4          -0.765037   0.518945  -1.474    0.144    
## x6           0.145173   0.105581   1.375    0.172    
## x10         -0.001227   0.001058  -1.159    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.042 on 93 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.995 
## F-statistic:  3271 on 6 and 93 DF,  p-value: < 2.2e-16

In both subset selection methods and sequential forward model selection, the methods select the right predictors and coefficients. However, backward selection chose 5 predictors with incorrect coeff of ~4 instead of 3 for x2. At the very least, x4, x6, and x10 are not considered significant according to the model results.

e) Now fit a lasso model to the simulated data, again using X, X2, …,X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

# Create the predictor matrix and response vector
x_matrix <- as.matrix(data_df[, -1]) # Exclude the response variable (y)
y_vector <- data_df$y

# Perform cross-validation to find the optimal lambda
cv_lasso <- cv.glmnet(x_matrix, y_vector, alpha = 1) # alpha = 1 for lasso, default nfolds=10

# Plot the cross-validation results
plot(cv_lasso)

# 2 possible cutoffs of lambda
c(log(cv_lasso$lambda.min), log(cv_lasso$lambda.1se))

## [1] -2.845343 -1.635904

# Get the optimal lambda value, we opt for 1se
best_lambda <- cv_lasso$lambda.1se

# Fit the lasso model with the optimal lambda
lasso_model <- glmnet(x_matrix, y_vector, alpha = 1, lambda = best_lambda)

# Report the coefficient estimates
coef(lasso_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept)  1.348018
## x           -2.018755
## x2           2.685827
## x3          -3.902978
## x4           .       
## x5           .       
## x6           .       
## x7           .       
## x8           .       
## x9           .       
## x10          .

As in class, we choose the lambda that gives us the most regularized model such that error is within one standard error of the lambda.min. The resulting model is y = 3.36 + 6.75x^7 + 0.0003x^9. The coefficient for x^9 is small, nonzero number. Overall, this lasso model is the poorest fit to the original so far.

f) Now generate a response vector Y according to the model Y = β0 + β7X7 + ϵ, and perform best subset selection and the lasso. Discuss the results obtained.

beta7 <- 7
Y <- beta0 + beta7*X^7 + epsilon
data_df <- data.frame(y = Y, x = X, x2 = X^2, x3 = X^3, x4 = X^4, x5 = X^5, 
                     x6 = X^6, x7 = X^7, x8 = X^8, x9 = X^9, x10 = X^10)
subsets <- regsubsets(y ~ ., data = data_df, nvmax = 10) 
subsets_summary <- summary(subsets)
best_cp <- which.min(subsets_summary$cp)
best_bic <- which.min(subsets_summary$bic)
best_adjr2 <- which.max(subsets_summary$adjr2)
cat("Best model based on Cp:", best_cp)

## Best model based on Cp: 1

cat("Best model based on BIC:", best_bic)

## Best model based on BIC: 1

cat("Best model based on adjusted R-squared:", best_adjr2)

## Best model based on adjusted R-squared: 2

coef(subsets, best_cp)

## (Intercept)          x7 
##    1.042105    6.999908

coef(subsets, best_bic)

## (Intercept)          x7 
##    1.042105    6.999908

coef(subsets, best_adjr2)

## (Intercept)          x2          x7 
##   1.1471879  -0.1074417   7.0004274

df.lm <- lm(y ~ ., data = data_df)
df.f <- ols_step_forward_p(df.lm, penter = 0.05)
df.b <- ols_step_backward_p(df.lm, prem = 0.1)
summary(df.f$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.99595 -0.65283  0.01794  0.56276  2.95288 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  1.147188   0.132249    8.674  9.7e-14 ***
## x7           7.000427   0.001113 6289.428  < 2e-16 ***
## x2          -0.107442   0.083717   -1.283    0.202    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.279e+07 on 2 and 97 DF,  p-value: < 2.2e-16

summary(df.b$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85434 -0.50383 -0.04033  0.46323  2.81306 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.942629   0.202373    4.658 1.05e-05 ***
## x2           0.907952   0.677624    1.340    0.184    
## x4          -0.758738   0.514381   -1.475    0.144    
## x6           0.145042   0.106775    1.358    0.178    
## x7           7.000710   0.001549 4518.276  < 2e-16 ***
## x10         -0.001249   0.001114   -1.121    0.265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.037 on 94 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 9.059e+06 on 5 and 94 DF,  p-value: < 2.2e-16

x_matrix <- as.matrix(data_df[, -1])
y_vector <- data_df$y
cv_lasso <- cv.glmnet(x_matrix, y_vector, alpha = 1)
plot(cv_lasso)

c(log(cv_lasso$lambda.min), log(cv_lasso$lambda.1se))

## [1] 3.013022 3.199090

best_lambda <- cv_lasso$lambda.1se
lasso_model <- glmnet(x_matrix, y_vector, alpha = 1, lambda = best_lambda)
coef(lasso_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 3.3608113059
## x           .           
## x2          .           
## x3          .           
## x4          .           
## x5          .           
## x6          .           
## x7          6.7502552058
## x8          .           
## x9          0.0003334034
## x10         .

Overall, Cp and BIC were the best criterion selection methods giving the correct predictors and coefficients. Adjr2 included X2 as a predictor, which makes sense given that it often overestimates number of predictors. Stepwise selection methods like forward and backward performed poorly due to forward including x2 and backward including x2,x4,x6,x10. However, in both models, only the x7 term was considered significant and had the correct coefficient. Lastly, lasso gave 3.36 + 6.75x^7 + 0.0003x^9, which is the same model given in part e) but a much better fit for this data. This is due to the L1 penalty forcing some of the coefficient estimates to be exactly equal to zero, slightly increasing bias over OLS, but decreasing variance for improved performance.

3. Exercise 6.6.11, We will now try to predict per capita crime rate in the Boston data set.

a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(ISLR2)
data(Boston)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

mod.sel <- regsubsets(Boston$crim ~ ., data = Boston, nvmax = 12)
reg.summary <- summary(mod.sel)
names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

reg.summary$outmat

##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

cat(which.max(reg.summary$rsq), reg.summary$rsq[which.max(reg.summary$rsq)]) #12

## 12 0.4493378

cat(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)]) #12

## 12 20574.51

cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) #9

## 9 0.4382783

cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) #7

## 7 6.748016

cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) #2

## 2 -257.6477

lm.Boston <- lm(Boston$crim ~ ., data = Boston)
summary(lm.Boston) # zn, dis, rad, medv

## 
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Boston.f <- ols_step_forward_p(lm.Boston, penter = 0.05) #9
Boston.b <- ols_step_backward_p(lm.Boston, prem = 0.1) #9
Boston.s <- ols_step_both_p(lm.Boston, pent = 0.05, prem = 0.1) #4
summary(Boston.f$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.547 -2.294 -0.263  1.108 73.970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.18247    6.96989   1.891 0.059160 .  
## rad           0.56097    0.04890  11.471  < 2e-16 ***
## lstat         0.14041    0.07168   1.959 0.050696 .  
## medv         -0.22013    0.05820  -3.782 0.000175 ***
## ptratio      -0.30424    0.18514  -1.643 0.100949    
## rm            0.63511    0.59541   1.067 0.286641    
## nox         -10.46823    5.07652  -2.062 0.039720 *  
## dis          -1.00637    0.27056  -3.720 0.000222 ***
## zn            0.04306    0.01807   2.383 0.017543 *  
## indus        -0.08821    0.07489  -1.178 0.239465    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.447 on 496 degrees of freedom
## Multiple R-squared:  0.4483, Adjusted R-squared:  0.4383 
## F-statistic: 44.78 on 9 and 496 DF,  p-value: < 2.2e-16

summary(Boston.b$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.625 -2.201 -0.349  1.128 73.893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.285117   7.031364   2.032 0.042724 *  
## zn            0.047681   0.018491   2.579 0.010206 *  
## nox         -11.431518   4.870169  -2.347 0.019305 *  
## rm            0.675191   0.592967   1.139 0.255392    
## dis          -0.970703   0.265259  -3.659 0.000280 ***
## rad           0.628695   0.083038   7.571 1.82e-13 ***
## tax          -0.005150   0.004632  -1.112 0.266789    
## ptratio      -0.318088   0.183795  -1.731 0.084134 .  
## lstat         0.132263   0.071654   1.846 0.065508 .  
## medv         -0.227946   0.058967  -3.866 0.000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.448 on 496 degrees of freedom
## Multiple R-squared:  0.4481, Adjusted R-squared:  0.4381 
## F-statistic: 44.75 on 9 and 496 DF,  p-value: < 2.2e-16

summary(Boston.s$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.942 -1.896 -0.346  0.974 75.878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.93612    3.88687   1.270   0.2047    
## rad          0.54758    0.04104  13.343   <2e-16 ***
## lstat        0.14350    0.06442   2.228   0.0263 *  
## medv        -0.11983    0.05066  -2.365   0.0184 *  
## ptratio     -0.30708    0.16706  -1.838   0.0666 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.529 on 501 degrees of freedom
## Multiple R-squared:  0.4283, Adjusted R-squared:  0.4238 
## F-statistic: 93.85 on 4 and 501 DF,  p-value: < 2.2e-16

x.Boston <- model.matrix(crim ~ ., Boston)[,-1]
y.Boston <- Boston[,"crim"]
cv_lasso.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 1)
best_lambda.Boston <- cv_lasso.Boston$lambda.1se
lasso_model.Boston <- glmnet(x.Boston, y.Boston, alpha=1, lambda=best_lambda.Boston)
coef(lasso_model.Boston) #1 (lambda.1se) vs 10 (lambda.min)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 1.4186414
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2298449
## tax         .        
## ptratio     .        
## lstat       .        
## medv        .

cv_ridge.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 0)
best_lambda.Boston <- cv_ridge.Boston$lambda.1se
ridge_model.Boston <- glmnet(x.Boston, y.Boston, alpha=0, lambda=best_lambda.Boston)
coef(ridge_model.Boston) #12

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  1.830241387
## zn          -0.002323439
## indus        0.018788484
## chas        -0.087676364
## nox          1.175797658
## rm          -0.094724473
## age          0.003960747
## dis         -0.058498956
## rad          0.025819211
## tax          0.001207665
## ptratio      0.043787293
## lstat        0.021432608
## medv        -0.014151535

cv_enet.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 0.5)
best_lambda.Boston <- cv_enet.Boston$lambda.1se
enet_model.Boston <- glmnet(x.Boston, y.Boston, alpha=0.5, lambda=best_lambda.Boston)
coef(enet_model.Boston) #2 (lambda.1se) vs 10 (lambda.min)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 0.872722755
## zn          .          
## indus       .          
## chas        .          
## nox         .          
## rm          .          
## age         .          
## dis         .          
## rad         0.154480237
## tax         0.003100174
## ptratio     .          
## lstat       .          
## medv        .

library(pls)

## Warning: package 'pls' was built under R version 4.4.2

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.Boston <- pcr(crim ~ . , data = Boston, scale=T, validation = "CV")
summary(pcr.Boston)

## Data:    X dimension: 506 12 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.295    7.289    6.899    6.878    6.823    6.812
## adjCV         8.61    7.291    7.285    6.893    6.873    6.817    6.806
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.682    6.703    6.683     6.677     6.638      6.56
## adjCV    6.671    6.696    6.675     6.669     6.628      6.55
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.93    63.64    72.94    80.21    86.83    90.26    92.79    94.99
## crim    29.39    29.55    37.39    37.85    38.85    39.23    41.73    41.82
##       9 comps  10 comps  11 comps  12 comps
## X       96.78     98.33     99.48    100.00
## crim    42.12     42.43     43.58     44.93

validationplot(pcr.Boston, val.type = "MSEP")

press.Boston <- sqrt(pcr.Boston$validation[["PRESS"]])
which.min(press.Boston)

## [1] 12

press.Boston.eff <- press.Boston - press.Boston[which.min(press.Boston)]
print(press.Boston.eff) # smallest CV error occurs when M=12 components used

##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## crim 16.53362 16.40192 7.633271 7.166196 5.908498 5.676738 2.757023 3.223763
##       9 comps 10 comps 11 comps 12 comps
## crim 2.771947 2.627039 1.750864        0

pls.Boston <- plsr(crim ~ . , data = Boston, scale=T, validation = "CV")
summary(pls.Boston)

## Data:    X dimension: 506 12 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.070    6.665    6.592    6.582    6.571    6.554
## adjCV         8.61    7.068    6.661    6.586    6.574    6.562    6.544
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.547    6.537    6.536     6.537     6.537     6.537
## adjCV    6.538    6.528    6.528     6.528     6.528     6.528
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.49    58.85    64.82    74.54    80.88    83.57    87.80    92.13
## crim    33.00    41.20    43.32    44.04    44.40    44.78    44.87    44.91
##       9 comps  10 comps  11 comps  12 comps
## X       94.56     96.22     98.11    100.00
## crim    44.93     44.93     44.93     44.93

validationplot(pls.Boston, val.type = "MSEP")

which.min(pls.Boston$validation[["PRESS"]]) # smallest when n.comp=9 components used

## [1] 9

I tried criterion selection methods and sequential selection methods for OLS, the lasso, ridge regression, elastic net, and PCR. Rsq and Rss selected all 12 predictors while AdjR2 selected 9, Cp 7, BIC 2, forward/backward 9, stepwise regression 4. The downside with these models is that we cannot conclude that all variables included are important or all excluded variables are unimportant. The selected models for the subset selection method may all equally good, and it is difficult to conclude which is optimal. Moreover, they are more flexible methods, such that they may overfit. In this chapter, we were introduced shrinkage methods (lasso regression, ridge regression, elastic net) and dimensionality reduction methods (PCA regression, PLS regression). Ridge uses all predictors; however, lasso interestingly chooses 10 predictors at lambda.min and only 1 predictor at lambda.1se. Enet is similar to lasso in that lambbda.min chooses 10 predictors and lambda.1se chooses only 2. The PCA model performs best with all 12 principal components while the PLS does best with 9 components.

b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:pls':
## 
##     R2

set.seed(1234)

# Splitting the dataset
tr.ind <- sample(seq_len(nrow(x.Boston)), floor(0.7 * nrow(x.Boston)), replace = FALSE)
x.tr <- x.Boston[tr.ind, ]
x.te <- x.Boston[-tr.ind, ]
y.tr <- y.Boston[tr.ind]
y.te <- y.Boston[-tr.ind]

# Full OLS model
lm.Boston <- lm(crim ~ ., data = Boston[tr.ind,])
pred.lm <- predict(lm.Boston, newdata = Boston[-tr.ind,])
mse.lm <- mean((y.te - pred.lm)^2)

# Subset selection models
mod.sel <- regsubsets(crim ~ ., data = Boston[tr.ind,], nvmax = 12)
reg.summary <- summary(mod.sel)

# Adjusted R^2
best.adjr2 <- which.max(reg.summary$adjr2)
selected.vars <- names(coef(mod.sel, best.adjr2))[-1]  # Exclude intercept
formula.adjr2 <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.adjr2 <- lm(formula.adjr2, data = Boston[tr.ind,])
pred.adjr2 <- predict(lm.adjr2, newdata = Boston[-tr.ind,])
mse.adjr2 <- mean((y.te - pred.adjr2)^2)

# Cp
best.cp <- which.max(reg.summary$cp)
selected.vars <- names(coef(mod.sel, best.cp))[-1]  
formula.cp <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.cp <- lm(formula.cp, data = Boston[tr.ind,])
pred.cp <- predict(lm.cp, newdata = Boston[-tr.ind,])
mse.cp <- mean((y.te - pred.cp)^2)

# BIC
best.bic <- which.max(reg.summary$bic)
selected.vars <- names(coef(mod.sel, best.bic))[-1]  
formula.bic <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.bic <- lm(formula.bic, data = Boston[tr.ind,])
pred.bic <- predict(lm.bic, newdata = Boston[-tr.ind,])
mse.bic <- mean((y.te - pred.bic)^2)

# Forward Selection
Boston.f <- ols_step_forward_p(lm.Boston, penter = 0.05)
pred.forward <- predict(Boston.f$model, newdata = Boston[-tr.ind,])
mse.forward <- mean((y.te - pred.forward)^2)

# Backward Selection
Boston.b <- ols_step_backward_p(lm.Boston, prem = 0.1)
pred.backward <- predict(Boston.b$model, newdata = Boston[-tr.ind,])
mse.backward <- mean((y.te - pred.backward)^2)

# Stepwise (Both Directions)
Boston.s <- ols_step_both_p(lm.Boston, pent = 0.05, prem = 0.1)
pred.stepwise <- predict(Boston.s$model, newdata = Boston[-tr.ind,])
mse.stepwise <- mean((y.te - pred.stepwise)^2)

# Lasso regression
lasso.cv <- cv.glmnet(x.tr, y.tr, alpha = 1)
lasso.fit <- glmnet(x.tr, y.tr, alpha = 1)
coef(lasso.fit, s = lasso.cv$lambda.1se)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.5160139
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2467164
## tax         .        
## ptratio     .        
## lstat       .        
## medv        .

lasso.predict <- predict(lasso.fit, s = lasso.cv$lambda.1se, newx = x.te)
mse.lasso <- mean((y.te - lasso.predict)^2)

# Ridge regression
ridge.cv <- cv.glmnet(x.tr, y.tr, alpha = 0)
ridge.fit <- glmnet(x.tr, y.tr, alpha = 0)
coef(ridge.fit, s = ridge.cv$lambda.1se)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.415322414
## zn          -0.002894736
## indus        0.024681808
## chas        -0.142582896
## nox          1.589732481
## rm          -0.133972959
## age          0.005513600
## dis         -0.081479027
## rad          0.037730343
## tax          0.001699419
## ptratio      0.061789713
## lstat        0.031522083
## medv        -0.021769896

ridge.predict <- predict(ridge.fit, s = ridge.cv$lambda.1se, newx = x.te)
mse.ridge <- mean((y.te - ridge.predict)^2)

# Elastic Net regression
enet.cv <- cv.glmnet(x.tr, y.tr, alpha = 0.5) 
enet.fit <- glmnet(x.tr, y.tr, alpha = 0.5)
coef(enet.fit, s = enet.cv$lambda.1se)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) 1.002704193
## zn          .          
## indus       .          
## chas        .          
## nox         .          
## rm          .          
## age         .          
## dis         .          
## rad         0.169922764
## tax         0.002874765
## ptratio     .          
## lstat       0.005833654
## medv        .

enet.predict <- predict(enet.fit, s = enet.cv$lambda.1se, newx = x.te)
mse.enet <- mean((y.te - enet.predict)^2)

# PCA regression
pcr.Boston <- pcr(crim ~ . , data = Boston[tr.ind,], scale=T, validation = "CV")
press.Boston <- sqrt(pcr.Boston$validation[["PRESS"]])
best_ncomp_pcr <- which.min(press.Boston)
pcr.pred <- predict(pcr.Boston, within(Boston,rm("crim"))[-tr.ind,], ncomp = best_ncomp_pcr)
mse.pcr <- mean((y.te - pcr.pred)^2)

# PLS regression
pls.Boston <- plsr(crim ~ . , data = Boston[tr.ind,], scale=T, validation = "CV")
best_ncomp_pls <- which.min(pls.Boston$validation[["PRESS"]]) 
pls.pred <- predict(pls.Boston, within(Boston,rm("crim"))[-tr.ind,], ncomp = best_ncomp_pls)
mse.pls <- mean((y.te - pls.pred)^2)

# Combine results into a data frame
mse_comparison <- data.frame(
  Model = c("Full OLS", "Best Subset (Adj R^2)", "Forward Selection", "Backward Elimination", "Stepwise Regression", "Lasso", "Ridge", "Elastic Net", "PCA Regression", "PLS Regression"),
  MSE = c(mse.lm, mse.adjr2, mse.forward, mse.backward, mse.stepwise, mse.lasso, mse.ridge, mse.enet, mse.pcr, mse.pls)
)

print(mse_comparison)

##                    Model      MSE
## 1               Full OLS 24.80042
## 2  Best Subset (Adj R^2) 24.85268
## 3      Forward Selection 24.85268
## 4   Backward Elimination 24.85268
## 5    Stepwise Regression 24.67800
## 6                  Lasso 29.52554
## 7                  Ridge 33.75263
## 8            Elastic Net 30.38215
## 9         PCA Regression 24.80042
## 10        PLS Regression 24.73771

I chose a 70/30 train test split. I use OLS with the full set of predictors as my baseline MSE (24.8) to compare my other models with. PCA regression performs the same while PLS regression improves upon that with (24.73). The best model is actually stepwise regression (24.68). It seems as though Lasso (29.53), Ridge (33.75), and Elastic Net (30.38) performed the worst. This can be due to the increase in bias being greater than the decrease in variance for this Boston dataset.

c) Does your chosen model involve all of the features in the data set? Why or why not?

For this seed(1234), the best model of stepwise regression did not include all of the predictors. Subset selection methods in general seem to perform best. The next best performing models are the dimensionality reduction models. Even though they do not necessarily use all the components, they still involve all of the features in the data set since each PC is a combination of all 12 features. Finally, the poorest performing models are the shrinkage models, with Ridge, the only one that uses all the predictors, performing the worst.

The best explanation is that there are irrelevant or redundant predictors in the Boston dataset. While PCR and PLS did not remove such predictors, they project the original features into a new set of uncorrelated, orthogonal components, which helps deal with multicollinearity. Finally, Ridge does not eliminate predictors, only shrinking their coefficients, and thus being unable to deal with multicollinearity.