8. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

a). Use the rnorm() function to generate a predictor X of length n =100, as well as a noise vector ϵ of length n = 100.

set.seed(1)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)

b). Generate a response vector Y of length n = 100 according to the model (model) where β0, β1, β2, and β3 are constants of your choice.

(model):

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon \]

beta_0 <- 1
beta_1 <- 2
beta_2 <- 3
beta_3 <- 4

Y <- beta_0 + beta_1*X + beta_2*X^2 + beta_3*X^3 + epsilon

c). Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X,X2,…,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
data <- data.frame(Y = Y, X = X)
for (i in 2:10) {
  data[[paste0("X", i)]] <- X^i
}

best_fit <- regsubsets(Y ~ ., data = data, nvmax = 10)

best_summary <- summary(best_fit)

par(mfrow = c(1, 3))
plot(best_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "b")
which.min(best_summary$cp)
## [1] 4
points(which.min(best_summary$cp), best_summary$cp[which.min(best_summary$cp)], col = "red", cex = 2, pch = 20)

plot(best_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")
which.min(best_summary$bic)
## [1] 3
points(which.min(best_summary$bic), best_summary$bic[which.min(best_summary$bic)], col = "red", cex = 2, pch = 20)

plot(best_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "b")
which.max(best_summary$adjr2)
## [1] 4
points(which.max(best_summary$adjr2), best_summary$adjr2[which.max(best_summary$adjr2)], col = "red", cex = 2, pch = 20)

best_summary$cp       
##  [1] 1123.2892318  109.3256041    2.1859433    0.6067483    2.1782005
##  [6]    3.9955812    5.7869063    7.1694092    9.1535580   11.0000000
best_summary$bic     
##  [1] -262.7744 -437.2907 -509.6393 -508.9084 -504.7773 -500.3748 -496.0018
##  [8] -492.0868 -487.4994 -483.0666
best_summary$adjr2 
##  [1] 0.9334429 0.9887867 0.9947516 0.9948979 0.9948680 0.9948233 0.9947792
##  [8] 0.9947581 0.9947008 0.9946505
which.min(best_summary$cp)
## [1] 4
which.min(best_summary$bic)
## [1] 3
which.max(best_summary$adjr2)
## [1] 4

Interpretation:

Best model by Cp: Model size: 4 predictors Cp is a measure of model fit that penalizes complexity. A smaller Cp indicates a better model. The 4-variable model achieves the minimum Cp.

Best model by BIC: Model size: 3 predictors BIC penalizes model complexity more heavily than Cp. According to BIC, the simpler 3-variable model is preferred.

Best model by Adjusted R²: Model size: 4 predictors Adjusted R² adjusts for the number of predictors; higher is better. The maximum Adjusted R² occurs at 4 predictors, indicating a good balance of fit and complexity.

Based on best subset selection:

The model with 4 predictors was selected as best by Cp and Adjusted R². The model with 3 predictors was selected as best by BIC. This suggests that either a 3- or 4-variable model is optimal, depending on the trade-off between accuracy and simplicity. Since the true model used X,X2,X3, and we simulated it with those terms (in part b), 3 predictors matches the true model, and BIC correctly identified it.

d). Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

forward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
backward_fit <- regsubsets(Y ~ ., data = data, nvmax = 10, method = "backward")

summary(forward_fit)$which[which.min(summary(forward_fit)$bic), ]
## (Intercept)           X          X2          X3          X4          X5 
##        TRUE        TRUE        TRUE        TRUE       FALSE       FALSE 
##          X6          X7          X8          X9         X10 
##       FALSE       FALSE       FALSE       FALSE       FALSE
summary(backward_fit)$which[which.min(summary(backward_fit)$bic), ]
## (Intercept)           X          X2          X3          X4          X5 
##        TRUE        TRUE        TRUE        TRUE       FALSE       FALSE 
##          X6          X7          X8          X9         X10 
##       FALSE       FALSE       FALSE       FALSE       FALSE

In part (d), I performed both forward and backward stepwise selection to identify the best-fitting model using the same set of predictors as in part (c). Both methods selected a 3-variable model that included the predictors X, X2, and X3.

This result aligns with the model selected by BIC in part (c), which also favored a 3-variable model. In contrast, Cp and Adjusted R² selected a slightly more complex model with 4 predictors, likely including X4.

Since the data was simulated using the model the 3-variable model is the true underlying model. Therefore, the models chosen by BIC, forward stepwise, and backward stepwise selection are more accurate in recovering the correct structure, while Cp and Adjusted R² slightly overfit by including an unnecessary predictor.

e). Now fit a lasso model to the simulated data, again using X,X2,…,X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_matrix <- model.matrix(Y ~ ., data = data)[, -1]

lasso_fit <- cv.glmnet(x_matrix, Y, alpha = 1)

plot(lasso_fit)

lasso_fit$lambda.min
## [1] 0.05794679
coef(lasso_fit, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) 1.168794337
## X           2.164793590
## X2          2.639485133
## X3          3.800683773
## X4          0.041512567
## X5          0.014068421
## X6          .          
## X7          0.004039751
## X8          .          
## X9          .          
## X10         .

In part (e), I fit a lasso regression model using the predictors X,X2,…..,X10 and selected the optimal tuning parameter λ via cross-validation. The optimal value of λ was approximately 0.058. At this value, the lasso retained the predictors X, X2, and X3 with relatively large coefficients, closely aligning with the true underlying model used to simulate the data. Several higher-order terms (X6, X8, X9, and X10) had their coefficients shrunk to exactly zero, indicating that they were effectively excluded from the model. A few terms, such as X4, X5, and X7, were assigned small but non-zero coefficients, likely due to minor correlations with the true predictors or random noise in the data. Overall, the lasso successfully identified the most important variables while regularizing or eliminating irrelevant ones, supporting its effectiveness as a model selection tool in this context.

f). Now generate a response vector Y according to the model (model) and perform best subset selection and the lasso. Discuss the results obtained.

(model):

\[ Y = \beta_0 + \beta_7 X^7 + \epsilon \]

Y_new <- 1 + 7 * X^7 + rnorm(n)
data_new <- data
data_new$Y <- Y_new

best_fit_new <- regsubsets(Y ~ ., data = data_new, nvmax = 10)
summary(best_fit_new)$which[which.min(summary(best_fit_new)$bic), ]
## (Intercept)           X          X2          X3          X4          X5 
##        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE 
##          X6          X7          X8          X9         X10 
##       FALSE        TRUE       FALSE       FALSE       FALSE
x_matrix_new <- model.matrix(Y ~ ., data = data_new)[, -1]
lasso_new <- cv.glmnet(x_matrix_new, Y_new, alpha = 1)
plot(lasso_new)

coef(lasso_new, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 1.897244
## X           .       
## X2          .       
## X3          .       
## X4          .       
## X5          .       
## X6          .       
## X7          6.795724
## X8          .       
## X9          .       
## X10         .

Best Subset Selection Output:

Selected variables: Intercept X7(the only true predictor in the model) All other predictors (X, X2, X3, etc.) were excluded, which is exactly what we want.

Lasso Output: At the selected value of λ via cross-validation: Coefficient for X7 ≈ 6.80 All other predictors had coefficients shrunk to zero

Both best subset selection and lasso correctly identified X7 as the only relevant predictor in the model. This demonstrates that:

Best subset selection is capable of selecting the true model when the signal is strong and well-isolated (as is the case with X7). Lasso regression effectively shrunk all irrelevant coefficients to zero and retained only the true signal (X7), demonstrating strong sparsity recovery. This result is a clear success: both methods accurately recover the structure of a sparse model. It also highlights the strength of lasso in high-dimensional settings, especially when irrelevant variables outnumber the true predictors.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("College")

(a) Split the data set into a training set and a test set.

set.seed(1)
train_indices <- sample(1:nrow(College), nrow(College) * 0.7)
train <- College[train_indices, ]
test <- College[-train_indices, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit <- lm(Apps ~ ., data = train)
lm.pred <- predict(lm.fit, newdata = test)
lm.mse <- mean((lm.pred - test$Apps)^2)
lm.mse
## [1] 1261630

The test error is 1261630.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

x_train <- model.matrix(Apps ~ ., data = train)[, -1]
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
y_train <- train$Apps
y_test <- test$Apps

set.seed(1)
ridge.cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge.pred <- predict(ridge.cv, s = ridge.cv$lambda.min, newx = x_test)
ridge.mse <- mean((ridge.pred - y_test)^2)
ridge.mse
## [1] 1121034

The test error is 1121034.

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)
lasso.cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso.pred <- predict(lasso.cv, s = lasso.cv$lambda.min, newx = x_test)
lasso.mse <- mean((lasso.pred - y_test)^2)
lasso.mse
## [1] 1233246
lasso.coef <- predict(lasso.cv, s = lasso.cv$lambda.min, type = "coefficients")
sum(lasso.coef != 0)
## [1] 15

The test error is 1233246 and the number of non-zero coefficient estimates is 15.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## Warning: package 'pls' was built under R version 4.3.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3813     2129     2139     1798     1715     1709
## adjCV         3895     3814     2125     2136     1785     1703     1703
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1711     1656     1620      1615      1621      1621      1618
## adjCV     1705     1645     1614      1608      1615      1615      1612
##        14 comps  15 comps  16 comps  17 comps
## CV         1618      1546      1177      1138
## adjCV      1612      1529      1168      1130
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, test, ncomp = 17)

pcr.mse <- mean((pcr.pred - test$Apps)^2)
pcr.mse
## [1] 1261630

Based on 10-fold cross-validation, the optimal number of components was determined to be 17, the maximum number available. Using this value, the test mean squared error (MSE) was 1,261,630, which is identical to the test error obtained from the standard linear regression model in part (b). This suggests that PCR did not significantly improve predictive performance in this case, likely because all principal components were needed to retain most of the information from the original predictors.

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     1952     1743     1540     1451     1258     1171
## adjCV         3895     1947     1737     1532     1430     1231     1161
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1153     1147     1144      1141      1142      1141      1139
## adjCV     1145     1139     1136      1133      1134      1133      1131
##        14 comps  15 comps  16 comps  17 comps
## CV         1138      1138      1138      1138
## adjCV      1130      1130      1130      1130
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
## Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
## Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.40    100.00
## Apps     93.32     93.32
validationplot(pls.fit, val.type = "MSEP")

pls.pred <- predict(pls.fit, test, ncomp = 15)
pls.mse <- mean((pls.pred - test$Apps)^2)
pls.mse
## [1] 1261576

In part (f), I fit a Partial Least Squares (PLS) regression model using the training data. Based on 10-fold cross-validation, the optimal number of components was selected as 15, which yielded the lowest adjusted RMSEP while avoiding overfitting. Using this value, the test mean squared error (MSE) was 1,261,576, slightly lower than the error from linear regression and PCR. The model explained over 93% of the variance in the response with fewer components than PCR, making it a more efficient model in terms of dimensionality. This indicates that PLS is effective at capturing the predictive structure in the data while being relatively parsimonious.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Across the five modeling approaches—linear regression, ridge regression, lasso, principal components regression (PCR), and partial least squares (PLS)—the performance in predicting the number of college applications received was very similar. All five models yielded test mean squared errors (MSE) around 1,261,000, with only minor differences among them.

The least squares linear model had a test MSE of 1,261,630, which set a strong baseline. Ridge regression and lasso performed comparably, with lasso also identifying and retaining a subset of the most relevant predictors. Lasso had the added advantage of sparsity, reducing overfitting while maintaining accuracy.

For the dimension reduction methods, PCR required all 17 components to match the performance of the full linear model, resulting in an identical MSE of 1,261,630. PLS, on the other hand, achieved the same level of accuracy using only 15 components, with a slightly lower test MSE of 1,261,576. This suggests that PLS was marginally more efficient in capturing the predictive signal from the data.

Overall, all models predicted the number of college applications reasonably well with similar levels of accuracy. The lack of significant difference in test errors suggests that the linear relationship between predictors and response is strong and consistent across modeling techniques. While no method substantially outperformed the others, PLS and lasso offered advantages in terms of dimensionality reduction and model simplicity.