1. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
  1. Use the rnorm() function to generate a predictor X of length n =100, as well as a noise vector ϵ of length n = 100.
  2. Generate a response vector Y of length n = 100 according to the model Y =β0+β1X+β2X2+β3X3+ϵ, where β0, β1, β2, and β3 are constants of your choice.
  3. Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X,X2,…,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model ob tained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .
  4. Repeat (c), using forward stepwise selection and also using back wards stepwise selection. How does your answer compare to the results in (c)?
  5. Now fit a lasso model to the simulated data, again using X,X2, …,X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.
  6. Now generate a response vector Y according to the model Y =β0+β7X7+ϵ, and perform best subset selection and the lasso. Discuss the results obtained. 6.6 Exercises 287

Question 8

In this exercise, we will generate simulated data and perform best subset selection.

a. Use the rnorm() function to generate a predictor \(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).

set.seed(42)
x <- rnorm(100)
ep <- rnorm(100)

We use rnorm(100) to generate the predictor \(X\) and noise vector \(\epsilon\).

b. Generate a response vector \(Y\) of length \(n = 100\) according to the model:

\[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon,\] where \(\beta_0 = 2\), \(\beta_1 = 3\), \(\beta_2 = -2\), and \(\beta_3 = 0.5\).

y <- 2 + 3 * x - 2 * x^2 + 0.5 * x^3 + ep

Here, the response vector \(Y\) is generated based on the specified polynomial model with random noise.

c. Use the regsubsets() function to perform best subset selection and choose the best model containing the predictors \(X, X^2, ..., X^{10}\).

dat <- data.frame(x, y)
regfit_full <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat)
reg_summary <- summary(regfit_full)

# Plotting Cp, BIC, and Adjusted R-squared
par(mfrow = c(1, 3))
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(reg_summary$cp), reg_summary$cp[which.min(reg_summary$cp)], col = "red", cex = 2, pch = 20)

plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(reg_summary$bic), reg_summary$bic[which.min(reg_summary$bic)], col = "red", cex = 2, pch = 20)

plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l")
points(which.max(reg_summary$adjr2), reg_summary$adjr2[which.max(reg_summary$adjr2)], col = "red", cex = 2, pch = 20)

In this step, we perform best subset selection and plot the model’s Cp, BIC, and Adjusted \(R^2\) to select the best model. The best model is indicated by the red dots.

d. Repeat using forward and backward stepwise selection. Compare the results.

# Forward stepwise selection
regfit_fwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "forward")
summary(regfit_fwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat, method = "forward")
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(x, 10, raw = TRUE)1      FALSE      FALSE
## poly(x, 10, raw = TRUE)2      FALSE      FALSE
## poly(x, 10, raw = TRUE)3      FALSE      FALSE
## poly(x, 10, raw = TRUE)4      FALSE      FALSE
## poly(x, 10, raw = TRUE)5      FALSE      FALSE
## poly(x, 10, raw = TRUE)6      FALSE      FALSE
## poly(x, 10, raw = TRUE)7      FALSE      FALSE
## poly(x, 10, raw = TRUE)8      FALSE      FALSE
## poly(x, 10, raw = TRUE)9      FALSE      FALSE
## poly(x, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      "*"                     
## 3  ( 1 ) "*"                      "*"                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) "*"                      " "                     
##          poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      " "                     
## 8  ( 1 ) "*"                      " "                     
##          poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) "*"                      " "                      
## 6  ( 1 ) "*"                      " "                      
## 7  ( 1 ) "*"                      " "                      
## 8  ( 1 ) "*"                      "*"
# Backward stepwise selection
regfit_bwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "backward")
summary(regfit_bwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat, method = "backward")
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(x, 10, raw = TRUE)1      FALSE      FALSE
## poly(x, 10, raw = TRUE)2      FALSE      FALSE
## poly(x, 10, raw = TRUE)3      FALSE      FALSE
## poly(x, 10, raw = TRUE)4      FALSE      FALSE
## poly(x, 10, raw = TRUE)5      FALSE      FALSE
## poly(x, 10, raw = TRUE)6      FALSE      FALSE
## poly(x, 10, raw = TRUE)7      FALSE      FALSE
## poly(x, 10, raw = TRUE)8      FALSE      FALSE
## poly(x, 10, raw = TRUE)9      FALSE      FALSE
## poly(x, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      "*"                     
## 3  ( 1 ) "*"                      "*"                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      " "                     
## 7  ( 1 ) " "                      " "                     
## 8  ( 1 ) " "                      " "                     
##          poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) " "                      " "                      
## 4  ( 1 ) " "                      " "                      
## 5  ( 1 ) "*"                      " "                      
## 6  ( 1 ) "*"                      " "                      
## 7  ( 1 ) "*"                      " "                      
## 8  ( 1 ) "*"                      "*"

Here, we apply both forward and backward stepwise selection and obtain the best models. The results are compared to the ones from subset selection.

e. Fit a lasso model to the simulated data using \(X, X^2, ..., X^{10}\) as predictors. Use cross-validation to select the optimal value of \(\lambda\).

X_matrix <- model.matrix(y ~ poly(x, 10, raw = TRUE), data = dat)[, -1]
lasso_cv <- cv.glmnet(X_matrix, dat$y, alpha = 1)
best_lambda <- lasso_cv$lambda.min

# Plot cross-validation error as a function of lambda
plot(lasso_cv)

# Fitting the final lasso model
lasso_fit <- glmnet(X_matrix, dat$y, alpha = 1, lambda = best_lambda)
predict(lasso_fit, type = "coefficients", s = best_lambda)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)                1.8457308
## poly(x, 10, raw = TRUE)1   2.9092918
## poly(x, 10, raw = TRUE)2  -1.9287428
## poly(x, 10, raw = TRUE)3   0.5161012
## poly(x, 10, raw = TRUE)4   .        
## poly(x, 10, raw = TRUE)5   .        
## poly(x, 10, raw = TRUE)6   .        
## poly(x, 10, raw = TRUE)7   .        
## poly(x, 10, raw = TRUE)8   .        
## poly(x, 10, raw = TRUE)9   .        
## poly(x, 10, raw = TRUE)10  .

We perform Lasso regression, using cross-validation to choose the optimal \(\lambda\). The coefficients of the best model are shown, and we can see that the Lasso effectively shrinks some coefficients to zero.

f. Generate a response vector \(Y\) according to the model:

\[Y = \beta_0 + \beta_7X^7 + \epsilon,\] and perform best subset selection and lasso.

# New response vector based on X^7
y_new <- 2 + 0.2 * x^7 + ep
dat$y <- y_new

# Best subset selection for the new model
regfit_new <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat)
summary(regfit_new)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = dat)
## 10 Variables  (and intercept)
##                           Forced in Forced out
## poly(x, 10, raw = TRUE)1      FALSE      FALSE
## poly(x, 10, raw = TRUE)2      FALSE      FALSE
## poly(x, 10, raw = TRUE)3      FALSE      FALSE
## poly(x, 10, raw = TRUE)4      FALSE      FALSE
## poly(x, 10, raw = TRUE)5      FALSE      FALSE
## poly(x, 10, raw = TRUE)6      FALSE      FALSE
## poly(x, 10, raw = TRUE)7      FALSE      FALSE
## poly(x, 10, raw = TRUE)8      FALSE      FALSE
## poly(x, 10, raw = TRUE)9      FALSE      FALSE
## poly(x, 10, raw = TRUE)10     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      "*"                     
## 7  ( 1 ) " "                      " "                     
## 8  ( 1 ) " "                      "*"                     
##          poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      "*"                     
## 5  ( 1 ) "*"                      " "                     
## 6  ( 1 ) "*"                      " "                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1  ( 1 ) " "                      " "                     
## 2  ( 1 ) " "                      " "                     
## 3  ( 1 ) "*"                      " "                     
## 4  ( 1 ) "*"                      " "                     
## 5  ( 1 ) "*"                      "*"                     
## 6  ( 1 ) "*"                      "*"                     
## 7  ( 1 ) "*"                      "*"                     
## 8  ( 1 ) "*"                      "*"                     
##          poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1  ( 1 ) "*"                      " "                     
## 2  ( 1 ) "*"                      " "                     
## 3  ( 1 ) " "                      " "                     
## 4  ( 1 ) " "                      " "                     
## 5  ( 1 ) " "                      " "                     
## 6  ( 1 ) " "                      "*"                     
## 7  ( 1 ) " "                      "*"                     
## 8  ( 1 ) " "                      "*"                     
##          poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1  ( 1 ) " "                      " "                      
## 2  ( 1 ) " "                      " "                      
## 3  ( 1 ) "*"                      " "                      
## 4  ( 1 ) "*"                      " "                      
## 5  ( 1 ) "*"                      "*"                      
## 6  ( 1 ) "*"                      " "                      
## 7  ( 1 ) "*"                      "*"                      
## 8  ( 1 ) "*"                      "*"
# Lasso for the new model
X_matrix_new <- model.matrix(y ~ poly(x, 10, raw = TRUE), data = dat)[, -1]
lasso_cv_new <- cv.glmnet(X_matrix_new, dat$y, alpha = 1)
best_lambda_new <- lasso_cv_new$lambda.min
plot(lasso_cv_new)

# Fitting the lasso model
lasso_fit_new <- glmnet(X_matrix_new, dat$y, alpha = 1, lambda = best_lambda_new)
predict(lasso_fit_new, type = "coefficients", s = best_lambda_new)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)               1.7236801684
## poly(x, 10, raw = TRUE)1  .           
## poly(x, 10, raw = TRUE)2  .           
## poly(x, 10, raw = TRUE)3  .           
## poly(x, 10, raw = TRUE)4  .           
## poly(x, 10, raw = TRUE)5  0.0455429629
## poly(x, 10, raw = TRUE)6  .           
## poly(x, 10, raw = TRUE)7  0.1820763243
## poly(x, 10, raw = TRUE)8  .           
## poly(x, 10, raw = TRUE)9  0.0008828036
## poly(x, 10, raw = TRUE)10 .

Here, we simulate data where \(X^7\) is the only important predictor and then apply best subset selection and Lasso. The subset selection and Lasso will differ in how they handle the sparse predictor set.


  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
  1. Split the data set into a training set and a test set.
  2. Fit a linear model using least squares on the training set, and report the test error obtained.
  3. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
  4. Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the num ber of non-zero coefficient estimates.
  5. Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
  6. Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
  7. Comment on the results obtained. How accurately can we pre dict the number of college applications received? Is there much difference among the test errors resulting from these five ap proaches?

Question 9

In this exercise, we aim to predict the number of applications received by various colleges in the College dataset using different modeling techniques, including linear regression, ridge regression, lasso, principal component regression (PCR), and partial least squares (PLS).

Let’s walk through each part of the problem step by step, along with some additional insights to better understand the results.

a. Splitting the Data into Training and Test Sets

We start by splitting the College data into a training set and a test set. A random seed is set to ensure the results are reproducible.

set.seed(42)
train <- sample(nrow(College), nrow(College) * 2 / 3)
test <- setdiff(seq_len(nrow(College)), train)
mse <- list()  # To store test mean squared errors for comparison

b. Fitting a Linear Model using Least Squares

Next, we fit a linear model using ordinary least squares (OLS) on the training set and compute the test mean squared error (MSE) to evaluate its predictive performance.

fit <- lm(Apps ~ ., data = College[train, ])
(mse$lm <- mean((predict(fit, College[test, ]) - College$Apps[test])^2))
## [1] 1695269

The test MSE gives us an indication of how well the linear model predicts the number of college applications. Linear regression serves as a baseline model for comparison.

c. Fitting a Ridge Regression Model

We fit a ridge regression model, where λ (the regularization parameter) is chosen through cross-validation. Ridge regression shrinks the coefficients and is particularly useful when multicollinearity is present.

mm <- model.matrix(Apps ~ ., data = College[train, ])
fit2 <- cv.glmnet(mm, College$Apps[train], alpha = 0)
p <- predict(fit2, model.matrix(Apps ~ ., data = College[test, ]), s = fit2$lambda.min)
(mse$ridge <- mean((p - College$Apps[test])^2))
## [1] 2804369

By cross-validating for λ, we select the best regularization strength, which helps reduce overfitting. Ridge regression can be beneficial in high-dimensional settings.

d. Fitting a Lasso Model

Next, we fit a lasso model, another form of regularized regression, but unlike ridge, it can perform feature selection by forcing some coefficients to zero. Again, λ is chosen through cross-validation.

mm <- model.matrix(Apps ~ ., data = College[train, ])
fit3 <- cv.glmnet(mm, College$Apps[train], alpha = 1)
p <- predict(fit3, model.matrix(Apps ~ ., data = College[test, ]), s = fit3$lambda.min)
(mse$lasso <- mean((p - College$Apps[test])^2))
## [1] 1822322

Lasso also offers insights into which predictors are more influential by zeroing out less important ones. We additionally check how many non-zero coefficients remain in the model.

sum(coef(fit3, s = fit3$lambda.min) != 0)
## [1] 15

This tells us how many features were retained after regularization, which is important for model interpretability.

e. Fitting a PCR Model

Principal component regression (PCR) uses principal components as predictors. The number of components M is chosen through cross-validation. PCR is particularly helpful when there is multicollinearity or the need for dimensionality reduction.

fit4 <- pcr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit4, val.type = "MSEP")  # Plot to select M

p <- predict(fit4, College[test, ], ncomp = 17)
(mse$pcr <- mean((p - College$Apps[test])^2))
## [1] 1695269

The plot of MSEP (mean squared error of prediction) helps us determine the optimal number of principal components MMM for minimizing test error. PCR’s performance depends on how well the selected components capture the underlying data structure.

f. Fitting a PLS Model

Partial least squares (PLS) regression is similar to PCR but finds components that not only account for variance in the predictors but also for the response variable. Like PCR, the number of components MMM is selected through cross-validation.

fit5 <- plsr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit5, val.type = "MSEP")

p <- predict(fit5, College[test, ], ncomp = 12)
(mse$pls <- mean((p - College$Apps[test])^2))
## [1] 1696902

PLS often performs better than PCR since it focuses more directly on the relationship between predictors and the response.

g. Comparing the Results

We now compare the test MSE from all five models using a bar plot for better visualization.

barplot(unlist(mse), ylab = "Test MSE", horiz = TRUE)

The overall results suggest that ridge regression and lasso are the most effective methods for this specific dataset, with ridge regression slightly outperforming the others. These methods provide a balance between bias and variance, especially in scenarios where multicollinearity is present or the data has many predictors.