For parts (a) through (c), we choose the correct statement from i. through iv.

The correct answer is .

The lasso is than least squares because it shrinks the coefficient estimates toward zero, and in some cases sets coefficients exactly equal to zero. This added constraint reduces variance but increases bias. Therefore, the lasso can improve prediction accuracy when the .

\[ \boxed{\text{(a) iii}} \]

The correct answer is .

Ridge regression is also than least squares because it imposes a penalty on the size of the coefficients. This shrinks the coefficients toward zero, which reduces variance and increases bias. Therefore, ridge regression improves prediction accuracy when the .

\[ \boxed{\text{(b) iii}} \]

The correct answer is .

Non-linear methods are generally than least squares. Greater flexibility tends to reduce bias, since the model can adapt more closely to the true relationship, but it tends to increase variance. Therefore, a non-linear method improves prediction accuracy when the .

\[ \boxed{\text{(c) ii}} \]

We estimate the regression coefficients by minimizing the residual sum of squares subject to

\[ \sum_{j=1}^{p} |\beta_j| \le s. \]

This is the lasso formulation. The parameter \(s\) controls how restrictive the constraint is. When \(s=0\), all slope coefficients must be zero, so the model is extremely inflexible. As \(s\) increases, the constraint loosens, the model becomes more flexible, and eventually approaches least squares.

The correct answer is .

When \(s\) increases, the feasible set for the optimization problem gets larger . Since the old solution is still available and new solutions are allowed, the minimum possible training RSS cannot increase. It will either stay the same or decrease, and in general it decreases steadily.

\[ \boxed{\text{(a) iv}} \]

The correct answer is .

When \(s\) is very small, the model is too restricted, so it has high bias and tends to underfit. Increasing \(s\) first improves the fit and lowers test RSS. But if \(s\) becomes too large, the model becomes too flexible, variance increases, and test RSS eventually rises.

\[ \boxed{\text{(b) ii}} \]

The correct answer is .

A more flexible model is more sensitive to the training data, so its variance increases as \(s\) increases.

\[ \boxed{\text{(c) iii}} \]

The correct answer is .

When \(s\) is small, the model is heavily constrained and may miss important structure, so bias is large. As \(s\) increases, the model becomes more flexible and can better approximate the true relationship, so squared bias decreases.

\[ \boxed{\text{(d) iv}} \]

The correct answer is .

The irreducible error is the noise in the data generating process . It does not depend on the model or on the tuning parameter \(s\).

\[ \boxed{\text{(e) v}} \]

We now explore equations (6.12) and (6.13) in the case \(p=1\).

For part (a), with \(p=1\), the ridge criterion is

\[ (y_1-\beta_1)^2+\lambda \beta_1^2. \]

For part (b), with \(p=1\), the lasso criterion is

\[ (y_1-\beta_1)^2+\lambda |\beta_1|. \]

We choose \(y_1=2\) and \(\lambda=1\).

For ridge regression , differentiating the objective with respect to \(\beta_1\) gives

\[ -2(y_1-\beta_1)+2\lambda \beta_1=0. \]

So

\[ -2y_1+2\beta_1+2\lambda \beta_1=0 \]

and hence

\[ \beta_1(1+\lambda)=y_1. \]

Therefore the minimizer is

\[ \hat{\beta}_1=\frac{y_1}{1+\lambda}. \]

With \(y_1=2\) and \(\lambda=1\),

\[ \hat{\beta}_1=\frac{2}{2}=1. \]

y1 <- 2
lambda <- 1
beta_grid <- seq(-2, 4, length.out = 1000)
ridge_obj <- (y1 - beta_grid)^2 + lambda * beta_grid^2
ridge_hat <- y1 / (1 + lambda)

plot(beta_grid, ridge_obj, type = "l", lwd = 2,
     xlab = expression(beta[1]),
     ylab = "Objective Function",
     main = "Chapter 6, Problem 6(a): Ridge Criterion")
abline(v = ridge_hat, lty = 2, lwd = 2)
points(ridge_hat, min(ridge_obj), pch = 19)

ridge_hat
## [1] 1

The plot shows that the minimum occurs at \(\beta_1 = 1\), which confirms equation (6.14).

For the lasso, the minimizer is obtained by soft-thresholding:

\[ \hat{\beta}_1= \begin{cases} y_1-\lambda/2, & y_1>\lambda/2,\\[6pt] y_1+\lambda/2, & y_1<-\lambda/2,\\[6pt] 0, & |y_1|\le \lambda/2. \end{cases} \]

Since \(y_1=2\) and \(\lambda=1\), we have \(\lambda/2=0.5\), so

\[ \hat{\beta}_1=2-0.5=1.5. \]

lasso_obj <- (y1 - beta_grid)^2 + lambda * abs(beta_grid)
lasso_hat <- sign(y1) * pmax(abs(y1) - lambda/2, 0)

plot(beta_grid, lasso_obj, type = "l", lwd = 2,
     xlab = expression(beta[1]),
     ylab = "Objective Function",
     main = "Chapter 6, Problem 6(b): Lasso Criterion")
abline(v = lasso_hat, lty = 2, lwd = 2)
points(lasso_hat, min(lasso_obj), pch = 19)

lasso_hat
## [1] 1.5

The plot shows that the minimum occurs at \(\beta_1 = 1.5\), which confirms equation (6.15).

Now we fit a lasso model to simulated data generated from

\[ Y=\beta_0+\beta_1X+\beta_2X^2+\beta_3X^3+\epsilon. \]

We choose

\[ \beta_0=1,\qquad \beta_1=2,\qquad \beta_2=3,\qquad \beta_3=4. \]

set.seed(474)
n <- 100
x <- rnorm(n)
eps <- rnorm(n)
beta0 <- 1
beta1 <- 2
beta2 <- 3
beta3 <- 4
y <- beta0 + beta1 * x + beta2 * x^2 + beta3 * x^3 + eps

dat <- data.frame(
  y = y,
  x1 = x,
  x2 = x^2,
  x3 = x^3,
  x4 = x^4,
  x5 = x^5,
  x6 = x^6,
  x7 = x^7,
  x8 = x^8,
  x9 = x^9,
  x10 = x^10
)

xmat <- model.matrix(y ~ ., data = dat)[, -1]
cv_lasso_8e <- cv.glmnet(xmat, y, alpha = 1)
best_lambda_8e <- cv_lasso_8e$lambda.min
coef_8e <- as.matrix(coef(cv_lasso_8e, s = "lambda.min"))
coef_8e_df <- data.frame(term = rownames(coef_8e), coefficient = as.numeric(coef_8e))
coef_8e_df
plot(cv_lasso_8e)

best_lambda_8e
## [1] 0.08682629

The optimal value chosen by cross-validation is

\[ \lambda_{\min} = 0.08683. \]

The resulting coefficient estimates are shown in the table above.

From these results, the lasso should place most of its weight on the lower-order polynomial terms, especially \(X\), \(X^2\), and \(X^3\), since those are the variables that generated the response. Any nonzero coefficients on higher-order powers are due to random noise and correlation among the polynomial predictors. Because the lasso performs shrinkage and variable selection at the same time, the fitted model is usually sparse and tends to recover the main structure of the true model.

Now we generate the response according to

\[ Y=\beta_0+\beta_7X^7+\epsilon. \]

We choose

\[ \beta_0=1,\qquad \beta_7=5. \]

Then we perform best subset selection and the lasso using \(X, X^2,\dots, X^{10}\) as predictors.

set.seed(474)
x_f <- rnorm(n)
eps_f <- rnorm(n)
y_f <- 1 + 5 * x_f^7 + eps_f

dat_f <- data.frame(
  y = y_f,
  x1 = x_f,
  x2 = x_f^2,
  x3 = x_f^3,
  x4 = x_f^4,
  x5 = x_f^5,
  x6 = x_f^6,
  x7 = x_f^7,
  x8 = x_f^8,
  x9 = x_f^9,
  x10 = x_f^10
)

regfit_f <- regsubsets(y ~ ., data = dat_f, nvmax = 10)
regsum_f <- summary(regfit_f)

cp_idx_f <- which.min(regsum_f$cp)
bic_idx_f <- which.min(regsum_f$bic)
adjr2_idx_f <- which.max(regsum_f$adjr2)

coef_cp_f <- coef(regfit_f, cp_idx_f)
coef_bic_f <- coef(regfit_f, bic_idx_f)
coef_adjr2_f <- coef(regfit_f, adjr2_idx_f)

cp_idx_f
## [1] 5
bic_idx_f
## [1] 1
adjr2_idx_f
## [1] 8
coef_cp_f
## (Intercept)          x4          x6          x7          x8         x10 
##   0.6101432   1.9405376  -1.6100880   4.9972412   0.4370872  -0.0384794
coef_bic_f
## (Intercept)          x7 
##   0.8908059   4.9990732
coef_adjr2_f
## (Intercept)          x1          x3          x4          x5          x6 
##  0.67687736  1.02639213 -3.80098804  0.92036629  3.42694361 -0.50338668 
##          x7          x8          x9 
##  3.92331956  0.06770238  0.10804886
par(mfrow = c(1,3))
plot(regsum_f$cp, xlab = "Model Size", ylab = "Cp", type = "b")
points(cp_idx_f, regsum_f$cp[cp_idx_f], col = 2, pch = 19, cex = 1.5)
plot(regsum_f$bic, xlab = "Model Size", ylab = "BIC", type = "b")
points(bic_idx_f, regsum_f$bic[bic_idx_f], col = 2, pch = 19, cex = 1.5)
plot(regsum_f$adjr2, xlab = "Model Size", ylab = "Adjusted R^2", type = "b")
points(adjr2_idx_f, regsum_f$adjr2[adjr2_idx_f], col = 2, pch = 19, cex = 1.5)

par(mfrow = c(1,1))
xmat_f <- model.matrix(y ~ ., data = dat_f)[, -1]
cv_lasso_8f <- cv.glmnet(xmat_f, y_f, alpha = 1)
best_lambda_8f <- cv_lasso_8f$lambda.min
coef_8f <- as.matrix(coef(cv_lasso_8f, s = "lambda.min"))
coef_8f_df <- data.frame(term = rownames(coef_8f), coefficient = as.numeric(coef_8f))
coef_8f_df
plot(cv_lasso_8f)

best_lambda_8f
## [1] 6.997927

For best subset selection, the preferred model sizes are:

\[ \text{Cp chooses size } 5, \qquad \text{BIC chooses size } 1, \qquad \text{Adjusted } R^2 \text{ chooses size } 8. \]

Because the true model only contains the seventh power, a good procedure should identify \(X^7\) as an important predictor . In practice, however, polynomial terms are highly correlated with one another, so both best subset selection and the lasso may sometimes include nearby powers such as \(X^6\) or \(X^8\) as well. Even so , both methods should reveal that the higher-order structure is concentrated around the seventh-degree term.

We will predict the number of applications received using the other variables in the data set.

data(College)
set.seed(474)
train_idx <- sample(1:nrow(College), nrow(College) / 2)
College_train <- College[train_idx, ]
College_test <- College[-train_idx, ]
y_train <- College_train$Apps
y_test <- College_test$Apps

We used a random split with half of the observations in the training set and half in the test set.

Training sample size:

\[ n_{\text{train}} = 388. \]

Test sample size:

\[ n_{\text{test}} = 389. \]

lm_fit <- lm(Apps ~ ., data = College_train)
lm_pred <- predict(lm_fit, newdata = College_test)
lm_mse <- mean((y_test - lm_pred)^2)
lm_mse
## [1] 1401891

The test MSE for least squares is

\[ 1401891.1849. \]

x_train <- model.matrix(Apps ~ ., data = College_train)[, -1]
x_test <- model.matrix(Apps ~ ., data = College_test)[, -1]

set.seed(474)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_pred <- predict(cv_ridge, s = "lambda.min", newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
ridge_lambda <- cv_ridge$lambda.min
ridge_mse
## [1] 1288651
ridge_lambda
## [1] 396.5246

The cross-validated choice of \(\lambda\) is

\[ 396.5, \]

and the test MSE for ridge regression is

\[ 1288650.7921. \]

set.seed(474)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_pred <- predict(cv_lasso, s = "lambda.min", newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_coef <- coef(cv_lasso, s = "lambda.min")
lasso_nonzero <- sum(lasso_coef[-1, 1] != 0)
lasso_mse
## [1] 1395498
lasso_nonzero
## [1] 17

The test MSE for the lasso is

\[ 1395497.6933, \]

and the number of non-zero coefficient estimates is

\[ 17. \]

set.seed(474)
pcr_fit <- pcr(Apps ~ ., data = College_train, scale = TRUE, validation = "CV")
pcr_cv <- RMSEP(pcr_fit, estimate = "CV")
pcr_msep_vals <- drop(pcr_cv$val[1, 1, -1])
pcr_M <- which.min(pcr_msep_vals)

pcr_pred <- predict(pcr_fit, newdata = College_test, ncomp = pcr_M)
pcr_mse <- mean((y_test - pcr_pred)^2)
pcr_M
## 17 comps 
##       17
pcr_mse
## [1] 1401891

The number of components selected by cross-validation is

\[ M = 17, \]

and the test MSE for PCR is

\[ 1401891.1849. \]

set.seed(474)
pls_fit <- plsr(Apps ~ ., data = College_train, scale = TRUE, validation = "CV")
pls_cv <- RMSEP(pls_fit, estimate = "CV")
pls_msep_vals <- drop(pls_cv$val[1, 1, -1])
pls_M <- which.min(pls_msep_vals)

pls_pred <- predict(pls_fit, newdata = College_test, ncomp = pls_M)
pls_mse <- mean((y_test - pls_pred)^2)
pls_M
## 13 comps 
##       13
pls_mse
## [1] 1397891

The number of components selected by cross-validation is

\[ M = 13, \]

and the test MSE for PLS is

\[ 1397890.8486. \]

The five test errors are:

\[ \begin{aligned} \text{Least Squares} &:\ 1401891.1849\\ \text{Ridge} &:\ 1288650.7921\\ \text{Lasso} &:\ 1395497.6933\\ \text{PCR} &:\ 1401891.1849\\ \text{PLS} &:\ 1397890.8486 \end{aligned} \]

These results show how accurately we can predict the number of applications using the available predictors. The best method is the one with the smallest test MSE. In many runs of this problem, the test errors are fairly similar, which suggests that several of these methods perform comparably well on the data. If one method has a clearly smaller test MSE , then that method provides the best predictive performance for this split of the data.

We now predict per capita crime rate in the data set.

data("Boston", package = "ISLR2")
set.seed(474)
train_idx_b <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train <- Boston[train_idx_b, ]
Boston_test <- Boston[-train_idx_b, ]
y_train_b <- Boston_train$crim
y_test_b <- Boston_test$crim
x_train_b <- model.matrix(crim ~ ., data = Boston_train)[, -1]
x_test_b <- model.matrix(crim ~ ., data = Boston_test)[, -1]
regfit_b <- regsubsets(crim ~ ., data = Boston_train, nvmax = 13)

best_subset_pred <- function(object, newdata, id){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  vars <- names(coefi)
  as.vector(mat[, vars, drop = FALSE] %*% coefi)
}

regsum_b <- summary(regfit_b)
n_models_b <- nrow(regsum_b$which)

test_mse_best_subset <- rep(NA, n_models_b)

for(i in seq_len(n_models_b)){
  pred_i <- best_subset_pred(regfit_b, Boston_test, i)
  test_mse_best_subset[i] <- mean((y_test_b - pred_i)^2)
}

best_size_b <- which.min(test_mse_best_subset)
best_subset_mse <- min(test_mse_best_subset)
best_subset_coef <- coef(regfit_b, id = best_size_b)

best_size_b
## [1] 8
best_subset_mse
## [1] 46.67471
best_subset_coef
##  (Intercept)           zn        indus          nox          dis          rad 
##  16.59376525   0.05050324  -0.07237439 -11.46631697  -1.01816462   0.56943294 
##      ptratio        lstat         medv 
##  -0.28442559   0.13351813  -0.19555533
plot(seq_along(test_mse_best_subset), test_mse_best_subset, type = "b",
     xlab = "Model Size", ylab = "Test MSE",
     main = "Boston Crime: Best Subset Selection")
points(best_size_b, best_subset_mse, col = 2, pch = 19, cex = 1.5)

set.seed(474)
cv_ridge_b <- cv.glmnet(x_train_b, y_train_b, alpha = 0)
ridge_pred_b <- predict(cv_ridge_b, s = "lambda.min", newx = x_test_b)
ridge_mse_b <- mean((y_test_b - ridge_pred_b)^2)
ridge_lambda_b <- cv_ridge_b$lambda.min
ridge_mse_b
## [1] 47.69083
ridge_lambda_b
## [1] 0.5540709
set.seed(474)
cv_lasso_b <- cv.glmnet(x_train_b, y_train_b, alpha = 1)
lasso_pred_b <- predict(cv_lasso_b, s = "lambda.min", newx = x_test_b)
lasso_mse_b <- mean((y_test_b - lasso_pred_b)^2)
lasso_coef_b <- coef(cv_lasso_b, s = "lambda.min")
lasso_nonzero_b <- sum(lasso_coef_b[-1, 1] != 0)
lasso_mse_b
## [1] 47.05996
lasso_nonzero_b
## [1] 11
lasso_coef_b
## 13 x 1 sparse Matrix of class "dgCMatrix"
##               lambda.min
## (Intercept) 14.652626564
## zn           0.044447328
## indus       -0.048478145
## chas        -0.475255004
## nox         -8.671004022
## rm          -0.401110562
## age          0.001953031
## dis         -0.827696667
## rad          0.552904253
## tax          .          
## ptratio     -0.215873218
## lstat        0.119538635
## medv        -0.152124089
set.seed(474)
pcr_fit_b <- pcr(crim ~ ., data = Boston_train, scale = TRUE, validation = "CV")
pcr_cv_b <- RMSEP(pcr_fit_b, estimate = "CV")
pcr_msep_b <- drop(pcr_cv_b$val[1, 1, -1])
pcr_M_b <- which.min(pcr_msep_b)
pcr_pred_b <- predict(pcr_fit_b, newdata = Boston_test, ncomp = pcr_M_b)
pcr_mse_b <- mean((y_test_b - pcr_pred_b)^2)
pcr_M_b
## 12 comps 
##       12
pcr_mse_b
## [1] 47.09263

The test MSE values are

\[ \begin{aligned} \text{Best Subset} &:\ 46.6747\\ \text{Ridge} &:\ 47.6908\\ \text{Lasso} &:\ 47.06\\ \text{PCR} &:\ 47.0926 \end{aligned} \]

A reasonable choice is the method with the smallest validation-set test MSE. For this split of the data, that method is the one that performs best out of the four considered above. This is the most appropriate choice because the comparison is based on out-of-sample prediction error rather than training error.

\subsection*{(c)}

If the chosen model is best subset selection, then it usually does include all features, because the method selects only the variables that most improve test-set prediction. If the chosen model is the lasso, then it also usually excludes some features because the \(\ell_1\) penalty can set coefficients exactly to zero. Ridge regression keeps all variables, but shrinks them toward zero. PCR replaces the original variables with a smaller number of components, so it does not use the original features in the same direct way.

For this split, the best subset model size is

\[ 8, \]

and the lasso retains

\[ 11 \]

non-zero predictors. Therefore, depending on which method performs best, the final chosen model may or may not involve all of the original features. In many cases, a smaller model performs just as well or better because removing weak predictors can reduce variance.

Below is one example of a partition of a two-dimensional predictor space that could result from recursive binary splitting. It has six regions.

This partition can be described as follows:

\[ \begin{aligned} R_1 &= \{X_1 < t_1,\ X_2 \ge t_2\},\\ R_2 &= \{X_1 < t_1,\ t_3 \le X_2 < t_2\},\\ R_3 &= \{X_1 < t_1,\ X_2 < t_3\},\\ R_4 &= \{X_1 \ge t_1,\ X_2 < t_4\},\\ R_5 &= \{t_1 \le X_1 < t_5,\ X_2 \ge t_4\},\\ R_6 &= \{X_1 \ge t_5,\ X_2 \ge t_4\}. \end{aligned} \]

This is a valid example because every split is axis parallel and every region is obtained through recursive binary splitting.

We must plot the Gini index, classification error, and entropy as functions of \(\hat{p}_{m1}\) in a two-class setting.

In the two-class case, if \(\hat{p}_{m1}=p\), then \(\hat{p}_{m2}=1-p\).

Therefore,

\[ \text{Gini}(p)=2p(1-p), \]

\[ \text{Classification Error}(p)=1-\max(p,1-p), \]

and

\[ \text{Entropy}(p)=-p\log(p)-(1-p)\log(1-p). \]

p <- seq(0.001, 0.999, length.out = 1000)
gini <- 2 * p * (1 - p)
class_error <- 1 - pmax(p, 1 - p)
entropy <- -p * log(p) - (1 - p) * log(1 - p)

plot(p, gini, type = "l", lwd = 2, ylim = range(c(gini, class_error, entropy)),
     xlab = expression(hat(p)[m1]), ylab = "Value",
     main = "Gini, Classification Error, and Entropy")
lines(p, class_error, lwd = 2, lty = 2)
lines(p, entropy, lwd = 2, lty = 3)
legend("top", legend = c("Gini", "Classification Error", "Entropy"),
       lty = c(1,2,3), lwd = 2, bty = "n")

All three measures are smallest near \(0\) and \(1\), where the node is very pure, and largest near \(0.5\), where the classes are most mixed. Entropy and the Gini index are smoother and more sensitive to changes in node purity than classification error.

This question is based on Figure 8.14.

From the partition, the first split is at \(X_1=1\) because the entire predictor space is divided into a left portion and a right portion. The right-hand region is terminal with mean \(5\).

Inside the left portion, there is a split at \(X_2=1\). The upper region is terminal with mean \(15\). The lower-left region is then split at \(X_1=0\), creating a left terminal node with mean \(3\) and a right subregion. That right subregion is split at \(X_2=0\), creating terminal nodes with means \(10\) and \(0\).

One correct tree is:

The tree in the right-hand panel can be read as follows:

So the corresponding partition is:

Equivalently, the five regions are

\[ \begin{aligned} R_1 &= \{X_2 < 1,\ X_1 < 1\}, && \hat{y}=-1.80,\\ R_2 &= \{X_2 < 1,\ X_1 \ge 1\}, && \hat{y}=0.63,\\ R_3 &= \{1 \le X_2 < 2,\ X_1 < 0\}, && \hat{y}=-1.06,\\ R_4 &= \{1 \le X_2 < 2,\ X_1 \ge 0\}, && \hat{y}=0.21,\\ R_5 &= \{X_2 \ge 2\}, && \hat{y}=2.49. \end{aligned} \]