1. Best Subset, Forward Stepwise & Backward Stepwise Selection
2. The Lasso, Ridge Regression, Least Squares & Non-Linear Methods
3. The Lasso (as \(s\) increases)
4. Ridge Regression (as \(\lambda\) increases)
5. Ridge vs Lasso - Correlated Predictors
6. Ridge & Lasso - Special Case
- (a) Ridge Regression
- (b) The Lasso
7. Ridge & Lasso - Bayesian Connection
8. APPLIED: Generated Data (Best Subset, Forwards, Backwards & Lasso Selection)
9. APPLIED: The College Dataset (Least Squares, Ridge, Lasso, PCR & PLS)
10. APPLIED: Generated Data (Best Subset Selection)
11. APPLIED: The Boston Dataset (Selecting from Various Approaches)

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

library(tidyverse)
library(ISLR)
library(leaps) # regsubsets() - best subset selection
library(glmnet) # ridge & lasso
library(caret)
library(pls) # PCR
library(MASS) # Boston data

# MASS/dplyr conflict:
select <- dplyr::select

# caret::train with RSS & MSE added:
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
  c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
    Rsquared = summary(lm(pred ~ obs, data))$r.squared,
    MAE = mean(abs(data$obs-data$pred)), 
    MSE = mean((data$obs-data$pred)^2),
    RSS = sum((data$obs-data$pred)^2))
}

theme_set(theme_light())

1. Best Subset, Forward Stepwise & Backward Stepwise Selection

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain \(p + 1\) models, containing \(0, 1, 2, \dots, p\) predictors. Explain your answers:

(a) Training RSS

Q: Which of the three models with \(k\) predictors has the smallest training RSS?

The model using best subset selection would perform the best, although it is also possible that forward & backward stepwise selection could arrive at the same model.

This is because best subset selection will fit all \(2^p\) possible models using the \(p\) predictors, so if the selection criteria is minimizing the training RSS, there is no subset of predictors that could be identified by forward/backward selection that wouldn’t also be identified by best subset selection.

(b) Test RSS

Q: Which of the three models with \(k\) predictors has the smallest test RSS?

We don’t know, although best subset selection is more likely to perform better. The selection process for all three algorithms will usually be based on a combination of minimizing the training RSS (to choose between models of the same size), followed by using cross-validated prediction error or some sort of penalized statistic (e.g. AIC, BIC, Adjusted \(R^2\)) to select the final model from the \(p + 1\) candidate models. Notice that this is all based on training performance, so there are no guarantees with test set performance.

Having said that, it is entirely possible for forward/backward stepwise selection to out-perform on the test data due to chance, but it is unlikely because best subsets will consider far more models. In the case of \(p\) = 20, best subset will select from \(2^p = 2^{20} = 1,048,576\) models, whereas the others will select from \(\sum_{k=0}^{p-1}(p - k) = 1 + \frac{p(p + 1)}{2} = 1 + \frac{20(20 + 1)}{2} = 211\) models.

(c) True or False

True or False:

Q: The predictors in the \(k\)-variable model identified by forward stepwise are a subset of the predictors in the \((k+1)\)-variable model identified by forward stepwise selection

TRUE. The \((k+1)\)-variable model will be identical to the \(k\)-variable model, but with one additional predictor.

ii.

Q: The predictors in the \(k\)-variable model identified by backward stepwise are a subset of the predictors in the \((k+1)\)-variable model identified by backward stepwise selection

TRUE. The \(k\)-variable model will be identical to the \((k+1)\)-variable model, but with one predictor removed.

iii.

Q: The predictors in the \(k\)-variable model identified by backward stepwise are a subset of the predictors in the \((k+1)\)-variable model identified by forward stepwise selection

FALSE. Forwards and backwards stepwise selection have different starting points (the null model and the full model) and will take different selection paths. The statement could hold true for specific examples, but it is not generally true.

iv.

Q: The predictors in the \(k\)-variable model identified by forward stepwise are a subset of the predictors in the \((k+1)\)-variable model identified by backward stepwise selection

FALSE (see part iii. as the same applies here).

Q: The predictors in the \(k\)-variable model identified by best subset are a subset of the predictors in the \((k+1)\)-variable model identified by best subset selection

FALSE. There is no guarantee that the best variable subset of size \((k+1)\) will simply be the best variable subset of size \(k\) with one additional predictor. If this were the case, we could simply do forward selection and reduce the number of models we test significantly.

2. The Lasso, Ridge Regression, Least Squares & Non-Linear Methods

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The Lasso vs Least Squares

Q: The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Part iii. - less flexible (for \(\lambda > 0\)), giving increased prediction accuracy if the increase in bias is outweighed by the decrease in variance.

This is because lasso selects the \(\hat{\beta}\) that minimizes \(RSS + \lambda \sum_{i = 1}^p |\beta_i|\), instead of simply the \(RSS\) in least squares.

Since the shrinkage penalty \(\lambda \sum_{i = 1}^p |\beta_i|\) is small for \(\beta_1, \beta_2, \dots, \beta_p\) close to zero, this tends to shrink the estimates towards zero (because for a given \(\lambda > 0\), the optimal lasso \(\hat{\beta}\) will be closer to zero than the least squares \(\hat{\beta}\)). For a larger \(\lambda\), the shrinkage terms importance is higher relative to the \(RSS\), so the shrinkage increases.

This shrinkage is what reduces the variance of the predictions, at the cost of a small increase in bias. This trade-off is usually worth it.

(b) Ridge Regression vs Least Squares

Q: Repeat (a) for ridge regression relative to least squares.

Part iii. for the same reasons as part (a). The only real difference here is in the ridge objective function \(RSS + \lambda \sum_{i = 1}^p \beta_i^2\), where the shrinkage term for ridge regression is slightly different to that of the lasso.

This just means that ridge regression won’t shrink coefficients of less-useful variables to exactly zero (the lasso can do this), but the rest of the argument (shrinkage reducing the variance, at the cost of an increase in bias) still applies.

(c) Non-Linear Methods vs Least Squares

Q: Repeat (a) for non-linear methods relative to least squares.

Part ii. - more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

If we have the relationship \(Y = f(X) + \epsilon\), by using a non-linear method we are using a more flexible method because we are making less assumptions about the functional form of \(f\) (not assuming linearity). In the case where \(f\) is better approximated using non-linear relationships between the predictors and the response, this will therefore lead to a decrease in bias that outweighs any increase in variance, hence a higher prediction accuracy.

3. The Lasso (as \(s\) increases)

Suppose we estimate the regression coefficients in a linear regression model by minimizing:

\[\sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2 \quad \text{subject to} \,\, \sum_{j = 1}^{p} |\beta_j| \le s\] for a particular value of \(s\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a) Training RSS

Q: As we increase \(s\) from 0, the training RSS will:

i. Increase initially, and then eventually start decreasing in an inverted U shape.

ii. Decrease initially, and then eventually start increasing in a U shape.

iii. Steadily increase.

iv. Steadily decrease.

v. Remain constant.

Part iv. - steadily decrease.

Minimizing the RSS (subject to the constraint that \(\sum_{j = 1}^{p} |\beta_j| \le s\)) is just another formulation for how the lasso parameters are selected.

Once \(s\) is sufficiently large, the least squares solution will satisfy the constraint. In this situation, the \(\beta\) that minimizes \(RSS = \sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2\) and also satisfies this constraint will always be the least squares solution. Up until that point, the training RSS will monotonically decrease.

(b) Test RSS

Q: Repeat (a) for test RSS.

Part ii. - decrease initially, and then eventually start increasing in a U shape.

When \(s\) = 0, the only \(\hat{\beta}\) that will satisfy \(\sum_{j = 1}^{p} |\beta_j| \le s\) will be a vector of zeros, so here we will simply have the null model (\(\hat{y} = \bar{y}\)). As \(s\) increases and this constraint loosens, the flexibility of the model will increase. The test RSS will therefore decrease, up to the point where it will begin to overfit (at which point, it will start increasing again).

(c) Variance

Q: Repeat (a) for variance.

Part iii. - steadily increase.

This is because the constraint region increasing in size (\(s\) increasing from zero) corresponds to \(\lambda\) decreasing (a reduction in shrinkage), so model flexibility is increasing and so an increase in variance will occur. If \(s\) is sufficiently large so that \(\hat{\beta}\) falls within the constraint region, the variance will no longer increase, because the \(\hat{\beta}\) chosen will always be the least squares estimate.

(d) Squared Bias

Q: Repeat (a) for (squared) bias.

Part iv. - steadily decrease.

Same reasoning as part (c) - increasing the flexibility will decrease the bias. Again, this will stop reducing if the least squares solution falls within the constraint region.

(e) Irreducible Error

Q: Repeat (a) for the irreducible error.

Part v. - remain constant.

The irreducible error is the error introduced by inherent uncertainty/noise in the system being approximated. It remains constant regardless of model flexibility, because there may be unmeasured variables not in \(X\) that would be required to explain it, or unmeasurable variation in \(y\) that cannot be predicted with the variables in \(X\), regardless of how well-specified the model is (so basically, it is completely independent of \(s\)).

4. Ridge Regression (as \(\lambda\) increases)

Suppose we estimate the regression coefficients in a linear regression model by minimizing:

\[\sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j = 1}^{p} \beta_j^2\] for a particular value of \(\lambda\). For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a) Training RSS

Q: As we increase \(\lambda\) from 0, the training RSS will:

i. Increase initially, and then eventually start decreasing in an inverted U shape.

ii. Decrease initially, and then eventually start increasing in a U shape.

iii. Steadily increase.

iv. Steadily decrease.

v. Remain constant.

Part iii. - steadily increase.

When \(\lambda\) = 0, the ridge regression \(\hat{\beta}\) will be the same as the least squares estimate for \(\beta\) (since the shrinkage term is removed), which will already minimize the training RSS. As \(\lambda\) increases, this training RSS can only increase, and will do so as the shrinkage increases.

(b) Test RSS

Q: Repeat (a) for test RSS.

Part ii. - decrease initially, and then eventually start increasing in a U shape.

As \(\lambda\) (shrinkage) increases, the hope is that the reduction in variance will outweigh the cost of shrinking the \(\hat{\beta}\)’s towards zero. This will generally mean that the test RSS will decrease, up to the point where increased shrinkage simply results in the model underfitting and decreasing its prediction accuracy (where the increased bias outweighs the decreased variance), at which point the the test RSS will start increasing again.

(c) Variance

Q: Repeat (a) for variance.

Part iv. - steadily decrease.

Increasing \(\lambda\) decreases the flexibility because the \(\beta\)’s are shrunk towards zero, reducing the variance. We can continue increasing the shrinkage arbitrarily, reducing the variance until it is arbitrarily close to zero (as the \(\hat{\beta}\)’s approach zero, the model approaches the null model and our predictions approach zero variance).

(d) Squared Bias

Q: Repeat (a) for (squared) bias.

Part iii. - steadily increase

Same reasoning as part (c) - increasing \(\lambda\) (decreasing the flexibility) will increase the bias as the \(\beta\)’s are shrunk towards zero.

(e) Irreducible Error

Q: Repeat (a) for the irreducible error.

Part v. - remain constant.

Same reasoning as in question 3 - the irreducible error is model-independent.

5. Ridge vs Lasso - Correlated Predictors

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that \(n = 2\), \(p = 2\), \(x_{11} = x_{12}\), \(x_{21} = x_{22}\). Furthermore, suppose that \(y_1 + y_2 = 0\) and \(x_{11} + x_{21} = 0\) and \(x_{12} + x_{22} = 0\), so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: \(\hat{\beta_0} = 0\).

(a) Ridge - The Optimization Problem

Q: Write out the ridge regression optimization problem in this setting.

We have \(X = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{11} \\ x_{22} & x_{22} \end{bmatrix}\), so the \(p = 2\) predictors (\(x_1\) & \(x_2\)) are perfectly correlated.

We know that the ridge coefficient estimates \(\hat{\beta}_{\lambda}^R\) are the values that minimize:

\[\sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j = 1}^{p} \beta_j^2\]

Plugging in the specific example, this means for each value of \(\lambda\), ridge optimization will select \(\hat{\beta}_{\lambda}^R = \begin{pmatrix} \hat{\beta_1} \\ \hat{\beta_2} \end{pmatrix}\) that minimizes the quantity:

\[\sum_{i = 1}^{2} \left( y_i - \beta_0 - \sum_{j = 1}^{2} \beta_j x_{ij} \right)^2 + \lambda \sum_{j = 1}^{2} \beta_j^2 \\ = ( y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{12})^2 + ( y_2 - \beta_0 - \beta_1 x_{21} - \beta_2 x_{22})^2 + \lambda (\beta_1^2 + \beta_2^2)\]

(b) Ridge - Argue Equal Coefficient Estimates

Q: Argue that in this setting, the ridge coefficient estimates satisfy \(\hat{\beta_1} =\hat{\beta_2}\).

Let \(f(\hat{\beta_1}, \hat{\beta_2}) = ( y_1 - \hat{\beta_0} - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{12})^2 + ( y_2 - \hat{\beta_0} - \hat{\beta_1} x_{21} - \hat{\beta_2} x_{22})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right)\).

We therefore have:

\[\begin{align*} f(\hat{\beta_1}, \hat{\beta_2}) & = ( y_1 - \hat{\beta_0} - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{12})^2 + ( y_2 - \hat{\beta_0} - \hat{\beta_1} x_{21} - \hat{\beta_2} x_{22})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) \\ & = ( y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{12})^2 + ( y_2 - \hat{\beta_1} x_{21} - \hat{\beta_2} x_{22})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) && \text{(since } \hat{\beta_0} = 0 \text{)} \\ & = ( y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{11})^2 + ( -y_1 + \hat{\beta_1} x_{11} + \hat{\beta_2} x_{11})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) && \text{(since } x_{11} = x_{12} = - x_{21} = -x_{22}, \,\,\, y_2 = -y_1 \text{)} \\ & = ( y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{11})^2 + (-1)^2(y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{11})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) \\ & = 2( y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{11})^2 + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) \\ & = 2(y_1^2 - 2 y_1 x_{11} \hat{\beta_1} - 2 y_1 x_{11} \hat{\beta_2} + 2 x_{11}^2 \hat{\beta_1} \hat{\beta_2} + x_{11}^2 \hat{\beta_1}^2 + x_{11}^2 \hat{\beta_2}^2) + \lambda \left( \hat{\beta_1}^2 + \hat{\beta_2}^2 \right) \\ & = 2y_1^2 - 4 y_1 x_{11} \hat{\beta_1} - 4 y_1 x_{11} \hat{\beta_2} + 4 x_{11}^2 \hat{\beta_1} \hat{\beta_2} + 2 x_{11}^2 \hat{\beta_1}^2 + 2 x_{11}^2 \hat{\beta_2}^2 + \lambda \hat{\beta_1}^2 + \lambda \hat{\beta_2}^2 \\ \end{align*}\]

To find the \(\hat{\beta_1}\) and \(\hat{\beta_2}\) that minimize the above function, we partially differentiate w.r.t \(\hat{\beta_1}\) & \(\hat{\beta_2}\) and set these equal to zero:

\[\frac{\partial f(\hat{\beta_1}, \hat{\beta_2})}{\partial \hat{\beta_1}} = -4 y_1 x_{11} + 4 x_{11}^2 \hat{\beta_2} + 4 x_{11}^2 \hat{\beta_1} + 2 \lambda \hat{\beta_1} = 0 \\ \begin{align*} & \implies \hat{\beta_1}(\lambda + 2x_{11}^2) = 2 y_1 x_{11} - 2 x_{11}^2 \hat{\beta_2} \\ & \implies \hat{\beta_1} = \frac{2 y_1 x_{11} - 2 x_{11}^2 \hat{\beta_2}}{\lambda + 2x_{11}^2} \end{align*}\]

\[\frac{\partial f(\hat{\beta_1}, \hat{\beta_2})}{\partial \hat{\beta_2}} = -4 y_1 x_{11} + 4 x_{11}^2 \hat{\beta_1} + 4 x_{11}^2 \hat{\beta_2} + 2 \lambda \hat{\beta_2} = 0 \\ \begin{align*} & \implies \hat{\beta_2}(\lambda + 2x_{11}^2) = 2 y_1 x_{11} - 2 x_{11}^2 \hat{\beta_1} \\ & \implies \hat{\beta_2} = \frac{2 y_1 x_{11} - 2 x_{11}^2 \hat{\beta_1}}{\lambda + 2x_{11}^2} \end{align*}\]

Since \(y_1\), \(x_{11}\) and \(\lambda\) are just numbers, we can simplify the notation and sub in \(C = \frac{2 y_1 x_{11}}{\lambda + 2x_{11}^2}\) and \(K = \frac{- 2 x_{11}^2}{\lambda + 2x_{11}^2}\) to get the simultaneous equations:

\[\hat{\beta_1} = C + K\hat{\beta_2} \,\,\, (1)\\ \hat{\beta_2} = C + K\hat{\beta_1} \,\,\, (2)\]

Subbing \((2)\) into \((1)\):

\[\hat{\beta_1} = C + K(C + K\hat{\beta_1}) \\ \hat{\beta_1}(1 - K^2) = C(1 + K) \\ \hat{\beta_1} = \frac{C(1 + K)}{(1 - K^2)}\]

Similarly, subbing \((1)\) into \((2)\):

\[\hat{\beta_2} = C + K(C + K\hat{\beta_2}) \\ \hat{\beta_2}(1 - K^2) = C(1 + K) \\ \hat{\beta_2} = \frac{C(1 + K)}{(1 - K^2)}\]

We therefore have \(\hat{\beta_1} = \hat{\beta_2}\) as required.

(c) Lasso - The Optimization Problem

Q: Write out the lasso optimization problem in this setting.

Similar to in ridge regression, we have that the lasso coefficient estimates \(\hat{\beta}_{\lambda}^L\) are the values that minimize:

\[\sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j = 1}^{p} |\beta_j|\]

In this example, for a given value of \(\lambda\), lasso optimization will select \(\hat{\beta}_{\lambda}^L = \begin{pmatrix} \hat{\beta_1} \\ \hat{\beta_2} \end{pmatrix}\) to minimize:

\[\sum_{i = 1}^{2} \left( y_i - \beta_0 - \sum_{j = 1}^{2} \beta_j x_{ij} \right)^2 + \lambda \sum_{j = 1}^{2} |\beta_j| \\ = ( y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{12})^2 + ( y_2 - \beta_0 - \beta_1 x_{21} - \beta_2 x_{22})^2 + \lambda (|\beta_1| + |\beta_2|) \\ = 2( y_1 - \hat{\beta_1} x_{11} - \hat{\beta_2} x_{11})^2 + \lambda (|\beta_1| + |\beta_2|) \quad \text{same working as (b)}\]

(d) Lasso - Argue Non-Unique Coefficient Estimates

Q: Argue that in this setting, the lasso coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are not unique - in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

I use the alternative formulation for the lasso selection approach here: lasso will select the coefficients \(\hat{\beta}_{\lambda}^L\) that minimize \(\sum_{i = 1}^{n} \left( y_i - \beta_0 - \sum_{j = 1}^{p} \beta_j x_{ij} \right)^2\) subject to the constraint \(\sum_{j = 1}^{p} |\beta_j| \le s\).

Applied to the this question specifically, the lasso coefficients must minimize:

\[2 \left( y_1 - (\hat{\beta_1} + \hat{\beta_2})x_{11} \right)^2 \ge 0 \quad \text{subject to} \,\, |\hat{\beta_1}| + |\hat{\beta_2}| \le s\]

Since the RSS has a minimum of zero, we can see that any \((\hat{\beta_1},\hat{\beta_2)}\) that satisfy \(\hat{\beta_1} + \hat{\beta_2} = \frac{y_1}{x_{11}}\) will have a RSS of zero.

However, we also have the lasso constraint to consider, as the solutions to the optimization problem will be where the contours of \(2 \left( y_1 - (\hat{\beta_1} + \hat{\beta_2})x_{11} \right)^2\) touch the lasso diamond.

As \((\hat{\beta_1}, \hat{\beta_2})\) vary along the line \(\hat{\beta_2} = -\hat{\beta_1} + \frac{y_1}{x_{11}}\), the contour will touch the lasso diamond (described by \(|\hat{\beta_1}| + |\hat{\beta_2}| \le s\)) at many points instead of one.

These are highlighted in orange in the image above (the edges \(\hat{\beta_1} + \hat{\beta_2} = s\) and \(\hat{\beta_1} + \hat{\beta_2} = -s\), since the position of the purple line closer to the top-right edge on the graph is completely arbitrary for visualization purposes).

This means that there are infinite solutions \(\hat{\beta}_{\lambda}^L\), described by the set below:

\[\hat{\beta}_{\lambda}^L \in \left\{ (\hat{\beta_1}, \hat{\beta_2}): \, \, \hat{\beta_1} + \hat{\beta_2} = s, \, \, \hat{\beta_1} \in [0, s], \, \, \hat{\beta_2} \in [0, s]\right\} \bigcup \left\{ (\hat{\beta_1}, \hat{\beta_2}): \, \, \hat{\beta_1} + \hat{\beta_2} = -s, \, \, \hat{\beta_1} \in [-s, 0], \, \, \hat{\beta_2} \in [-s, 0]\right\}\]

6. Ridge & Lasso - Special Case

We will now explore (6.12) and (6.13) further.

(a) Ridge Regression

Q: Consider (6.12) with \(p\) = 1. For some choice of \(y_1\) and \(\lambda \gt 0\), plot (6.12) as a function of \(\beta_1\). Your plot should confirm that (6.12) is solved by (6.14).

Conditions:

Firstly, the question is referring to a special case for ridge regression and the lasso. In this case, we have \(n = p\), and \(X\) is the \(n \times n\) identity matrix:

\[X = I_n = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix}\]

There is also the assumption that we are performing regression without an intercept.

Since \(X \beta = \beta\) in this case, the least squares solution becomes requires finding \(\beta_1, \dots, \beta_p\) that minimizes \(\sum_{j=1}^p (y_j - \beta_j)^2\).

Ridge Regression:

In this setting, ridge regression amounts to finding \(\beta_1, \dots, \beta_p\) such that the following is minimized:

\[\sum_{j=1}^p (y_j - \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2 \quad (6.12)\]

The book says that the ridge regression estimates take the form:

\[\hat{\beta}_j^R = \frac{y_j}{1 + \lambda} \quad (6.14)\]

Answer:

For \(p = 1\), \((6.12)\) gives a quadratic in terms of \(\beta_1\):

\[(y_1 - \beta_1)^2 + \lambda \beta_1^2 \\ = y_1^2 - 2 y_1 \beta_1 + \beta_1^2 + \lambda \beta_1^2\]

For \(\lambda = 2\), \(y_1 = 3\), we get:

\[f(\beta_1) = 3 \beta_1^2 - 6 \beta_1 + 9\]

We should see that the minimum occurs at \(\hat{\beta_1} = \frac{3}{1+2} = 1\), with a minimum of \(f(1) = 3 - 6 + 9 = 6\):

beta_1 <- seq(-2, 3, 0.1)
f_beta_1 <- 3*beta_1^2 - 6*beta_1 + 9

data.frame(beta_1, 
           f_beta_1) %>%
  ggplot(aes(x = beta_1, y = f_beta_1)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = 1, col = "deepskyblue3") + 
  geom_hline(yintercept = 6, col = "deepskyblue3") + 
  labs(x = "beta_1", 
       y = "f(beta_1)")

The graph confirms this is the value of \(\beta_1\) that minimizes the ridge optimization function.

(b) The Lasso

Q: Consider (6.13) with \(p\) = 1. For some choice of \(y_1\) and \(\lambda \gt 0\), plot (6.13) as a function of \(\beta_1\). Your plot should confirm that (6.13) is solved by (6.15).

Conditions:

The conditions are the same as part (a) - \(X\) is an \(n \times n\) identity matrix, and we are performing regression without an intercept.

The Lasso:

The Lasso amounts to finding \(\beta_1, \dots, \beta_p\) such that the following is minimized:

\[\sum_{j=1}^p (y_j - \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j| \quad (6.13)\]

The book says that these lasso coefficient estimates take the form:

\[\hat{\beta}_j^L = \left\{ \begin{array}{l} y_j - \lambda/2 & \mbox{if } y_j > \lambda/2;\\ y_j + \lambda/2 & \mbox{if } y_j < -\lambda/2; && (6.15)\\ 0 & \mbox{if } |y_j| \le \lambda/2.\\ \end{array} \right.\]

Answer:

For \(p = 1\), \((6.13)\) gives:

\[y_1^2 - 2 y_1 \beta_1 + \beta_1^2 + \lambda |\beta_1|\]

For \(\lambda = 4\), \(y_1 = 5\), we get:

\[g(\beta_1) = \beta_1^2 - 10 \beta_1 + 4|\beta_1| + 25\]

Since \(y_1 > \lambda/2\), we should see that the minimum of this function occurs at \(y_1 - \lambda / 2 = 3\), with a minimum of \(g(3) = 9 - 30 + 12 + 25 = 16\):

beta_1 <- seq(-2, 8, 0.1)
g_beta_1 <- beta_1^2 - 10*beta_1 + 4*abs(beta_1) + 25

data.frame(beta_1, 
           g_beta_1) %>%
  ggplot(aes(x = beta_1, y = g_beta_1)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = 3, col = "mediumseagreen") + 
  geom_hline(yintercept = 16, col = "mediumseagreen") + 
  labs(x = "beta_1", 
       y = "g(beta_1)")

The graph confirms that this value of \(\beta_1\) minimizes the Lasso optimization function.

7. Ridge & Lasso - Bayesian Connection

This question is marked as being particularly hard by the authors and I couldn’t do it! Here is the only solution I could find online.

If anyone with a good handle on the Bayesian statistics & probability required would be willing to work through this question with me and explain the steps I would be very grateful and give you a big shout-out :)

Please contact me on Liam95morgan@gmail.com, or via LinkedIn.

8. APPLIED: Generated Data (Best Subset, Forwards, Backwards & Lasso Selection)

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Generate \(X\) & \(\epsilon\)

Q: Use the rnorm() function to generate a predictor \(X\) of length \(n\) = 100, as well as a noise vector \(\epsilon\) of length \(n\) = 100.

I create X and error, leaving the default arguments for rnorm() of mean = 0 and sd = 1:

set.seed(123)
X <- rnorm(100)
error <- rnorm(100)

(b) Generate \(Y\)

Q: Generate a response vector \(Y\) of length \(n\) = 100 according to the model \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\] where \(\beta_0\), \(\beta_1\), \(\beta_2\) & \(\beta_3\) are constants of your choice.

I chose the following parameters:

\(\beta_0 = 10\)
\(\beta_1 = -5\)
\(\beta_2 = 6\)
\(\beta_3 = 3\)

Y <- 10 - 5*X + 6*X^2 + 3*X^3 + error

(c) Best Subset Selection

Q: Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X, X^2, \dots, X^{10}\). What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both \(X\) and \(Y\).

X_matrix <- poly(X, 10, raw = T, simple = T)

model <- regsubsets(y = Y, 
                    x = X_matrix, 
                    nvmax = 10, 
                    method = "exhaustive")

model_summary <- summary(model)

Mallows \(C_p\):

data.frame(cp = model_summary$cp, subset_size = 1:10) %>%
  mutate(min_cp = as.numeric(min(cp) == cp)) %>%
  ggplot(aes(x = subset_size, y = cp)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_cp))) + 
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - Mallows Cp", 
       x = "Subset Size", 
       y = "Cp")

\(BIC\):

data.frame(bic = model_summary$bic, subset_size = 1:10) %>%
  mutate(min_bic = as.numeric(min(bic) == bic)) %>%
  ggplot(aes(x = subset_size, y = bic)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_bic))) + 
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - BIC", 
       x = "Subset Size", 
       y = "BIC")

Adjusted \(R^2\):

data.frame(adj_r2 = model_summary$adjr2, subset_size = 1:10) %>%
  mutate(max_adj_r2 = as.numeric(max(adj_r2) == adj_r2)) %>%
  ggplot(aes(x = subset_size, y = adj_r2)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(max_adj_r2))) + 
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - Adjusted R-Squared", 
       x = "Subset Size", 
       y = "Adjusted R-Squared")

When using Mallows \(C_P\) & \(BIC\) as the selection criterion, the 3-variable model is chosen.

Looking at the third row of the matrix below, this should result in a model with the first, second & third-order terms of \(X\):

model_summary$outmat

##           1   2   3   4   5   6   7   8   9   10 
## 1  ( 1 )  " " "*" " " " " " " " " " " " " " " " "
## 2  ( 1 )  " " "*" " " " " "*" " " " " " " " " " "
## 3  ( 1 )  "*" "*" "*" " " " " " " " " " " " " " "
## 4  ( 1 )  "*" "*" "*" " " " " "*" " " " " " " " "
## 5  ( 1 )  "*" "*" "*" " " " " " " "*" " " "*" " "
## 6  ( 1 )  "*" "*" "*" "*" " " "*" " " "*" " " " "
## 7  ( 1 )  "*" "*" "*" "*" " " "*" " " "*" " " "*"
## 8  ( 1 )  "*" "*" "*" "*" "*" "*" " " "*" " " "*"
## 9  ( 1 )  "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

The coefficients are given below.

coef(model, id = 3)

## (Intercept)           1           2           3 
##    9.970394   -5.079554    5.908457    3.020436

Using Adjusted \(R^2\), the best-subset algorithm selects the 7-variable model, with the following coefficients:

coef(model, id = 7)

## (Intercept)           1           2           3           4           6 
##   9.7931750  -5.1092516   8.2758848   3.0530514  -4.0243410   2.2138377 
##           8          10 
##  -0.4870719   0.0373538

(d) Other Selection Algorithms

Q: Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

Foward Stepwise Selection:

model <- regsubsets(y = Y, 
                    x = X_matrix, 
                    nvmax = 10, 
                    method = "forward")

model_summary <- summary(model)

Mallows \(C_p\):

which.min(model_summary$cp)

## [1] 4

\(BIC\):

which.min(model_summary$bic)

## [1] 4

Adjusted \(R^2\):

which.max(model_summary$adjr2)

## [1] 5

When using Mallows \(C_P\) & \(BIC\) as the selection criterion in forward selection, the 4-variable model was chosen, with the following coefficients:

coef(model, id = 4)

##  (Intercept)            1            2            3            5 
##  9.970679609 -5.118569011  5.905942599  3.065426465 -0.008184758

Using Adjusted \(R^2\), forward selection chose the 5-variable model, with the following coefficients:

coef(model, id = 5)

##  (Intercept)            1            2            3            5            9 
##  9.997687418 -4.829782509  5.862875876  2.505181775  0.197180909 -0.004256077

Backwards Stepwise Selection:

model <- regsubsets(y = Y, 
                    x = X_matrix, 
                    nvmax = 10, 
                    method = "backward")

model_summary <- summary(model)

Mallows \(C_p\):

which.min(model_summary$cp)

## [1] 3

\(BIC\):

which.min(model_summary$bic)

## [1] 3

Adjusted \(R^2\):

which.max(model_summary$adjr2)

## [1] 7

Again, Mallows \(C_P\) & \(BIC\) chose the same model (this time, the 3-variable model), with the following coefficients:

coef(model, id = 3)

## (Intercept)           1           2           3 
##    9.970394   -5.079554    5.908457    3.020436

However, Adjusted \(R^2\) chose the 7-variable model:

coef(model, id = 7)

## (Intercept)           1           2           3           4           6 
##   9.7931750  -5.1092516   8.2758848   3.0530514  -4.0243410   2.2138377 
##           8          10 
##  -0.4870719   0.0373538

Different selection methods and metrics result in different subsets of the predictors being selected.

Like with real data, the fact that the functional form of the selected model does not exactly match the functional form of the generating function (a cubic) is not necessarily a bad thing for prediction - the generating function might still be well-approximated, and the prediction error can still be low.

We could compare the cross-validated MSE of each of these selected models quite easily, although I don’t here.

(e) Lasso Selection

Q: Now fit a lasso model to the simulated data, again using \(X, X^2, \dots, X^{10}\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.

I use 5-fold cross-validation for a grid of 100 values of \(\lambda\), selecting the value of \(\lambda\) corresponding to the smallest out-of-sample MSE.

Note that the standardize argument to cv.glmnet is TRUE by default, specifying that the predictors will be centered and scaled before any models are fit.

# NOTE: different results to 'RNGkind()' on rstudio vs rmarkdown
# R 3.6.0 changed sampling method:
# https://stackoverflow.com/questions/56268011/sample-function-gives-different-result-in-console-and-in-knitted-document-when-s
# set 'sample.kind' to reconcile the two and get the same results

set.seed(1, sample.kind = "Rounding")

model_lasso <- cv.glmnet(y = Y, 
                         x = X_matrix, 
                         alpha = 1, 
                         lambda = 10^seq(1,-2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5)

Here is a graph showing the selection. Both the x and y axis are on a \(log_{10}\) scale, as both \(\lambda\) and the cross-validation MSE are over several orders of magnitude.

data.frame(lambda = model_lasso$lambda, 
           cv_mse = model_lasso$cvm, 
           nonzero_coeff = model_lasso$nzero) %>%
  ggplot(aes(x = lambda, y = cv_mse, col = nonzero_coeff)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = model_lasso$lambda.min, col = "deepskyblue3") +
  geom_hline(yintercept = min(model_lasso$cvm), col = "deepskyblue3") +
  scale_x_continuous(trans = 'log10') + 
  scale_y_continuous(trans = 'log10') + 
  theme(legend.position = "bottom") + 
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Lasso - Lambda Selection (Using 5-Fold Cross-Validation)")

The best value of \(\lambda\) is 0.0351.

I re-fit this model on the full data and extract the coefficients (which are returned on the original scale, despite the fact that the \(X\) variables were standardized).

model_lasso_best <- glmnet(y = Y, x = X_matrix, alpha = 1)

predict(model_lasso_best, s = model_lasso$lambda.min, type = "coefficients")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 10.080844150
## 1           -4.494879835
## 2            5.645676674
## 3            2.551562536
## 4            0.058437508
## 5            0.070769078
## 6            0.001028764
## 7            .          
## 8            .          
## 9            .          
## 10           .

We can see that the coefficients for \(X^7\) - \(X^{10}\) have been shrunk to exactly zero, the coefficients for \(X^4\) - \(X^6\) have been shrunk close to zero, and the coefficients for \(X\) - \(X^3\) are noticeably larger, which makes sense as the generating function was an order 3 polynomial. The estimates (\(\hat{\beta_0}\), \(\hat{\beta_1}\), \(\hat{\beta_2}\), \(\hat{\beta_4}\)) are all reasonably close to the true coefficients in the generating function.

(f) New Example - Best Subset vs Lasso

Q: Now generate a response vector \(Y\) according to the model \(Y = \beta_0 + \beta_7 X^7 + \epsilon\), and perform best subset selection and the lasso. Discuss the results obtained.

I kept \(\beta_0 = 10\) and chose \(\beta_7 = 5\) to generate the new response vector.

Y <- 10 + 5*X^7 + error

Best Subset Selection:

model <- regsubsets(y = Y,
                    x = X_matrix,
                    nvmax = 10,
                    method = "exhaustive")

model_summary <- summary(model)

Mallows \(C_p\):

which.min(model_summary$cp)

## [1] 1

\(BIC\):

which.min(model_summary$bic)

## [1] 1

Adjusted \(R^2\):

which.max(model_summary$adjr2)

## [1] 6

Once again, mallows \(C_p\) & \(BIC\) select the same model - the one-variable model. The matrix below shows that this single-variable model contains \(X^7\) as the single predictor (same as the generating function).

model_summary$outmat

##           1   2   3   4   5   6   7   8   9   10 
## 1  ( 1 )  " " " " " " " " " " " " "*" " " " " " "
## 2  ( 1 )  " " "*" " " " " " " " " "*" " " " " " "
## 3  ( 1 )  " " "*" " " " " " " "*" "*" " " " " " "
## 4  ( 1 )  " " " " " " "*" " " "*" "*" "*" " " " "
## 5  ( 1 )  " " "*" " " "*" " " "*" "*" "*" " " " "
## 6  ( 1 )  " " "*" " " "*" " " "*" "*" "*" " " "*"
## 7  ( 1 )  "*" "*" " " "*" " " "*" "*" "*" " " "*"
## 8  ( 1 )  " " "*" "*" "*" " " "*" "*" "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

Extracting the coefficients below, we see that \(\hat{\beta_0} \approx 10\) and \(\hat{\beta_7} \approx 5\), as we would expect:

coef(model, id = 1)

## (Intercept)           7 
##    9.893251    4.999704

Lasso Selection:

set.seed(2)

model_lasso <- cv.glmnet(y = Y, 
                         x = X_matrix, 
                         alpha = 1, 
                         lambda = 10^seq(1,-2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5)

data.frame(lambda = model_lasso$lambda, 
           cv_mse = model_lasso$cvm, 
           nonzero_coeff = model_lasso$nzero) %>%
  ggplot(aes(x = lambda, y = cv_mse, col = nonzero_coeff)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = model_lasso$lambda.min, col = "deepskyblue3") +
  geom_hline(yintercept = min(model_lasso$cvm), col = "deepskyblue3") +
  scale_x_continuous(trans = 'log10') + 
    scale_y_continuous(trans = 'log10') + 
  theme(legend.position = "bottom") + 
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Lasso - Lambda Selection (Using 5-Fold Cross-Validation)")

The best value of \(\lambda\) is 0.0163.

As before, I re-fit this model on the full data and extract the coefficients.

model_lasso_best <- glmnet(y = Y, x = X_matrix, alpha = 1)

predict(model_lasso_best, s = model_lasso$lambda.min, type = "coefficients")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept) 10.28679
## 1            .      
## 2            .      
## 3            .      
## 4            .      
## 5            .      
## 6            .      
## 7            4.85396
## 8            .      
## 9            .      
## 10           .

Lasso has also picked the 1-variable model with \(X^7\) as the single predictor, and the coefficient estimates are again close to the true coefficients.

9. APPLIED: The `College` Dataset (Least Squares, Ridge, Lasso, PCR & PLS)

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) `train`/`test` Split

Q: Split the data set into a training set and a test set.

I randomly sample 70% of the observations for train:

set.seed(3)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))

train <- College[train_index, ]
nrow(train) / nrow(College)

## [1] 0.7001287

The remaining observations are allocated to the test dataset:

test <- College[-train_index, ]
nrow(test) / nrow(College)

## [1] 0.2998713

(b) OLS Regression

Q: Fit a linear model using least squares on the training set, and report the test error obtained.

I fit the model and provide the model summary:

model_linear <- lm(Apps ~ ., data = train)
summary(model_linear)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3149.1  -366.6   -42.3   296.6  5657.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -625.31793  418.96146  -1.493 0.136156    
## PrivateYes  -531.00812  149.13545  -3.561 0.000404 ***
## Accept         1.31888    0.05618  23.476  < 2e-16 ***
## Enroll        -0.43546    0.19874  -2.191 0.028885 *  
## Top10perc     50.93958    5.74902   8.861  < 2e-16 ***
## Top25perc    -12.40222    4.52484  -2.741 0.006335 ** 
## F.Undergrad    0.07675    0.03237   2.371 0.018120 *  
## P.Undergrad    0.03967    0.03172   1.251 0.211625    
## Outstate      -0.04319    0.01936  -2.231 0.026096 *  
## Room.Board     0.21327    0.04889   4.362 1.55e-05 ***
## Books          0.14681    0.23792   0.617 0.537454    
## Personal       0.02492    0.07000   0.356 0.721965    
## PhD           -8.50811    4.95017  -1.719 0.086248 .  
## Terminal      -1.80323    5.56078  -0.324 0.745858    
## S.F.Ratio     11.20908   13.00766   0.862 0.389229    
## perc.alumni   -6.99057    4.34316  -1.610 0.108094    
## Expend         0.04041    0.01221   3.310 0.000998 ***
## Grad.Rate      5.98245    3.01090   1.987 0.047448 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 899.6 on 526 degrees of freedom
## Multiple R-squared:  0.9278, Adjusted R-squared:  0.9255 
## F-statistic: 397.9 on 17 and 526 DF,  p-value: < 2.2e-16

Using MSE as the test error metric:

ols_pred <- predict(model_linear, test)
(ols_mse <- mean((ols_pred - test$Apps)^2))

## [1] 2019409

(c) Ridge Regression

Q: Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

Since there is a categorical variable in the College dataset (Private), I first create train_mat and test_mat, which are train & test matrices with Private dummy-coded and Apps removed.

I specify that we shouldn’t drop one of these Private levels as a reference level, and you can see the top comment here to see a nice justification for why this is.

The justification makes sense to me for factors with >2 levels, and in the 2-level case regularization will deal with the perfect correlation of the one-hot-encoding of binary factors (here, Private.No & Private.Yes).

train_mat <- dummyVars(Apps ~ ., data = train, fullRank = F) %>%
  predict(newdata = train) %>%
  as.matrix()

test_mat <- dummyVars(Apps ~ ., data = test, fullRank = F) %>%
  predict(newdata = test) %>%
  as.matrix()

I test varying values of \(\lambda\) (from 0.01 to 100) using 5-fold cross-validation:

set.seed(3)

model_ridge <- cv.glmnet(y = train$Apps, 
                         x = train_mat, 
                         alpha = 0, 
                         lambda = 10^seq(2,-2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5)

data.frame(lambda = model_ridge$lambda, 
           cv_mse = model_ridge$cvm) %>%
  ggplot(aes(x = lambda, y = cv_mse)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = model_ridge$lambda.min, col = "deepskyblue3") +
  geom_hline(yintercept = min(model_ridge$cvm), col = "deepskyblue3") +
  scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "bottom") + 
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Ridge Regression - Lambda Selection (Using 5-Fold Cross-Validation)")

The selected value of \(\lambda\) is 43.2876. I re-fit the model and produce predictions for this value of lambda on the test data. This gives the resulting test MSE:

model_ridge_best <- glmnet(y = train$Apps,
                           x = train_mat,
                           alpha = 0, 
                           lambda = 10^seq(2,-2, length = 100))

ridge_pred <- predict(model_ridge_best, s = model_ridge$lambda.min, newx = test_mat)
(ridge_mse <- mean((ridge_pred - test$Apps)^2))

## [1] 2297697

(d) The Lasso

Q: Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

I test varying values of \(\lambda\) (from 0.01 to 100) using 5-fold cross-validation.

set.seed(4)

model_lasso <- cv.glmnet(y = train$Apps, 
                         x = train_mat, 
                         alpha = 1, 
                         lambda = 10^seq(2, -2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5, 
                         thresh = 1e-12)

data.frame(lambda = model_lasso$lambda, 
           cv_mse = model_lasso$cvm, 
           nonzero_coeff = model_lasso$nzero) %>%
  ggplot(aes(x = lambda, y = cv_mse, col = nonzero_coeff)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = model_lasso$lambda.min, col = "deepskyblue3") +
  geom_hline(yintercept = min(model_lasso$cvm), col = "deepskyblue3") +
  scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "bottom") + 
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Lasso - Lambda Selection (Using 5-Fold Cross-Validation)")

The selected value of \(\lambda\) is 0.3126.

Fitting the full model, and evaluating the test MSE:

model_lasso_best <- glmnet(y = train$Apps,
                           x = train_mat,
                           alpha = 1, 
                           lambda = 10^seq(2,-5, length = 100))

lasso_pred <- predict(model_lasso_best, s = model_lasso$lambda.min, newx = test_mat)
(lasso_mse <- mean((lasso_pred - test$Apps)^2))

## [1] 2025346

The coefficients:

lasso_coef <- predict(model_lasso_best, type = "coefficients", s = model_lasso$lambda.min)

round(lasso_coef, 3)

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) -1162.486
## Private.No    532.620
## Private.Yes     .    
## Accept          1.314
## Enroll         -0.402
## Top10perc      50.654
## Top25perc     -12.190
## F.Undergrad     0.073
## P.Undergrad     0.039
## Outstate       -0.043
## Room.Board      0.214
## Books           0.145
## Personal        0.024
## PhD            -8.501
## Terminal       -1.760
## S.F.Ratio      11.042
## perc.alumni    -7.051
## Expend          0.040
## Grad.Rate       5.934

We can see that 18 of the 19 coefficients are non-zero.

(e) Principal Components Regression

Q: Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

First, I select \(M\) using cross-validation:

set.seed(5)

model_pcr <- pcr(Apps ~ .,
                 data = train, 
                 scale = T, 
                 validation = "CV")

model_pcr_mse <- MSEP(model_pcr, estimate = "CV")$val %>%
  reshape2::melt() %>%
  mutate(M = 0:(nrow(.)-1)) %>%
  select(M, value) %>%
  rename(CV_MSE = value)

model_pcr_mse

##     M     CV_MSE
## 1   0 10884668.5
## 2   1 10907595.4
## 3   2  2569406.1
## 4   3  2641560.1
## 5   4  2567752.3
## 6   5  1464503.0
## 7   6  1400044.1
## 8   7  1351492.1
## 9   8  1303898.4
## 10  9  1230513.5
## 11 10  1220249.5
## 12 11  1226540.0
## 13 12  1238549.3
## 14 13  1240215.1
## 15 14  1242592.0
## 16 15  1263342.5
## 17 16   950107.6
## 18 17   893799.2

model_pcr_mse %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = M, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(x = "M", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "PCR - M Selection (Using 10-Fold Cross-Validation)")

Cross-validation selected \(M\) = 17 (so we can note that \(M = p = 17\)) - the dimensionality hasn’t been reduced at all.

Evaluating the test MSE:

pcr_pred <- predict(model_pcr, test, ncomp = 17)
(pcr_mse <- mean((pcr_pred - test$Apps)^2))

## [1] 2019409

I noticed here that the test MSE is identical to that obtained in part (b) from performing OLS. This is as expected from p.230 of the ISLR book, where it says:

If \(M = p\), and all the \(Z_m\) are linearly independent, then (6.18) poses no constraints. In this case, no dimension reduction occurs, and so fitting (6.17) is equivalent to performing least squares on the original \(p\) predictors.

Where \((6.17)\) just describes what we are doing above (fitting an OLS model with the linear combinations \(Z_1, Z_2, \dots, Z_M\) as predictors, instead of \(X_1, X_2, \dots, X_p\))

\[y_i = \theta_0 + \sum_{m = 1}^M \theta_m z_{im} + \epsilon_i, \quad i = 1, \dots, n \quad (6.17)\]

(f) Partial Least Squares Regression

Q: Fit a PLS model on the training set, with \(M\) chosen by crossvalidation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

Selecting \(M\) by cross-validation:

set.seed(6)

model_pls <- plsr(Apps ~ .,
                  data = train, 
                  scale = T, 
                  validation = "CV")

model_pls_mse <- MSEP(model_pls, estimate = "CV")$val %>%
  reshape2::melt() %>%
  mutate(M = 0:(nrow(.)-1)) %>%
  select(M, value) %>%
  rename(CV_MSE = value)

model_pls_mse

##     M     CV_MSE
## 1   0 10884668.5
## 2   1  2085322.2
## 3   2  1198711.4
## 4   3  1169960.5
## 5   4  1055530.1
## 6   5   951850.4
## 7   6   898388.2
## 8   7   889700.3
## 9   8   889619.9
## 10  9   884242.9
## 11 10   885543.9
## 12 11   882526.5
## 13 12   880744.0
## 14 13   880911.3
## 15 14   881088.5
## 16 15   881559.9
## 17 16   881677.0
## 18 17   881688.1

model_pls_mse %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = M, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(x = "M", 
       y = "Cross-Validation MSE", 
       title = "PLS - M Selection (Using 10-Fold Cross-Validation)")

Cross-validation selected \(M\) = 12 as the number of principal components to minimize the out-of-sample MSE.

Evaluating the test MSE:

pls_pred <- predict(model_pls, test, ncomp = 12)
(pls_mse <- mean((pls_pred - test$Apps)^2))

## [1] 2019361

(g) Performance Comparison

Q: Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

I have stored the test MSEs and calculated the test \(R^2\) using the most general definition, so I put them into a data frame for comparison:

# R2 = 1 - (SS_res / SS_tot)^2

SS_tot <- sum((test$Apps - mean(test$Apps))^2)

data.frame(method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"), 
           test_MSE = c(ols_mse, ridge_mse, lasso_mse, pcr_mse, pls_mse), 
           test_R2 = c(1 - sum((test$Apps - ols_pred)^2) / SS_tot,
                       1 - sum((test$Apps - ridge_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - lasso_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - pcr_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - pls_pred)^2) / SS_tot)) %>%
  arrange(test_MSE)

##   method test_MSE   test_R2
## 1    PLS  2019361 0.9168871
## 2    PCR  2019409 0.9168851
## 3    OLS  2019409 0.9168851
## 4  Lasso  2025346 0.9166408
## 5  Ridge  2297697 0.9054313

There is really a minimal performance difference between each of the 5 methods - for instance, the performance-boost in going from OLS to PLS is a reduction in test MSE by 0.0024%. The only model that could potentially have a meaningful performance difference was ridge regression (which performed the worst).

Given the size of the dataset, I have no doubt that the selected models and test performance would shift around if the random seeds were changed. Each of the models performed well, with test \(R^2\) values above 0.9.

10. APPLIED: Generated Data (Best Subset Selection)

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate Data

Q: Generate a data set with \(p\) = 20 features, \(n\) = 1,000 observations, and an associated quantitative response vector generated according to the model

\[Y = X \beta + \epsilon\] where \(\beta\) has some elements that are exactly equal to zero.

I create the design matrix \(X\) by generating a vector of length \(n \times p = 20 \times 1,000\) (sampled from a normal distribution where mean = 0 and sd = 1), and filling in \(X\) column-by-column.

Similarly, for \(\beta\), I generate a two vectors of length \(n = 1,000\); \(\beta \sim \mathcal{N}(1, 2^2)\) (with 4 of these subsequently overwritten to zero), and \(\epsilon \sim \mathcal{N}(0, 3^2)\).

\(Y\) is then generated by calculating \(Y = X \beta + \epsilon\).

set.seed(1)
p <- 20
n <- 1000

X <- matrix(data = rnorm(n * p, mean = 0, sd = 1), nrow = n, ncol = p)

B <- rnorm(p, mean = 1, sd = 2)
B[2] <- 0
B[9] <- 0
B[14] <- 0
B[17] <- 0

eps <- rnorm(p, mean = 0, sd = 3)

Y = X %*% B + eps

I put \(Y\) and \(X\) into a dataframe (df) and view it:

df <- cbind(Y, as.data.frame(X))
names(df) <- c("Y", paste0("X", 1:20))
glimpse(df)

## Rows: 1,000
## Columns: 21
## $ Y   <dbl> -7.5791363, 9.2157068, 8.2360422, 3.7628364, -6.4013460, 5.5314...
## $ X1  <dbl> -0.62645381, 0.18364332, -0.83562861, 1.59528080, 0.32950777, -...
## $ X2  <dbl> 1.13496509, 1.11193185, -0.87077763, 0.21073159, 0.06939565, -1...
## $ X3  <dbl> -0.88614959, -1.92225490, 1.61970074, 0.51926990, -0.05584993, ...
## $ X4  <dbl> 0.73911492, 0.38660873, 1.29639717, -0.80355836, -1.60262567, 0...
## $ X5  <dbl> -1.13463018, 0.76455710, 0.57071014, -1.35169393, -2.02988547, ...
## $ X6  <dbl> -1.5163733, 0.6291412, -1.6781940, 1.1797811, 1.1176545, -1.237...
## $ X7  <dbl> -0.61882708, -1.10942196, -2.17033523, -0.03130307, -0.26039848...
## $ X8  <dbl> -1.32541772, 0.95197972, 0.86000439, 1.06079031, -0.35058396, -...
## $ X9  <dbl> 0.26370340, -0.82945185, -1.46163477, 1.68399018, -1.54432429, ...
## $ X10 <dbl> -1.21712008, -0.94622927, 0.09140980, 0.70135127, 0.67342236, 1...
## $ X11 <dbl> -0.80433160, -1.05652565, -1.03539578, -1.18556035, -0.50043951...
## $ X12 <dbl> -1.411521883, 1.083869657, 1.170222351, 0.294754540, -0.5544276...
## $ X13 <dbl> -0.93910663, 1.39366493, 1.62581486, 0.40900106, -0.09255856, 0...
## $ X14 <dbl> 0.2264537, -0.8185942, -0.8471526, -1.9843326, -0.8127788, 1.46...
## $ X15 <dbl> 0.5232667, 0.9935537, 0.2737370, -0.6949193, -0.7180502, -0.101...
## $ X16 <dbl> -0.21390898, -0.10672328, -0.46458931, -0.68427247, -0.79080075...
## $ X17 <dbl> 0.85763410, -1.62539515, -0.23427831, -1.03265445, -1.14114122,...
## $ X18 <dbl> 1.0496171412, 0.2903237344, 1.2421262227, -0.6850857039, -0.667...
## $ X19 <dbl> 0.95140989, 0.45709866, -0.35869346, -1.04586136, 0.30753453, 1...
## $ X20 <dbl> -2.07771241, -0.45446091, -0.16555991, 0.89765209, -0.02948916,...

(b) `train`/`test` Split

Q: Split your data set into a training set containing 100 observations and a test set containing 900 observations.

It’s interesting that the training set is so small here - this would make more sense to me if we were dealing with a dataset of millions of rows, but a training set of 100 observations seems really small to capture the relationship between \(Y\) and 20 variables!

train contains 10% of the total observations:

set.seed(2)
train_index <- sample(1:1000, 100)

train <- df[train_index, ]
nrow(train) / nrow(df)

## [1] 0.1

test contains the remaining 90%:

test <- df[-train_index, ]
nrow(test) / nrow(df)

## [1] 0.9

(c) `train` MSE Selection

Q: Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

Unsurprisingly, the model with the lowest train MSE is the model with 20 predictors. This is guaranteed, as adding additional predictors will only ever reduce the training RSS.

model <- regsubsets(Y ~ ., 
                    data = train, 
                    nvmax = 20, 
                    method = "exhaustive")

model_summary <- summary(model)

data.frame(MSE = model_summary$rss / nrow(train), subset_size = 1:20) %>%
  mutate(min_MSE = as.numeric(min(MSE) == MSE)) %>%
  ggplot(aes(x = subset_size, y = MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_MSE))) + 
  scale_x_continuous(breaks = seq(1, 20, 2)) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - Train MSE", 
       x = "Subset Size", 
       y = "MSE")

(d) `test` MSE Selection

Q: Plot the test set MSE associated with the best model of each size.

X_test <- cbind("(Intercept)" = 1, as.matrix(test[ ,-1])) # remove y
MSE_test <- c()

for (i in 1:ncol(model_summary$outmat)) {
  model_coef <- coef(model, id = i)
  pred <- X_test[, colnames(X_test) %in% names(model_coef)] %*% model_coef
  MSE_test[i] <- mean((test$Y - pred)^2)
}
  
data.frame(MSE = MSE_test, subset_size = 1:20) %>%
  mutate(min_MSE = as.numeric(min(MSE) == MSE)) %>%
  ggplot(aes(x = subset_size, y = MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_MSE))) + 
  scale_x_continuous(breaks = seq(1, 20, 2)) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - Test MSE", 
       x = "Subset Size", 
       y = "MSE")

(e) Model Minimizing the `test` MSE

Q: For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

The test MSE is minimized for the 14-variable model.

The test MSE graph paints quite a common picture - there is a lot of added benefit as the number of variables in the subset initially increases, then after a certain point larger subsets make very little difference (they actually begin to harm the model, as the increased variance from additional predictors outweighs the decrease in bias).

In practice, for very large datasets I would probably choose the 14-variable model, and for very small datasets such as this I would almost certainly choose the 9-variable model.

(f) Model Minimizing the `test` MSE - Coefficient Comparison

Q: How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

I’ve put the estimated parameters (for the selected 14-variable model) alongside the actual parameters (used to generate \(Y\)) in the dataframe params, which can be viewed below:

params <- data.frame(parameter = colnames(model_summary$which), actual = c(0, B)) %>%
  left_join(data.frame(parameter = names(coef(model, id = 14)), estimated = coef(model, id = 14)), by = "parameter") %>%
  mutate(estimated = case_when(is.na(estimated) ~ 0, TRUE ~ estimated))

params

##      parameter     actual  estimated
## 1  (Intercept)  0.0000000 -0.2263411
## 2           X1  1.4706970  0.5664113
## 3           X2  0.0000000  0.0000000
## 4           X3 -0.2843738  0.0000000
## 5           X4 -2.8696170 -2.8392890
## 6           X5  3.0773913  2.7177520
## 7           X6  0.4328998  0.0000000
## 8           X7 -1.8194582 -1.8207587
## 9           X8  2.4463609  2.8901977
## 10          X9  0.0000000  0.0000000
## 11         X10  2.4609806  3.3123983
## 12         X11  2.7583068  2.5165307
## 13         X12  2.1091128  1.8863706
## 14         X13  0.4308379  0.7166513
## 15         X14  0.0000000  0.0000000
## 16         X15 -0.4309777 -0.8653210
## 17         X16  0.4589441  0.6424018
## 18         X17  0.0000000  0.0000000
## 19         X18  4.3396136  4.0862468
## 20         X19  2.7845185  3.3514497
## 21         X20 -1.0309778 -0.8718763

Note that if a model doesn’t include certain predictors (e.g. above, this 14-variable model doesn’t include X2), this is mathematically the same as saying that the coefficient estimate for X2 is zero (\(\hat{\beta_2} = 0\)).

I create a plot of these below:

params_plot <- params %>%
  arrange(actual)

params_plot$parameter <- factor(params_plot$parameter, ordered = TRUE, levels = params_plot$parameter)

params_plot %>%
  gather("key", "value", -parameter) %>%
  ggplot(aes(x = value, y = parameter, col = factor(key))) + 
  geom_point() + 
  geom_vline(xintercept = 0, col = "grey55") + 
  scale_x_continuous(breaks = seq(-6, 6)) +
  scale_color_discrete(labels = c("Actual", "Estimated")) +
  labs(x = "Value", 
       y = "Parameter", 
       col = "", 
       title = "Best Subset Selection - Selected 14-Variable Model", 
       subtitle = "Actual vs estimated coefficients")

It looks like best subset selection selected model estimates the true parameters pretty accurately, despite the fact that the true model is 16-variable and a 14-variable model was chosen.

To get some perspective on the accuracy, the train \(R^2\):

selected_model <- lm(Y ~ ., data = train[ ,c("Y", names(coef(model, id = 14)[-1]))])

1 - sum((train$Y - predict(selected_model))^2) / sum((train$Y - mean(train$Y))^2)

## [1] 0.9171097

… and the test \(R^2\):

1 - sum((test$Y - predict(selected_model, newdata = test))^2) / sum((test$Y - mean(test$Y))^2)

## [1] 0.8572715

(g) Actual vs Estimated Parameter ‘Distance’

Q: Create a plot displaying \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^r})^2}\) for a range of values of \(r\), where \(\hat{\beta_j^r}\) is the \(j\)th coefficient estimate for the best model containing \(r\) coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

I think the goal of this question is to produce a ‘parameter distance’ measurement (for lack of a better term) of how close our estimated parameters are to the actuals used to generate the data for each model.

The graph in the previous question is really useful for visualising what we are being asked to do here, where we have the best model containing \(r = 14\) predictors.

Looking at the graph above for for \(r = 14\), we would be calculating \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^{14}})^2}\), i.e. calculate the sum of the squared horizontal distances between the actual and predicted parameters (excluding the intercept), then square-rooting the result. We repeat this for \(r = 1\) to \(r = 20\) and visualize.

param_dist <- c()

for (j in 1:20) {
  params <- data.frame(parameter = colnames(model_summary$which)[-1], actual = B) %>%
    left_join(data.frame(parameter = names(coef(model, id = j)), estimated = coef(model, id = j)), by = "parameter") %>%
    mutate(estimated = case_when(is.na(estimated) ~ 0, TRUE ~ estimated))
  
  param_dist[j] <- sqrt(sum((params$actual - params$estimated)^2))
}


data.frame(dist = param_dist, subset_size = 1:20) %>%
  mutate(min_dist = as.numeric(min(dist) == dist)) %>%
  ggplot(aes(x = subset_size, y = dist)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_dist))) + 
  scale_x_continuous(breaks = seq(1, 20, 2)) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Best Subset Selection - Actual vs Estimated Parameter Distance", 
       x = "Subset Size", 
       y = "Parameter Distance")

What we observe is that the graph of parameter distances looks almost identical to the graph of the test MSE for different subset sizes, and that the model that minimizes this distance is also the 14-variable model.

To phrase this slightly differently: selecting a model based on the test MSE gave us a 14-variable model (can be applied to any dataset), and selecting a model based on how close the estimated parameters are to the true parameters (a luxury we can only afford when we know the true parameters) also led us to arrive at selecting the 14-variable model.

The fact that the graphs are similar is just showing that a model that has parameter estimates closer to the true parameters will also generally have a lower test MSE (makes sense).

11. APPLIED: The `Boston` Dataset (Selecting from Various Approaches)

We will now try to predict per capita crime rate in the Boston data set.

(a) Various Approaches

Q: Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

The Boston dataset is pretty small, so I’m going to use 10-repeated 10-fold cross-validation MSE to approximate the test error (instead of partitioning the data into train & test).

I use caret::train() where possible here as I prefer the consistent syntax across model types, which is particularly useful when I want to try many different models in quick succession.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10 , summaryFunction = custom_regression_metrics)

Best Subset Selection:

# best models per subset size:
model <- regsubsets(crim ~ ., 
                    data = Boston, 
                    nvmax = ncol(Boston) - 1, 
                    method = "exhaustive")




# cross-validating to compare MSE:

CV_MSE <- c()

set.seed(10101)

for (i in 1:(ncol(Boston)-1)) {
  Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
  model_temp <- train(crim ~ ., 
                      data = Boston_temp, 
                      method = "lm", 
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = subset_size, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Best Subset Selection", 
       subtitle = "Selecting parameter subset size with cross-validation",
       x = "Subset Size", 
       y = "CV MSE")

Minimum CV MSE:

min(CV_MSE)

## [1] 42.58941

Lasso Regression:

set.seed(159)

model_lasso <- train(crim ~ ., 
                     data = Boston, 
                     method = "glmnet", 
                     preProcess = c("center", "scale"), 
                     metric = "MSE",
                     maximize = F,
                     trControl = ctrl, 
                     tuneGrid = expand.grid(alpha = 1,
                                            lambda = seq(0, 1, length = 100)))



model_lasso$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
  ggplot(aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Lasso Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

Minimum CV MSE:

min(model_lasso$results$MSE)

## [1] 42.98778

Ridge Regression:

set.seed(159)

model_ridge <- train(crim ~ ., 
                     data = Boston, 
                     method = "glmnet", 
                     preProcess = c("center", "scale"), 
                     metric = "MSE",
                     maximize = F,
                     trControl = ctrl, 
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = seq(0, 1, length = 100)))


model_ridge$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(lambda == model_ridge$bestTune$lambda)) %>%
  ggplot(aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Ridge Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

Minimum CV MSE:

min(model_ridge$results$MSE)

## [1] 43.09366

Principal Components Regression:

set.seed(339)

model_pcr <- train(crim ~ ., 
                   data = Boston, 
                   method = "pcr", 
                   preProcess = c("center", "scale"), 
                   metric = "MSE", 
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


model_pcr$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(ncomp == model_pcr$bestTune$ncomp)) %>%
  ggplot(aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Principal Components Regression", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

Minimum CV MSE:

min(model_pcr$results$MSE)

## [1] 43.06695

Partial Least Squares:

set.seed(840)

model_pls <- train(crim ~ ., 
                   data = Boston, 
                   method = "pls", 
                   preProcess = c("center", "scale"), 
                   metric = "MSE", 
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


model_pls$results %>%
  rename(CV_MSE = MSE) %>%
  mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
  ggplot(aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Partial Least Squares", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

Minimum CV MSE:

min(model_pls$results$MSE)

## [1] 43.18333

(b) Proposed Model

Q: Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

The best performing model with regards to cross-validation MSE is the one selected using best subset selection, with MSE = 42.59. However, I would propose that in reality greater performance would likely be obtained through non-linear methods.

To demonstrate this, I make some small changes to the dataset, which I thought would be the most promising changes based on my chapter 3 solutions:

I add quadratic & cubic terms for medv (the most obvious case of a non-linear predictor)
I create rad_cat - a binary variable indicating whether rad is > 20
I remove rad & tax (their useful information is captured by rad_cat)

Boston$medv_sq <- Boston$medv^2
Boston$medv_cub <- Boston$medv^3

Boston$rad_cat <- ifelse(Boston$rad > 20, 1, 0)
Boston$rad <- NULL
Boston$tax <- NULL

glimpse(Boston)

## Rows: 506
## Columns: 15
## $ crim     <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08...
## $ zn       <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12....
## $ indus    <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87...
## $ chas     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ nox      <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0....
## $ rm       <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5....
## $ age      <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85....
## $ dis      <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5....
## $ ptratio  <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2...
## $ black    <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 39...
## $ lstat    <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 1...
## $ medv     <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9...
## $ medv_sq  <dbl> 576.00, 466.56, 1204.09, 1115.56, 1310.44, 823.69, 524.41,...
## $ medv_cub <dbl> 13824.000, 10077.696, 41781.923, 37259.704, 47437.928, 236...
## $ rad_cat  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

As before, I perform best subset selection (obtain the best-fitting model of each subset size, then select between these subsets using repeated cross-validation):

# best models per subset size:
model <- regsubsets(crim ~ ., 
                    data = Boston, 
                    nvmax = ncol(Boston) - 1, 
                    method = "exhaustive")


# cross-validating to compare MSE:
CV_MSE <- c()

set.seed(20202)

for (i in 1:(ncol(Boston)-1)) {
  Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
  model_temp <- train(crim ~ ., 
                      data = Boston_temp, 
                      method = "lm", 
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = subset_size, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Modified Boston Dataset - Best Subset Selection", 
       subtitle = "Selecting parameter subset size with cross-validation",
       x = "Subset Size", 
       y = "CV MSE")

As far as predictive power, the improvement was significant when compared to the other linear methods when using the raw data: MSE = 35.2

One could argue that this model is potentially overfit, since these transformations were found by an exploration of the full Boston dataset, instead of just a training sample.

However, I suspect that these transformations would have been found had I only worked with a training sample of the data, and either way it is certainly evidence that nonlinear methods that could capture these relationships in a less biased way could be more promising.

(c) Selected Model

Q: Does your chosen model involve all of the features in the data set? Why or why not?

The selected model uses just 7 variables:

coef(model, id = 7)

##   (Intercept)           nox           dis       ptratio          medv 
##  52.291186700 -11.960713097  -0.468287570  -0.328043396  -3.885786605 
##       medv_sq      medv_cub       rad_cat 
##   0.123927558  -0.001246049   9.187000141

As far as why - adding additional features will reduce the training RSS, but causes the cross-validation MSE to increase, so there is no evidence that they will increase the out-of-sample predictive performance.

If we wanted to select a more parsimonious model, we could select the four-variable model, which would only contain the 4 variables I created:

coef(model, id = 4)

##  (Intercept)         medv      medv_sq     medv_cub      rad_cat 
## 38.178959648 -4.035693216  0.132937552 -0.001358685  8.101325067

ISLR - Linear Model Selection and Regularization (Ch. 6) - Exercise Solutions

Liam Morgan

July 2020

1. Best Subset, Forward Stepwise & Backward Stepwise Selection

(a) Training RSS

(b) Test RSS

(c) True or False

2. The Lasso, Ridge Regression, Least Squares & Non-Linear Methods

(a) The Lasso vs Least Squares

(b) Ridge Regression vs Least Squares

(c) Non-Linear Methods vs Least Squares

3. The Lasso (as \(s\) increases)

(a) Training RSS

(b) Test RSS

(c) Variance

(d) Squared Bias

(e) Irreducible Error

4. Ridge Regression (as \(\lambda\) increases)

(a) Training RSS

(b) Test RSS

(c) Variance

(d) Squared Bias

(e) Irreducible Error

5. Ridge vs Lasso - Correlated Predictors

(a) Ridge - The Optimization Problem

(b) Ridge - Argue Equal Coefficient Estimates

(c) Lasso - The Optimization Problem

(d) Lasso - Argue Non-Unique Coefficient Estimates

6. Ridge & Lasso - Special Case

(a) Ridge Regression

(b) The Lasso

7. Ridge & Lasso - Bayesian Connection

8. APPLIED: Generated Data (Best Subset, Forwards, Backwards & Lasso Selection)

(a) Generate \(X\) & \(\epsilon\)

(b) Generate \(Y\)

(c) Best Subset Selection

(d) Other Selection Algorithms

(e) Lasso Selection

(f) New Example - Best Subset vs Lasso

9. APPLIED: The College Dataset (Least Squares, Ridge, Lasso, PCR & PLS)

(a) train/test Split

(b) OLS Regression

(c) Ridge Regression

(d) The Lasso

(e) Principal Components Regression

(f) Partial Least Squares Regression

(g) Performance Comparison

10. APPLIED: Generated Data (Best Subset Selection)

(a) Generate Data

(b) train/test Split

(c) train MSE Selection

(d) test MSE Selection

(e) Model Minimizing the test MSE

(f) Model Minimizing the test MSE - Coefficient Comparison

(g) Actual vs Estimated Parameter ‘Distance’

11. APPLIED: The Boston Dataset (Selecting from Various Approaches)

(a) Various Approaches

(b) Proposed Model

(c) Selected Model

9. APPLIED: The `College` Dataset (Least Squares, Ridge, Lasso, PCR & PLS)

(a) `train`/`test` Split

(b) `train`/`test` Split

(c) `train` MSE Selection

(d) `test` MSE Selection

(e) Model Minimizing the `test` MSE

(f) Model Minimizing the `test` MSE - Coefficient Comparison

11. APPLIED: The `Boston` Dataset (Selecting from Various Approaches)