1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p +1 models, containing 0,1,2,…,ppredictors. Explain your answers (part c only)

c). True or False:

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

True - Forward stepwise selection starts with no predictors and adds one predictor at a time, choosing the one that improves model fit the most. As a result, the (k+1)-variable model always contains all k predictors from the k-variable model, plus one additional variable. This behavior is built into how forward selection works.

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)variable model identified by backward stepwise selection.

False - Backward stepwise selection starts with all predictors and removes one at a time. This means that the k-variable model comes from removing variables from the (k+1)-variable model, but there’s no guarantee the same predictors were kept at each step. Due to the combinatorial nature, the optimal k-variable model may not be nested within the (k+1)-variable model in backward selection.

iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)variable model identified by forward stepwise selection.

False - Forward and backward stepwise selections use different paths and different criteria at each step. There is no guarantee that the k-variable model from one method (backward) is a subset of the (k+1)-variable model from the other (forward).

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

False - Again, since forward and backward stepwise selection are distinct methods, the predictors chosen in their respective models may differ completely. There is no nesting guarantee between forward and backward models.

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

False - Best subset selection fits all possible models of each size and selects the best one for each model size. The best k-variable model is not guaranteed to be a subset of the best (k+1)-variable model, because adding one variable might change the whole optimal combination.

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

a). The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Flexibility: The lasso introduces a constraint (L1 penalty) on the regression coefficients, effectively shrinking some to exactly zero, performing variable selection. This means lasso is less flexible than least squares, which fits all p predictors without constraint.

Bias-Variance Trade-off: By reducing flexibility, the lasso increases bias (since some coefficients are shrunk toward zero), but reduces variance, especially when many predictors are irrelevant or highly correlated. If the reduction in variance outweighs the increase in bias, the overall test MSE decreases, improving prediction accuracy.

b). Repeat (a) for ridge regression relative to least squares.

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Flexibility: Ridge regression introduces an L2 penalty on the regression coefficients, shrinking them toward zero but not setting them exactly to zero (unlike the lasso). This shrinks the coefficients, reducing model complexity and hence reducing flexibility relative to least squares.

Bias-Variance Trade-off: Like the lasso, ridge regression increases bias but decreases variance. If the variance reduction is greater than the bias increase, the test MSE improves, and prediction accuracy is better than least squares.

c). Repeat (a) for non-linear methods relative to least squares.

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Flexibility: Non-linear methods (e.g., splines, polynomial regression, GAMs, trees, etc.) are typically more flexible than linear regression because they can capture complex relationships between predictors and the response.

Bias-Variance Trade-off: This added flexibility reduces bias (better fit to training data) but increases variance (more sensitive to training data fluctuations). These methods will improve test prediction accuracy if the reduction in bias outweighs the increase in variance.

3. Suppose we estimate the regression coefficients in a linear regression model by minimizing (equation) for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

Equation:

\[ \min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p |\beta_j| \leq s \]

a). As we increase s from 0, the training RSS will:

i. Increase initially, and then eventually start decreasing in an inverted U shape.

ii. Decrease initially, and then eventually start increasing in a U shape.

iii. Steadily increase.

iv. Steadily decrease.

v. Remain constant.

Steadily decrease

Justification:At s=0, the constraint forces all βj=0, so the model predicts only the intercept β0, resulting in high training RSS. As s increases, the constraint is relaxed, allowing the model to include more predictors or increase the size of the coefficients, thereby reducing the training RSS. At the extreme, when s is large enough, the lasso solution becomes equivalent to ordinary least squares, achieving the minimum possible training RSS. So the training RSS decreases monotonically (or at least non-increasingly) as s increases.

b). Repeat (a) for test RSS.

Decrease initially, and then eventually start increasing in a U shape

Justification: At s=0, the model is overly simple (just the intercept), so test RSS is high due to underfitting. As we increase s, the model becomes more flexible -> better fit -> test RSS decreases. However, beyond a point, adding too much flexibility leads to overfitting -> test RSS increases. This creates a U-shaped curve for test error.

c). Repeat (a) for variance.

Steadily increase

Justification: Small s -> highly constrained, simple model -> low variance.

As s increases, more coefficients can grow or enter the model -> higher model flexibility -> higher variance.

d). Repeat (a) for (squared) bias.

Steadily decrease

Justification: At s=0, bias is very high (only intercept is used). As s increases, model becomes more complex, capturing more patterns in data -> lower bias. So, squared bias decreases with increasing s.

e). Repeat (a) for the irreducible error.

Remain constant

Justification: The irreducible error is caused by random noise. It is independent of the model or method used. So changing s has no effect on irreducible error.

Exercise 5

jace vayhinger

2025-03-27