We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p +1 models, containing 0,1,2,…,ppredictors. Explain your answers:
True or False
The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
True
Justification: Forward stepwise selection starts with no variables and adds one predictor at a time, always choosing the one that gives the best improvement in the model.
The predictors in the k-variable model identified by back ward stepwise are a subset of the predictors in the (k + 1) variable model identified by backward stepwise selection.
True
Justification: Backward stepwise selection starts with all variables and removes one at a time, choosing the variable whose removal harms performance the least.
The predictors in the k-variable model identified by back ward stepwise are a subset of the predictors in the (k + 1) variable model identified by forward stepwise selection.
False
Justification: This compares two different algorithms: backward and forward stepwise.There is no guarantee they will choose the same variables at any step.
The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
False
Justification: Again, you’re comparing two different procedures.
The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
True
Justification: Best subset selection tries all possible combinations of variables for each model size (k = 0 to p) and picks the best for each. # Question 2
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
The lasso, relative to least squares, is:
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justification: Lasso adds an L1 penalty to the least squares objective, shrinking some coefficients to zero, which reduces model complexity.
Repeat (a) for bridge regression relative to least squares.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justification: Ridge regression adds an L2 penalty to the objective, shrinking coefficients but not setting them to zero.
Repeat (a) for non-linear methods relative to least squares.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justification: Non-linear methods (e.g., splines, GAMs, decision trees) are more flexible than linear regression.
Suppose we estimate the regression coefficients in a linear regression model by minimizing
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^{p} |\beta_j| \leq s \]
for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
As we increase s from 0, the training RSS will:
Steadily decrease.
Justification: As s increases, constraint loosens → training RSS decreases steadily
Repeat (a) for test RSS.
Decrease initially, and then eventually start increasing in a U shape.
Justification: Test RSS first decreases (better fit), then increases due to overfitting: U-shaped
Repeat (a) for variance.
Steadily increase.
Justiication: As s increases, model becomes more complex → variance increases steadily
Repeat (a) for (squared) bias.
Steadily decrease.
Justification: More flexibility → less bias → bias decreases
Repeat (a) for the irreducible error.
Remain constant.
Justification: Irreducible error is due to noise in data → remains constant