True - Forward stepwise selection starts with no predictors and adds one predictor at a time, choosing the one that improves model fit the most. As a result, the (k+1)-variable model always contains all k predictors from the k-variable model, plus one additional variable. This behavior is built into how forward selection works.
False - Backward stepwise selection starts with all predictors and removes one at a time. This means that the k-variable model comes from removing variables from the (k+1)-variable model, but there’s no guarantee the same predictors were kept at each step. Due to the combinatorial nature, the optimal k-variable model may not be nested within the (k+1)-variable model in backward selection.
False - Forward and backward stepwise selections use different paths and different criteria at each step. There is no guarantee that the k-variable model from one method (backward) is a subset of the (k+1)-variable model from the other (forward).
False - Again, since forward and backward stepwise selection are distinct methods, the predictors chosen in their respective models may differ completely. There is no nesting guarantee between forward and backward models.
False - Best subset selection fits all possible models of each size and selects the best one for each model size. The best k-variable model is not guaranteed to be a subset of the best (k+1)-variable model, because adding one variable might change the whole optimal combination.
Justification: Flexibility: The lasso introduces a constraint (L1 penalty) on the regression coefficients, effectively shrinking some to exactly zero, performing variable selection. This means lasso is less flexible than least squares, which fits all p predictors without constraint.
Bias-Variance Trade-off: By reducing flexibility, the lasso increases bias (since some coefficients are shrunk toward zero), but reduces variance, especially when many predictors are irrelevant or highly correlated. If the reduction in variance outweighs the increase in bias, the overall test MSE decreases, improving prediction accuracy.
Justification: Flexibility: Ridge regression introduces an L2 penalty on the regression coefficients, shrinking them toward zero but not setting them exactly to zero (unlike the lasso). This shrinks the coefficients, reducing model complexity and hence reducing flexibility relative to least squares.
Bias-Variance Trade-off: Like the lasso, ridge regression increases bias but decreases variance. If the variance reduction is greater than the bias increase, the test MSE improves, and prediction accuracy is better than least squares.
Justification: Flexibility: Non-linear methods (e.g., splines, polynomial regression, GAMs, trees, etc.) are typically more flexible than linear regression because they can capture complex relationships between predictors and the response.
Bias-Variance Trade-off: This added flexibility reduces bias (better fit to training data) but increases variance (more sensitive to training data fluctuations). These methods will improve test prediction accuracy if the reduction in bias outweighs the increase in variance.
\[ \min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p |\beta_j| \leq s \]
Justification:At s=0, the constraint forces all βj=0, so the model predicts only the intercept β0, resulting in high training RSS. As s increases, the constraint is relaxed, allowing the model to include more predictors or increase the size of the coefficients, thereby reducing the training RSS. At the extreme, when s is large enough, the lasso solution becomes equivalent to ordinary least squares, achieving the minimum possible training RSS. So the training RSS decreases monotonically (or at least non-increasingly) as s increases.
Justification: At s=0, the model is overly simple (just the intercept), so test RSS is high due to underfitting. As we increase s, the model becomes more flexible -> better fit -> test RSS decreases. However, beyond a point, adding too much flexibility leads to overfitting -> test RSS increases. This creates a U-shaped curve for test error.
Justification: Small s -> highly constrained, simple model -> low variance.
As s increases, more coefficients can grow or enter the model -> higher model flexibility -> higher variance.
Justification: At s=0, bias is very high (only intercept is used). As s increases, model becomes more complex, capturing more patterns in data -> lower bias. So, squared bias decreases with increasing s.
Justification: The irreducible error is caused by random noise. It is independent of the model or method used. So changing s has no effect on irreducible error.