1 (c) True or False: i. The predictors in the k-variable model identifed by forward stepwise are a subset of the predictors in the (k+1)-variable model identifed by forward stepwise selection. A:True. The (π+1) variable model will be identical to the π. It will be carried forward.
The predictors in the k-variable model identifed by backward stepwise are a subset of the predictors in the (k + 1)- variable model identifed by backward stepwise selection. A:TRUE. The π-variable model will be identical to the (π+1) -variable model, but with one predictor removed.
The predictors in the k-variable model identifed by backward stepwise are a subset of the predictors in the (k + 1)- variable model identifed by forward stepwise selection. A:FALSE. Forwards and backwards stepwise selection have different starting points (the null model and the full model) and will take different selection paths. The statement could hold true for specific examples, but it is not generally true.
The predictors in the k-variable model identifed by forward stepwise are a subset of the predictors in the (k+1)-variable model identifed by backward stepwise selection. A:FALSE Same as above.
The predictors in the k-variable model identifed by best subset are a subset of the predictors in the (k + 1)-variable model identifed by best subset selection. A:FALSE. There is no guarantee that the best variable subset of size (π+1) will simply be the best variable subset of size πwith one additional predictor. If this were the case, we could simply do forward selection and reduce the number of models we test significantly.
Repeat (a) for ridge regression relative to least squares. A: iii- less flexible. Similiar to above, The only real difference here is in the ridge objective function, where the shrinkage term for ridge regression is slightly different to that of the lasso. This just means that ridge regression wonβt shrink coefficients of less-useful variables to exactly zero (the lasso can do this), but the rest of the argument (shrinking reduces the variance, at the cost of an increase in bias) still applies.
Repeat (a) for non-linear methods relative to least squares. A: ii - more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are more flexible than least squares, allowing complex relationships by making fewer assumptions about the form of f(X). This flexibility reduces bias but increases variance. If the drop in bias outweighs the rise in variance, prediction accuracy improves.
A: iv. steadily decrease. Minimizing the RSS (subject to the constraint that is just another formulation for how the lasso parameters are selected.Once s is sufficiently large, the least squares solution will satisfy the constraint. In this situation, the B that minimizes RSS and also satisfies this constraint will always be the least squares solution. Up until that point, the training RSS will monotonically decrease. (b) Repeat (a) for test RSS. A: ii.decrease initially, and then eventually start increasing in a U shape.
When s = 0, the B will be a vector of zeros, so here we will simply have the null model. As s increases and this constraint loosens, the flexibility of the model will increase. The test RSS will therefore decrease, up to the point where it will begin to overfit (at which point, it will start increasing again).
This is because the constraint region increasing in size (s increasing from zero) corresponds to lambda decreasing (a reduction in shrinkage), so model flexibility is increasing and so an increase in variance will occur. If s is sufficiently large so that B falls within the constraint region, the variance will no longer increase, because the B chosen will always be the least squares estimate.
Same reasoning as above - increasing the flexibility will decrease the bias. Again, this will stop reducing if the least squares solution falls within the constraint region.
The irreducible error is the error introduced by inherent uncertainty in the system being approximated. It remains constant regardless of model flexibility, because there may be unmeasured variables not in X that would be required to explain it, or unmeasurable variation in y that cannot be predicted with the variables in X so basically, it is completely independent of s
.