Exercise 5

1 (c) True or False: i. The predictors in the k-variable model identifed by forward stepwise are a subset of the predictors in the (k+1)-variable model identifed by forward stepwise selection. A:True. The (𝑘+1) variable model will be identical to the 𝑘. It will be carried forward.

The predictors in the k-variable model identifed by backward stepwise are a subset of the predictors in the (k + 1)- variable model identifed by backward stepwise selection. A:TRUE. The 𝑘-variable model will be identical to the (𝑘+1) -variable model, but with one predictor removed.
The predictors in the k-variable model identifed by backward stepwise are a subset of the predictors in the (k + 1)- variable model identifed by forward stepwise selection. A:FALSE. Forwards and backwards stepwise selection have different starting points (the null model and the full model) and will take different selection paths. The statement could hold true for specific examples, but it is not generally true.
The predictors in the k-variable model identifed by forward stepwise are a subset of the predictors in the (k+1)-variable model identifed by backward stepwise selection. A:FALSE Same as above.
The predictors in the k-variable model identifed by best subset are a subset of the predictors in the (k + 1)-variable model identifed by best subset selection. A:FALSE. There is no guarantee that the best variable subset of size (𝑘+1) will simply be the best variable subset of size 𝑘with one additional predictor. If this were the case, we could simply do forward selection and reduce the number of models we test significantly.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

The lasso, relative to least squares, is:

More fexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More fexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less fexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less fexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. A: iii. - less flexible (for lambda >0), giving increased prediction accuracy if the increase in bias is outweighed by the decrease in variance. LASSO selects coefficients by minimizing RSS+lambda(E)B, shrinking less important ones to zero as lambda increases. This reduces model flexibility, increasing bias but significantly lowering variance. If the drop in variance outweighs the bias increase, prediction accuracy improves. This shrinkage is what reduces the variance of the predictions, at the cost of a small increase in bias. This trade-off is usually worth it.

Repeat (a) for ridge regression relative to least squares. A: iii- less flexible. Similiar to above, The only real difference here is in the ridge objective function, where the shrinkage term for ridge regression is slightly different to that of the lasso. This just means that ridge regression won’t shrink coefficients of less-useful variables to exactly zero (the lasso can do this), but the rest of the argument (shrinking reduces the variance, at the cost of an increase in bias) still applies.
Repeat (a) for non-linear methods relative to least squares. A: ii - more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are more flexible than least squares, allowing complex relationships by making fewer assumptions about the form of f(X). This flexibility reduces bias but increases variance. If the drop in bias outweighs the rise in variance, prediction accuracy improves.

Suppose we estimate the regression coeffcients in a linear regression model by minimizing # for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

As we increase s from 0, the training RSS will:

Increase initially, and then eventually start decreasing in an inverted U shape.
Decrease initially, and then eventually start increasing in a U shape.
Steadily increase.
Steadily decrease.
Remain constant.

A: iv. steadily decrease. Minimizing the RSS (subject to the constraint that is just another formulation for how the lasso parameters are selected.Once s is sufficiently large, the least squares solution will satisfy the constraint. In this situation, the B that minimizes RSS and also satisfies this constraint will always be the least squares solution. Up until that point, the training RSS will monotonically decrease. (b) Repeat (a) for test RSS. A: ii.decrease initially, and then eventually start increasing in a U shape.

When s = 0, the B will be a vector of zeros, so here we will simply have the null model. As s increases and this constraint loosens, the flexibility of the model will increase. The test RSS will therefore decrease, up to the point where it will begin to overfit (at which point, it will start increasing again).

Repeat (a) for variance. A: iii.steadily increase.

This is because the constraint region increasing in size (s increasing from zero) corresponds to lambda decreasing (a reduction in shrinkage), so model flexibility is increasing and so an increase in variance will occur. If s is sufficiently large so that B falls within the constraint region, the variance will no longer increase, because the B chosen will always be the least squares estimate.

Repeat (a) for (squared) bias. A: iv. steadily decrease.

Same reasoning as above - increasing the flexibility will decrease the bias. Again, this will stop reducing if the least squares solution falls within the constraint region.

Repeat (a) for the irreducible error. A: v. remain constant

The irreducible error is the error introduced by inherent uncertainty in the system being approximated. It remains constant regardless of model flexibility, because there may be unmeasured variables not in X that would be required to explain it, or unmeasurable variation in y that cannot be predicted with the variables in X so basically, it is completely independent of s