NikhilBharadwaj

1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:

True or False:

The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
The predictors in the k-variable model identified by back- ward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.
The predictors in the k-variable model identified by back- ward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.
The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
True. In forward stepwise selection, at each step, one predictor is added to the model based on some criterion (e.g., AIC, BIC). Therefore, if a k-variable model includes certain predictors, the (k+1)-variable model will contain at least those k predictors plus one additional predictor. Thus, the predictors in the k-variable model identified by forward stepwise are indeed a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
True. Similar to forward stepwise selection, in backward stepwise selection, at each step, one predictor is removed from the model based on some criterion. Therefore, if a k-variable model includes certain predictors, the (k+1)-variable model will contain those k predictors plus possibly one additional predictor. Thus, the predictors in the k-variable model identified by backward stepwise are indeed a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
False. While in both forward and backward stepwise selection, models are built incrementally, the predictors chosen at each step may differ. Forward stepwise adds predictors one at a time based on certain criteria, while backward stepwise removes predictors one at a time. Hence, the predictors in the k-variable model identified by backward stepwise may not necessarily be a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
False. Forward stepwise selection and backward stepwise selection are two distinct procedures. The predictors chosen at each step in these two approaches may differ. Hence, the predictors in the k-variable model identified by forward stepwise are not necessarily a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
True. Best subset selection considers all possible subsets of predictors up to a certain size k and selects the subset with the best fit according to some criterion (e.g., AIC, BIC). Since it exhaustively considers all subsets, the k-variable model identified by best subset selection will indeed be a subset of the (k+1)-variable model identified by best subset selection. Thus, the predictors in the k-variable model identified by best subset are indeed a subset of the predictors in the (k+1)-variable model identified by best subset selection.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

To understand this, let’s consider the properties of the Lasso regularization method. The Lasso method adds a penalty term to the least squares objective function, which encourages sparsity in the coefficient estimates, effectively shrinking some coefficients to zero. This leads to feature selection and hence reduced model complexity.

The Lasso method is indeed less flexible compared to ordinary least squares, as it imposes constraints on the coefficient estimates. Therefore, statement (iv) is correct. When the increase in variance due to reduced flexibility (increase in bias) is less than the decrease in bias (increase in bias), the overall prediction accuracy is improved.

(b) Repeat (a) for ridge regression relative to least squares.

Similar to the Lasso, ridge regression also imposes a penalty on the coefficient estimates. However, unlike the Lasso, ridge regression does not enforce sparsity; instead, it shrinks the coefficients toward zero without eliminating them entirely.

Ridge regression is also less flexible compared to ordinary least squares, as it adds constraints to the coefficient estimates. Therefore, statement (iv) is also correct for ridge regression. When the increase in variance due to reduced flexibility (increase in bias) is less than the decrease in bias (increase in bias), the overall prediction accuracy is improved.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods, such as polynomial regression or kernel methods, are generally more flexible than ordinary least squares, as they can capture complex relationships between predictors and the response variable.

Therefore, statement (ii) is correct for non-linear methods relative to least squares. When the increase in variance due to increased flexibility is less than the decrease in bias, the overall prediction accuracy is improved.

Suppose we estimate the regression coefficients in a linear regression model by minimizing

Let’s analyze each statement:

(a) As we increase s from 0, the training RSS will:

Increase initially, and then eventually start decreasing in an inverted U shape.
Decrease initially, and then eventually start increasing in a U shape.
Steadily increase.
Steadily decrease.
Remain constant.

When s=0, all coefficients are forced to be zero, which means the model is too simple and does not capture the underlying relationship between the predictors and the response variable well. Therefore, the training RSS will be high.

As s increases, the model becomes more flexible, allowing it to better fit the training data, which decreases the training RSS. However, increasing s too much can lead to overfitting, causing the training RSS to start increasing again.

So, the correct answer is (ii) Decrease initially, and then eventually start increasing in a U shape.

(b) Repeat (a) for test RSS.

Similarly to the training RSS, the test RSS will initially decrease as s increases because the model becomes better at capturing the underlying relationship between the predictors and the response variable. However, after a certain point, increasing s too much will lead to overfitting, causing the test RSS to increase again.

So, the correct answer is also (ii) Decrease initially, and then eventually start increasing in a U shape.

(c) Repeat (a) for variance.

As s increases, the model becomes more flexible, which generally leads to higher variance because the model becomes more sensitive to the fluctuations in the training data. Therefore, the variance will increase as s increases.

So, the correct answer is (iii) Steadily increase.

(d) Repeat (a) for (squared) bias.

As s increases, the model becomes more flexible and can better capture the underlying relationship between the predictors and the response variable. This typically leads to a decrease in bias because the model can better fit the training data.

So, the correct answer is (iv) Steadily decrease.

(e) Repeat (a) for the irreducible error.

The irreducible error represents the inherent noise in the data that cannot be reduced by any model. It is unrelated to the choice of s in this context. Therefore, the irreducible error will remain constant regardless of the value of s.

So, the correct answer is (v) Remain constant.

NikhilBharadwaj_Exersice5

2024-03-17