Subset Selectioin

Prediction accuracy and model interpretability are two very important aspects of this chapter. When the number of observations is much larger than the number of variables ($n>>p$), least squares estimates will tend to have low variance. If the number of observations is not much larger than the number of variables, there can be a lot of variability to the least squares fit leading to overfitting and poor predictions. Including irrelevant variables in a model leads to more complexity and reduced interpretability. Feature selection can be used to automatically exclude irrelevant variables from a multiple regression model.

Best subset selection

Fits a separate least squares regression for each possible combination of the p predictors. It fits all $2^p$ models that contain exactly $k$ predictors, and picks the best models based on the smallest RSS. The single best model is determined by using cross-validated prediction error, AIC, BIC, or adjusted $R^2$. Best subset selection has computational limitations since $2p$ models must be considered, and hence becomes computationally infeasible for values of $p$ greater than ~40.

Best subset selection in logistic regression involves a measure that substitutes for residual sum of squares in a broader class of models known as deviance. Deviance is negative two times the maximized log-likelihood. The smaller the deviance, the better the fit of the model.

Forward stepwise selection

This begins with a model that utilizes no predictors and successively adds predictors one-at-a-time until the model utilizes all the predictors. Specifically, the predictor that yields the greatest additional improvement is added to the model at each step. This is a computationally efficient alternative. This procedure differs from best selection primarily by only choosing the best ($p−k$) model that yields the smallest residual sum of squares (or largest $R^2$). Forward stepwise selection can be applied in high-dimensional scenarios where n < p. One issue with forward stepwise selection is that it cannot find the best 2-variable model in a data set where the best 1-variable model utilizes a variable not used by the best 2-variable model.

Backward stepwise selection starts with the full least squares model utilizing all $p$ predictors and iteratively removes the least useful predictor with each iteration. It considers all $k$ models that use $k−1$ predictors from $M_k$. Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best possible model. However, it requires that $n > p$, so the full model with all $p$ predictors can be fit.

Two common approaches for estimating test error are:

Indirectly estimating test error by adjusting the training error to account for the bias caused by overfitting.
Directly estimating test error using a validation set or cross validation.

For a model containing d predictors fitted with least squares, the Cp estimate of test mean squared error is calculated as $C_p=\frac{1}{n} * (RSS+2d\sigma2)$.

Bayes information criterion (BIC)

BIC is given by $BIC=\frac{1}{n} * (RSS+log(n)d\sigma2)$. BIC tends to take on smaller values when test MSE is low, so smaller values of BIC are preferable. Since $log(n) > 2$ for $n > 7$, the BIC statistic penalizes models with more variables more heavily than Cp.

Adjusted $R^2$

This aims to penalize models that include unnecessary variables. After all of the correct variables have been added, adding additional noise variables will only decrease the residual sum of squares slightly.

Shrinkage Methods

These methods fit a model which contains all $p$ predictors, as opposed to a subset, and then “shrinks” the coefficients estimates towards 0. This ends up reducing the variance of the coefficients, by sacrificing small amounts of bias, and improving the overall model fit.

Ridge Regression

Ridge regression is performed by choosing $\beta$ values such that the following is minimized \[\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p} \beta_{j}^{2}=\mathrm{RSS}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}\] which is equivalent to minimizing RSS just like in OLS, plus an extra piece known as the “shrinkage penalty.” This penalty is minimized when the estimates of $\beta$ are close to zero so it has the effect of shrinking the estimates. The $\lambda$ is known as the “tuning parameter” which determines how harsh the impact of this penalty is. When $\lambda = 0$, the penalty has no effect and the resulting coefficients will be the same as those from OLS. Conversely, when $\lambda \rightarrow \infty$ the coefficients approach 0.

Becacuse of the sum of squared coefficients in the above formula, the predictors need to be “standardized” so that they are all on the same scale. They can be standardized using the following formula which results in all the variables having a standard deviation of 1. \[\tilde{x}_{i j}=\frac{x_{i j}}{\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i j}-\bar{x}_{j}\right)^{2}}}\]

Lasso

While ridge regression shrinks coefficients close to 0, it does not have the ability to set any of them exactly equal to 0. The lasso is able to do this, by way of changing the shrinkage penalty slightly, making it a useful model for variable selection. \[\lambda\sum_{j=1}^{p} \beta_{j}^{2} \rightarrow \lambda\sum_{j=1}^{p} |\beta_{j}|\] This change ends up allowing for coefficients to be set exactly to 0 when $\lambda$ is sufficiently large. This is the main advantage of lasso, as it produces more sparse models using only a subset of the predictors, and allows for easier interpretation.

These methods will both produce different coefficients for different values of $\lambda$ and so the choice of $\lambda$ is very important. It is best estimated using cross validation, and selecting the one which minimizes MSE. The advantage of these methods over OLD is that they substantially reduce the variance in estimates while only slightly increasing bias (as $\lambda$ increases), so they are particularly usefull when estimates have high variance. Lasso will perform better when a small number of predictors have substantial impacts, where ridge regression will perform better when the response variable is a function of many predictors with relatively equal impacts.

Dimension Reduction Methods

Dimension reduction methods are a class of techniques that transform the predictors and then fit a least squares model using the transformed variables instead of the original predictors. Let $Z_1,Z_2,...,Z_m$ represent $M<p $linear combinations of the original predictors, $p$. Formally,\[Z_{m}=\sum_{j=1}^{p} \phi_{j m} X_{j}\]

For some constants $\phi_1m,\phi_2m,...,\phi_pm$, where $m=1,...,M$. It is then possible to use least squares to fit the linear regression model:\[y_{i}=\theta_{0}+\sum_{m=1}^{M} \theta_{m} Z_{i m}+\epsilon_{i}\]

where $i=1,...,n$ and the regression coefficients are represented by $\theta_0,\theta_1,...,\theta_M$. The term dimension reduction references the fact that this approach reduces the problem of estimating the $p+1$ coefficients to $M+1$ coefficients. This has the potential to significantly reduce the variance of the fitted coefficients in situations where $p$ is large relative to $n$.

Principal Component Regression (PCR)

The premise behind this approach is that a small number of principal components can often suffice to explain most of the variability in the data as well as the relationship between the predictors and the response. This relies on the assumption that the directions in which $X_1,...,X_p$ show the most variation are the directions that are associated with the predictor Y. Though not always true, it is true often enough to approximate good results. In scenarios where the assumption underlying principal component regression holds true, then the result of fitting a model to $Z_1,...,Z_M$ will likely be better than the result of fitting a model to $X_1,...,X_p$ since most of the information in the data that relates to the response is captured by $Z_1,...,Z_M$ and by estimating only $M<<p$ coefficients overfitting is mitigated.

Partial Least Squares (PLS)

Unlike principal component regression, partial least squares is a supervised learning method in that the value of the response is used to supervise the dimension reduction process. PLS identifies a new set of features $Z_1,...,Z_M$ that are linear combinations of the original predictors and then uses these $M$ new features to fit a linear model using least squares. This approach makes use of the response $Y$ to identify new features that not only approximate the original predictors well, but that are also related to the response. Like principal component regression, the number of partial least squares directions, $M$, used with partial least squares is generally selected using cross validation.

Applications

The Boston housing dataset has 506 observations, and 14 variables.

Ridge Regression

# model.matrix() automatically converts any qualitative variables into dummy variables
x = model.matrix(medv~., boston)[,-1]
y = boston$medv

set.seed(123)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)

y_test = y[test]

# alpha = 0 fits a Ridge model, and alpha = 1 fits a Lasso model
ridge_boston <- cv.glmnet(x[train,], y[train], alpha = 0)
ridge_pred_boston <- predict(ridge_boston, s = ridge_boston$lambda.min, newx = x[test,])
mean((ridge_pred_boston - y_test)^2)

## [1] 20.93114

ridge <- glmnet(x, y, alpha = 0)
predict(ridge, type = "coefficients", s = ridge_boston$lambda.min)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  27.267998220
## crim         -0.085906235
## zn            0.031534937
## indus        -0.041548821
## chas          2.909284605
## nox         -11.361701629
## rm            4.017335580
## age          -0.004078128
## dis          -1.079159649
## rad           0.142772657
## tax          -0.005370291
## ptratio      -0.844550487
## black         0.009027138
## lstat        -0.465964921

Lasso Regression

lasso_boston <- cv.glmnet(x[train,], y[train], alpha = 1)
lasso_pred_boston <- predict(lasso_boston, s = lasso_boston$lambda.min, newx = x[test,])
mean((lasso_pred_boston - y_test)^2)

## [1] 22.1527

lasso <- glmnet(x, y, alpha = 1)
predict(lasso, type = "coefficients", s = lasso_boston$lambda.min)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  35.05925375
## crim         -0.10164838
## zn            0.04289605
## indus         .         
## chas          2.69720998
## nox         -16.66244869
## rm            3.84559731
## age           .         
## dis          -1.42820736
## rad           0.26808255
## tax          -0.01046781
## ptratio      -0.93547527
## black         0.00911473
## lstat        -0.52253105

plot(lasso_boston)

plot(lasso_boston$glmnet.fit,"lambda",label=TRUE)

Principal Components Regression

set.seed(123)
# CV defaults to 10-fold cross validation
pcr_boston = pcr(medv~., data = boston, scale = TRUE, validation = "CV")

summary(pcr_boston)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    7.300    6.890    5.609    5.594    5.165    5.141
## adjCV        9.206    7.298    6.883    5.606    5.640    5.153    5.132
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.166    5.138    5.149     5.164     5.164     4.949     4.879
## adjCV    5.159    5.130    5.141     5.155     5.153     4.939     4.868
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.13    58.15    67.71    74.31    80.73    85.79    89.91
## medv    37.42    45.59    63.59    64.78    69.70    70.05    70.05
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       92.95    95.08     96.78     98.21     99.51    100.00
## medv    70.56    70.57     70.89     71.30     73.21     74.06

validationplot(pcr_boston, val.type = "MSEP")

# Now splitting into training and test data
pcr_boston <- pcr(medv~., data = boston, subset = train, scale = TRUE, validation = "CV")

validationplot(pcr_boston, val.type = "MSEP")

pcr_pred_boston <- predict(pcr_boston, x[test,], ncomp = 13)
mean((pcr_pred_boston - y_test)^2)

## [1] 22.3659

pcr_boston <- pcr(y~x, scale = TRUE, ncomp = 13)
summary(pcr_boston)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    47.13    58.15    67.71    74.31    80.73    85.79    89.91    92.95
## y    37.42    45.59    63.59    64.78    69.70    70.05    70.05    70.56
##    9 comps  10 comps  11 comps  12 comps  13 comps
## X    95.08     96.78     98.21     99.51    100.00
## y    70.57     70.89     71.30     73.21     74.06

Partial Least Squares

set.seed(123)
# CV defaults to 10-fold cross validation
pls_boston = plsr(medv~., data = boston, scale = TRUE, validation = "CV")

summary(pls_boston)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    6.554    5.111    5.003    4.955    4.923    4.892
## adjCV        9.206    6.553    5.104    4.995    4.945    4.910    4.881
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       4.881    4.875    4.876     4.878     4.879     4.879     4.879
## adjCV    4.871    4.865    4.866     4.867     4.868     4.868     4.868
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       45.86    56.78    63.91    69.70    75.47    78.72    81.95
## medv    49.93    70.64    72.30    73.29    73.77    73.92    74.01
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       84.62    90.16     92.40     96.22     98.16    100.00
## medv    74.06    74.06     74.06     74.06     74.06     74.06

validationplot(pls_boston, val.type = "MSEP")

# Now splitting into training and test data
pls_boston <- plsr(medv~., data = boston, subset = train, scale = TRUE, validation = "CV")

validationplot(pls_boston, val.type = "MSEP")

pls_pred_boston <- predict(pls_boston, x[test,], ncomp = 9)
mean((pls_pred_boston - y_test)^2)

## [1] 22.34382

pls_boston <- plsr(y~x, scale = TRUE, ncomp = 9)
summary(pls_boston)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 9
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    45.86    56.78    63.91    69.70    75.47    78.72    81.95    84.62
## y    49.93    70.64    72.30    73.29    73.77    73.92    74.01    74.06
##    9 comps
## X    90.16
## y    74.06

Model Selection and Regularization

Subset Selectioin

Best subset selection

Forward stepwise selection

Bayes information criterion (BIC)

Adjusted \(R^2\)

Shrinkage Methods

Ridge Regression

Lasso

Dimension Reduction Methods

Principal Component Regression (PCR)

Partial Least Squares (PLS)

Applications

Ridge Regression

Lasso Regression

Principal Components Regression

Partial Least Squares