Linear Model Selection and Regularization

Data Sources:

Heavily borrowed from:

Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link

require(knitr)

## Loading required package: knitr

Overview and Definitions

In this lesson we consider some alternative fitting approaches for linear models, besides the usual ordinary least squares. These alternatives can sometimes provide better prediction accuracy and model interpretability.

Prediction Accuracy: Given that the true relationship between Y and X is approx. linear, the ordinary least squares estimates will have low bias. OLS also behaves well went n >> p, the number of observation is larger then the number of parameters. Yet if n is not much larger than p, then there can be a lot of variability in the fit, resulting in overfitting and/or poor predictions. If p > n, then there is no longer a unique least squares estimate, and the method cannot be used at all. By constraining and shrinking the estimated coefficients, we can often substantially reduce the variance as the cost of a negligible increase in bias, which often leads to dramatic improvements in accuracy.
Model Interpretability: Often in multiple regression, many variables are not associated with the response. Irrelevant variables leads to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0) we obtain a more easily interpretable model. However, using OLS makes it very unlikely that the coefficients will be exactly zero. Here we explore some approach for automatically excluding features using this idea.
- Subset Selection: This approach identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.
- Shrinkage. This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage, AKA regularization has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Thus this method also performs variable selection.
- Dimension Reduction: This approach involves projecting the p predictors into an M-dimensional subspace, where M < p. This is attained by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

Though we discuss the application of these techniques to linear models, they also apply to other methods like classification.

Considerations for High Dimensions

Most traditional regression techniques are intended for low-dimensional settings in which n >> p. This in part because through most of the fields history, the bulk of the problems requiring statistics have been low dimensional. These days p can be very large, and n is often limited due to cost, availability or other considerations.

In general a dataset is high-dimensional if p > n or if p is slightly smaller than n. Classical approaches such as least squares linear regression are not appropriate here.

So what exactly goes wrong in high dimensional settings?

We will discuss the problems under the context of linear regression, though the ideas hold true for all classical regression techniques (linear OLS, logistic regression, LDA, ect).

When p is larger, or almost larger than n, the least squares approach will yield a set of coefficient estimates that result in a perfect fit to the data, regardless of whether there is truly a relationship present. The problem is that a perfect fit will almost always lead to over fitting the data. The problem is simply that the least squares regression is _too flexible__ and hence over fits the data.

In this scenario, we can actually attain perfect fits to data even when the features are completely unrelated to the response.

Unfortunately, the model fitting parameters R², C_p, AIC, and BIC approaches are also not effective in a high dimensional setting, even with cross validation, because estimating variance can be problematic.

However, it turns out that many of the methods we will discuss in this notebook for fitting less flexible least squares models (forward stepwise, ridge, lasso, and PCA) are particularly useful for performing regression in high dimensional settings. Essentially, these approaches avoid over fitting by using a less flexible fitting approach than ordinary least squares.

The general rule for additional dimensions in data is that additional features are only useful if they are truly associated with the response. Otherwise, the addition of noise features will lead to increased test set error and reduce the chance of over fitting.

Interpreting Results in High Dimensions

We must always be cautious about the way we report the obtained model results, especially in high dimensional settings. In this setting, the multicollinearity problem is extreme, since any variable in the model can be rewritten as a linear combination of all the other variables in the model. Essentially, we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients. In general we should be careful not to overstate the results obtained. We can make it clear the results found were simply one of many possible models for predicting the response, and that it must be validated on independent data sets.

It is also important to be careful when reporting errors and measure of model fit in the high dimensional. setting. We have seen that we can obtain useless models that have zero residuals when p > n, thus we should never use SSE, p-values, R², or other traditional measures of fit on the training data. Thus it is imperative to report error and prediction results using a test set, or cross validation errors.

Subset Selection

Best Subset Selection

Here we fit a separate OLS regression for each possible combination of the p predictors and then look at the resulting model fits. The problem with this method is the best model is hidden within \(2^p\) possibilities. The algorithm is broken up into two stages. (1) Fit all models that contain k predictors, where k is the max length of the models. (2) Select a single model using cross-validated prediction error. More specific prediction error methods like AIC and BIC are discussed below. It is important to use testing or validation error, and not training error to assess model fit because RSS (residual sum of squares) and R² monotonically increase with more variables. The best approach is to cross validate and choose the model with the highest R²nd lowest RSS on testing error estimates.

This works on other types of model selection, such as logistic regression, except that the score that we select upon changes. For logistic regression we would utilize deviance instead of RSS and R².

Next we discuss methods are more computationally efficient.

Stepwise Selection

Besides computational issues, the best subset procedure also can suffer from statistical problems when p is large, since we have a greater chance of overfitting.

Forward Stepwise Selection considers a much smaller subset of p predictors. It begins with a model containing no predictors, then adds predictors to the model, one at a time until all of the predictors are in the model. The order of the variables being added is the variable, which gives the greatest addition improvement to the fit, until no more variables improve model fit using cross-validated prediction error. A best subset model for p = 20 would have to fit 1 048 576 models, where as forward step wise only requires fitting 211 potential models. However, this method is not guaranteed to find the model. Forward stepwise regression can even be applied in the high-dimensional setting where p > n.
Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the least useful predictor one at a time. Requires that n > p.
Hybrid Methods follows the forward stepwise approach, however, after adding each new variable, the method may also remove any variables that do not contribute to the model fit.

Choosing the Best Model

Each of the three above-mentioned algorithms requires us to manually decide which model performs best. As mentioned before, the models with the most predictors will usually have the smallest RSS and largest R² when using training error. To select the model with the best test error, we need to estimate the test error. There are two ways to compute test error.

Indirectly estimate the test error by making and _adjustment) to the training error to account for the over fitting bias.
Directly estimate the test error, using either a validation set, or cross-validation approach.

C_p, AIC, BIC, and Adjusted R²

These four measures are the four common approaches to adjust the training error to estimate test error.

Essentially the C_p statistic adds a penalty of 2 d σ² to the training RSS given that the training error trends to underestimate the test error, where d is the number of predictors and σ²2 is an estimate of the variance of the error associated with each response measurement.

AIC criterion is defined for a large class of models fit by maximum likelihood (ML). In the case of a Gaussian model, ML and OLS are equivalent. Thus for OLS models, AIC and C_p are proportional to each other and only differ in that AIC has an additive constant term.

BIC is derived from a Bayesian point of view, but looks similar to AIC and C_p. For an OLS model with d predictors, the BIC replaces the 2 d σ² from C_p with \(log (n) d σ²\), where n is the number of observations. Since log n > 2 for a n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and results in smaller models.

The adjusted R² adds a penalty term for additional variables being added to the model.

\(Adjusted R^2 = 1 − RSS/(n − d − 1) / TSS / (n-1)\)

Unlike the earlier statistics, a large value of Adjusted R^2 indicates a model with small test error. In theory, the model with the largest adjusted R^2 will have only correct variables and no noise variables. This is because unlike R², Adjusted R² is penalized for inclusion of unnecessary variables in the model. Despite its popularity, and even though it is quite intuitive, the adjusted R2 is not as well motivated in statistical theory as AIC, BIC, and C_p.

All these statistics have rigorous theoretical justifications, but they still all rely (to some degree) on certain arguments, like large n.

Here we apply best subset selection to the Hitters dataset from the ISLR package. We want to predict a baseball player’s Salary based on various statistics from the previous year.

library(ISLR)
attach(Hitters)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

dim(Hitters)

## [1] 322  20

str(Hitters)

## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

# Check for NA data
sum(is.na(Hitters$Salary))/length(Hitters[ ,1])*100

## [1] 18.32298

Turns out that about 18% of the data is missing. For the purpose of this lesson we will just omit missing data.

Hitters <- na.omit(Hitters)
dim(Hitters)

## [1] 263  20

The regsubsets() function of the leaps library will perform out best subset selection, where best is quantified using RSS.

library(leaps)

regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

The asterisk indicates that a variables is included in the corresponding model. For example, the best two-variable model contains only the Hits and CRBI. By default regsubsets() only reports up to the best eight-variable model. We can adjust this with the nvmax parameter.

regfit <- regsubsets(Salary ~ ., data=Hitters, nvmax = 19)
summary(regfit)$rsq

##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

In this 19 variable model, the R² increases monotonically as more variables are included. This always happen to the R² because of the colinearity.

We can use the built in plot functions to plot the RSS, adj. R², C_p, AIC and BIC. Each row in this graph represents a model; the shaded rectangles in the columns indicate the variables included in the given model. The darkness of the shading simply represents the ordering of the BIC values.

par(mfrow=c(2,2))
plot(regfit, scale = 'r2')
plot(regfit, scale = 'adjr2')
plot(regfit, scale = 'Cp')
plot(regfit, scale = 'bic')

Note: recall, the measures of fit shown above are (besides R²) all estimates of test error.

Forward and Backwards Stepwise Selection

We can also use regsubsets() here by specifying the paramter method with either backwards or forwards.

regfit.fwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19, 
                      method = "forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

regfit.bwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
                      method ="backward")
summary(regfit.bwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 5  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 6  ( 1 )  " "  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

We can see here that 1 - 6 variable models are identical for best subset and forward selection. However, the best 7+ variable models are different for all three techniques.

Note: To select the best value of nvmax, we should cross use cross validation.

Validation and Cross-Validation

Resampling methods have become and integral tool in modern statistics. Resampling is based on repeatedly drawing samples from a training set of observations and refitting a model on each sample in order to obtain additional insights into that model. For example: to examine the robustness and variability of a linear regression, we can fit a linear regression to each new sample and examine the difference in the results. The goal for any resampling measure is to better estimate how a statistical model will perform on out of sample, ‘real-life’ data.

In general, cross validation techniques are more direct estimates of test error, and makes fewer assumptions about the underlying model. Further, it can be used in a wider selection of model types.

Cross-Validation

Test error rate is the average error from using a statistical method to predict the response on a new observation.
Training error rate is simply the average error from a statistical method that uses predictions based on the data that was used to fit the model in the first place.

We always prefer to utilize the test error rate to measure model performance, since training error rate can dramatically over-estimate real-world performance. Sometimes, simply splitting the data to obtain testing and training dataset is not recommended due to the small size of the sample.

Cross-validation estimates the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical method to those held out observations. This process can be repeated several times to enable more data to partake in the training of the model, one subset at a time.

Validation Set

Prior to explaining cross-validation we describe the more basic validation set approach, and it’s drawbacks. In this method, we randomly divide the available data into two parts, a training set, and a validation set. The model is fit on the training set, then the fitted model is used to predict the responses for the observations in the validation set. Typically, the performance of the models are measured using MSE.

To attain a more robust measure of fit, we can repeatedly randomly sample the data, and compute the MSE, then compare the results. The validation approach is simple and easy to implement, but it has two potential problems.

The validation estimate can be highly variable and depend highly upon which observations are included in the training set / validation set.
Only a subset of the observations are used to train the model, and for training models, the more data the better.

k-Fold Cross-Validation

This method randomly divides a set of observations into k groups, for folds, of approximately equal size. Each fold contains a non overlapping (with the subsequent folds) validation set and training set.

For example, a 5-fold CV would re sample the training and validation sets in a non-overlapping fashion 5 times, such that all of the data is eventually used as training data. The test error is then estimated by averaging the five resulting MSE estimates.

The difference between a 5, 10 n or other sized k-folds CV is the bias-variance trad-off.

Using Cross-Validation for Classification Problems

Cross validation can also be used in a qualitative setting. With qualitative data, we utilize the number of classifieds observations to quantify test error, rather than MSE. The estimated test error rate using k-folds is much more robust than the training error rate, when deciding what order of polynomial to keep, or what value of K we should use in a KNN classification. This is because training error rate often decays as the method becomes more flexible, yet also means we will eventually over-fit the model.

Shrinkage Methods

The subset selection methods described above used least squares fitting that contained a subset of the predictors to choose the best model, and estimate test error. Here, we discuss an alternative where we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.

Ridge Regression

Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. A parameter, λ, controls the impact of the shrinking. λ = 0 will behave exactly like OLS regression. Of course, selection a good value for λ is critical, and should be chosen using cross validation techniques. A requirement of ridge regression is that the predictors X have been centered to have a mean = 0, thus the data must be standardized before hand.

Note that the shrinkage does not apply to the intercept.

Why is ridge regression better than least squares?

The advantage is apparent in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, but increased bias. Regular OLS regression is fixed with high variance, but no bias. However, the lowest test MSE tends to occur at the intercept between variance and bias. Thus, by properly tuning λ and acquiring less variance at the cost of a small amount of bias, we can find a lower potential MSE.

Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of λ.

The Lasso

Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn’t generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted that λ is large enough. Thus, Lasso regression also performs variable selection.

There is no dominant algorithm present here, in general it is best to test all three techniques introduced so far and chose the one that best suits the data using cross-validated test error estimates.

Validation Set

Instead of using adj. R² _C_p and BIC to estimate test error rates, we can use cross-validation approaches. In order for this to work we must only use the training observations to perform all aspects of model fitting and variable selection. The test errors are then computed by applying the training model to the test or validation data. We can split the data into training and testing sets using the caret package.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(1234)

split <- createDataPartition(y = Hitters$Salary, p = 0.5, list = FALSE)
train <- Hitters[split,]
test <- Hitters[-split,]

set.seed(123) # for reproducing these results

ridge <- train(Salary ~., data = train,
               method='ridge',
               lambda = 4,
               preProcess=c('scale', 'center'))

## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2

ridge

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled (19), centered (19) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   RMSE SD   Rsquared SD
##   0e+00   369.5398  0.3878097  48.75211  0.1235791  
##   1e-04   367.5711  0.3914902  48.77035  0.1223455  
##   1e-01   343.9462  0.4463620  45.99850  0.1007578  
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was lambda = 0.1.

ridge.pred <- predict(ridge, test)

mean(ridge.pred - test$Salary)^2

## [1] 492.5604

k-folds

Use k-folds to select the best lambda. (Even though caret uses bootstrap in the example above by default).

For cross-validation, we will split the data into testing and training data

set.seed(123)
fitControl <- trainControl(method = "cv",
                            number = 10)
# Set seq of lambda to test
lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))
                          
ridge <- train(Salary~., data = train,
              method='ridge',
              trControl = fitControl,
#                tuneGrid = lambdaGrid
              preProcess=c('center', 'scale')
            )

ridge

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered (19), scaled (19) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 118, 118, 121, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   RMSE SD   Rsquared SD
##   0e+00   310.9253  0.4955675  70.65403  0.1990873  
##   1e-04   310.4766  0.4966985  70.59140  0.1984634  
##   1e-01   303.2478  0.5321637  75.70436  0.1646462  
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was lambda = 0.1.

# Compute coeff
predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,]

##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##   0.000000  93.203820 -27.122046 -20.841875  10.588744 110.872392 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
## -16.092720  28.095612  83.070969  41.076537  72.119073  67.348985 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
## -42.508339  21.016185 -25.150519  46.020431 -13.878715   1.286104 
## NewLeagueN 
##  10.441192

ridge.pred <- predict(ridge, test)
sqrt(mean(ridge.pred - test$Salary)^2)

## [1] 22.1937

You’ll notice that the regression coefficients dont really seem like they have shifted towards zero, but that is because we are standardizing the data first.

We should now check to see if this is actually any better than a regular lm() model.

lmfit <- train(Salary ~., data = train,
               method='lm',
               trControl = fitControl,
               preProc=c('scale', 'center'))
lmfit

## Linear Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled (19), centered (19) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 119, 118, 121, 120, ... 
## Resampling results
## 
##   RMSE      Rsquared   RMSE SD   Rsquared SD
##   316.4114  0.4691751  99.22171  0.1931053  
## 
##

coef(lmfit$finalModel)

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  518.911805 -242.937109  430.118098   34.425754 -210.737113  -60.878038 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##  226.088602  -45.963985  133.208255 -293.157175   -8.227962  495.389105 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##  181.778197 -278.938054   14.916906  -30.533021   35.987237  -19.843898 
##      Errors  NewLeagueN 
##   15.845189   10.722860

lmfit.pred <- predict(lmfit, test)
sqrt(mean(lmfit.pred - test$Salary)^2)

## [1] 11.71337

As we can see this simple linear regression fit certainly has lower RMSE (root mean squared error) and higher R². We can also see that the ridge regression has indeed shrunk the coefficients, some of them extremely close to zero and this a advantage do the Ridge model.

The Lasso

lasso <- train(Salary ~., train,
               method='lasso',
               preProc=c('scale','center'),
              
               trControl=fitControl)
lasso

## The lasso 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled (19), centered (19) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 121, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   RMSE SD   Rsquared SD
##   0.1       294.4562  0.5859093  100.8605  0.1645902  
##   0.5       301.5807  0.5813785  106.0907  0.2036618  
##   0.9       313.6748  0.5432861  111.6562  0.2372415  
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was fraction = 0.1.

# Get coef
predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')

## $s
## [1] 0.1
## 
## $fraction
##   0 
## 0.1 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##    0.00000   35.66515    0.00000    0.00000    0.00000   81.03587 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
##    0.00000    0.00000   95.06893    0.00000   14.84315   50.35656 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
##    0.00000    0.00000    0.00000    0.00000    0.00000    0.00000 
## NewLeagueN 
##    0.00000

lasso.pred <- predict(lasso, test)
sqrt(mean(lasso.pred - test$Salary)^2)

## [1] 28.08927

Here in the lasso we see that many of the coefficients have been forced to zero. This presents a simplicity advantage over ridge and linear regression models, even though the RMSE is a bit higher than the ridge regression.

Dimension Reduction Methods

So far, the methods we have discussed have controlled for variance by either using a subset of the original variables, or by shrinking their coefficients toward zero. Now we explore a class of models that transform the predictors and then fit a least squares model using the transformed variables. Dimension reduction reduces the problem of estimating p + 1 coefficients to the simpler problem of M + 1 coefficients, where M < p. Two approaches for this task are principal component regression and partial least squares.

Principal Components Regression (PCA)

One can describe PCA as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint. It turns out that the 2 PC must be perpendicular to the first PC direction. The idea is that the principal components capture (maximize) the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.

In regression, we construct M principal components and then use these components as predictors in a linear regression using least squares. The idea being that we fit a model with a small number of variables (principals) that explain most of the variability in the data, and the most relationship with the response. In general, we have potential to fit better models than ordinary least squares since we can reduce the effect of over fitting. In general, PCR will tend to be better in cases where the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response.

Note that PCR is not a feature selection method. This is because it is a linear combination of all p original features. Thus PCR is more related to ridge regression than lasso.

Partial Least Squares

The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.

PLS does this by placing higher weights on the variables that are most strongly related to the response. To attain subsequent direction, the method first adjusts each of the variables for the first component by regressing each variable on the first component and taking the residuals. The residuals can be interpreted as the remaining information that has no been explained by the first PLS direction. We then compute the second component in exactly the same way as the first component. This can be iterated M times to identify multiple PLS components.

In practice PLS performs no better than ridge regression or PCR. This is because even though PLS can reduce bias, it also has potential to increase the variance, so the overall benefit is not really distinct.

PCR and PLS

Principal Components Regression

We will show PCR using both the pls package, as well as the caret package.

library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(123)

#defaults to 10 folds cross validation
pcr.fit <- pcr(Salary ~., data=train, scale=TRUE, validation="CV", ncomp=19)
summary(pcr.fit)

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           421.7    309.2    312.4    305.8    304.0    305.1    309.2
## adjCV        421.7    308.9    311.9    305.3    303.4    304.6    308.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       313.8    319.8    320.6     320.8     318.9     320.0     320.7
## adjCV    312.3    318.1    318.7     319.6     316.7     317.7     318.8
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        326.1     327.6     318.2     317.0     330.1     329.9
## adjCV     323.3     324.7     315.0     313.8     326.2     325.9
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         39.93    59.19    71.46    78.51    83.79    88.40    92.04
## Salary    46.53    47.58    50.14    51.56    51.81    52.94    53.37
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.82    96.19     97.14     98.00     98.71     99.11     99.47
## Salary    53.40    53.66     53.84     55.57     56.19     56.36     58.07
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.74     99.89     99.96     99.99    100.00
## Salary     58.28     61.32     62.26     62.39     62.71

validationplot(pcr.fit, val.type='MSEP')

This algorithm reports the CV scores as RMSE, and R² of the training data. However, we can see from plotting the MSE against the number of components that we achieve the lowest MSE at the fourth component. This suggests a large improvement over a least squares approach because we are able to explain much of the variance using only 4 components, rather than 19.

Let’s see how this performs on the test dataset.

pcr.pred <- predict(pcr.fit, test, ncomp=4)
sqrt(mean((pcr.pred - test$Salary)^2))

## [1] 389.6798

This is comparable, but a bit lower than the RMSE of the ridge/ lasso/linear regression.

Using the caret package

pcr.fit <- train(Salary ~., data=train,
                 preProc = c('center', 'scale'),
                 method='pcr',
                 trControl=fitControl)
pcr.fit

## Principal Component Analysis 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered (19), scaled (19) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 119, 120, 121, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   RMSE SD   Rsquared SD
##   1      290.7372  0.5554968  107.4741  0.2151344  
##   2      290.6449  0.5522785  110.7581  0.2211907  
##   3      286.7008  0.5711053  111.5541  0.2204429  
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was ncomp = 3.

The caret package, using bootstrapping and 10 fold cv choses the best model with 3 component.

pcr.pred <- predict(pcr.fit, test)
sqrt(mean(pcr.pred - test$Salary)^2)

## [1] 29.37353

The results are comparable to lasso regression. However, PCR results are not easily interpretable.

Partial Least Squares

pls.fit <- plsr(Salary~., data=train, scale=TRUE, validation="CV")
summary(pls.fit)

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           421.7    309.3    310.6    317.7    324.4    329.0    324.2
## adjCV        421.7    308.9    309.6    315.9    321.9    325.6    321.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       323.8    325.0    326.3     325.7     325.3     322.9     324.8
## adjCV    320.7    321.5    322.7     322.1     321.9     319.5     321.2
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        326.3     328.6     329.4     335.5     335.6     336.4
## adjCV     322.6     324.8     325.6     331.2     331.2     332.0
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         39.63    52.82    65.27    75.24    79.47    83.61    86.50
## Salary    49.69    53.44    54.96    56.29    58.55    59.70    60.54
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         88.81    92.47     94.49     96.57     97.56     98.20     98.50
## Salary    61.43    61.69     61.97     62.12     62.31     62.38     62.49
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.25     99.56     99.73     99.98    100.00
## Salary     62.52     62.55     62.63     62.67     62.71

validationplot(pls.fit, val.type='MSEP')

Here the best M (number of components) is 1. Now we evaluate the corresponding test error.

pls.pred <- predict(pls.fit, test, ncomp=1)
sqrt(mean(pls.pred - test$Salary)^2)

## [1] 23.0655

Here we see a mild better result in RMSE compared to PCR. This is probably due to the fact that the component directions are estimated based on the predictors and response.

Linear Model Selection and Regularization

Matheus Rabetti

November 30, 2015

Data Sources:

Overview and Definitions

Considerations for High Dimensions

Interpreting Results in High Dimensions

Subset Selection

Choosing the Best Model

C_p, AIC, BIC, and Adjusted R²

Forward and Backwards Stepwise Selection

Validation and Cross-Validation

Cross-Validation

Validation Set

k-Fold Cross-Validation

Using Cross-Validation for Classification Problems

Shrinkage Methods

Ridge Regression

The Lasso

Validation Set

k-folds

The Lasso

Dimension Reduction Methods

Principal Components Regression (PCA)

Partial Least Squares

PCR and PLS

Principal Components Regression

Partial Least Squares

Linear Model Selection and Regularization

Matheus Rabetti

November 30, 2015

Data Sources:

Overview and Definitions

Considerations for High Dimensions

Interpreting Results in High Dimensions

Subset Selection

Choosing the Best Model

Cp, AIC, BIC, and Adjusted R²

Forward and Backwards Stepwise Selection

Validation and Cross-Validation

Cross-Validation

Validation Set

k-Fold Cross-Validation

Using Cross-Validation for Classification Problems

Shrinkage Methods

Ridge Regression

The Lasso

Validation Set

k-folds

The Lasso

Dimension Reduction Methods

Principal Components Regression (PCA)

Partial Least Squares

PCR and PLS

Principal Components Regression

Partial Least Squares

C_p, AIC, BIC, and Adjusted R²