Data Sources:

Heavily borrowed from:

Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link
Wikipedia

require(knitr)

## Loading required package: knitr

Overview and Definitions

In this lesson we consider some alternative fitting approaches for linear models, besides the usual ordinary least squares. These alternatives can sometimes provide better prediction accuracy and model interpretability.

Prediction Accuracy: Given that the true relationship between Y and X is approx. linear, the ordinary least squares estimates will have low bias. OLS also behaves well went n >> p. Yet if n is not much larger than p, then there can be a lot of variability in the fit, resulting in overfitting and/or poor predictions. If p > n, then there is no longer a unique least squares estimate, and the method cannot be used at all.

This problem is another facet of the curse of dimensionality. When p starts to become large, observations x start to become closer to the boundaries between classes than the nearby observations, which presents major problems for predicting. Further, with many p, the training samples are often sparsely populated, making it difficult to identify trends and predict. In-fact with increasing p increases the first nearest neighbor distance almost linearly. Generally, you could expect a NN distance close to zero for one-dimension, and average about 0.8 once we reach 10-dimensions.

By constraining and shrinking the estimated coefficients, we can often substantially reduce the variance as the cost of a negligible increase in bias, which often leads to dramatic improvements in accuracy.

Model Interpretability: Often in multiple regression, many variables are not associated with the response. Irrelevant variables leads to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0) we obtain a more easily interpretable model. However, using OLS makes it very unlikely that the coefficients will be exactly zero. Here we explore some approach for automatically excluding features using this idea.
- Subset Selection: This approach identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.
- Shrinkage. This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage, AKA regularization has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Thus this method also performs variable selection.
- Dimension Reduction: This approach involves projecting the p predictors into an M-dimensional subspace, where M < p. This is attained by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

Though we discuss the application of these techniques to linear models, they also apply to other methods like classification.

Methods in Detail

Subset Selection

Best Subset Selection

Here we fit a separate OLS regression for each possible combination of the p predictors and then look at the resulting model fits. The problem with this method is the best model is hidden within 2^p possibilities. The algorithm is broken up into two stages. (1) Fit all models that contain k predictors, where k is the max length of the models. (2) Select a single model using cross-validated prediction error. More specific prediction error methods like AIC and BIC are discussed below. It is important to use testing or validation error, and not training error to assess model fit because RSS and R^2 monotonically increase with more variables. The best approach is to cross validate and choose the model with the highest R^2 and lowest RSS on testing error estimates.

This works on other types of model selection, such as logistic regression, except that the score that we select upon changes. For logistic regression we would utilize deviance instead of RSS and R^2.

Next we discuss methods are more computationally efficient.

Stepwise Selection

Besides computational issues, the best subset procedure also can suffer from statistical problems when p is large, since we have a greater chance of overfitting.

Forward Stepwise Selection considers a much smaller subset of p predictors. It begins with a model containing no predictors, then adds predictors to the model, one at a time until all of the predictors are in the model. The order of the variables being added is the variable, which gives the greatest addition improvement to the fit, until no more variables improve model fit using cross-validated prediction error. A best subset model for p = 20 would have to fit 1 048 576 models, where as forward step wise only requires fitting 211 potential models. However, this method is not guaranteed to find the model. Forward stepwise regression can even be applied in the high-dimensional setting where p > n.
Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the least useful predictor one at a time. Requires that n > p.
Hybrid Methods follows the forward stepwise approach, however, after adding each new variable, the method may also remove any variables that do not contribute to the model fit.

Choosing the Best Model

Each of the three above-mentioned algorithms requires us to manually decide which model performs best. As mentioned before, the models with the most predictors will usually have the smallest RSS and largest R^2 when using training error. To select the model with the best test error, we need to estimate the test error. There are two ways to compute test error.

Indirectly estimate the test error by making and _adjustment) to the training error to account for the over fitting bias.
Directly estimate the test error, using either a validation set, or cross-validation approach.

C_p, AIC, BIC, and Adjusted R^2

These four measures are the four common approaches to adjust the training error to estimate test error.

Essentially the C_p statistic adds a penalty of 2 d σ^2 to the training RSS given that the training error trends to underestimate the test error, where d is the number of predictors and σ^2 is an estimate of the variance of the error associated with each response measurement.

AIC criterion is defined for a large class of models fit by maximum likelihood (ML). In the case of a Gaussian model, ML and OLS are equivalent. Thus for OLS models, AIC and C_p are proportional to each other and only differ in that AIC has an additive constant term.

BIC is derived from a Bayesian point of view, but looks similar to AIC and C_p. For an OLS model with d predictors, the BIC replaces the 2 d σ^2 from C_p with \(log (n) d σ^2\), where n is the number of observations. Since log n > 2 for a n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and results in smaller models.

The adjusted R^2 adds a penalty term for additional variables being added to the model.

\[ Adjusted R^2 = 1 − RSS/(n − d − 1) / TSS / (n-1)\]

Unlike the earlier statistics, a large value of Adjusted R^2 indicates a model with small test error. In theory, the model with the largest adjusted R^2 will have only correct variables and no noise variables. This is because unlike R^2, Adjusted R^2 is penalized for inclusion of unnecessary variables in the model. Despite its popularity, and even though it is quite intuitive, the adjusted R2 is not as well motivated in statistical theory as AIC, BIC, and C_p.

All these statistics have rigorous theoretical justifications, but they still all rely (to some degree) on certain arguments, like large n.

Validation and Cross-Validation

These approaches discussed in detail here.

In general, cross validation techniques are more direct estimates of test error, and makes fewer assumptions about the underlying model. Further, it can be used in a wider selection of model types.

Note: it is common for many models to have similar test errors. In this situation it is often better to pick the simplest model.

Shrinkage Methods

The subset selection methods described above used least squares fitting that contained a subset of the predictors to choose the best model, and estimate test error. Here, we discuss an alternative where we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.

Ridge Regression

Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. A parameter, λ, controls the impact of the shrinking. λ = 0 will behave exactly like OLS regression. Of course, selection a good value for λ is critical, and should be chosen using cross validation techniques. A requirement of ridge regression is that the predictors X have been centered to have a mean = 0, thus the data must be standardized before hand. Without going into the math, it is useful to know that ridge regression shrinks the features with the smallest column space variance. Like in prinicipal component analysis, ridge regression projects the data into d directional space and then shrinks the coefficients of the low-variance components more than the high variance components, which are equivalent to the largest and smallest principal components.

Note that the shrinkage does not apply to the intercept.

Why is ridge regression better than least squares?

The advantage is apparent in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias. However, the lowest test MSE tends to occur at the intercept between variance and bias. Thus, by properly tuning λ and acquiring less variance at the cost of a small amount of bias, we can find a lower potential MSE.

Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of λ.

The Lasso

Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn’t generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted that s is small enough. Since s = 1 results in regular OLS regression, as s approaches 0 the coefficients shrink towards zero. Thus, Lasso regression also performs variable selection.

There is no dominant algorithm present here, in general it is best to test all three techniques introduced so far and chose the one that best suits the data using cross-validated test error estimates.

Elastic Net

The elastic net is another penalty that incorperates the variable selection of the lasso and the shrinkage of correlated predictors like righe regression.

Dimension Reduction Methods

So far, the methods we have discussed have controlled for variance by either using a subset of the original variables, or by shrinking their coefficients toward zero. Now we explore a class of models that transform the predictors and then fit a least squares model using the transformed variables. Dimension reduction reduces the problem of estimating p + 1 coefficients to the simpler problem of M + 1 coefficients, where M < p. Two approaches for this task are principal component regression and partial least squares.

Principal Components Regression (PCA)

One can describe PCA as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint. It turns out that the 2 PC must be perpendicular to the first PC direction. The idea is that the principal components capture (maximize) the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.

In regression, we construct M principal components and then use these components as predictors in a linear regression using least squares. The idea being that we fit a model with a small number of variables (principals) that explain most of the variability in the data, and the most relationship with the response. In general, we have potential to fit better models than ordinary least squares since we can reduce the effect of over fitting. In general, PCR will tend to be better in cases where the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response.

Note that PCR is not a feature selection method. This is because it is a linear combination of all p original features. Thus PCR is more related to ridge regression than lasso.

Partial Least Squares

The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.

PLS does this by placing higher weights on the variables that are most strongly related to the response. To attain subsequent direction, the method first adjusts each of the variables for the first component by regressing each variable on the first component and taking the residuals. The residuals can be interpreted as the remaining information that has no been explained by the first PLS direction. We then compute the second component in exactly the same way as the first component. This can be iterated M times to identify multiple PLS components.

In practice PLS performs no better than ridge regression or PCR. This is because even though PLS can reduce bias, it also has potential to increase the variance, so the overall benefit is not really distinct.

Considerations for High Dimensions

Most traditional regression techniques are intended for low-dimensional settings in which n >> p. This in part because through most of the fields history, the bulk of the problems requiring statistics have been low dimensional. These days p can be very large, and n is often limited due to cost, availability or other considerations.

In general a dataset is high-dimensional if p > n or if p is slightly smaller than n. Classical approaches such as least squares linear regression are not appropriate here.

So what exactly goes wrong in high dimensional settings?

We will discuss the problems under the context of linear regression, though the ideas hold true for all classical regression techniques (linear OLS, logistic regression, LDA, ect).

When p is larger, or almost larger than n, the least squares approach will yield a set of coefficient estimates that result in a perfect fit to the data, regardless of whether there is truly a relationship present. The problem is that a perfect fit will almost always lead to over fitting the data. The problem is simply that the least squares regression is _too flexible__ and hence over fits the data.

In this scenario, we can actually attain perfect fits to data even when the features are completely unrelated to the response.

Unfortunately, the model fitting parameters R^2, C_p, AIC, and BIC approaches are also not effective in a high dimensional setting, even with cross validation, because estimating variance can be problematic.

However, it turns out that many of the methods we have discussed in this notebook for fitting less flexible least squares models (forward stepwise, ridge, lasso, and PCA) are particularly useful for performing regression in high dimensional settings. Essentially, these approaches avoid over fitting by using a less flexible fitting approach than ordinary least squares.

The general rule for additional dimensions in data is that additional features are only useful if they are truly associated with the response. Otherwise, the addition of noise features will lead to increased test set error and reduce the chance of over fitting.

Interpreting Results in High Dimensions

We must always be cautious about the way we report the obtained model results, especially in high dimensional settings. In this setting, the multicollinearity problem is extreme, since any variable in the model can be rewritten as a linear combination of all the other variables in the model. Essentially, we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients. In general we should be careful not to overstate the results obtained. We can make it clear the results found were simply one of many possible models for predicting the response, and that it must be validated on independent data sets.

It is also important to be careful when reporting errors and measure of model fit in the high dim. setting. We have seen that we can obtain useless models that have zero residuals when p > n, thus we should never use SSE, p-values, R^2, or other traditional measures of fit on the training data. Thus it is imperative to report error and prediction results using a test set, or cross validation errors.

Comparison of Selection and Shrinking Methods.

Code Examples

Subset Selection Methods

Best Subset Selection

Here we apply best subset selection to the Hitters dataset from the ISLR package. We want to predict a baseball player’s Salary based on various statistics from the previous year.

library(ISLR)
attach(Hitters)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

dim(Hitters)

## [1] 322  20

str(Hitters)

## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

# Check for NA data
sum(is.na(Hitters$Salary))/length(Hitters[,1])*100

## [1] 18.32

Turns out that about 18% of the data is missing. For the purpose of this lesson we will just omit missing data.

Hitters <- na.omit(Hitters)
dim(Hitters)

## [1] 263  20

The regsubsets() function of the leaps library will perform out best subset selection, where best is quantified using RSS.

library(leaps)

regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

The asterisk indicates that a variables is included in the corresponding model. For example, the best two-variable model contains only the Hits and CRBI. By default regsubsets() only reports up to the best eight-variable model. We can adjust this with the nvmax parameter.

regfit <- regsubsets(Salary ~ ., data=Hitters, nvmax = 19)
summary(regfit)$rsq

##  [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405
## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461

In this 19 variable model, the R^2 increases monotonically as more vaiables are included.

We can use the built in plot functions to plot the RSS, adj. R^2, C_p, AIC and BIC.

par(mfrow=c(2,2))
plot(regfit, scale = 'r2')
plot(regfit, scale = 'adjr2')
plot(regfit, scale = 'Cp')
plot(regfit, scale = 'bic')

plot of chunk unnamed-chunk-5

Note: recall, the measures of fit shown above are (besides R^2) all estimates of test error.

Forward and Backwards Stepwise Selection

We can also use regsubsets() here by specifying the paramter method with either backwards or forwards.

regfit.fwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19, 
                      method = "forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

regfit.bwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
                      method ="backward")
summary(regfit.bwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 5  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 6  ( 1 )  " "  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

We can see here that 1 - 6 variable models are identical for best subset and forward selection. However, the best 7 + variable models are different for all three techniques.

Note: To select the best value of nvmax, we should cross use cross validation.

Ridge Regression and Lasso

Forewarning: ridge and lasso regression are not well explained using the caret package, since it handles a lot of the action automatically.

Start Cross-Validation Methods

We will be applying cross-validation methods within the Regularization methods as well, rather than isolating them to a single section.

Validation Set

Instead of using adj. R^2 Cp and BIC to estimate test error rates, we can use cross-validation approaches. In order for this to work we must only use the training observations to perform all aspects of model fitting and variable selection. The test errors are then computed by applying the training model to the test or validation data. We can split the data into training and testing sets using the caret package.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

split <- createDataPartition(y=Hitters$Salary, p = 0.5, list = FALSE)
train <- Hitters[split,]
test <- Hitters[-split,]

set.seed(825) # for reproducing these results

ridge <- train(Salary ~., data = train,
               method='ridge',
               lambda = 4,
               preProcess=c('scale', 'center'))

## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2

ridge

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... 
## 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0       400   0.4       40       0.09       
##   1e-04   400   0.4       40       0.09       
##   0.1     300   0.5       40       0.09       
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was lambda = 0.1.

ridge.pred <- predict(ridge, test)

mean(ridge.pred - test$Salary)^2

## [1] 30.1

k-folds

Use k-folds to select the best lambda. (Even though caret uses bootstrap in the example above by default).

For cross-validation, we will split the data into testing and training data

set.seed(825)
fitControl <- trainControl(method = "cv",
                            number = 10)
# Set seq of lambda to test
lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))
                          
ridge <- train(Salary~., data = train,
              method='ridge',
              trControl = fitControl,
#                tuneGrid = lambdaGrid
              preProcess=c('center', 'scale')
            )

ridge

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... 
## 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0       300   0.6       70       0.1        
##   1e-04   300   0.6       70       0.1        
##   0.1     300   0.6       70       0.1        
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was lambda = 1e-04.

# Compute coeff
predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,]

##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##   -157.221    313.860    -18.996      0.000    -70.392    171.242 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
##    -27.543      0.000      0.000     51.811    202.537    187.933 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
##   -224.951     12.839    -38.595     -9.128     13.288    -18.620 
## NewLeagueN 
##     22.326

ridge.pred <- predict(ridge, test)
sqrt(mean(ridge.pred - test$Salary)^2)

## [1] 17.53

So the average error in salary is ~ 33 thousand. You’ll notice that the regression coefficients dont really seem like they have shifted towards zero, but that is because we are standardizing the data first.

We should now check to see if this is actually any better than a regular lm() model.

lmfit <- train(Salary ~., data = train,
               method='lm',
               trControl = fitControl,
               preProc=c('scale', 'center'))
lmfit

## Linear Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ... 
## 
## Resampling results
## 
##   RMSE  Rsquared  RMSE SD  Rsquared SD
##   300   0.5       70       0.2        
## 
##

coef(lmfit$finalModel)

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##     535.958    -327.835     591.667      73.964    -169.699    -162.024 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##     234.093     -60.557     125.017    -529.709     -45.888     680.654 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##     393.276    -399.506      19.118     -46.679      -4.898      41.271 
##      Errors  NewLeagueN 
##     -22.672      22.469

lmfit.pred <- predict(lmfit, test)
sqrt(mean(lmfit.pred - test$Salary)^2)

## [1] 17.62

As we can see this ridge regression fit certainly has lower RMSE and higher R^2. We can also see that the ridge regression has indeed shrunk the coefficients, some of them extremely close to zero.

The Lasso

lasso <- train(Salary ~., train,
               method='lasso',
               preProc=c('scale','center'),
              
               trControl=fitControl)
lasso

## The lasso 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ... 
## 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0.1       300   0.6       70       0.2        
##   0.5       300   0.6       60       0.2        
##   0.9       300   0.6       70       0.2        
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was fraction = 0.5.

# Get coef
predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')

## $s
## [1] 0.5
## 
## $fraction
##   0 
## 0.5 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##   -227.113    406.285      0.000    -48.612    -93.740    197.472 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
##    -47.952      0.000      0.000     82.291    274.745    166.617 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
##   -287.549     18.059    -41.697     -7.001     30.768    -26.407 
## NewLeagueN 
##     19.190

lasso.pred <- predict(lasso, test)
sqrt(mean(lasso.pred - test$Salary)^2)

## [1] 14.35

Here in the lasso we see that many of the coefficients have been forced to zero. This presents a simplicity advantage over ridge and linear regression models, even though the RMSE is a bit higher than the ridge regression.

PCR and PLS

Principal Components Regression

We will show PCR using both the pls package, as well as the caret package.

library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(2)

#defaults to 10 folds cross validation
pcr.fit <- pcr(Salary ~., data=train, scale=TRUE, validation="CV", ncomp=19)
summary(pcr.fit)

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           451.5    336.9    323.9    328.5    328.4    329.9    337.1
## adjCV        451.5    336.3    323.6    327.8    327.5    328.8    335.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       335.2    333.7    338.5     334.3     337.8     340.4     346.7
## adjCV    332.5    331.7    336.4     332.0     335.5     337.6     343.4
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        345.1     345.7     329.4     337.3     343.5     338.7
## adjCV     341.2     341.6     325.7     332.7     338.4     333.9
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         36.55    60.81    71.75    80.59    85.72    89.76    92.74
## Salary    45.62    50.01    51.19    51.98    53.23    53.36    55.63
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         95.37    96.49     97.45     98.09     98.73     99.21     99.52
## Salary    56.48    56.73     58.57     58.92     59.34     59.44     62.01
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.77     99.90     99.97     99.99    100.00
## Salary     62.65     65.29     66.48     66.77     67.37

validationplot(pcr.fit, val.type='MSEP')

plot of chunk unnamed-chunk-12

This algorithm reports the CV scores as RMSE, and R^2 of the training data. However, we can see from plotting the MSE against the number of components that we achieve the lowest MSE at about 3 components. This suggests a large improvement over a least squares approach because we are able to explain much of the variance using only 3 components, rather than 19.

Let’s see how this performs on the test dataset.

pcr.pred <- predict(pcr.fit, test, ncomp=3)
sqrt(mean((pcr.pred - test$Salary)^2))

## [1] 374.8

This is comparable, but a bit lower than the RMSE of the ridge/ lasso/linear regression.

Using the caret package

pcr.fit <- train(Salary ~., data=train,
                 preProc = c('center', 'scale'),
                 method='pcr',
                 trControl=fitControl)
pcr.fit

## Principal Component Analysis 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 121, 120, 118, 119, 120, 120, ... 
## 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
##   1      300   0.5       100      0.2        
##   2      300   0.5       100      0.2        
##   3      300   0.6       100      0.2        
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was ncomp = 3.

The caret package, using bootstrapping and 10 fold cv choses the best model @ 2 components

pcr.pred <- predict(pcr.fit, test)
sqrt(mean(pcr.pred - test$Salary)^2)

## [1] 21.86

The results are comparable to lasso regression. However, PCR results are not easily interpretable.

Partial Least Squares

pls.fit <- plsr(Salary~., data=train, scale=TRUE, validation="CV")
summary(pls.fit)

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           451.5    328.9    328.4    332.6    329.2    325.4    323.4
## adjCV        451.5    328.2    327.4    330.6    326.9    323.0    320.9
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       318.7    318.7    316.3     317.6     316.5     317.0     319.2
## adjCV    316.2    315.5    313.5     314.9     313.6     313.9     315.9
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        323.0     323.8     325.4     324.5     323.6     321.4
## adjCV     319.3     320.1     321.4     320.5     319.9     317.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         35.94    55.11    67.37    74.29    79.65    85.17    89.17
## Salary    51.56    54.90    57.72    59.78    61.50    62.94    63.96
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         90.55    93.49     95.82     97.05     97.67     98.45     98.67
## Salary    65.34    65.75     66.03     66.44     66.69     66.77     66.94
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.02     99.26     99.42     99.98    100.00
## Salary     67.02     67.11     67.24     67.26     67.37

validationplot(pls.fit, val.type='MSEP')

plot of chunk unnamed-chunk-16

Here the best M (number of components) is 2. Now we evaluate the corresponding test error.

pls.pred <- predict(pls.fit, test, ncomp=2)
sqrt(mean(pls.pred - test$Salary)^2)

## [1] 14.34

Here we see a mild improvment in RMSE compared to PCR. This is probably due to the fact that the component directions are estimated based on the predictors and and response.

Linear Model Selection & Regularization

Ryan Kelly

July 4, 2014

Data Sources:

Overview and Definitions

Methods in Detail

Subset Selection

Choosing the Best Model

C_p, AIC, BIC, and Adjusted R^2

Validation and Cross-Validation

Shrinkage Methods

Ridge Regression

The Lasso

Elastic Net

Dimension Reduction Methods

Principal Components Regression (PCA)

Partial Least Squares

Considerations for High Dimensions

Interpreting Results in High Dimensions

Comparison of Selection and Shrinking Methods.

Code Examples

Subset Selection Methods

Best Subset Selection

Forward and Backwards Stepwise Selection

Ridge Regression and Lasso

Start Cross-Validation Methods

Validation Set

k-folds

The Lasso

PCR and PLS

Principal Components Regression

Partial Least Squares

Linear Model Selection & Regularization

Ryan Kelly

July 4, 2014

Data Sources:

Overview and Definitions

Methods in Detail

Subset Selection

Choosing the Best Model

Cp, AIC, BIC, and Adjusted R^2

Validation and Cross-Validation

Shrinkage Methods

Ridge Regression

The Lasso

Elastic Net

Dimension Reduction Methods

Principal Components Regression (PCA)

Partial Least Squares

Considerations for High Dimensions

Interpreting Results in High Dimensions

Comparison of Selection and Shrinking Methods.

Code Examples

Subset Selection Methods

Best Subset Selection

Forward and Backwards Stepwise Selection

Ridge Regression and Lasso

Start Cross-Validation Methods

Validation Set

k-folds

The Lasso

PCR and PLS

Principal Components Regression

Partial Least Squares

C_p, AIC, BIC, and Adjusted R^2