Heavily borrowed from:
Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link
require(knitr)
## Loading required package: knitr
In this lesson we consider some alternative fitting approaches for linear models, besides the usual ordinary least squares. These alternatives can sometimes provide better prediction accuracy and model interpretability.
Prediction Accuracy: Given that the true relationship between Y and X is approx. linear, the ordinary least squares estimates will have low bias. OLS also behaves well went n >> p, the number of observation is larger then the number of parameters. Yet if n is not much larger than p, then there can be a lot of variability in the fit, resulting in overfitting and/or poor predictions. If p > n, then there is no longer a unique least squares estimate, and the method cannot be used at all. By constraining and shrinking the estimated coefficients, we can often substantially reduce the variance as the cost of a negligible increase in bias, which often leads to dramatic improvements in accuracy.
Model Interpretability: Often in multiple regression, many variables are not associated with the response. Irrelevant variables leads to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0) we obtain a more easily interpretable model. However, using OLS makes it very unlikely that the coefficients will be exactly zero. Here we explore some approach for automatically excluding features using this idea.
Subset Selection: This approach identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.
Shrinkage. This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage, AKA regularization has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Thus this method also performs variable selection.
Dimension Reduction: This approach involves projecting the p predictors into an M-dimensional subspace, where M < p. This is attained by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.
Though we discuss the application of these techniques to linear models, they also apply to other methods like classification.
Most traditional regression techniques are intended for low-dimensional settings in which n >> p. This in part because through most of the fields history, the bulk of the problems requiring statistics have been low dimensional. These days p can be very large, and n is often limited due to cost, availability or other considerations.
In general a dataset is high-dimensional if p > n or if p is slightly smaller than n. Classical approaches such as least squares linear regression are not appropriate here.
So what exactly goes wrong in high dimensional settings?
We will discuss the problems under the context of linear regression, though the ideas hold true for all classical regression techniques (linear OLS, logistic regression, LDA, ect).
When p is larger, or almost larger than n, the least squares approach will yield a set of coefficient estimates that result in a perfect fit to the data, regardless of whether there is truly a relationship present. The problem is that a perfect fit will almost always lead to over fitting the data. The problem is simply that the least squares regression is _too flexible__ and hence over fits the data.
In this scenario, we can actually attain perfect fits to data even when the features are completely unrelated to the response.
Unfortunately, the model fitting parameters R², Cp, AIC, and BIC approaches are also not effective in a high dimensional setting, even with cross validation, because estimating variance can be problematic.
However, it turns out that many of the methods we will discuss in this notebook for fitting less flexible least squares models (forward stepwise, ridge, lasso, and PCA) are particularly useful for performing regression in high dimensional settings. Essentially, these approaches avoid over fitting by using a less flexible fitting approach than ordinary least squares.
The general rule for additional dimensions in data is that additional features are only useful if they are truly associated with the response. Otherwise, the addition of noise features will lead to increased test set error and reduce the chance of over fitting.
We must always be cautious about the way we report the obtained model results, especially in high dimensional settings. In this setting, the multicollinearity problem is extreme, since any variable in the model can be rewritten as a linear combination of all the other variables in the model. Essentially, we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients. In general we should be careful not to overstate the results obtained. We can make it clear the results found were simply one of many possible models for predicting the response, and that it must be validated on independent data sets.
It is also important to be careful when reporting errors and measure of model fit in the high dimensional. setting. We have seen that we can obtain useless models that have zero residuals when p > n, thus we should never use SSE, p-values, R², or other traditional measures of fit on the training data. Thus it is imperative to report error and prediction results using a test set, or cross validation errors.
Best Subset Selection
Here we fit a separate OLS regression for each possible combination of the p predictors and then look at the resulting model fits. The problem with this method is the best model is hidden within \(2^p\) possibilities. The algorithm is broken up into two stages. (1) Fit all models that contain k predictors, where k is the max length of the models. (2) Select a single model using cross-validated prediction error. More specific prediction error methods like AIC and BIC are discussed below. It is important to use testing or validation error, and not training error to assess model fit because RSS (residual sum of squares) and R² monotonically increase with more variables. The best approach is to cross validate and choose the model with the highest R²nd lowest RSS on testing error estimates.
This works on other types of model selection, such as logistic regression, except that the score that we select upon changes. For logistic regression we would utilize deviance instead of RSS and R².
Next we discuss methods are more computationally efficient.
Stepwise Selection
Besides computational issues, the best subset procedure also can suffer from statistical problems when p is large, since we have a greater chance of overfitting.
Forward Stepwise Selection considers a much smaller subset of p predictors. It begins with a model containing no predictors, then adds predictors to the model, one at a time until all of the predictors are in the model. The order of the variables being added is the variable, which gives the greatest addition improvement to the fit, until no more variables improve model fit using cross-validated prediction error. A best subset model for p = 20 would have to fit 1 048 576 models, where as forward step wise only requires fitting 211 potential models. However, this method is not guaranteed to find the model. Forward stepwise regression can even be applied in the high-dimensional setting where p > n.
Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the least useful predictor one at a time. Requires that n > p.
Hybrid Methods follows the forward stepwise approach, however, after adding each new variable, the method may also remove any variables that do not contribute to the model fit.
Each of the three above-mentioned algorithms requires us to manually decide which model performs best. As mentioned before, the models with the most predictors will usually have the smallest RSS and largest R² when using training error. To select the model with the best test error, we need to estimate the test error. There are two ways to compute test error.
Indirectly estimate the test error by making and _adjustment) to the training error to account for the over fitting bias.
Directly estimate the test error, using either a validation set, or cross-validation approach.
These four measures are the four common approaches to adjust the training error to estimate test error.
Essentially the Cp statistic adds a penalty of 2 d σ² to the training RSS given that the training error trends to underestimate the test error, where d is the number of predictors and σ²2 is an estimate of the variance of the error associated with each response measurement.
AIC criterion is defined for a large class of models fit by maximum likelihood (ML). In the case of a Gaussian model, ML and OLS are equivalent. Thus for OLS models, AIC and Cp are proportional to each other and only differ in that AIC has an additive constant term.
BIC is derived from a Bayesian point of view, but looks similar to AIC and Cp. For an OLS model with d predictors, the BIC replaces the 2 d σ² from Cp with \(log (n) d σ²\), where n is the number of observations. Since log n > 2 for a n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and results in smaller models.
The adjusted R² adds a penalty term for additional variables being added to the model.
\(Adjusted R^2 = 1 − RSS/(n − d − 1) / TSS / (n-1)\)
Unlike the earlier statistics, a large value of Adjusted R^2 indicates a model with small test error. In theory, the model with the largest adjusted R^2 will have only correct variables and no noise variables. This is because unlike R², Adjusted R² is penalized for inclusion of unnecessary variables in the model. Despite its popularity, and even though it is quite intuitive, the adjusted R2 is not as well motivated in statistical theory as AIC, BIC, and Cp.
All these statistics have rigorous theoretical justifications, but they still all rely (to some degree) on certain arguments, like large n.
Here we apply best subset selection to the Hitters
dataset from the ISLR
package. We want to predict a baseball player’s Salary
based on various statistics from the previous year.
library(ISLR)
attach(Hitters)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
# Check for NA data
sum(is.na(Hitters$Salary))/length(Hitters[ ,1])*100
## [1] 18.32298
Turns out that about 18% of the data is missing. For the purpose of this lesson we will just omit missing data.
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
The regsubsets()
function of the leaps
library will perform out best subset selection, where best is quantified using RSS.
library(leaps)
regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
The asterisk indicates that a variables is included in the corresponding model. For example, the best two-variable model contains only the Hits and CRBI. By default regsubsets()
only reports up to the best eight-variable model. We can adjust this with the nvmax
parameter.
regfit <- regsubsets(Salary ~ ., data=Hitters, nvmax = 19)
summary(regfit)$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
In this 19 variable model, the R² increases monotonically as more variables are included. This always happen to the R² because of the colinearity.
We can use the built in plot functions to plot the RSS, adj. R², Cp, AIC and BIC. Each row in this graph represents a model; the shaded rectangles in the columns indicate the variables included in the given model. The darkness of the shading simply represents the ordering of the BIC values.
par(mfrow=c(2,2))
plot(regfit, scale = 'r2')
plot(regfit, scale = 'adjr2')
plot(regfit, scale = 'Cp')
plot(regfit, scale = 'bic')
Note: recall, the measures of fit shown above are (besides R²) all estimates of test error.
We can also use regsubsets()
here by specifying the paramter method
with either backwards
or forwards
.
regfit.fwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
regfit.bwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
method ="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
We can see here that 1 - 6 variable models are identical for best subset and forward selection. However, the best 7+ variable models are different for all three techniques.
Note: To select the best value of nvmax, we should cross use cross validation.
Resampling methods have become and integral tool in modern statistics. Resampling is based on repeatedly drawing samples from a training set of observations and refitting a model on each sample in order to obtain additional insights into that model. For example: to examine the robustness and variability of a linear regression, we can fit a linear regression to each new sample and examine the difference in the results. The goal for any resampling measure is to better estimate how a statistical model will perform on out of sample, ‘real-life’ data.
In general, cross validation techniques are more direct estimates of test error, and makes fewer assumptions about the underlying model. Further, it can be used in a wider selection of model types.
Test error rate is the average error from using a statistical method to predict the response on a new observation.
Training error rate is simply the average error from a statistical method that uses predictions based on the data that was used to fit the model in the first place.
We always prefer to utilize the test error rate to measure model performance, since training error rate can dramatically over-estimate real-world performance. Sometimes, simply splitting the data to obtain testing and training dataset is not recommended due to the small size of the sample.
Cross-validation estimates the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical method to those held out observations. This process can be repeated several times to enable more data to partake in the training of the model, one subset at a time.
Prior to explaining cross-validation we describe the more basic validation set approach, and it’s drawbacks. In this method, we randomly divide the available data into two parts, a training set, and a validation set. The model is fit on the training set, then the fitted model is used to predict the responses for the observations in the validation set. Typically, the performance of the models are measured using MSE.
To attain a more robust measure of fit, we can repeatedly randomly sample the data, and compute the MSE, then compare the results. The validation approach is simple and easy to implement, but it has two potential problems.
This method randomly divides a set of observations into k groups, for folds, of approximately equal size. Each fold contains a non overlapping (with the subsequent folds) validation set and training set.
For example, a 5-fold CV would re sample the training and validation sets in a non-overlapping fashion 5 times, such that all of the data is eventually used as training data. The test error is then estimated by averaging the five resulting MSE estimates.
The difference between a 5, 10 n or other sized k-folds CV is the bias-variance trad-off.
Cross validation can also be used in a qualitative setting. With qualitative data, we utilize the number of classifieds observations to quantify test error, rather than MSE. The estimated test error rate using k-folds is much more robust than the training error rate, when deciding what order of polynomial to keep, or what value of K we should use in a KNN classification. This is because training error rate often decays as the method becomes more flexible, yet also means we will eventually over-fit the model.
The subset selection methods described above used least squares fitting that contained a subset of the predictors to choose the best model, and estimate test error. Here, we discuss an alternative where we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.
Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. A parameter, λ, controls the impact of the shrinking. λ = 0 will behave exactly like OLS regression. Of course, selection a good value for λ is critical, and should be chosen using cross validation techniques. A requirement of ridge regression is that the predictors X have been centered to have a mean = 0
, thus the data must be standardized before hand.
Note that the shrinkage does not apply to the intercept.
Why is ridge regression better than least squares?
The advantage is apparent in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, but increased bias. Regular OLS regression is fixed with high variance, but no bias. However, the lowest test MSE tends to occur at the intercept between variance and bias. Thus, by properly tuning λ and acquiring less variance at the cost of a small amount of bias, we can find a lower potential MSE.
Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of λ.
Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn’t generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted that λ is large enough. Thus, Lasso regression also performs variable selection.
There is no dominant algorithm present here, in general it is best to test all three techniques introduced so far and chose the one that best suits the data using cross-validated test error estimates.
Instead of using adj. R² _C_p and BIC to estimate test error rates, we can use cross-validation approaches. In order for this to work we must only use the training observations to perform all aspects of model fitting and variable selection. The test errors are then computed by applying the training model to the test or validation data. We can split the data into training
and testing
sets using the caret
package.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1234)
split <- createDataPartition(y = Hitters$Salary, p = 0.5, list = FALSE)
train <- Hitters[split,]
test <- Hitters[-split,]
set.seed(123) # for reproducing these results
ridge <- train(Salary ~., data = train,
method='ridge',
lambda = 4,
preProcess=c('scale', 'center'))
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
ridge
## Ridge Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled (19), centered (19)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared RMSE SD Rsquared SD
## 0e+00 369.5398 0.3878097 48.75211 0.1235791
## 1e-04 367.5711 0.3914902 48.77035 0.1223455
## 1e-01 343.9462 0.4463620 45.99850 0.1007578
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
ridge.pred <- predict(ridge, test)
mean(ridge.pred - test$Salary)^2
## [1] 492.5604
Use k-folds to select the best lambda. (Even though caret
uses bootstrap in the example above by default).
For cross-validation, we will split the data into testing and training data
set.seed(123)
fitControl <- trainControl(method = "cv",
number = 10)
# Set seq of lambda to test
lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))
ridge <- train(Salary~., data = train,
method='ridge',
trControl = fitControl,
# tuneGrid = lambdaGrid
preProcess=c('center', 'scale')
)
ridge
## Ridge Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: centered (19), scaled (19)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 118, 118, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared RMSE SD Rsquared SD
## 0e+00 310.9253 0.4955675 70.65403 0.1990873
## 1e-04 310.4766 0.4966985 70.59140 0.1984634
## 1e-01 303.2478 0.5321637 75.70436 0.1646462
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
# Compute coeff
predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,]
## AtBat Hits HmRun Runs RBI Walks
## 0.000000 93.203820 -27.122046 -20.841875 10.588744 110.872392
## Years CAtBat CHits CHmRun CRuns CRBI
## -16.092720 28.095612 83.070969 41.076537 72.119073 67.348985
## CWalks LeagueN DivisionW PutOuts Assists Errors
## -42.508339 21.016185 -25.150519 46.020431 -13.878715 1.286104
## NewLeagueN
## 10.441192
ridge.pred <- predict(ridge, test)
sqrt(mean(ridge.pred - test$Salary)^2)
## [1] 22.1937
You’ll notice that the regression coefficients dont really seem like they have shifted towards zero, but that is because we are standardizing the data first.
We should now check to see if this is actually any better than a regular lm()
model.
lmfit <- train(Salary ~., data = train,
method='lm',
trControl = fitControl,
preProc=c('scale', 'center'))
lmfit
## Linear Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled (19), centered (19)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 119, 118, 121, 120, ...
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 316.4114 0.4691751 99.22171 0.1931053
##
##
coef(lmfit$finalModel)
## (Intercept) AtBat Hits HmRun Runs RBI
## 518.911805 -242.937109 430.118098 34.425754 -210.737113 -60.878038
## Walks Years CAtBat CHits CHmRun CRuns
## 226.088602 -45.963985 133.208255 -293.157175 -8.227962 495.389105
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 181.778197 -278.938054 14.916906 -30.533021 35.987237 -19.843898
## Errors NewLeagueN
## 15.845189 10.722860
lmfit.pred <- predict(lmfit, test)
sqrt(mean(lmfit.pred - test$Salary)^2)
## [1] 11.71337
As we can see this simple linear regression fit certainly has lower RMSE (root mean squared error) and higher R². We can also see that the ridge regression has indeed shrunk the coefficients, some of them extremely close to zero and this a advantage do the Ridge model.
lasso <- train(Salary ~., train,
method='lasso',
preProc=c('scale','center'),
trControl=fitControl)
lasso
## The lasso
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled (19), centered (19)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 121, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared RMSE SD Rsquared SD
## 0.1 294.4562 0.5859093 100.8605 0.1645902
## 0.5 301.5807 0.5813785 106.0907 0.2036618
## 0.9 313.6748 0.5432861 111.6562 0.2372415
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
# Get coef
predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')
## $s
## [1] 0.1
##
## $fraction
## 0
## 0.1
##
## $mode
## [1] "fraction"
##
## $coefficients
## AtBat Hits HmRun Runs RBI Walks
## 0.00000 35.66515 0.00000 0.00000 0.00000 81.03587
## Years CAtBat CHits CHmRun CRuns CRBI
## 0.00000 0.00000 95.06893 0.00000 14.84315 50.35656
## CWalks LeagueN DivisionW PutOuts Assists Errors
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
## NewLeagueN
## 0.00000
lasso.pred <- predict(lasso, test)
sqrt(mean(lasso.pred - test$Salary)^2)
## [1] 28.08927
Here in the lasso we see that many of the coefficients have been forced to zero. This presents a simplicity advantage over ridge and linear regression models, even though the RMSE is a bit higher than the ridge regression.
So far, the methods we have discussed have controlled for variance by either using a subset of the original variables, or by shrinking their coefficients toward zero. Now we explore a class of models that transform the predictors and then fit a least squares model using the transformed variables. Dimension reduction reduces the problem of estimating p + 1 coefficients to the simpler problem of M + 1 coefficients, where M < p. Two approaches for this task are principal component regression and partial least squares.
One can describe PCA as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint. It turns out that the 2 PC must be perpendicular to the first PC direction. The idea is that the principal components capture (maximize) the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.
In regression, we construct M principal components and then use these components as predictors in a linear regression using least squares. The idea being that we fit a model with a small number of variables (principals) that explain most of the variability in the data, and the most relationship with the response. In general, we have potential to fit better models than ordinary least squares since we can reduce the effect of over fitting. In general, PCR will tend to be better in cases where the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response.
Note that PCR is not a feature selection method. This is because it is a linear combination of all p original features. Thus PCR is more related to ridge regression than lasso.
The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.
PLS does this by placing higher weights on the variables that are most strongly related to the response. To attain subsequent direction, the method first adjusts each of the variables for the first component by regressing each variable on the first component and taking the residuals. The residuals can be interpreted as the remaining information that has no been explained by the first PLS direction. We then compute the second component in exactly the same way as the first component. This can be iterated M times to identify multiple PLS components.
In practice PLS performs no better than ridge regression or PCR. This is because even though PLS can reduce bias, it also has potential to increase the variance, so the overall benefit is not really distinct.
We will show PCR using both the pls
package, as well as the caret
package.
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
set.seed(123)
#defaults to 10 folds cross validation
pcr.fit <- pcr(Salary ~., data=train, scale=TRUE, validation="CV", ncomp=19)
summary(pcr.fit)
## Data: X dimension: 133 19
## Y dimension: 133 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 421.7 309.2 312.4 305.8 304.0 305.1 309.2
## adjCV 421.7 308.9 311.9 305.3 303.4 304.6 308.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 313.8 319.8 320.6 320.8 318.9 320.0 320.7
## adjCV 312.3 318.1 318.7 319.6 316.7 317.7 318.8
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 326.1 327.6 318.2 317.0 330.1 329.9
## adjCV 323.3 324.7 315.0 313.8 326.2 325.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 39.93 59.19 71.46 78.51 83.79 88.40 92.04
## Salary 46.53 47.58 50.14 51.56 51.81 52.94 53.37
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 94.82 96.19 97.14 98.00 98.71 99.11 99.47
## Salary 53.40 53.66 53.84 55.57 56.19 56.36 58.07
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.74 99.89 99.96 99.99 100.00
## Salary 58.28 61.32 62.26 62.39 62.71
validationplot(pcr.fit, val.type='MSEP')
This algorithm reports the CV scores as RMSE, and R² of the training data. However, we can see from plotting the MSE against the number of components that we achieve the lowest MSE at the fourth component. This suggests a large improvement over a least squares approach because we are able to explain much of the variance using only 4 components, rather than 19.
Let’s see how this performs on the test dataset.
pcr.pred <- predict(pcr.fit, test, ncomp=4)
sqrt(mean((pcr.pred - test$Salary)^2))
## [1] 389.6798
This is comparable, but a bit lower than the RMSE of the ridge/ lasso/linear regression.
Using the caret package
pcr.fit <- train(Salary ~., data=train,
preProc = c('center', 'scale'),
method='pcr',
trControl=fitControl)
pcr.fit
## Principal Component Analysis
##
## 133 samples
## 19 predictors
##
## Pre-processing: centered (19), scaled (19)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 119, 120, 121, 118, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared RMSE SD Rsquared SD
## 1 290.7372 0.5554968 107.4741 0.2151344
## 2 290.6449 0.5522785 110.7581 0.2211907
## 3 286.7008 0.5711053 111.5541 0.2204429
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The caret package, using bootstrapping and 10 fold cv choses the best model with 3 component.
pcr.pred <- predict(pcr.fit, test)
sqrt(mean(pcr.pred - test$Salary)^2)
## [1] 29.37353
The results are comparable to lasso regression. However, PCR results are not easily interpretable.
pls.fit <- plsr(Salary~., data=train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 133 19
## Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 421.7 309.3 310.6 317.7 324.4 329.0 324.2
## adjCV 421.7 308.9 309.6 315.9 321.9 325.6 321.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 323.8 325.0 326.3 325.7 325.3 322.9 324.8
## adjCV 320.7 321.5 322.7 322.1 321.9 319.5 321.2
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 326.3 328.6 329.4 335.5 335.6 336.4
## adjCV 322.6 324.8 325.6 331.2 331.2 332.0
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 39.63 52.82 65.27 75.24 79.47 83.61 86.50
## Salary 49.69 53.44 54.96 56.29 58.55 59.70 60.54
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 88.81 92.47 94.49 96.57 97.56 98.20 98.50
## Salary 61.43 61.69 61.97 62.12 62.31 62.38 62.49
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.25 99.56 99.73 99.98 100.00
## Salary 62.52 62.55 62.63 62.67 62.71
validationplot(pls.fit, val.type='MSEP')
Here the best M (number of components) is 1. Now we evaluate the corresponding test error.
pls.pred <- predict(pls.fit, test, ncomp=1)
sqrt(mean(pls.pred - test$Salary)^2)
## [1] 23.0655
Here we see a mild better result in RMSE compared to PCR. This is probably due to the fact that the component directions are estimated based on the predictors and response.