Visit my website for more like this!
Heavily borrowed from:
Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link
require(knitr)
## Loading required package: knitr
In this lesson we consider some alternative fitting approaches for linear models, besides the usual ordinary least squares. These alternatives can sometimes provide better prediction accuracy and model interpretability.
This problem is another facet of the curse of dimensionality. When p starts to become large, observations x start to become closer to the boundaries between classes than the nearby observations, which presents major problems for predicting. Further, with many p, the training samples are often sparsely populated, making it difficult to identify trends and predict. In-fact with increasing p increases the first nearest neighbor distance almost linearly. Generally, you could expect a NN distance close to zero for one-dimension, and average about 0.8 once we reach 10-dimensions.
By constraining and shrinking the estimated coefficients, we can often substantially reduce the variance as the cost of a negligible increase in bias, which often leads to dramatic improvements in accuracy.
Model Interpretability: Often in multiple regression, many variables are not associated with the response. Irrelevant variables leads to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0) we obtain a more easily interpretable model. However, using OLS makes it very unlikely that the coefficients will be exactly zero. Here we explore some approach for automatically excluding features using this idea.
Subset Selection: This approach identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.
Shrinkage. This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage, AKA regularization has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Thus this method also performs variable selection.
Dimension Reduction: This approach involves projecting the p predictors into an M-dimensional subspace, where M < p. This is attained by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.
Though we discuss the application of these techniques to linear models, they also apply to other methods like classification.
Best Subset Selection
Here we fit a separate OLS regression for each possible combination of the p predictors and then look at the resulting model fits. The problem with this method is the best model is hidden within 2^p possibilities. The algorithm is broken up into two stages. (1) Fit all models that contain k predictors, where k is the max length of the models. (2) Select a single model using cross-validated prediction error. More specific prediction error methods like AIC and BIC are discussed below. It is important to use testing or validation error, and not training error to assess model fit because RSS and R^2 monotonically increase with more variables. The best approach is to cross validate and choose the model with the highest R^2 and lowest RSS on testing error estimates.
This works on other types of model selection, such as logistic regression, except that the score that we select upon changes. For logistic regression we would utilize deviance instead of RSS and R^2.
Next we discuss methods are more computationally efficient.
Stepwise Selection
Besides computational issues, the best subset procedure also can suffer from statistical problems when p is large, since we have a greater chance of overfitting.
Forward Stepwise Selection considers a much smaller subset of p predictors. It begins with a model containing no predictors, then adds predictors to the model, one at a time until all of the predictors are in the model. The order of the variables being added is the variable, which gives the greatest addition improvement to the fit, until no more variables improve model fit using cross-validated prediction error. A best subset model for p = 20 would have to fit 1 048 576 models, where as forward step wise only requires fitting 211 potential models. However, this method is not guaranteed to find the model. Forward stepwise regression can even be applied in the high-dimensional setting where p > n.
Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the least useful predictor one at a time. Requires that n > p.
Hybrid Methods follows the forward stepwise approach, however, after adding each new variable, the method may also remove any variables that do not contribute to the model fit.
Each of the three above-mentioned algorithms requires us to manually decide which model performs best. As mentioned before, the models with the most predictors will usually have the smallest RSS and largest R^2 when using training error. To select the model with the best test error, we need to estimate the test error. There are two ways to compute test error.
Indirectly estimate the test error by making and _adjustment) to the training error to account for the over fitting bias.
Directly estimate the test error, using either a validation set, or cross-validation approach.
These four measures are the four common approaches to adjust the training error to estimate test error.
Essentially the Cp statistic adds a penalty of 2 d σ^2 to the training RSS given that the training error trends to underestimate the test error, where d is the number of predictors and σ^2 is an estimate of the variance of the error associated with each response measurement.
AIC criterion is defined for a large class of models fit by maximum likelihood (ML). In the case of a Gaussian model, ML and OLS are equivalent. Thus for OLS models, AIC and Cp are proportional to each other and only differ in that AIC has an additive constant term.
BIC is derived from a Bayesian point of view, but looks similar to AIC and Cp. For an OLS model with d predictors, the BIC replaces the 2 d σ^2 from Cp with \(log (n) d σ^2\), where n is the number of observations. Since log n > 2 for a n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and results in smaller models.
The adjusted R^2 adds a penalty term for additional variables being added to the model.
\[ Adjusted R^2 = 1 − RSS/(n − d − 1) / TSS / (n-1)\]
Unlike the earlier statistics, a large value of Adjusted R^2 indicates a model with small test error. In theory, the model with the largest adjusted R^2 will have only correct variables and no noise variables. This is because unlike R^2, Adjusted R^2 is penalized for inclusion of unnecessary variables in the model. Despite its popularity, and even though it is quite intuitive, the adjusted R2 is not as well motivated in statistical theory as AIC, BIC, and Cp.
All these statistics have rigorous theoretical justifications, but they still all rely (to some degree) on certain arguments, like large n.
These approaches discussed in detail here.
In general, cross validation techniques are more direct estimates of test error, and makes fewer assumptions about the underlying model. Further, it can be used in a wider selection of model types.
Note: it is common for many models to have similar test errors. In this situation it is often better to pick the simplest model.
The subset selection methods described above used least squares fitting that contained a subset of the predictors to choose the best model, and estimate test error. Here, we discuss an alternative where we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.
Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. A parameter, λ, controls the impact of the shrinking. λ = 0 will behave exactly like OLS regression. Of course, selection a good value for λ is critical, and should be chosen using cross validation techniques. A requirement of ridge regression is that the predictors X have been centered to have a mean = 0
, thus the data must be standardized before hand. Without going into the math, it is useful to know that ridge regression shrinks the features with the smallest column space variance. Like in prinicipal component analysis, ridge regression projects the data into d directional space and then shrinks the coefficients of the low-variance components more than the high variance components, which are equivalent to the largest and smallest principal components.
Note that the shrinkage does not apply to the intercept.
Why is ridge regression better than least squares?
The advantage is apparent in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias. However, the lowest test MSE tends to occur at the intercept between variance and bias. Thus, by properly tuning λ and acquiring less variance at the cost of a small amount of bias, we can find a lower potential MSE.
Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of λ.
Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn’t generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted that s is small enough. Since s = 1 results in regular OLS regression, as s approaches 0 the coefficients shrink towards zero. Thus, Lasso regression also performs variable selection.
There is no dominant algorithm present here, in general it is best to test all three techniques introduced so far and chose the one that best suits the data using cross-validated test error estimates.
The elastic net is another penalty that incorperates the variable selection of the lasso and the shrinkage of correlated predictors like righe regression.
So far, the methods we have discussed have controlled for variance by either using a subset of the original variables, or by shrinking their coefficients toward zero. Now we explore a class of models that transform the predictors and then fit a least squares model using the transformed variables. Dimension reduction reduces the problem of estimating p + 1 coefficients to the simpler problem of M + 1 coefficients, where M < p. Two approaches for this task are principal component regression and partial least squares.
One can describe PCA as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint. It turns out that the 2 PC must be perpendicular to the first PC direction. The idea is that the principal components capture (maximize) the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.
In regression, we construct M principal components and then use these components as predictors in a linear regression using least squares. The idea being that we fit a model with a small number of variables (principals) that explain most of the variability in the data, and the most relationship with the response. In general, we have potential to fit better models than ordinary least squares since we can reduce the effect of over fitting. In general, PCR will tend to be better in cases where the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response.
Note that PCR is not a feature selection method. This is because it is a linear combination of all p original features. Thus PCR is more related to ridge regression than lasso.
The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.
PLS does this by placing higher weights on the variables that are most strongly related to the response. To attain subsequent direction, the method first adjusts each of the variables for the first component by regressing each variable on the first component and taking the residuals. The residuals can be interpreted as the remaining information that has no been explained by the first PLS direction. We then compute the second component in exactly the same way as the first component. This can be iterated M times to identify multiple PLS components.
In practice PLS performs no better than ridge regression or PCR. This is because even though PLS can reduce bias, it also has potential to increase the variance, so the overall benefit is not really distinct.
Most traditional regression techniques are intended for low-dimensional settings in which n >> p. This in part because through most of the fields history, the bulk of the problems requiring statistics have been low dimensional. These days p can be very large, and n is often limited due to cost, availability or other considerations.
In general a dataset is high-dimensional if p > n or if p is slightly smaller than n. Classical approaches such as least squares linear regression are not appropriate here.
So what exactly goes wrong in high dimensional settings?
We will discuss the problems under the context of linear regression, though the ideas hold true for all classical regression techniques (linear OLS, logistic regression, LDA, ect).
When p is larger, or almost larger than n, the least squares approach will yield a set of coefficient estimates that result in a perfect fit to the data, regardless of whether there is truly a relationship present. The problem is that a perfect fit will almost always lead to over fitting the data. The problem is simply that the least squares regression is _too flexible__ and hence over fits the data.
In this scenario, we can actually attain perfect fits to data even when the features are completely unrelated to the response.
Unfortunately, the model fitting parameters R^2, Cp, AIC, and BIC approaches are also not effective in a high dimensional setting, even with cross validation, because estimating variance can be problematic.
However, it turns out that many of the methods we have discussed in this notebook for fitting less flexible least squares models (forward stepwise, ridge, lasso, and PCA) are particularly useful for performing regression in high dimensional settings. Essentially, these approaches avoid over fitting by using a less flexible fitting approach than ordinary least squares.
The general rule for additional dimensions in data is that additional features are only useful if they are truly associated with the response. Otherwise, the addition of noise features will lead to increased test set error and reduce the chance of over fitting.
We must always be cautious about the way we report the obtained model results, especially in high dimensional settings. In this setting, the multicollinearity problem is extreme, since any variable in the model can be rewritten as a linear combination of all the other variables in the model. Essentially, we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients. In general we should be careful not to overstate the results obtained. We can make it clear the results found were simply one of many possible models for predicting the response, and that it must be validated on independent data sets.
It is also important to be careful when reporting errors and measure of model fit in the high dim. setting. We have seen that we can obtain useless models that have zero residuals when p > n, thus we should never use SSE, p-values, R^2, or other traditional measures of fit on the training data. Thus it is imperative to report error and prediction results using a test set, or cross validation errors.
Here we apply best subset selection to the Hitters
dataset from the ISLR
package. We want to predict a baseball player’s Salary
based on various statistics from the previous year.
library(ISLR)
attach(Hitters)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
# Check for NA data
sum(is.na(Hitters$Salary))/length(Hitters[,1])*100
## [1] 18.32
Turns out that about 18% of the data is missing. For the purpose of this lesson we will just omit missing data.
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
The regsubsets()
function of the leaps
library will perform out best subset selection, where best is quantified using RSS.
library(leaps)
regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
The asterisk indicates that a variables is included in the corresponding model. For example, the best two-variable model contains only the Hits and CRBI. By default regsubsets()
only reports up to the best eight-variable model. We can adjust this with the nvmax
parameter.
regfit <- regsubsets(Salary ~ ., data=Hitters, nvmax = 19)
summary(regfit)$rsq
## [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405
## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461
In this 19 variable model, the R^2 increases monotonically as more vaiables are included.
We can use the built in plot functions to plot the RSS, adj. R^2, Cp, AIC and BIC.
par(mfrow=c(2,2))
plot(regfit, scale = 'r2')
plot(regfit, scale = 'adjr2')
plot(regfit, scale = 'Cp')
plot(regfit, scale = 'bic')
Note: recall, the measures of fit shown above are (besides R^2) all estimates of test error.
We can also use regsubsets()
here by specifying the paramter method
with either backwards
or forwards
.
regfit.fwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
regfit.bwd <- regsubsets(Salary ~.,data=Hitters,nvmax=19,
method ="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
We can see here that 1 - 6 variable models are identical for best subset and forward selection. However, the best 7 + variable models are different for all three techniques.
Note: To select the best value of nvmax, we should cross use cross validation.
Forewarning: ridge and lasso regression are not well explained using the caret package, since it handles a lot of the action automatically.
We will be applying cross-validation methods within the Regularization methods as well, rather than isolating them to a single section.
Instead of using adj. R^2 Cp and BIC to estimate test error rates, we can use cross-validation approaches. In order for this to work we must only use the training observations to perform all aspects of model fitting and variable selection. The test errors are then computed by applying the training model to the test or validation data. We can split the data into training
and testing
sets using the caret
package.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
split <- createDataPartition(y=Hitters$Salary, p = 0.5, list = FALSE)
train <- Hitters[split,]
test <- Hitters[-split,]
set.seed(825) # for reproducing these results
ridge <- train(Salary ~., data = train,
method='ridge',
lambda = 4,
preProcess=c('scale', 'center'))
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
ridge
## Ridge Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled, centered
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ...
##
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared RMSE SD Rsquared SD
## 0 400 0.4 40 0.09
## 1e-04 400 0.4 40 0.09
## 0.1 300 0.5 40 0.09
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
ridge.pred <- predict(ridge, test)
mean(ridge.pred - test$Salary)^2
## [1] 30.1
Use k-folds to select the best lambda. (Even though caret
uses bootstrap in the example above by default).
For cross-validation, we will split the data into testing and training data
set.seed(825)
fitControl <- trainControl(method = "cv",
number = 10)
# Set seq of lambda to test
lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))
ridge <- train(Salary~., data = train,
method='ridge',
trControl = fitControl,
# tuneGrid = lambdaGrid
preProcess=c('center', 'scale')
)
ridge
## Ridge Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ...
##
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared RMSE SD Rsquared SD
## 0 300 0.6 70 0.1
## 1e-04 300 0.6 70 0.1
## 0.1 300 0.6 70 0.1
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
# Compute coeff
predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,]
## AtBat Hits HmRun Runs RBI Walks
## -157.221 313.860 -18.996 0.000 -70.392 171.242
## Years CAtBat CHits CHmRun CRuns CRBI
## -27.543 0.000 0.000 51.811 202.537 187.933
## CWalks LeagueN DivisionW PutOuts Assists Errors
## -224.951 12.839 -38.595 -9.128 13.288 -18.620
## NewLeagueN
## 22.326
ridge.pred <- predict(ridge, test)
sqrt(mean(ridge.pred - test$Salary)^2)
## [1] 17.53
So the average error in salary is ~ 33 thousand. You’ll notice that the regression coefficients dont really seem like they have shifted towards zero, but that is because we are standardizing the data first.
We should now check to see if this is actually any better than a regular lm()
model.
lmfit <- train(Salary ~., data = train,
method='lm',
trControl = fitControl,
preProc=c('scale', 'center'))
lmfit
## Linear Regression
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled, centered
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 300 0.5 70 0.2
##
##
coef(lmfit$finalModel)
## (Intercept) AtBat Hits HmRun Runs RBI
## 535.958 -327.835 591.667 73.964 -169.699 -162.024
## Walks Years CAtBat CHits CHmRun CRuns
## 234.093 -60.557 125.017 -529.709 -45.888 680.654
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 393.276 -399.506 19.118 -46.679 -4.898 41.271
## Errors NewLeagueN
## -22.672 22.469
lmfit.pred <- predict(lmfit, test)
sqrt(mean(lmfit.pred - test$Salary)^2)
## [1] 17.62
As we can see this ridge regression fit certainly has lower RMSE and higher R^2. We can also see that the ridge regression has indeed shrunk the coefficients, some of them extremely close to zero.
lasso <- train(Salary ~., train,
method='lasso',
preProc=c('scale','center'),
trControl=fitControl)
lasso
## The lasso
##
## 133 samples
## 19 predictors
##
## Pre-processing: scaled, centered
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ...
##
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared RMSE SD Rsquared SD
## 0.1 300 0.6 70 0.2
## 0.5 300 0.6 60 0.2
## 0.9 300 0.6 70 0.2
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.5.
# Get coef
predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')
## $s
## [1] 0.5
##
## $fraction
## 0
## 0.5
##
## $mode
## [1] "fraction"
##
## $coefficients
## AtBat Hits HmRun Runs RBI Walks
## -227.113 406.285 0.000 -48.612 -93.740 197.472
## Years CAtBat CHits CHmRun CRuns CRBI
## -47.952 0.000 0.000 82.291 274.745 166.617
## CWalks LeagueN DivisionW PutOuts Assists Errors
## -287.549 18.059 -41.697 -7.001 30.768 -26.407
## NewLeagueN
## 19.190
lasso.pred <- predict(lasso, test)
sqrt(mean(lasso.pred - test$Salary)^2)
## [1] 14.35
Here in the lasso we see that many of the coefficients have been forced to zero. This presents a simplicity advantage over ridge and linear regression models, even though the RMSE is a bit higher than the ridge regression.
We will show PCR using both the pls
package, as well as the caret
package.
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
#defaults to 10 folds cross validation
pcr.fit <- pcr(Salary ~., data=train, scale=TRUE, validation="CV", ncomp=19)
summary(pcr.fit)
## Data: X dimension: 133 19
## Y dimension: 133 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 451.5 336.9 323.9 328.5 328.4 329.9 337.1
## adjCV 451.5 336.3 323.6 327.8 327.5 328.8 335.7
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 335.2 333.7 338.5 334.3 337.8 340.4 346.7
## adjCV 332.5 331.7 336.4 332.0 335.5 337.6 343.4
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 345.1 345.7 329.4 337.3 343.5 338.7
## adjCV 341.2 341.6 325.7 332.7 338.4 333.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 36.55 60.81 71.75 80.59 85.72 89.76 92.74
## Salary 45.62 50.01 51.19 51.98 53.23 53.36 55.63
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 95.37 96.49 97.45 98.09 98.73 99.21 99.52
## Salary 56.48 56.73 58.57 58.92 59.34 59.44 62.01
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.77 99.90 99.97 99.99 100.00
## Salary 62.65 65.29 66.48 66.77 67.37
validationplot(pcr.fit, val.type='MSEP')
This algorithm reports the CV scores as RMSE, and R^2 of the training data. However, we can see from plotting the MSE against the number of components that we achieve the lowest MSE at about 3 components. This suggests a large improvement over a least squares approach because we are able to explain much of the variance using only 3 components, rather than 19.
Let’s see how this performs on the test dataset.
pcr.pred <- predict(pcr.fit, test, ncomp=3)
sqrt(mean((pcr.pred - test$Salary)^2))
## [1] 374.8
This is comparable, but a bit lower than the RMSE of the ridge/ lasso/linear regression.
Using the caret package
pcr.fit <- train(Salary ~., data=train,
preProc = c('center', 'scale'),
method='pcr',
trControl=fitControl)
pcr.fit
## Principal Component Analysis
##
## 133 samples
## 19 predictors
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 121, 120, 118, 119, 120, 120, ...
##
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared RMSE SD Rsquared SD
## 1 300 0.5 100 0.2
## 2 300 0.5 100 0.2
## 3 300 0.6 100 0.2
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The caret package, using bootstrapping and 10 fold cv choses the best model @ 2 components
pcr.pred <- predict(pcr.fit, test)
sqrt(mean(pcr.pred - test$Salary)^2)
## [1] 21.86
The results are comparable to lasso regression. However, PCR results are not easily interpretable.
pls.fit <- plsr(Salary~., data=train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 133 19
## Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 451.5 328.9 328.4 332.6 329.2 325.4 323.4
## adjCV 451.5 328.2 327.4 330.6 326.9 323.0 320.9
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 318.7 318.7 316.3 317.6 316.5 317.0 319.2
## adjCV 316.2 315.5 313.5 314.9 313.6 313.9 315.9
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 323.0 323.8 325.4 324.5 323.6 321.4
## adjCV 319.3 320.1 321.4 320.5 319.9 317.8
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 35.94 55.11 67.37 74.29 79.65 85.17 89.17
## Salary 51.56 54.90 57.72 59.78 61.50 62.94 63.96
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 90.55 93.49 95.82 97.05 97.67 98.45 98.67
## Salary 65.34 65.75 66.03 66.44 66.69 66.77 66.94
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.02 99.26 99.42 99.98 100.00
## Salary 67.02 67.11 67.24 67.26 67.37
validationplot(pls.fit, val.type='MSEP')
Here the best M (number of components) is 2. Now we evaluate the corresponding test error.
pls.pred <- predict(pls.fit, test, ncomp=2)
sqrt(mean(pls.pred - test$Salary)^2)
## [1] 14.34
Here we see a mild improvment in RMSE compared to PCR. This is probably due to the fact that the component directions are estimated based on the predictors and and response.