This method evaluates all possible combinations of \(p\) predictors. Since each predictor can either be included or excluded, there are \(2^p\) potential models. The process has several steps:
The null model only includes the intercept, \(\beta_0\), and represents the sample mean. We can denote this as \(\mathcal{M}_0\).
For \(k=1,2,...,p:\)
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).
For example, if there were 4 predictors you would find and record the best model using every combination of 1 predictor, using every combination of 2 predictors, using every combination of 3 predictors, and using every combination of all 4. You would then select the best model among those 4 recorded values.
Unfortunatley the process can suffer from computational limitations. With 10 predictors you must evaluate \(2^{10}\) or over \(1,000\) possible models and for 20 predictors you would need to evaluate over \(1,000,000\).
We can use the Hitters dataset to predict a baseball player’s Salary using a linear regression.
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
[1] 322 20
[1] 59
We can see that there are 59 missing values for Salary
[1] 263 20
[1] 0
They can be removed with the na.omit function.
> library(leaps) #Contains regsubsets function
> regfit.full=regsubsets(Salary~.,Hitters) #full model
> summary(regfit.full) #display 8 variablesSubset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
Next, we can use the regsubsets function to perform best subsets selection (best according to RSS). In the output an asterisk signifies that the variable is included in the best model. For example, the best two-variable model includes Hits and CRBI.
> regfit.full=regsubsets(Salary~.,Hitters, nvmax=19) #full model
> reg.summary=summary(regfit.full) #display 19 variables
> # Output not displayedBy default it only displays up to 8 variables but more can be added by changing nvmax
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
[1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
[8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
It is also useful to examine \(R^2, RSS, adjusted \space R^2, C_p, and \space BIC\) by bumber of variables. We can see that \(R^2\) increases with the number of variables.
> library(gridExtra) #to make plot grid
> library(tidyverse) #has ggplot2
>
> RSSval=as.data.frame(reg.summary$rss) #create data frame
> names(RSSval)[1]="RSS" #name column
> ggplot(data=RSSval)+
+ geom_line(mapping=aes(as.numeric(row.names(RSSval)),y=RSS),
+ color="blue", size=1)+
+ theme_classic()+
+ ggtitle("RSS by Number of Variables") +
+ theme(plot.title = element_text(hjust = 0.5)) +
+ xlab("Number of Variables") + ylab("RSS")->p1 #plot RSS
>
> ADJval=as.data.frame(reg.summary$adjr2) #create data frame
> names(ADJval)[1]="adjr2" #name column
> which.max(ADJval$adjr2) #find max value[1] 11
> #plot values
> ggplot(data=ADJval)+
+ geom_line(mapping=aes(as.numeric(row.names(ADJval)),y=adjr2),
+ col="blue", size=1)+
+ geom_point(mapping=aes(x=11, y=ADJval[11,1]),col="red",
+ shape=19, size=4,)+
+ theme_classic()+
+ ggtitle("AdjRsq by Number of Variables") +
+ theme(plot.title = element_text(hjust = 0.5)) +
+ xlab("Number of Variables") + ylab("Adjusted R Squared")->p2
>
> CPval=as.data.frame(reg.summary$cp) #create data frame
> names(CPval)[1]="cp" #name column
> which.min(CPval$cp) #find min[1] 10
> #plot values
> ggplot(data=CPval)+
+ geom_line(mapping=aes(as.numeric(row.names(CPval)),y=cp),
+ col="blue", size=1)+
+ geom_point(mapping=aes(x=10, y=CPval[10,1]),col="red",
+ shape=19, size=4,)+
+ theme_classic()+
+ ggtitle("CP by Number of Variables") +
+ theme(plot.title = element_text(hjust = 0.5)) +
+ xlab("Number of Variables") + ylab("CP")->p3
>
> BICval=as.data.frame(reg.summary$bic) #create data frame
> names(BICval)[1]="bic" #name column
> which.min(BICval$bic) #find min[1] 6
> #plot values
> ggplot(data=BICval)+
+ geom_line(mapping=aes(as.numeric(row.names(BICval)),y=bic),
+ col="blue", size=1)+
+ geom_point(mapping=aes(x=6, y=BICval[6,1]),col="red",
+ shape=19, size=4,)+
+ theme_classic()+
+ ggtitle("BIC by Number of Variables") +
+ theme(plot.title = element_text(hjust = 0.5)) +
+ xlab("Number of Variables") + ylab("BIC")->p4
>
> grid.arrange(p1, p2, p3, p4, ncol = 2, nrow=2)The four plots show \(RSS, adjusted \space R^2, C_p, and \space BIC\) by the number of variables. I used the which.max function to locate and plot the largest \(adjusted \space R^2\) value and the which.min function to plot the lowest \(C_p \space and \space BIC\) values (as a red dot). The best model had 11 variables when using \(adjusted \space R^2\), had 10 variables when using \(C_p\), and had 6 variables when using \(BIC\).
The regsubsets function also has a built in plot command. It contains a black square for each variable selected. For example, there are several models that have a \(BIC\) close to \(-150\). However, the best model contains only 6 variables.
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076
Stepwise selection considers a much smaller number of models and is more computationally efficient then the best subset method.
With forward stepwise selection you start with a model that contains no predictors and then add one at a time until the best model is found.
The null model is denoted as \(\mathcal{M}_0\).
For \(k=0,...,p-1\)
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).
Forward selection evaluates \(\sum_{k=0}^{p-1}(p-k)=1+p(p+1)/2\) models, which is substantially fewer than \(2^p\) models. If \(p=20\) then forward selection would fit \(211\) models, which compares to \(1,048,576\) with best subset selection.
While it is a faster method, it is not guaranteed to select the best model. Once a predictor is selected it remains in the model and additional predictors are added one at a time if they offer an improvement. However, the best one-variable model might contain predictor \(X_1\) while the best two-variable model might contain \(X_2\) and \(X_3\). With forward selection the two-variable model must contain \(X_1\) and another variable.
> #Use regsubsets and sepcify forward
> regfit.fwd=regsubsets(Salary~.,data=Hitters,
+ nvmax=19, method="forward")
> summary(regfit.fwd)Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Using forward stepwise selection the best one-variable model contains only CRBI and the best two-variable model also includes Hits.
With backward stepwise selection you start with the full model and then remove one at a time until the best model is found.
The full model is denoted as \(\mathcal{M}_p\).
For \(k=p,p-1,...,1\)
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).
This method also evaluates \(1+p(p+1)/2\) models and is also not guaranteed to yield the best model. It requires that \(n\) is larger than \(p\), whereas forward selection can be used when \(n<p\).
> #Specify Backward Selection
> regfit.bwd=regsubsets(Salary~.,data=Hitters,
+ nvmax=19, method="backward")
> summary(regfit.bwd)Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " "
5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " " " "*" " " " " " "
5 ( 1 ) " " " " " " "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Some of the models differ for backward selection.
(Intercept) Hits Walks CAtBat CHits CHmRun
79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538
DivisionW PutOuts
-129.9866432 0.2366813
(Intercept) AtBat Hits Walks CRBI CWalks
109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070
DivisionW PutOuts
-127.1223928 0.2533404
(Intercept) AtBat Hits Walks CRuns CWalks
105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346
DivisionW PutOuts
-116.1692169 0.3028847
Here we can see that best subset, forward, and backward selection produces different results for the model with 7 varaibles.
Sequential replacement is a combination of forward and backward selection. It starts with the null model and adds one variable at a time, but it will also remove a variable if it no longer provides and improvement.
> #Specify seqrep for combination of forward and backwards
> regfit.seq=regsubsets(Salary~.,data=Hitters,
+ nvmax=19, method="seqrep")
> summary(regfit.seq)Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "seqrep")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: 'sequential replacement'
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" "*" "*" "*" " " " " " " " " " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " " " " " " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Sequential replacement can be specified using seqrep.
> library(MASS)
> library(caret)
>
> # Fit the full model
> full.model <- lm(Salary ~., data = Hitters)
> # Stepwise regression model
> step.model <- stepAIC(full.model, direction = "both",
+ trace = FALSE)
> summary(step.model)
Call:
lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns +
CRBI + CWalks + Division + PutOuts + Assists, data = Hitters)
Residuals:
Min 1Q Median 3Q Max
-939.11 -176.87 -34.08 130.90 1910.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 162.53544 66.90784 2.429 0.015830 *
AtBat -2.16865 0.53630 -4.044 7.00e-05 ***
Hits 6.91802 1.64665 4.201 3.69e-05 ***
Walks 5.77322 1.58483 3.643 0.000327 ***
CAtBat -0.13008 0.05550 -2.344 0.019858 *
CRuns 1.40825 0.39040 3.607 0.000373 ***
CRBI 0.77431 0.20961 3.694 0.000271 ***
CWalks -0.83083 0.26359 -3.152 0.001818 **
DivisionW -112.38006 39.21438 -2.866 0.004511 **
PutOuts 0.29737 0.07444 3.995 8.50e-05 ***
Assists 0.28317 0.15766 1.796 0.073673 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 311.8 on 252 degrees of freedom
Multiple R-squared: 0.5405, Adjusted R-squared: 0.5223
F-statistic: 29.64 on 10 and 252 DF, p-value: < 2.2e-16
The stepAIC function will find the best model according to AIC. It differs from regsubsets which finds the best model according to RSS. Specifying both will use a combination of both forwards and backwards selection.
> # Set seed for reproducibility
> set.seed(123)
> # Set up repeated k-fold cross-validation
> train.control <- trainControl(method = "cv", number = 10)
> # Train the model
> step.model <- train(Salary ~., data = Hitters,
+ method = "leapSeq",
+ tuneGrid = data.frame(nvmax = 1:19),
+ trControl = train.control
+ )
>
> #Model metrics
> step.model$results nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 371.3887 0.3725884 263.4471 85.50842 0.1751559 46.56110
2 2 347.0982 0.4454148 233.9166 76.43313 0.1967189 39.05602
3 3 356.1870 0.4166351 243.4634 70.71259 0.2180077 39.21220
4 4 342.1703 0.4466726 239.5107 70.12556 0.2101984 39.98074
5 5 351.6811 0.4141493 251.5542 79.28301 0.1771184 46.24903
6 6 353.8881 0.3971377 252.7744 78.96201 0.2230353 41.31359
7 7 341.5259 0.4565160 238.2607 71.74847 0.1942697 42.61554
8 8 336.3383 0.4637186 234.2683 81.92114 0.1932574 43.54620
9 9 332.7480 0.4818312 235.5175 71.18994 0.1971069 41.09389
10 10 328.5512 0.4910098 234.5948 73.40151 0.1981401 42.91242
11 11 327.8415 0.4933839 234.4990 70.98498 0.1983722 41.75501
12 12 335.1188 0.4811998 237.1109 77.38550 0.2116452 47.00768
13 13 342.7259 0.4716044 240.2832 77.18950 0.2236727 41.75227
14 14 341.6455 0.4644272 239.2281 76.96833 0.2032048 43.27559
15 15 337.2516 0.4754851 238.8219 73.05104 0.2060950 42.33556
16 16 337.0595 0.4791323 236.8887 74.54225 0.2033363 42.89640
17 17 336.4725 0.4793681 238.2540 74.45945 0.2017970 43.15108
18 18 337.9055 0.4783858 238.6822 74.09984 0.2029365 42.70828
19 19 338.6068 0.4772604 239.2730 75.20275 0.2044082 42.80425
nvmax
11 11
Subset selection object
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 11
Selection Algorithm: 'sequential replacement'
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" "*" "*" "*" " " " " " " " " " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " " " " " " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
(Intercept) AtBat Hits Walks CAtBat CRuns
135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.7852528 -0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277
We can also use stepwise selection along with cross validation. This method will find the best model from 1 to 19 predictors (I used sequential replacement in this case) and will then use 10 fold cross validation to estimate RMSE. The best model will have the lowest RMSE (model 11).
The MASS, caret, and leaps packages made it relatively simple to perform cross validation with forward, backward, and sequential replacement, but it is more complicated with best subsets. The follwing formula can be used.
> #Predict formula for regsubsets
> predict.regsubsets=function(object,newdata,id){
+ form=as.formula(object$call[[2]])
+ mat=model.matrix(form,newdata)
+ coefi=coef(object, id=id)
+ xvars=names(coefi)
+ mat[,xvars]%*%coefi
+ }There is no predict method for regsubsets, but the above formula can be used. It will be demonstrated below.
> #Set 10 folds and create a matrix of NAs
> k=10
> set.seed(1)
> folds=sample(1:k,nrow(Hitters),replace=TRUE)
> cv.errors=matrix(NA,k,19,dimnames = list(NULL, paste(1:19)))k sepcifies the number of folds and a matrix is created to store the results
> #loop that performs cross validation
> for(j in 1:k){
+ best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],
+ nvmax=19)
+ for(i in 1:19){
+ pred=predict.regsubsets(best.fit,Hitters[folds==j,],id=i)
+ cv.errors[j,i]=mean((Hitters$Salary[folds==j]-pred)^2)
+ }
+ }This gives us a 10x19 matrix. The \((i,j)^{th}\) element corresponds to the test MSE for the \(i^{th}\) cross-validation fold for the best \(j\) variable model.
1 2 3 4 5 6 7 8
149821.1 130922.0 139127.0 131028.8 131050.2 119538.6 124286.1 113580.0
9 10 11 12 13 14 15 16
115556.5 112216.7 113251.2 115755.9 117820.8 119481.2 120121.6 120074.3
17 18 19
120084.8 120085.8 120403.5
We can now take an average of the columns.
We can see that cross-validation selects an 11 variable model
> #best subset selection on full data
> reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
> #view model 11
> coef(reg.best,11) (Intercept) AtBat Hits Walks CAtBat CRuns
135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.7852528 -0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277
We can use \(RSS\) and \(R^2\) to evaluate the training error, but we are mostly interested in the test error. We can therefore use methods that adjust the training error for model size.
Cp
For a least squares model containing \(d\) predictors, the \(C_p\) estimate of the test MSE is:
\[C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2)\]
\(\hat{\sigma}^2\) is an estimate for the variance of \(\epsilon\), the error associated with each response.
\(2d\hat{\sigma}^2\) is a penalty term that adjusts for the fact that the training error underestimates the test error. It increases with the number of predictors.
The best model is the one with the lowest \(C_p\) value
AIC
Similarly, AIC can be calculated as:
\[AIC=\frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2)\]
BIC
BIC looks similar:
\[BIC=\frac{1}{n\hat{\sigma}^2}(RSS+log(n)d\hat{\sigma}^2)\]
Instead of \(2d\hat{\sigma}^2\) BIC uses an error term of \(log(n)d\hat{\sigma}^2\).
As with \(C_p\) and \(AIC\), we also select the lowest value for \(BIC\)
Since \(log(n) >2\) for any \(n > 7\), BIC places a heavier penalty on models with many variables.
Adjusted R Squared
The un-adjusted \(R^2\) is calculated as \(1-RSS/TSS\), where \(TSS = \sum{(y_i-\bar{y})^2}\). \(R^2\) always increases as more variables are added to the model, since \(RSS\) will decline.
For a least squares model with \(d\) variables, \(adj\space R^2\) is calculated as:
\[1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\]
In this case, a larger value is preferred.
Unlike with \(R^2\), \(adj \space R^2\) penalizes unnecessary variables.