Best Subset Selection

This method evaluates all possible combinations of \(p\) predictors. Since each predictor can either be included or excluded, there are \(2^p\) potential models. The process has several steps:

The null model only includes the intercept, \(\beta_0\), and represents the sample mean. We can denote this as \(\mathcal{M}_0\).
For \(k=1,2,...,p:\)
- Fit all \(\binom{p}{k}\) models
- Label the best among these \(\binom{p}{k}\) models \(\mathcal{M}_k\). Choose the model with the smallest RSS (Residual Sum of Squares) or largest \(R^2\) as the best.
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).

For example, if there were 4 predictors you would find and record the best model using every combination of 1 predictor, using every combination of 2 predictors, using every combination of 3 predictors, and using every combination of all 4. You would then select the best model among those 4 recorded values.

Unfortunatley the process can suffer from computational limitations. With 10 predictors you must evaluate \(2^{10}\) or over \(1,000\) possible models and for 20 predictors you would need to evaluate over \(1,000,000\).

Best Subset Example

We can use the Hitters dataset to predict a baseball player’s Salary using a linear regression.

> library(ISLR) #load package
> names(Hitters) #column names

 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

> dim(Hitters) #Rows and Columns

[1] 322  20

> sum(is.na(Hitters$Salary)) #Number of NA values for salary

[1] 59

We can see that there are 59 missing values for Salary

> Hitters=na.omit(Hitters) #Remove NAs
> dim(Hitters) #Rows and Columns

[1] 263  20

> sum(is.na(Hitters)) #Number of NA values

[1] 0

They can be removed with the na.omit function.

> library(leaps) #Contains regsubsets function
> regfit.full=regsubsets(Salary~.,Hitters) #full model
> summary(regfit.full) #display 8 variables

Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
         CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

Next, we can use the regsubsets function to perform best subsets selection (best according to RSS). In the output an asterisk signifies that the variable is included in the best model. For example, the best two-variable model includes Hits and CRBI.

> regfit.full=regsubsets(Salary~.,Hitters, nvmax=19) #full model
> reg.summary=summary(regfit.full) #display 19 variables
> # Output not displayed

By default it only displays up to 8 variables but more can be added by changing nvmax

> names(reg.summary) #column names

[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

> reg.summary$rsq # r squared values

 [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
 [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

It is also useful to examine \(R^2, RSS, adjusted \space R^2, C_p, and \space BIC\) by bumber of variables. We can see that \(R^2\) increases with the number of variables.

> library(gridExtra) #to make plot grid
> library(tidyverse) #has ggplot2
> 
> RSSval=as.data.frame(reg.summary$rss) #create data frame
> names(RSSval)[1]="RSS" #name column
> ggplot(data=RSSval)+
+   geom_line(mapping=aes(as.numeric(row.names(RSSval)),y=RSS), 
+             color="blue", size=1)+
+   theme_classic()+
+   ggtitle("RSS by Number of Variables") + 
+   theme(plot.title = element_text(hjust = 0.5)) + 
+   xlab("Number of Variables") + ylab("RSS")->p1 #plot RSS
> 
> ADJval=as.data.frame(reg.summary$adjr2) #create data frame
> names(ADJval)[1]="adjr2" #name column
> which.max(ADJval$adjr2) #find max value

[1] 11

> #plot values
> ggplot(data=ADJval)+
+   geom_line(mapping=aes(as.numeric(row.names(ADJval)),y=adjr2), 
+             col="blue", size=1)+
+   geom_point(mapping=aes(x=11, y=ADJval[11,1]),col="red",
+              shape=19, size=4,)+
+   theme_classic()+
+   ggtitle("AdjRsq by Number of Variables") + 
+   theme(plot.title = element_text(hjust = 0.5)) + 
+   xlab("Number of Variables") + ylab("Adjusted R Squared")->p2  
> 
> CPval=as.data.frame(reg.summary$cp) #create data frame
> names(CPval)[1]="cp" #name column
> which.min(CPval$cp) #find min

[1] 10

> #plot values
> ggplot(data=CPval)+
+   geom_line(mapping=aes(as.numeric(row.names(CPval)),y=cp), 
+             col="blue", size=1)+
+   geom_point(mapping=aes(x=10, y=CPval[10,1]),col="red",
+              shape=19, size=4,)+
+   theme_classic()+
+   ggtitle("CP by Number of Variables") + 
+   theme(plot.title = element_text(hjust = 0.5)) + 
+   xlab("Number of Variables") + ylab("CP")->p3 
> 
> BICval=as.data.frame(reg.summary$bic) #create data frame
> names(BICval)[1]="bic" #name column
> which.min(BICval$bic) #find min

[1] 6

> #plot values
> ggplot(data=BICval)+
+   geom_line(mapping=aes(as.numeric(row.names(BICval)),y=bic), 
+             col="blue", size=1)+
+   geom_point(mapping=aes(x=6, y=BICval[6,1]),col="red",
+              shape=19, size=4,)+
+   theme_classic()+
+   ggtitle("BIC by Number of Variables") + 
+   theme(plot.title = element_text(hjust = 0.5)) + 
+   xlab("Number of Variables") + ylab("BIC")->p4  
> 
> grid.arrange(p1, p2, p3, p4, ncol = 2, nrow=2)

The four plots show \(RSS, adjusted \space R^2, C_p, and \space BIC\) by the number of variables. I used the which.max function to locate and plot the largest \(adjusted \space R^2\) value and the which.min function to plot the lowest \(C_p \space and \space BIC\) values (as a red dot). The best model had 11 variables when using \(adjusted \space R^2\), had 10 variables when using \(C_p\), and had 6 variables when using \(BIC\).

> plot(regfit.full,scale="r2")

> plot(regfit.full,scale="adjr2")

> plot(regfit.full,scale="Cp")

> plot(regfit.full,scale="bic")

The regsubsets function also has a built in plot command. It contains a black square for each variable selected. For example, there are several models that have a \(BIC\) close to \(-150\). However, the best model contains only 6 variables.

> coef(regfit.full,6)

 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
     PutOuts 
   0.2643076

Forward Stepwise Selection

Stepwise selection considers a much smaller number of models and is more computationally efficient then the best subset method.

With forward stepwise selection you start with a model that contains no predictors and then add one at a time until the best model is found.

The null model is denoted as \(\mathcal{M}_0\).
For \(k=0,...,p-1\)
- Evaluate all \(p-k\) models (if one additional predictor was added) and determine which would improve upon model \(\mathcal{M}_k\)
- Choose the best (lowest RSS or highest \(R^2\)) among the \(p-k\) models and label it as \(\mathcal{M}_{k+1}\)
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).

Forward selection evaluates \(\sum_{k=0}^{p-1}(p-k)=1+p(p+1)/2\) models, which is substantially fewer than \(2^p\) models. If \(p=20\) then forward selection would fit \(211\) models, which compares to \(1,048,576\) with best subset selection.

While it is a faster method, it is not guaranteed to select the best model. Once a predictor is selected it remains in the model and additional predictors are added one at a time if they offer an improvement. However, the best one-variable model might contain predictor \(X_1\) while the best two-variable model might contain \(X_2\) and \(X_3\). With forward selection the two-variable model must contain \(X_1\) and another variable.

Forward Stepwise Example

> #Use regsubsets and sepcify forward
> regfit.fwd=regsubsets(Salary~.,data=Hitters,
+                       nvmax=19, method="forward")
> summary(regfit.fwd)

Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

Using forward stepwise selection the best one-variable model contains only CRBI and the best two-variable model also includes Hits.

Backward Stepwise Selection

With backward stepwise selection you start with the full model and then remove one at a time until the best model is found.

The full model is denoted as \(\mathcal{M}_p\).
For \(k=p,p-1,...,1\)
- Evaluate all \(k\) models (if one additional predictor was removed) and determine which would improve upon model \(\mathcal{M}_k\)
- Choose the best (lowest RSS or highest \(R^2\)) among the \(k\) models and label it as \(\mathcal{M}_{k-1}\)
Select a single best model from among \(\mathcal{M}_0,...,\mathcal{M}_p\) using either \(C_p\), AIC, BIC, or adjusted \(R^2\).

This method also evaluates \(1+p(p+1)/2\) models and is also not guaranteed to yield the best model. It requires that \(n\) is larger than \(p\), whereas forward selection can be used when \(n<p\).

Backward Stepwise Selection Example

> #Specify Backward Selection
> regfit.bwd=regsubsets(Salary~.,data=Hitters,
+                       nvmax=19, method="backward")
> summary(regfit.bwd)

Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

Some of the models differ for backward selection.

> #Compare the three methods with 7 variables
> coef(regfit.full,7)

 (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
  79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
   DivisionW      PutOuts 
-129.9866432    0.2366813

> coef(regfit.fwd,7)

 (Intercept)        AtBat         Hits        Walks         CRBI       CWalks 
 109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 
   DivisionW      PutOuts 
-127.1223928    0.2533404

> coef(regfit.bwd,7)

 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
 105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095   -0.7163346 
   DivisionW      PutOuts 
-116.1692169    0.3028847

Here we can see that best subset, forward, and backward selection produces different results for the model with 7 varaibles.

Hybrid Approaches

Sequential replacement is a combination of forward and backward selection. It starts with the null model and adds one variable at a time, but it will also remove a variable if it no longer provides and improvement.

Hybrid Examples

> #Specify seqrep for combination of forward and backwards
> regfit.seq=regsubsets(Salary~.,data=Hitters,
+                       nvmax=19, method="seqrep")
> summary(regfit.seq)

Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "seqrep")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: 'sequential replacement'
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 )  "*"   "*"  "*"   "*"  "*" " "   " "   " "    " "   " "    " "   " " 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

Sequential replacement can be specified using seqrep.

> library(MASS)
> library(caret)
> 
> # Fit the full model 
> full.model <- lm(Salary ~., data = Hitters)
> # Stepwise regression model
> step.model <- stepAIC(full.model, direction = "both", 
+                       trace = FALSE)
> summary(step.model)


Call:
lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + 
    CRBI + CWalks + Division + PutOuts + Assists, data = Hitters)

Residuals:
    Min      1Q  Median      3Q     Max 
-939.11 -176.87  -34.08  130.90 1910.55 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  162.53544   66.90784   2.429 0.015830 *  
AtBat         -2.16865    0.53630  -4.044 7.00e-05 ***
Hits           6.91802    1.64665   4.201 3.69e-05 ***
Walks          5.77322    1.58483   3.643 0.000327 ***
CAtBat        -0.13008    0.05550  -2.344 0.019858 *  
CRuns          1.40825    0.39040   3.607 0.000373 ***
CRBI           0.77431    0.20961   3.694 0.000271 ***
CWalks        -0.83083    0.26359  -3.152 0.001818 ** 
DivisionW   -112.38006   39.21438  -2.866 0.004511 ** 
PutOuts        0.29737    0.07444   3.995 8.50e-05 ***
Assists        0.28317    0.15766   1.796 0.073673 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 311.8 on 252 degrees of freedom
Multiple R-squared:  0.5405,    Adjusted R-squared:  0.5223 
F-statistic: 29.64 on 10 and 252 DF,  p-value: < 2.2e-16

The stepAIC function will find the best model according to AIC. It differs from regsubsets which finds the best model according to RSS. Specifying both will use a combination of both forwards and backwards selection.

> # Set seed for reproducibility
> set.seed(123)
> # Set up repeated k-fold cross-validation
> train.control <- trainControl(method = "cv", number = 10)
> # Train the model
> step.model <- train(Salary ~., data = Hitters,
+                     method = "leapSeq", 
+                     tuneGrid = data.frame(nvmax = 1:19),
+                     trControl = train.control
+ )
> 
> #Model metrics
> step.model$results

   nvmax     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      1 371.3887 0.3725884 263.4471 85.50842  0.1751559 46.56110
2      2 347.0982 0.4454148 233.9166 76.43313  0.1967189 39.05602
3      3 356.1870 0.4166351 243.4634 70.71259  0.2180077 39.21220
4      4 342.1703 0.4466726 239.5107 70.12556  0.2101984 39.98074
5      5 351.6811 0.4141493 251.5542 79.28301  0.1771184 46.24903
6      6 353.8881 0.3971377 252.7744 78.96201  0.2230353 41.31359
7      7 341.5259 0.4565160 238.2607 71.74847  0.1942697 42.61554
8      8 336.3383 0.4637186 234.2683 81.92114  0.1932574 43.54620
9      9 332.7480 0.4818312 235.5175 71.18994  0.1971069 41.09389
10    10 328.5512 0.4910098 234.5948 73.40151  0.1981401 42.91242
11    11 327.8415 0.4933839 234.4990 70.98498  0.1983722 41.75501
12    12 335.1188 0.4811998 237.1109 77.38550  0.2116452 47.00768
13    13 342.7259 0.4716044 240.2832 77.18950  0.2236727 41.75227
14    14 341.6455 0.4644272 239.2281 76.96833  0.2032048 43.27559
15    15 337.2516 0.4754851 238.8219 73.05104  0.2060950 42.33556
16    16 337.0595 0.4791323 236.8887 74.54225  0.2033363 42.89640
17    17 336.4725 0.4793681 238.2540 74.45945  0.2017970 43.15108
18    18 337.9055 0.4783858 238.6822 74.09984  0.2029365 42.70828
19    19 338.6068 0.4772604 239.2730 75.20275  0.2044082 42.80425

> #Display number of variables for best model
> step.model$bestTune

   nvmax
11    11

> #Display up to the best model
> summary(step.model$finalModel)

Subset selection object
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 11
Selection Algorithm: 'sequential replacement'
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 )  "*"   "*"  "*"   "*"  "*" " "   " "   " "    " "   " "    " "   " " 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "

> #Show model coefficients
> coef(step.model$finalModel, 11)

 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
   0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277

We can also use stepwise selection along with cross validation. This method will find the best model from 1 to 19 predictors (I used sequential replacement in this case) and will then use 10 fold cross validation to estimate RMSE. The best model will have the lowest RMSE (model 11).

Cross Validation with Best Subsets

The MASS, caret, and leaps packages made it relatively simple to perform cross validation with forward, backward, and sequential replacement, but it is more complicated with best subsets. The follwing formula can be used.

> #Predict formula for regsubsets
> predict.regsubsets=function(object,newdata,id){
+   form=as.formula(object$call[[2]]) 
+   mat=model.matrix(form,newdata)
+   coefi=coef(object, id=id)
+   xvars=names(coefi)
+   mat[,xvars]%*%coefi
+ }

There is no predict method for regsubsets, but the above formula can be used. It will be demonstrated below.

> #Set 10 folds and create a matrix of NAs
> k=10
> set.seed(1)
> folds=sample(1:k,nrow(Hitters),replace=TRUE)
> cv.errors=matrix(NA,k,19,dimnames = list(NULL, paste(1:19)))

k sepcifies the number of folds and a matrix is created to store the results

> #loop that performs cross validation
> for(j in 1:k){
+   best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],
+                       nvmax=19)
+   for(i in 1:19){
+     pred=predict.regsubsets(best.fit,Hitters[folds==j,],id=i)
+     cv.errors[j,i]=mean((Hitters$Salary[folds==j]-pred)^2)
+   }
+ }

This gives us a 10x19 matrix. The \((i,j)^{th}\) element corresponds to the test MSE for the \(i^{th}\) cross-validation fold for the best \(j\) variable model.

> #average over columns
> mean.cv.errors=apply(cv.errors,2,mean)
> mean.cv.errors

       1        2        3        4        5        6        7        8 
149821.1 130922.0 139127.0 131028.8 131050.2 119538.6 124286.1 113580.0 
       9       10       11       12       13       14       15       16 
115556.5 112216.7 113251.2 115755.9 117820.8 119481.2 120121.6 120074.3 
      17       18       19 
120084.8 120085.8 120403.5

We can now take an average of the columns.

> #plot the column averages
> par(mfrow=c(1,1))
> plot(mean.cv.errors,type='b')

We can see that cross-validation selects an 11 variable model

> #best subset selection on full data
> reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
> #view model 11
> coef(reg.best,11)

 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
   0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277

Cp, AIC, BIC, and Adjusted R Squared

We can use \(RSS\) and \(R^2\) to evaluate the training error, but we are mostly interested in the test error. We can therefore use methods that adjust the training error for model size.

For a least squares model containing \(d\) predictors, the \(C_p\) estimate of the test MSE is:

\[C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2)\]

\(\hat{\sigma}^2\) is an estimate for the variance of \(\epsilon\), the error associated with each response.
\(2d\hat{\sigma}^2\) is a penalty term that adjusts for the fact that the training error underestimates the test error. It increases with the number of predictors.
The best model is the one with the lowest \(C_p\) value

AIC

Similarly, AIC can be calculated as:

\[AIC=\frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2)\]

For the least squares model \(AIC\) and \(C_p\) are proportional to each other. (a lower value is also better)

BIC

BIC looks similar:

\[BIC=\frac{1}{n\hat{\sigma}^2}(RSS+log(n)d\hat{\sigma}^2)\]

Instead of \(2d\hat{\sigma}^2\) BIC uses an error term of \(log(n)d\hat{\sigma}^2\).
As with \(C_p\) and \(AIC\), we also select the lowest value for \(BIC\)
Since \(log(n) >2\) for any \(n > 7\), BIC places a heavier penalty on models with many variables.

Adjusted R Squared

The un-adjusted \(R^2\) is calculated as \(1-RSS/TSS\), where \(TSS = \sum{(y_i-\bar{y})^2}\). \(R^2\) always increases as more variables are added to the model, since \(RSS\) will decline.

For a least squares model with \(d\) variables, \(adj\space R^2\) is calculated as:

\[1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\]

In this case, a larger value is preferred.
Unlike with \(R^2\), \(adj \space R^2\) penalizes unnecessary variables.

Linear Model Selection

R Markdown

Paul Jozefek

2020-06-06