Model Selection for MLB Salaries

In this lesson we will be using the Hitters dataset from the ISLR package. These data reflect the salaries of baseball platers and various player metrics for the 1886 and 1987 seasons.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.2
data("Hitters")
names(Hitters)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"
dim(Hitters)
## [1] 322  20

First we want to check for any missing data in your response, which is salary. Then we will remove the NAs. This process is called listwise deletion.

# How many data points are misisng for Salary?
sum(is.na(Hitters$Salary))
## [1] 59
# We are going to have to remove the NAs
Hitters<-na.omit(Hitters)
dim(Hitters)
## [1] 263  20
sum(is.na(Hitters$Salary))
## [1] 0

Part I: Classic Variable Section Approaches

  • Forward Selection
  • Backward Selection
  • Mixed Selection
  • Best Subset Selection

In order to demonstrate how the algorithms work, we will first only consider a small subset of predictors.

Candidate Predictors:

  • CRBI: Career runs batted in
  • Hits: Number of hits in 1986
  • Runs: Number of runs in 1986

Forward Selection Algorithm

  1. Start with nothing
  2. Pick variable with lowest sum of squares residual (or p-value)
  3. Add variables until hit stopping rule

Let’s do this with our three variables.

STEP 1: Use simple linear regression to select one variable to enter the model.

# Forward Selection
mod1<-lm(Salary~CRBI, data=Hitters)
summary(mod1)
## 
## Call:
## lm(formula = Salary ~ CRBI, data = Hitters)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1099.27  -203.45   -97.43   146.37  1847.22 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 274.58039   32.85537   8.357 3.85e-15 ***
## CRBI          0.79095    0.07113  11.120  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 372.3 on 261 degrees of freedom
## Multiple R-squared:  0.3215, Adjusted R-squared:  0.3189 
## F-statistic: 123.6 on 1 and 261 DF,  p-value: < 2.2e-16
mod2<-lm(Salary~Hits, data=Hitters)
summary(mod2)
## 
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -893.99 -245.63  -59.08  181.12 2059.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.0488    64.9822   0.970    0.333    
## Hits          4.3854     0.5561   7.886 8.53e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406.2 on 261 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1893 
## F-statistic: 62.19 on 1 and 261 DF,  p-value: 8.531e-14
mod3<-lm(Salary~Runs, data=Hitters)
summary(mod3)
## 
## Call:
## lm(formula = Salary ~ Runs, data = Hitters)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -714.8 -251.2  -55.9  193.6 1997.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 129.9292    59.9240   2.168    0.031 *  
## Runs          7.4161     0.9923   7.474 1.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 410.2 on 261 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1731 
## F-statistic: 55.86 on 1 and 261 DF,  p-value: 1.18e-12
# First step: keep CRBI (smallest p-val)

STEP 2: Look for a second variable to add to the model.

mod12<-lm(Salary~CRBI+Hits, data=Hitters)
summary(mod12)
## 
## Call:
## lm(formula = Salary ~ CRBI + Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -942.47 -183.72  -37.85   96.55 2167.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -47.95590   55.98245  -0.857    0.392    
## CRBI          0.68990    0.06723  10.262  < 2e-16 ***
## Hits          3.30084    0.48177   6.851 5.28e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 343.3 on 260 degrees of freedom
## Multiple R-squared:  0.4252, Adjusted R-squared:  0.4208 
## F-statistic: 96.17 on 2 and 260 DF,  p-value: < 2.2e-16
mod13<-lm(Salary~CRBI+Runs, data=Hitters)
summary(mod13)
## 
## Call:
## lm(formula = Salary ~ CRBI + Runs, data = Hitters)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1037.58  -194.51   -41.45   111.69  2125.83 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.40744   52.04504  -0.065    0.948    
## CRBI         0.70114    0.06737  10.408  < 2e-16 ***
## Runs         5.61989    0.85295   6.589 2.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 345.3 on 260 degrees of freedom
## Multiple R-squared:  0.4185, Adjusted R-squared:  0.4141 
## F-statistic: 93.57 on 2 and 260 DF,  p-value: < 2.2e-16
# Second step: keep Hits

STEP 3: Look for a third variable, but ….OH NO! ITs not significant! So we will stop and only use the best model with two variables.

mod123<-lm(Salary~CRBI+Hits+Runs, data=Hitters)
summary(mod123)
## 
## Call:
## lm(formula = Salary ~ CRBI + Hits + Runs, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -969.16 -186.72  -41.07  100.55 2166.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -46.24811   56.01270  -0.826   0.4098    
## CRBI          0.68948    0.06724  10.255   <2e-16 ***
## Hits          2.28191    1.14187   1.998   0.0467 *  
## Runs          1.97827    2.00996   0.984   0.3259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 343.3 on 259 degrees of freedom
## Multiple R-squared:  0.4274, Adjusted R-squared:  0.4207 
## F-statistic: 64.43 on 3 and 259 DF,  p-value: < 2.2e-16
# STOP! Because adding Hits is not significant

Backward Selection Algorithm

  1. Start with saturated model
  2. Remove largest p-value
  3. Remove variables until hit stopping rule

STEP 1: Start with a full model using all the variables. Take out any variables that are not significant.

# Backward
# Start with all vars and take away
mod123<-lm(Salary~CRBI+Hits+Runs, data=Hitters)
summary(mod123)
## 
## Call:
## lm(formula = Salary ~ CRBI + Hits + Runs, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -969.16 -186.72  -41.07  100.55 2166.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -46.24811   56.01270  -0.826   0.4098    
## CRBI          0.68948    0.06724  10.255   <2e-16 ***
## Hits          2.28191    1.14187   1.998   0.0467 *  
## Runs          1.97827    2.00996   0.984   0.3259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 343.3 on 259 degrees of freedom
## Multiple R-squared:  0.4274, Adjusted R-squared:  0.4207 
## F-statistic: 64.43 on 3 and 259 DF,  p-value: < 2.2e-16
# First step: Remove Runs

STEP 2: Continue taking out variables until the only variables that remain are significant. We can stop with a two variable model.

mod12<-lm(Salary~CRBI+Hits, data=Hitters)
summary(mod12)
## 
## Call:
## lm(formula = Salary ~ CRBI + Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -942.47 -183.72  -37.85   96.55 2167.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -47.95590   55.98245  -0.857    0.392    
## CRBI          0.68990    0.06723  10.262  < 2e-16 ***
## Hits          3.30084    0.48177   6.851 5.28e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 343.3 on 260 degrees of freedom
## Multiple R-squared:  0.4252, Adjusted R-squared:  0.4208 
## F-statistic: 96.17 on 2 and 260 DF,  p-value: < 2.2e-16
# STOP! Now everything is significant 

Mixed Selection

  1. Start with no variables
  2. Add variables with best fit
  3. Continue to add variables, but if p-values become larger as new variables are added remove that variable from the model
  4. Continue going back and forth until variable have sufficiently low p-value

Ok, now for the shortcut!… We can use an R package to automate this

The leaps package in R can used for variable selection. For some reason, there can be problems with installing this package, but if you specify the repository it will work! (See the example in my code)

#install.packages("leaps", repo = 'https://mac.R-project.org')
library(leaps)

This package will default to telling you the best models with one through eight variables included in the model. However, if you want to know all of the different possibilities up to the full model with all the predictors, you can do that too using the nvmax argument.

You must specify what type of selection you would like to do using the method argument.

# Forward 
regfit.fwd<-regsubsets(Salary~., data=Hitters,
                       method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "
# Backward 
regfit.bwd<-regsubsets(Salary~., data=Hitters, 
                       method="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, method = "backward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 4  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 5  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
## 7  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

Best Subset Selection Algorithm

  1. Fit all the models for the specified number of predictors
  2. Select the model that is best (p-val, R-squared adj, Cp, BIC, AIC,…)
# Best Subset 
regfit.full<-regsubsets(Salary~., data=Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "
regfit.full<-regsubsets(Salary~., data=Hitters, nvmax=19)
reg.summary<-summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
##           CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

Part II: Model Comparison Metrics

Now lets look at the variables that are contained in the regsubset object. We can learn a lot about different model comparison metrics.

  • R-squared
  • R-squared Adjusted
  • BIC
  • Mallows Cp

Remember, the goal of these model comparison metrics is to find the “best” model that balances the model fit and simplicity/interpretability.

names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

R-squared

\[ R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]

Although the R-squared metric is useful in simple linear regression, it has some drawbacks for multiple linear regression. R-squared monotonically increases as we add more and more variables into the model, which is not good because it does not balance for the increased complexity. We want to maximize this value!

# R-squared (monotonically increasing)
reg.summary$rsq
##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
rsq_df<-data.frame(rsq=reg.summary$rsq,
                   size=1:19)

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
ggplot(rsq_df, aes(x=size, y=rsq))+
  geom_point()+
  geom_line()+
  ggtitle("R-squared")

R-squared Adjusted

This is similar to R-squared but gives some penalty to the number of predictors, p. We want to maximize this value!

\[ R^2_{adj}=1-\frac{SS(Res)/(n-p-1)}{SS(Tot)/(n-1)}\]

adjrsq_df<-data.frame(rsq=c(reg.summary$rsq,reg.summary$adjr2),
                      type=c(rep("rsq",19),rep("rsq_adj",19)),
                   size=rep(1:19,2))

ggplot(adjrsq_df, aes(x=size, y=rsq, color=type))+
  geom_point()+
  geom_line()+
  ggtitle("R-squared vs Adj R-squared")

Mallows Cp and BIC

Unlike the R-squared values, we want to minimize these metrics:

\[C_p=\frac{1}{n}(SS(Res)+2p\hat{\sigma}^2)\]

\[BIC=\frac{1}{n}(SS(Res)+log(n)p\hat{\sigma}^2)\] Note that BIC gives a harsher penalty and thereby picks smaller models.

# Lets look at the other metrics
metric_df<-data.frame(metric=c(reg.summary$rsq,
                               reg.summary$adjr2,
                               reg.summary$cp, 
                               reg.summary$bic),
                      type=c(rep("rsq",19),
                             rep("rsq_adj",19),
                             rep("cp",19),
                             rep("bic",19)),
                      size=rep(1:19,4))

# Add vertical lines to show the best model
which.min(reg.summary$bic)
## [1] 6
which.min(reg.summary$cp)
## [1] 10
which.max(reg.summary$rsq)
## [1] 19
which.max(reg.summary$adjr2)
## [1] 11
vline.dat <- data.frame(type=unique(metric_df$type), vl=c(6, 10,19, 11 ))

ggplot(metric_df, aes(x=size, y=metric, color=type))+
  geom_point()+
  geom_line()+
  ggtitle("Model Comparison Metrics")+
  geom_vline(aes(xintercept=vl), data=vline.dat, lty=2) +
  facet_grid(type~., scales = "free_y")