Model Selection: Best Subset Regression

I followed the tutorial video for Week 7 for best subset regression. I applied the steps shown in the tutorial to the baseball data set we worked on for homework 1.

library(ISLR)
library(regclass)

Read baseball stats from homework 1

team <- read.csv(file="moneyball-training-data.csv", header=TRUE, sep=",")
nrow(team)
## [1] 2276

There are some variables with missing data.

There are 2,276 rows.

summary(team)
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

I’m dropping TEAM_BATTING_HBP and TEAM_BASERUN_CS because these columns have too many missing values.
I’m also dropping the INDEX columns since it is not going to be used for the model.

team <- subset(team, select = -c(INDEX, TEAM_BATTING_HBP, TEAM_BASERUN_CS) )

Omitting rows that are not complete cases. There are 1,835 complete cases.

team <- na.omit(team)
nrow(team)
## [1] 1835

Summary of complete cases.

summary(team)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   : 38.00   Min.   :1137   Min.   :130.0   Min.   : 11.00  
##  1st Qu.: 72.00   1st Qu.:1386   1st Qu.:216.0   1st Qu.: 32.00  
##  Median : 82.00   Median :1448   Median :245.0   Median : 42.00  
##  Mean   : 80.99   Mean   :1457   Mean   :247.8   Mean   : 47.84  
##  3rd Qu.: 90.00   3rd Qu.:1524   3rd Qu.:277.0   3rd Qu.: 60.00  
##  Max.   :120.00   Max.   :1876   Max.   :392.0   Max.   :138.00  
##  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  4.0   Min.   :273.0   Min.   : 324.0   Min.   : 18.0  
##  1st Qu.: 77.0   1st Qu.:473.0   1st Qu.: 602.5   1st Qu.: 62.0  
##  Median :119.0   Median :525.0   Median : 816.0   Median : 91.0  
##  Mean   :116.5   Mean   :530.3   Mean   : 786.8   Mean   :100.8  
##  3rd Qu.:156.0   3rd Qu.:586.0   3rd Qu.: 956.5   3rd Qu.:131.0  
##  Max.   :264.0   Max.   :878.0   Max.   :1399.0   Max.   :367.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1137    Min.   :  4.0    Min.   : 325     Min.   : 345.0  
##  1st Qu.:1409    1st Qu.: 80.0    1st Qu.: 488     1st Qu.: 641.0  
##  Median :1493    Median :123.0    Median : 542     Median : 829.0  
##  Mean   :1525    Mean   :120.7    Mean   : 554     Mean   : 818.0  
##  3rd Qu.:1602    3rd Qu.:160.0    3rd Qu.: 609     3rd Qu.: 972.5  
##  Max.   :2394    Max.   :343.0    Max.   :1090     Max.   :1781.0  
##  TEAM_FIELDING_E TEAM_FIELDING_DP
##  Min.   : 65.0   Min.   : 72.0   
##  1st Qu.:122.0   1st Qu.:136.0   
##  Median :143.0   Median :151.0   
##  Mean   :159.9   Mean   :150.4   
##  3rd Qu.:181.5   3rd Qu.:165.0   
##  Max.   :515.0   Max.   :228.0

Best subset Regression

Best subset regression looks through all possible regression models of all different subset sizes and looks for the best of each size. This produces a sequence of models which is best for each size. The library leaps will help us do this.

library(leaps)

Call the regsubsets function of the leaps library to perform the best subset regression. The default number of variables that this function will do the analysis for is 8. We have a total of 13 explanatory variables. Specify nvmax = 13 to include all 13 variables.

ncol(team)
## [1] 14

Below you see the result of the best fit subset for each size.

regfit.full = leaps::regsubsets(TARGET_WINS~., data=team, nvmax=13)
(reg.summary = summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(TARGET_WINS ~ ., data = team, nvmax = 13)
## 13 Variables  (and intercept)
##                  Forced in Forced out
## TEAM_BATTING_H       FALSE      FALSE
## TEAM_BATTING_2B      FALSE      FALSE
## TEAM_BATTING_3B      FALSE      FALSE
## TEAM_BATTING_HR      FALSE      FALSE
## TEAM_BATTING_BB      FALSE      FALSE
## TEAM_BATTING_SO      FALSE      FALSE
## TEAM_BASERUN_SB      FALSE      FALSE
## TEAM_PITCHING_H      FALSE      FALSE
## TEAM_PITCHING_HR     FALSE      FALSE
## TEAM_PITCHING_BB     FALSE      FALSE
## TEAM_PITCHING_SO     FALSE      FALSE
## TEAM_FIELDING_E      FALSE      FALSE
## TEAM_FIELDING_DP     FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1  ( 1 )  "*"            " "             " "             " "            
## 2  ( 1 )  "*"            " "             " "             " "            
## 3  ( 1 )  "*"            " "             " "             " "            
## 4  ( 1 )  " "            " "             " "             "*"            
## 5  ( 1 )  " "            " "             "*"             "*"            
## 6  ( 1 )  " "            " "             "*"             "*"            
## 7  ( 1 )  " "            " "             "*"             "*"            
## 8  ( 1 )  " "            " "             "*"             "*"            
## 9  ( 1 )  "*"            "*"             "*"             "*"            
## 10  ( 1 ) " "            "*"             "*"             "*"            
## 11  ( 1 ) "*"            "*"             "*"             "*"            
## 12  ( 1 ) "*"            "*"             "*"             "*"            
## 13  ( 1 ) "*"            "*"             "*"             "*"            
##           TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 1  ( 1 )  " "             " "             " "             " "            
## 2  ( 1 )  "*"             " "             " "             " "            
## 3  ( 1 )  "*"             " "             "*"             " "            
## 4  ( 1 )  " "             "*"             "*"             " "            
## 5  ( 1 )  " "             "*"             "*"             " "            
## 6  ( 1 )  " "             " "             "*"             " "            
## 7  ( 1 )  " "             " "             "*"             " "            
## 8  ( 1 )  "*"             " "             "*"             "*"            
## 9  ( 1 )  " "             " "             "*"             " "            
## 10  ( 1 ) "*"             "*"             "*"             "*"            
## 11  ( 1 ) "*"             " "             "*"             "*"            
## 12  ( 1 ) "*"             "*"             "*"             "*"            
## 13  ( 1 ) "*"             "*"             "*"             "*"            
##           TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 1  ( 1 )  " "              " "              " "             
## 2  ( 1 )  " "              " "              " "             
## 3  ( 1 )  " "              " "              " "             
## 4  ( 1 )  " "              " "              " "             
## 5  ( 1 )  " "              " "              " "             
## 6  ( 1 )  " "              "*"              "*"             
## 7  ( 1 )  " "              "*"              "*"             
## 8  ( 1 )  " "              " "              "*"             
## 9  ( 1 )  " "              "*"              "*"             
## 10  ( 1 ) " "              " "              "*"             
## 11  ( 1 ) " "              "*"              "*"             
## 12  ( 1 ) " "              "*"              "*"             
## 13  ( 1 ) "*"              "*"              "*"             
##           TEAM_FIELDING_E TEAM_FIELDING_DP
## 1  ( 1 )  " "             " "             
## 2  ( 1 )  " "             " "             
## 3  ( 1 )  " "             " "             
## 4  ( 1 )  "*"             " "             
## 5  ( 1 )  "*"             " "             
## 6  ( 1 )  "*"             " "             
## 7  ( 1 )  "*"             "*"             
## 8  ( 1 )  "*"             "*"             
## 9  ( 1 )  "*"             "*"             
## 10  ( 1 ) "*"             "*"             
## 11  ( 1 ) "*"             "*"             
## 12  ( 1 ) "*"             "*"             
## 13  ( 1 ) "*"             "*"

These are the difference statistics we can use from the best models of different sizes.

names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Plot the cp(estimate of prediction errors)

The model with the lowest cp should be selected.

plot(reg.summary$cp, xlab="Number of Variables", ylab="Cp")

The model with the lowest cp is the one with 12 variables.

which.min(reg.summary$cp)
## [1] 12

This plot shows the cp statistic on the y-axis. 12 is the lowest cp value (with 12 variables selected -> indicated by black areas on the plot). Highest cp is 850. White areas indicate the variable is out. Black indicate variable is in.

You’ll notice that towards the top, the black band is reasonably stable.

plot(regfit.full, scale="Cp")

These are the coefficients of model 12, which is the best model using best subset regression.

coef(regfit.full, 12)
##      (Intercept)   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##      59.05483239      -0.03384353      -0.04926792       0.18349652 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##       0.10026292       0.11836348       0.03331614       0.06946472 
##  TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##       0.05981299      -0.08030854      -0.05352319      -0.11886411 
## TEAM_FIELDING_DP 
##      -0.11231685

The plot below shows adjust r-squared value for each model.

plot(reg.summary$adjr2, xlab="Number of Variables", ylab="adjr2")

Using the lm function, we get the same intercept for model 12 (all variables except TEAM_PITCHING_HR)

All the variables are significant at the 0.05 level.

model12 <- lm(TARGET_WINS ~. - TEAM_PITCHING_HR, data=team)
summary(model12)
## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_PITCHING_HR, data = team)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.168  -7.256   0.142   6.945  29.896 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      59.054832   6.025125   9.801  < 2e-16 ***
## TEAM_BATTING_H   -0.033844   0.015767  -2.146  0.03197 *  
## TEAM_BATTING_2B  -0.049268   0.008874  -5.552 3.24e-08 ***
## TEAM_BATTING_3B   0.183497   0.018984   9.666  < 2e-16 ***
## TEAM_BATTING_HR   0.100263   0.009155  10.952  < 2e-16 ***
## TEAM_BATTING_BB   0.118363   0.041385   2.860  0.00428 ** 
## TEAM_BATTING_SO   0.033316   0.017529   1.901  0.05751 .  
## TEAM_BASERUN_SB   0.069465   0.005534  12.552  < 2e-16 ***
## TEAM_PITCHING_H   0.059813   0.014323   4.176 3.11e-05 ***
## TEAM_PITCHING_BB -0.080309   0.039308  -2.043  0.04119 *  
## TEAM_PITCHING_SO -0.053523   0.016559  -3.232  0.00125 ** 
## TEAM_FIELDING_E  -0.118864   0.007122 -16.690  < 2e-16 ***
## TEAM_FIELDING_DP -0.112317   0.012271  -9.153  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 1822 degrees of freedom
## Multiple R-squared:  0.4058, Adjusted R-squared:  0.4019 
## F-statistic: 103.7 on 12 and 1822 DF,  p-value: < 2.2e-16

VIF of Model 12

Looking at the VIF of each variable, some are above 5 (way above 5), and this is indicative of problematic amount of collinearity in the model.

Source: http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/

For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

VIF(model12)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##        51.432070         2.572898         3.008175         4.410352 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_H 
##       221.742560       256.175602         1.509687       110.268728 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##       260.043366       239.822447         2.999999         1.372712

Standardized Residual

The article linked below discusses how standardized residual plots should look like for an acceptable model.

Source: https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/

Standardized residual plots should have these characteristics:

  1. they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot.
  2. they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150).
  3. in general, there aren’t any clear patterns.

The standardized residual plot of model 12 seem to display the characteristics described above. Generally, the residuals are between -3 and 3 (not a big range). They’re symmetrically distributed along the zero horizontal line (indicating low errors between actual and predicted). I do not see any curved patterns.

model12.predict <- predict(model12)
model12.stdres <- rstandard(model12)
plot(model12.predict, model12.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins",  main="Model 12") 
abline(0, 0)