I followed the tutorial video for Week 7 for best subset regression. I applied the steps shown in the tutorial to the baseball data set we worked on for homework 1.
library(ISLR)
library(regclass)
Read baseball stats from homework 1
team <- read.csv(file="moneyball-training-data.csv", header=TRUE, sep=",")
nrow(team)
## [1] 2276
There are some variables with missing data.
There are 2,276 rows.
summary(team)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
I’m dropping TEAM_BATTING_HBP and TEAM_BASERUN_CS because these columns have too many missing values.
I’m also dropping the INDEX columns since it is not going to be used for the model.
team <- subset(team, select = -c(INDEX, TEAM_BATTING_HBP, TEAM_BASERUN_CS) )
Omitting rows that are not complete cases. There are 1,835 complete cases.
team <- na.omit(team)
nrow(team)
## [1] 1835
Summary of complete cases.
summary(team)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 38.00 Min. :1137 Min. :130.0 Min. : 11.00
## 1st Qu.: 72.00 1st Qu.:1386 1st Qu.:216.0 1st Qu.: 32.00
## Median : 82.00 Median :1448 Median :245.0 Median : 42.00
## Mean : 80.99 Mean :1457 Mean :247.8 Mean : 47.84
## 3rd Qu.: 90.00 3rd Qu.:1524 3rd Qu.:277.0 3rd Qu.: 60.00
## Max. :120.00 Max. :1876 Max. :392.0 Max. :138.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 4.0 Min. :273.0 Min. : 324.0 Min. : 18.0
## 1st Qu.: 77.0 1st Qu.:473.0 1st Qu.: 602.5 1st Qu.: 62.0
## Median :119.0 Median :525.0 Median : 816.0 Median : 91.0
## Mean :116.5 Mean :530.3 Mean : 786.8 Mean :100.8
## 3rd Qu.:156.0 3rd Qu.:586.0 3rd Qu.: 956.5 3rd Qu.:131.0
## Max. :264.0 Max. :878.0 Max. :1399.0 Max. :367.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. :1137 Min. : 4.0 Min. : 325 Min. : 345.0
## 1st Qu.:1409 1st Qu.: 80.0 1st Qu.: 488 1st Qu.: 641.0
## Median :1493 Median :123.0 Median : 542 Median : 829.0
## Mean :1525 Mean :120.7 Mean : 554 Mean : 818.0
## 3rd Qu.:1602 3rd Qu.:160.0 3rd Qu.: 609 3rd Qu.: 972.5
## Max. :2394 Max. :343.0 Max. :1090 Max. :1781.0
## TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 65.0 Min. : 72.0
## 1st Qu.:122.0 1st Qu.:136.0
## Median :143.0 Median :151.0
## Mean :159.9 Mean :150.4
## 3rd Qu.:181.5 3rd Qu.:165.0
## Max. :515.0 Max. :228.0
Best subset regression looks through all possible regression models of all different subset sizes and looks for the best of each size. This produces a sequence of models which is best for each size. The library leaps
will help us do this.
library(leaps)
Call the regsubsets
function of the leaps
library to perform the best subset regression. The default number of variables that this function will do the analysis for is 8. We have a total of 13 explanatory variables. Specify nvmax
= 13 to include all 13 variables.
ncol(team)
## [1] 14
Below you see the result of the best fit subset for each size.
regfit.full = leaps::regsubsets(TARGET_WINS~., data=team, nvmax=13)
(reg.summary = summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(TARGET_WINS ~ ., data = team, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## TEAM_BATTING_H FALSE FALSE
## TEAM_BATTING_2B FALSE FALSE
## TEAM_BATTING_3B FALSE FALSE
## TEAM_BATTING_HR FALSE FALSE
## TEAM_BATTING_BB FALSE FALSE
## TEAM_BATTING_SO FALSE FALSE
## TEAM_BASERUN_SB FALSE FALSE
## TEAM_PITCHING_H FALSE FALSE
## TEAM_PITCHING_HR FALSE FALSE
## TEAM_PITCHING_BB FALSE FALSE
## TEAM_PITCHING_SO FALSE FALSE
## TEAM_FIELDING_E FALSE FALSE
## TEAM_FIELDING_DP FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 ( 1 ) "*" " " " " " "
## 2 ( 1 ) "*" " " " " " "
## 3 ( 1 ) "*" " " " " " "
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) " " " " "*" "*"
## 6 ( 1 ) " " " " "*" "*"
## 7 ( 1 ) " " " " "*" "*"
## 8 ( 1 ) " " " " "*" "*"
## 9 ( 1 ) "*" "*" "*" "*"
## 10 ( 1 ) " " "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) "*" " " " " " "
## 3 ( 1 ) "*" " " "*" " "
## 4 ( 1 ) " " "*" "*" " "
## 5 ( 1 ) " " "*" "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" "*"
## 9 ( 1 ) " " " " "*" " "
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " " " "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) " " " " "*"
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) "*" "*" "*"
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
These are the difference statistics we can use from the best models of different sizes.
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Plot the cp
(estimate of prediction errors)
The model with the lowest cp should be selected.
plot(reg.summary$cp, xlab="Number of Variables", ylab="Cp")
The model with the lowest cp is the one with 12 variables.
which.min(reg.summary$cp)
## [1] 12
This plot shows the cp statistic on the y-axis. 12 is the lowest cp value (with 12 variables selected -> indicated by black areas on the plot). Highest cp is 850. White areas indicate the variable is out. Black indicate variable is in.
You’ll notice that towards the top, the black band is reasonably stable.
plot(regfit.full, scale="Cp")
These are the coefficients of model 12, which is the best model using best subset regression.
coef(regfit.full, 12)
## (Intercept) TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 59.05483239 -0.03384353 -0.04926792 0.18349652
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0.10026292 0.11836348 0.03331614 0.06946472
## TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0.05981299 -0.08030854 -0.05352319 -0.11886411
## TEAM_FIELDING_DP
## -0.11231685
The plot below shows adjust r-squared value for each model.
plot(reg.summary$adjr2, xlab="Number of Variables", ylab="adjr2")
Using the lm
function, we get the same intercept for model 12 (all variables except TEAM_PITCHING_HR)
All the variables are significant at the 0.05 level.
model12 <- lm(TARGET_WINS ~. - TEAM_PITCHING_HR, data=team)
summary(model12)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_PITCHING_HR, data = team)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.168 -7.256 0.142 6.945 29.896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.054832 6.025125 9.801 < 2e-16 ***
## TEAM_BATTING_H -0.033844 0.015767 -2.146 0.03197 *
## TEAM_BATTING_2B -0.049268 0.008874 -5.552 3.24e-08 ***
## TEAM_BATTING_3B 0.183497 0.018984 9.666 < 2e-16 ***
## TEAM_BATTING_HR 0.100263 0.009155 10.952 < 2e-16 ***
## TEAM_BATTING_BB 0.118363 0.041385 2.860 0.00428 **
## TEAM_BATTING_SO 0.033316 0.017529 1.901 0.05751 .
## TEAM_BASERUN_SB 0.069465 0.005534 12.552 < 2e-16 ***
## TEAM_PITCHING_H 0.059813 0.014323 4.176 3.11e-05 ***
## TEAM_PITCHING_BB -0.080309 0.039308 -2.043 0.04119 *
## TEAM_PITCHING_SO -0.053523 0.016559 -3.232 0.00125 **
## TEAM_FIELDING_E -0.118864 0.007122 -16.690 < 2e-16 ***
## TEAM_FIELDING_DP -0.112317 0.012271 -9.153 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.18 on 1822 degrees of freedom
## Multiple R-squared: 0.4058, Adjusted R-squared: 0.4019
## F-statistic: 103.7 on 12 and 1822 DF, p-value: < 2.2e-16
Looking at the VIF of each variable, some are above 5 (way above 5), and this is indicative of problematic amount of collinearity in the model.
For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
VIF(model12)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 51.432070 2.572898 3.008175 4.410352
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 221.742560 256.175602 1.509687 110.268728
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 260.043366 239.822447 2.999999 1.372712
The article linked below discusses how standardized residual plots should look like for an acceptable model.
Standardized residual plots should have these characteristics:
The standardized residual plot of model 12 seem to display the characteristics described above. Generally, the residuals are between -3 and 3 (not a big range). They’re symmetrically distributed along the zero horizontal line (indicating low errors between actual and predicted). I do not see any curved patterns.
model12.predict <- predict(model12)
model12.stdres <- rstandard(model12)
plot(model12.predict, model12.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins", main="Model 12")
abline(0, 0)