library(tidyverse)
library(Stat2Data)
#install.packages("leaps")
library(leaps)
4.3 Major League Baseball winning percentage. Consider the MLB2007Standings file described in Exercise 3.12 on page 152. In this exercise, you will consider models with four variables to predict winning percentages (WinPct) using any of the predictors except Wins and Losses-those would make it too easy! Don’t worry if you are unfamiliar with baseball terminology and some of the acronyms for variable names have little meaning. Although knowledge of the context for data is often helpful for choosing good models, you should use software to build the models requested in this exercise rather than using specific baseball knowledge.
data(MLB2007Standings)
head(MLB2007Standings)
## Team League Wins Losses WinPct BattingAvg Runs Hits HR
## 1 Arizona Diamondbacks NL 90 72 0.556 0.250 712 1350 171
## 2 Atlanta Braves NL 84 78 0.519 0.275 810 1562 176
## 3 Baltimore Orioles AL 69 93 0.426 0.272 756 1529 142
## 4 Boston Red Sox AL 96 66 0.593 0.279 867 1561 166
## 5 Chicago Cubs NL 85 77 0.525 0.271 752 1530 151
## 6 Chicago White Sox AL 72 90 0.444 0.246 693 1341 190
## Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks StrikeOuts
## 1 286 40 687 109 0.321 0.413 4.13 1446 546 1088
## 2 328 27 781 64 0.339 0.435 4.11 1442 537 1106
## 3 306 30 718 144 0.333 0.412 5.17 1491 696 1087
## 4 352 35 829 96 0.362 0.444 3.87 1350 482 1149
## 5 340 28 711 86 0.333 0.422 4.04 1340 573 1211
## 6 249 20 667 78 0.318 0.404 4.77 1556 499 1015
## Saves WHIP
## 1 51 1.38
## 2 36 1.36
## 3 30 1.52
## 4 45 1.27
## 5 39 1.32
## 6 42 1.43
MLB2007Standings <- MLB2007Standings %>%
select(-Wins, -Losses, -Team)
head(MLB2007Standings)
## League WinPct BattingAvg Runs Hits HR Doubles Triples RBI SB OBP
## 1 NL 0.556 0.250 712 1350 171 286 40 687 109 0.321
## 2 NL 0.519 0.275 810 1562 176 328 27 781 64 0.339
## 3 AL 0.426 0.272 756 1529 142 306 30 718 144 0.333
## 4 AL 0.593 0.279 867 1561 166 352 35 829 96 0.362
## 5 NL 0.525 0.271 752 1530 151 340 28 711 86 0.333
## 6 AL 0.444 0.246 693 1341 190 249 20 667 78 0.318
## SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 0.413 4.13 1446 546 1088 51 1.38
## 2 0.435 4.11 1442 537 1106 36 1.36
## 3 0.412 5.17 1491 696 1087 30 1.52
## 4 0.444 3.87 1350 482 1149 45 1.27
## 5 0.422 4.04 1340 573 1211 39 1.32
## 6 0.404 4.77 1556 499 1015 42 1.43
For part a: The answer in the back of the book is wrong (they used forward selection). I would like you to use stepwise regression, as the question asks. For part d, please use AIC instead of Mallow’s Cp.
A. Use stepwise regression until you have a four-predictor model for WinPct. You may need to adjust the criteria if the procedure won’t give sufficient steps initially. Write down the predictors in this model and its R2 value.
ModStep<- regsubsets(WinPct ~.,
data = MLB2007Standings,
nbest = 1, # 1 best model for each number of predictors
nvmax = 4, # 4 for a 4 predictor variable limit on number of variables
force.in = NULL, force.out = NULL,
method = "seqrep")
with(summary(ModStep), data.frame(cp, outmat))
## cp LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI
## 1 ( 1 ) 104.8799157
## 2 ( 1 ) 17.7229487 *
## 3 ( 1 ) 7.7227532 *
## 4 ( 1 ) 0.1863323 * *
## SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) *
## 2 ( 1 ) *
## 3 ( 1 ) * *
## 4 ( 1 ) * *
As we can see from the summary the best predictor variables are BattingAvg, Runs, Saves, and WHIP (Number of walks and hits per inning pitched). Now we have to find the model and its R^2 value.
Mod<-lm(WinPct~BattingAvg+Runs+Saves+WHIP, data=MLB2007Standings)
summary(Mod)
##
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.028470 -0.011783 0.000509 0.013115 0.047619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.717e-01 1.310e-01 1.311 0.201920
## BattingAvg 1.441e+00 4.784e-01 3.011 0.005875 **
## Runs 3.187e-04 7.887e-05 4.041 0.000446 ***
## Saves 4.173e-03 7.059e-04 5.911 3.61e-06 ***
## WHIP -3.357e-01 4.852e-02 -6.918 2.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01833 on 25 degrees of freedom
## Multiple R-squared: 0.9115, Adjusted R-squared: 0.8973
## F-statistic: 64.37 on 4 and 25 DF, p-value: 8.509e-13
The adjusted R^2 of this model is .8973 which means that 89.73% of variation in winning percentage (WinPct) is explained by this model.
b. Use backward elimination until you have a four-predictor model for WinPct. You may need to adjust the criteria if the procedure stops too soon to continue eliminating predictors. Write down the predictors in this model and its R2 value.
Modback<- regsubsets(WinPct ~.,
data = MLB2007Standings,
nbest = 1, # 1 best model for each number of predictors
nvmax = 4, # 4 for a 4 predictor variable limit on number of variables
method = "backward")
summary(Modback)
## Subset selection object
## Call: regsubsets.formula(WinPct ~ ., data = MLB2007Standings, nbest = 1,
## nvmax = 4, method = "backward")
## 17 Variables (and intercept)
## Forced in Forced out
## LeagueNL FALSE FALSE
## BattingAvg FALSE FALSE
## Runs FALSE FALSE
## Hits FALSE FALSE
## HR FALSE FALSE
## Doubles FALSE FALSE
## Triples FALSE FALSE
## RBI FALSE FALSE
## SB FALSE FALSE
## OBP FALSE FALSE
## SLG FALSE FALSE
## ERA FALSE FALSE
## HitsAllowed FALSE FALSE
## Walks FALSE FALSE
## StrikeOuts FALSE FALSE
## Saves FALSE FALSE
## WHIP FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI SB OBP SLG
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " " "
## ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*" "*"
## 4 ( 1 ) " " " " " " " " "*" "*"
with(summary(Modback), data.frame(cp, outmat))
## cp LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI
## 1 ( 1 ) 112.369101
## 2 ( 1 ) 25.793061 *
## 3 ( 1 ) 5.508871 *
## 4 ( 1 ) 3.689755 * *
## SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) *
## 2 ( 1 ) *
## 3 ( 1 ) * *
## 4 ( 1 ) * *
As we can see from the summary the 4 best predictors of winning percentage (WinPct) are Runs, HR (Number of home runs hit), Saves and WHIP (Number of walks and hits per inning pitched). Now we have to find the R^2 for this model.
Mod2<-lm(WinPct~Runs+HR+Saves+WHIP,data=MLB2007Standings)
summary(Mod2)
##
## Call:
## lm(formula = WinPct ~ Runs + HR + Saves + WHIP, data = MLB2007Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.026796 -0.012707 -0.000922 0.010269 0.043383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.902e-01 1.035e-01 3.768 0.000896 ***
## Runs 5.743e-04 6.392e-05 8.985 2.66e-09 ***
## HR -3.059e-04 1.524e-04 -2.008 0.055613 .
## Saves 3.940e-03 7.567e-04 5.206 2.19e-05 ***
## WHIP -3.153e-01 5.490e-02 -5.743 5.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01986 on 25 degrees of freedom
## Multiple R-squared: 0.8961, Adjusted R-squared: 0.8795
## F-statistic: 53.93 on 4 and 25 DF, p-value: 6.193e-12
The adjusted R^2 for this model is .8795 which means 87.95% of the variation of the Winning percentage is explained by this model.
c. Use a “best subsets” procedure to determine which four-predictors together would explain the most variability in WinPct. Write down the predictors in this model and its R2 value.
best <- regsubsets(WinPct ~ ., data=MLB2007Standings, nbest=1, nvmax = 4)
with(summary(best), data.frame(rsq, adjr2, cp, rss, outmat))
## rsq adjr2 cp rss LeagueNL BattingAvg
## 1 ( 1 ) 0.4262137 0.4057213 104.8799157 0.054457981
## 2 ( 1 ) 0.8170838 0.8035345 17.7229487 0.017360552
## 3 ( 1 ) 0.8793993 0.8654838 5.5088713 0.011446199
## 4 ( 1 ) 0.9115018 0.8973421 0.1863323 0.008399355 *
## Runs Hits HR Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks
## 1 ( 1 ) *
## 2 ( 1 ) * *
## 3 ( 1 ) *
## 4 ( 1 ) *
## StrikeOuts Saves WHIP
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 ) * *
## 4 ( 1 ) * *
As we can see from the summary the best predictor variables are BattingAvg, Runs, Saves, and WHIP (Number of walks and hits per inning pitched). Now we have to find the model and its R^2 value. It is the best by rsq, adjr2, cp , rss.
Mod3<-lm(WinPct~BattingAvg+Runs+Saves+WHIP, data=MLB2007Standings)
summary(Mod3)
##
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.028470 -0.011783 0.000509 0.013115 0.047619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.717e-01 1.310e-01 1.311 0.201920
## BattingAvg 1.441e+00 4.784e-01 3.011 0.005875 **
## Runs 3.187e-04 7.887e-05 4.041 0.000446 ***
## Saves 4.173e-03 7.059e-04 5.911 3.61e-06 ***
## WHIP -3.357e-01 4.852e-02 -6.918 2.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01833 on 25 degrees of freedom
## Multiple R-squared: 0.9115, Adjusted R-squared: 0.8973
## F-statistic: 64.37 on 4 and 25 DF, p-value: 8.509e-13
The adjusted R^2 of this model is .8973 which means that 89.73% of variation in winning percentage (WinPct) is explained by this model.
d. Find the value of AIC for each of the models produced in (a-c).
AIC(Mod) #Stepwise Model
## [1] -148.2876
AIC(Mod2) #Backward Model
## [1] -143.4865
AIC(Mod3) #Best Model
## [1] -148.2876
The AIC for the Best Model, which was the same model as the Stepwise Model was lower than the Backward Model’s AIC as such the Backward Model is not as good a predictor as the other models.
I would prefer the Best Model. It trys all the subsets and yes it is a lot of subsets to go through but it will have the most actuate results.