library(tidyverse)
library(Stat2Data)
#install.packages("leaps")
library(leaps)

Exercise 4.3

4.3 Major League Baseball winning percentage. Consider the MLB2007Standings file described in Exercise 3.12 on page 152. In this exercise, you will consider models with four variables to predict winning percentages (WinPct) using any of the predictors except Wins and Losses-those would make it too easy! Don’t worry if you are unfamiliar with baseball terminology and some of the acronyms for variable names have little meaning. Although knowledge of the context for data is often helpful for choosing good models, you should use software to build the models requested in this exercise rather than using specific baseball knowledge.

data(MLB2007Standings)
head(MLB2007Standings)
##                   Team League Wins Losses WinPct BattingAvg Runs Hits  HR
## 1 Arizona Diamondbacks     NL   90     72  0.556      0.250  712 1350 171
## 2       Atlanta Braves     NL   84     78  0.519      0.275  810 1562 176
## 3    Baltimore Orioles     AL   69     93  0.426      0.272  756 1529 142
## 4       Boston Red Sox     AL   96     66  0.593      0.279  867 1561 166
## 5         Chicago Cubs     NL   85     77  0.525      0.271  752 1530 151
## 6    Chicago White Sox     AL   72     90  0.444      0.246  693 1341 190
##   Doubles Triples RBI  SB   OBP   SLG  ERA HitsAllowed Walks StrikeOuts
## 1     286      40 687 109 0.321 0.413 4.13        1446   546       1088
## 2     328      27 781  64 0.339 0.435 4.11        1442   537       1106
## 3     306      30 718 144 0.333 0.412 5.17        1491   696       1087
## 4     352      35 829  96 0.362 0.444 3.87        1350   482       1149
## 5     340      28 711  86 0.333 0.422 4.04        1340   573       1211
## 6     249      20 667  78 0.318 0.404 4.77        1556   499       1015
##   Saves WHIP
## 1    51 1.38
## 2    36 1.36
## 3    30 1.52
## 4    45 1.27
## 5    39 1.32
## 6    42 1.43
MLB2007Standings <- MLB2007Standings %>%
  select(-Wins, -Losses, -Team)
head(MLB2007Standings)
##   League WinPct BattingAvg Runs Hits  HR Doubles Triples RBI  SB   OBP
## 1     NL  0.556      0.250  712 1350 171     286      40 687 109 0.321
## 2     NL  0.519      0.275  810 1562 176     328      27 781  64 0.339
## 3     AL  0.426      0.272  756 1529 142     306      30 718 144 0.333
## 4     AL  0.593      0.279  867 1561 166     352      35 829  96 0.362
## 5     NL  0.525      0.271  752 1530 151     340      28 711  86 0.333
## 6     AL  0.444      0.246  693 1341 190     249      20 667  78 0.318
##     SLG  ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 0.413 4.13        1446   546       1088    51 1.38
## 2 0.435 4.11        1442   537       1106    36 1.36
## 3 0.412 5.17        1491   696       1087    30 1.52
## 4 0.444 3.87        1350   482       1149    45 1.27
## 5 0.422 4.04        1340   573       1211    39 1.32
## 6 0.404 4.77        1556   499       1015    42 1.43

For part a: The answer in the back of the book is wrong (they used forward selection). I would like you to use stepwise regression, as the question asks. For part d, please use AIC instead of Mallow’s Cp.

A. Use stepwise regression until you have a four-predictor model for WinPct. You may need to adjust the criteria if the procedure won’t give sufficient steps initially. Write down the predictors in this model and its R2 value.

ModStep<- regsubsets(WinPct ~.,
               data = MLB2007Standings,
               nbest = 1,       # 1 best model for each number of predictors
               nvmax = 4,    # 4 for a 4 predictor variable limit on number of variables
               force.in = NULL, force.out = NULL,
               method = "seqrep")

with(summary(ModStep), data.frame(cp, outmat))
##                   cp LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI
## 1  ( 1 ) 104.8799157                                                     
## 2  ( 1 )  17.7229487                                                    *
## 3  ( 1 )   7.7227532                                                    *
## 4  ( 1 )   0.1863323                   *    *                            
##          SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 )              *                                        
## 2  ( 1 )              *                                        
## 3  ( 1 )                                                 *    *
## 4  ( 1 )                                                 *    *

As we can see from the summary the best predictor variables are BattingAvg, Runs, Saves, and WHIP (Number of walks and hits per inning pitched). Now we have to find the model and its R^2 value.

Mod<-lm(WinPct~BattingAvg+Runs+Saves+WHIP, data=MLB2007Standings)
summary(Mod)
## 
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.028470 -0.011783  0.000509  0.013115  0.047619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.717e-01  1.310e-01   1.311 0.201920    
## BattingAvg   1.441e+00  4.784e-01   3.011 0.005875 ** 
## Runs         3.187e-04  7.887e-05   4.041 0.000446 ***
## Saves        4.173e-03  7.059e-04   5.911 3.61e-06 ***
## WHIP        -3.357e-01  4.852e-02  -6.918 2.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01833 on 25 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.8973 
## F-statistic: 64.37 on 4 and 25 DF,  p-value: 8.509e-13

The adjusted R^2 of this model is .8973 which means that 89.73% of variation in winning percentage (WinPct) is explained by this model.

b. Use backward elimination until you have a four-predictor model for WinPct. You may need to adjust the criteria if the procedure stops too soon to continue eliminating predictors. Write down the predictors in this model and its R2 value.

Modback<- regsubsets(WinPct ~.,
               data = MLB2007Standings,
               nbest = 1,       # 1 best model for each number of predictors
               nvmax = 4,    # 4 for a 4 predictor variable limit on number of variables
               method = "backward")

summary(Modback)
## Subset selection object
## Call: regsubsets.formula(WinPct ~ ., data = MLB2007Standings, nbest = 1, 
##     nvmax = 4, method = "backward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## LeagueNL        FALSE      FALSE
## BattingAvg      FALSE      FALSE
## Runs            FALSE      FALSE
## Hits            FALSE      FALSE
## HR              FALSE      FALSE
## Doubles         FALSE      FALSE
## Triples         FALSE      FALSE
## RBI             FALSE      FALSE
## SB              FALSE      FALSE
## OBP             FALSE      FALSE
## SLG             FALSE      FALSE
## ERA             FALSE      FALSE
## HitsAllowed     FALSE      FALSE
## Walks           FALSE      FALSE
## StrikeOuts      FALSE      FALSE
## Saves           FALSE      FALSE
## WHIP            FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          LeagueNL BattingAvg Runs Hits HR  Doubles Triples RBI SB  OBP SLG
## 1  ( 1 ) " "      " "        " "  " "  " " " "     " "     " " " " " " " "
## 2  ( 1 ) " "      " "        "*"  " "  " " " "     " "     " " " " " " " "
## 3  ( 1 ) " "      " "        "*"  " "  " " " "     " "     " " " " " " " "
## 4  ( 1 ) " "      " "        "*"  " "  "*" " "     " "     " " " " " " " "
##          ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 ) " " " "         " "   " "        " "   "*" 
## 2  ( 1 ) " " " "         " "   " "        " "   "*" 
## 3  ( 1 ) " " " "         " "   " "        "*"   "*" 
## 4  ( 1 ) " " " "         " "   " "        "*"   "*"
with(summary(Modback), data.frame(cp, outmat))
##                  cp LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI
## 1  ( 1 ) 112.369101                                                     
## 2  ( 1 )  25.793061                        *                            
## 3  ( 1 )   5.508871                        *                            
## 4  ( 1 )   3.689755                        *       *                    
##          SB OBP SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 )                                                      *
## 2  ( 1 )                                                      *
## 3  ( 1 )                                                 *    *
## 4  ( 1 )                                                 *    *

As we can see from the summary the 4 best predictors of winning percentage (WinPct) are Runs, HR (Number of home runs hit), Saves and WHIP (Number of walks and hits per inning pitched). Now we have to find the R^2 for this model.

Mod2<-lm(WinPct~Runs+HR+Saves+WHIP,data=MLB2007Standings)
summary(Mod2)
## 
## Call:
## lm(formula = WinPct ~ Runs + HR + Saves + WHIP, data = MLB2007Standings)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026796 -0.012707 -0.000922  0.010269  0.043383 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.902e-01  1.035e-01   3.768 0.000896 ***
## Runs         5.743e-04  6.392e-05   8.985 2.66e-09 ***
## HR          -3.059e-04  1.524e-04  -2.008 0.055613 .  
## Saves        3.940e-03  7.567e-04   5.206 2.19e-05 ***
## WHIP        -3.153e-01  5.490e-02  -5.743 5.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01986 on 25 degrees of freedom
## Multiple R-squared:  0.8961, Adjusted R-squared:  0.8795 
## F-statistic: 53.93 on 4 and 25 DF,  p-value: 6.193e-12

The adjusted R^2 for this model is .8795 which means 87.95% of the variation of the Winning percentage is explained by this model.

c. Use a “best subsets” procedure to determine which four-predictors together would explain the most variability in WinPct. Write down the predictors in this model and its R2 value.

best <- regsubsets(WinPct ~ ., data=MLB2007Standings, nbest=1, nvmax = 4)
with(summary(best), data.frame(rsq, adjr2, cp, rss, outmat))
##                rsq     adjr2          cp         rss LeagueNL BattingAvg
## 1  ( 1 ) 0.4262137 0.4057213 104.8799157 0.054457981                    
## 2  ( 1 ) 0.8170838 0.8035345  17.7229487 0.017360552                    
## 3  ( 1 ) 0.8793993 0.8654838   5.5088713 0.011446199                    
## 4  ( 1 ) 0.9115018 0.8973421   0.1863323 0.008399355                   *
##          Runs Hits HR Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks
## 1  ( 1 )                                               *                  
## 2  ( 1 )                                *              *                  
## 3  ( 1 )    *                                                             
## 4  ( 1 )    *                                                             
##          StrikeOuts Saves WHIP
## 1  ( 1 )                      
## 2  ( 1 )                      
## 3  ( 1 )                *    *
## 4  ( 1 )                *    *

As we can see from the summary the best predictor variables are BattingAvg, Runs, Saves, and WHIP (Number of walks and hits per inning pitched). Now we have to find the model and its R^2 value. It is the best by rsq, adjr2, cp , rss.

Mod3<-lm(WinPct~BattingAvg+Runs+Saves+WHIP, data=MLB2007Standings)
summary(Mod3)
## 
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.028470 -0.011783  0.000509  0.013115  0.047619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.717e-01  1.310e-01   1.311 0.201920    
## BattingAvg   1.441e+00  4.784e-01   3.011 0.005875 ** 
## Runs         3.187e-04  7.887e-05   4.041 0.000446 ***
## Saves        4.173e-03  7.059e-04   5.911 3.61e-06 ***
## WHIP        -3.357e-01  4.852e-02  -6.918 2.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01833 on 25 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.8973 
## F-statistic: 64.37 on 4 and 25 DF,  p-value: 8.509e-13

The adjusted R^2 of this model is .8973 which means that 89.73% of variation in winning percentage (WinPct) is explained by this model.

d. Find the value of AIC for each of the models produced in (a-c).

AIC(Mod) #Stepwise Model
## [1] -148.2876
AIC(Mod2) #Backward Model
## [1] -143.4865
AIC(Mod3) #Best Model
## [1] -148.2876

The AIC for the Best Model, which was the same model as the Stepwise Model was lower than the Backward Model’s AIC as such the Backward Model is not as good a predictor as the other models.

  1. Assuming that these three procedures didn’t all give the same four-predictor model, which model would you prefer? Explain why.

I would prefer the Best Model. It trys all the subsets and yes it is a lot of subsets to go through but it will have the most actuate results.