WHIP (Walks plus Hits per Inning Pitched) is a saber metric measure of how many runners a pitcher has allowed per inning pitched. Since the innings pitched statistic is not provided with the given dataset, the WHIP statistic has been modified to be “Walks plus Hits per Game Played.” WHGP
is a statistic is a measure of a team’s success in preventing a batter from reaching base. From a defensive perspective, a lower WHGP score indicates better peformance in preventing batters from reaching base whereas, from an offensive perspective, a higher score indicates a propensity to get batters on base.
Model 3 will focus on WHGP as a predictor by examining the significance of this metric specifically in combination with other offensive and defensive metrics to determine the optimal regression model for predicting wins.
\[WHGP = (PITCHING\_H + PITCHING\_BB)/162\]
where:
* PITCHING_H is Team Pitching Hits Allowed
* PITCHING_BB is Team Pitching Walks Allowed
Games played as a team statistic is set to 162 for the calculation of WHGP
Apply Forward Stepwise Selection using WHGP as the starting predictor variable. For this model, the base variables (those provided in the dataset) plus BATTING_TB and BATTING_1B will be considered. Due to the collinearity between the BATTING-related predictor variables, BATTING_TB will be used in place of BATTING_1B, BATTING_2B, BATTING_3B, and BATTING_HR.
Of note is that leaving PITCHING_BB and PITCHING_H in the model yields a better model based on Adjusted R-squared and AIC values, despite likely collinearity among these variables with WHGP.
model <- step(lm(TARGET_WINS ~ WHGP, data = mbstats), direction="forward",
scope= ~ BATTING_TB + BATTING_BB + BATTING_SO +
BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
PITCHING_SO + FIELDING_E + FIELDING_DP + WHGP)
## Start: AIC=10262.9
## TARGET_WINS ~ WHGP
##
## Df Sum of Sq RSS AIC
## + BATTING_TB 1 37403 315598 10043
## + FIELDING_E 1 26377 326624 10111
## + PITCHING_HR 1 19034 333968 10155
## + BATTING_BB 1 14957 338044 10179
## + PITCHING_H 1 3177 349824 10247
## + PITCHING_BB 1 3177 349824 10247
## + BATTING_SO 1 2254 350747 10252
## + FIELDING_DP 1 1012 351989 10259
## <none> 353001 10263
## + BASERUN_SB 1 41 352960 10265
## + PITCHING_SO 1 20 352981 10265
##
## Step: AIC=10043.25
## TARGET_WINS ~ WHGP + BATTING_TB
##
## Df Sum of Sq RSS AIC
## + BASERUN_SB 1 7495.8 308102 9997.7
## + BATTING_BB 1 5686.4 309912 10009.3
## + FIELDING_E 1 4838.5 310760 10014.7
## + PITCHING_H 1 3170.2 312428 10025.3
## + PITCHING_BB 1 3170.2 312428 10025.3
## + FIELDING_DP 1 2840.6 312757 10027.4
## + PITCHING_SO 1 2382.9 313215 10030.2
## + BATTING_SO 1 1426.5 314172 10036.3
## <none> 315598 10043.2
## + PITCHING_HR 1 279.8 315318 10043.5
##
## Step: AIC=9997.68
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## + FIELDING_E 1 26335.7 281767 9822.8
## + BATTING_BB 1 9489.8 298612 9937.8
## + PITCHING_H 1 6302.4 301800 9958.8
## + PITCHING_BB 1 6302.4 301800 9958.8
## + PITCHING_SO 1 3095.3 305007 9979.7
## + BATTING_SO 1 2118.5 305984 9986.0
## <none> 308102 9997.7
## + PITCHING_HR 1 2.7 308100 9999.7
## + FIELDING_DP 1 0.5 308102 9999.7
##
## Step: AIC=9822.85
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E
##
## Df Sum of Sq RSS AIC
## + PITCHING_SO 1 13905.2 267861 9724.7
## + BATTING_SO 1 12878.1 268888 9732.3
## + FIELDING_DP 1 6847.5 274919 9776.2
## + PITCHING_HR 1 3150.1 278616 9802.6
## + BATTING_BB 1 2277.8 279489 9808.8
## + PITCHING_H 1 508.3 281258 9821.3
## + PITCHING_BB 1 508.3 281258 9821.3
## <none> 281767 9822.8
##
## Step: AIC=9724.69
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO
##
## Df Sum of Sq RSS AIC
## + FIELDING_DP 1 11651.9 256209 9638.7
## + BATTING_BB 1 3402.6 264459 9701.4
## + PITCHING_H 1 2912.0 264949 9705.1
## + PITCHING_BB 1 2912.0 264949 9705.1
## + PITCHING_HR 1 2320.8 265541 9709.5
## <none> 267861 9724.7
## + BATTING_SO 1 164.4 267697 9725.5
##
## Step: AIC=9638.68
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP
##
## Df Sum of Sq RSS AIC
## + BATTING_BB 1 4484.9 251725 9605.7
## + PITCHING_H 1 3826.0 252383 9610.9
## + PITCHING_BB 1 3826.0 252383 9610.9
## + PITCHING_HR 1 2960.4 253249 9617.7
## + BATTING_SO 1 407.3 255802 9637.5
## <none> 256209 9638.7
##
## Step: AIC=9605.73
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB
##
## Df Sum of Sq RSS AIC
## + BATTING_SO 1 1754.0 249971 9593.9
## + PITCHING_HR 1 1448.6 250276 9596.3
## + PITCHING_BB 1 436.0 251289 9604.3
## + PITCHING_H 1 436.0 251289 9604.3
## <none> 251725 9605.7
##
## Step: AIC=9593.89
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO
##
## Df Sum of Sq RSS AIC
## + PITCHING_BB 1 2580.87 247390 9575.4
## + PITCHING_H 1 2580.87 247390 9575.4
## + PITCHING_HR 1 382.06 249588 9592.9
## <none> 249971 9593.9
##
## Step: AIC=9575.35
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO + PITCHING_BB
##
## Df Sum of Sq RSS AIC
## + PITCHING_HR 1 2523.4 244866 9557.1
## <none> 247390 9575.4
##
## Step: AIC=9557.06
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO + PITCHING_BB + PITCHING_HR
##
## Df Sum of Sq RSS AIC
## <none> 244866 9557.1
The resulting model includes the following ten predictor variables :
## TARGET_WINS ~ WHGP + BATTING_TB + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO + PITCHING_BB + PITCHING_HR
Reviewing the model summary, BATTING_TB is not proving to be a significant predictor variable. Consequently, this variable will be dropped from the regression model. After updating the model and re-examining, the AIC value drops slightly and adjusted R-squared increases slightly.
model <- update(model, . ~ . - BATTING_TB)
summary(model)
##
## Call:
## lm(formula = TARGET_WINS ~ WHGP + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO + PITCHING_BB + PITCHING_HR,
## data = mbstats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.068 -7.703 0.046 7.340 46.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.509465 5.343901 10.013 < 2e-16 ***
## WHGP 4.812634 0.425928 11.299 < 2e-16 ***
## BASERUN_SB 0.073945 0.004943 14.961 < 2e-16 ***
## FIELDING_E -0.077672 0.004327 -17.950 < 2e-16 ***
## PITCHING_SO 0.035557 0.012467 2.852 0.00439 **
## FIELDING_DP -0.134489 0.012843 -10.472 < 2e-16 ***
## BATTING_BB 0.179262 0.021777 8.232 3.32e-16 ***
## BATTING_SO -0.060580 0.013318 -4.549 5.73e-06 ***
## PITCHING_BB -0.169435 0.020887 -8.112 8.66e-16 ***
## PITCHING_HR 0.070050 0.008447 8.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.15 on 1969 degrees of freedom
## Multiple R-squared: 0.3586, Adjusted R-squared: 0.3557
## F-statistic: 122.3 on 9 and 1969 DF, p-value: < 2.2e-16
extractAIC(model)
## [1] 10.000 9555.062
formula(model)
## TARGET_WINS ~ WHGP + BASERUN_SB + FIELDING_E + PITCHING_SO +
## FIELDING_DP + BATTING_BB + BATTING_SO + PITCHING_BB + PITCHING_HR
The resulting regression equation to predict wins using WHGP plus other predictors is given below:
\[ \widehat{TARGET\_WINS} = 53.50 + 4.81WHGP + 0.073FIELDING\_E + 0.035PITCHING\_SO + 0.035FIELDING\_DP \\ + 0.18BATTING\_BB -0.06BATTING\_SO -0.17PITCHING\_BB + 0.07PITCHING\_HR \]
Examination of the residuals plot shows some indication of constant variance; however, the Residuals vs. Fitted plot may show a slight fanning appearance when looking from left to right.
However, the Q-Q Plot of the standardized residuals looks very close to normal with a few noted outliers in the tails – observations 1577, 346, and 1762. These observations may be contributing to the slight fanning in the Residual vs. Fitted Plot.
## 1762 346 1577
## 1 1978 1979
Several options for transformations such as a power transformation or recipricol values of predictors (FIELDING_E and FIELDING_DP) were applied. However, none yielded better models as determined by the resulting Adjusted R-squared and F-statistics.
Looking for the overlap of outliers and high leverage points, we see that three observations in particular (observations 346, 1577, and 1762) may be impacting the model. These observations may be candidates for removal.
## rstudent unadjusted p-value Bonferonni p
## 1577 4.23866 2.3526e-05 0.046557
## StudRes Hat CookD
## 52 2.064950 0.085878762 0.039992802
## 243 2.082459 0.061690808 0.028463738
## 346 4.006469 0.013576257 0.021924654
## 634 1.512645 0.096132306 0.024319481
## 1577 4.238660 0.011247788 0.020263324
## 1762 -3.705244 0.005742056 0.007877772
Among the predictor variables included in Model 3, Walks and Hits Per Game Played is the most significant contributor to wins with a coefficient of nearly 5. From an offensive perspective, a team with a higher WHGP metric will be more likely to win. This makes intuitive sense given that the more times a batter is on base, the more chance there is of scoring. However, the signficance of the variable is considered compared to others such as stolen bases and walks.
VARIABLE | COEFFICIENT |
---|---|
Intercept | 53.50946 |
WHGP | 4.812634 |
BASERUN_SB | 0.073945 |
FIELDING_E | -0.077672 |
PITCHING_SO | 0.035557 |
FIELDING_DP | -0.134489 |
BATTING_BB | 0.179262 |
BATTING_SO | -0.060580 |
PITCHING_BB | -0.169435 |
PITCHING_HR | 0.070050 |
Residual Standard Error | Adjused R-squared | F statistic | AIC | Predicted Accuracy (Train) |
---|---|---|---|---|
11.15 | 0.3557 | 122.3 | 9555 | 89.6% |