Source files can be found at !github-link
variable | definition | effect | Mo | M1 | M2 |
---|---|---|---|---|---|
INDEX |
Identification Variable | None | N | N | N |
TARGET_WINS |
Number of wins | Positive | Y | Y | Y |
TEAM_BATTING_H |
Base Hits by batters | Removed | N | N | N |
TEAM_BATTING_1B |
Singles by batters (1B) | Positive | Y | Y | Y |
TEAM_BATTING_2B |
Doubles by batters (2B) | Positive | Y | Y | Y |
TEAM_BATTING_3B |
Triples by batters (3B) | Positive | Y | Y | Y |
TEAM_BATTING_HR |
Homeruns by batters (4B) | Positive | Y | Y | Y |
TEAM_BATTING_BB |
Walks by batters | Positive | Y | Y | Y |
TEAM_BATTING_SO |
Strikeouts by batters | Negative | Y | Y | Y |
TEAM_BATTING_HBP |
Batters hit by pitch | Removed | N | N | N |
TEAM_BASERUN_SB |
Stolen bases | Removed | N | N | N |
TEAM_BASERUN_CS |
Caught stealing | Removed | N | N | N |
TEAM_PITCHING_H |
Hits allowed | Negative | Y | Y | Y |
TEAM_PITCHING_HR |
Homeruns allowed | Removed | N | N | N |
TEAM_PITCHING_BB |
Walks allowed | Negative | Y | Y | Y |
TEAM_PITCHING_SO |
Strikeouts by pitchers | Positive | Y | Y | Y |
TEAM_FIELDING_E |
Errors | Negative | Y | Y | Y |
TEAM_FIELDING_DP |
Double Plays | Positive | Y | Y | Y |
Many of the statistics in the data sets provided have been extrapolated using base statistics from the deadball era circa ~1900-1920 and prior. Outliers that need to be adjusted can be found using the reference link !baseball-almanac. Note that during the deadball era, a nearly soft ball was used which had dramatic effect on power hitting and pitching statistics. Any adjustment that rationalize the data from this period into observations that include post WWII statistics should be bound by the later era’s limits so distributions aren’t skewed.
variable | NA count | NA % | action |
---|---|---|---|
TEAM_BATTING_SO |
102 | 4.48 | impute w/ median |
TEAM_BASERUN_SB |
131 | 5.75 | impute w/ median |
TEAM_BASERUN_CS |
772 | 33.89 | removed variable |
TEAM_BATTING_HBP |
2085 | 91.53 | removed variable |
TEAM_PITCHING_SO |
102 | 4.48 | impute w/ median |
TEAM_FIELDING_DP |
286 | 12.55 | impute w/ median |
Deleting missing cases is the simplest strategy for dealing with missing data. It avoids the complexity and possible biases introduced by more sophisticated methods. The drawback is throwing away infomration that might allow more precise inference. If relatively few cases contain missing values deleting still leaves a large dataset or to communicate a simple data analysis method, the deltion strategy is satisfactory.
Standard errors are larger after deleting cases because of fewer records to fit the model. Larger standard errors results in less precise estimates. (Faraway, LMR 2015, p.200)
Single imputation . . causes bias, while deletion causes a loss of information. Multiple imputation is a way to reduce the bias caused by single imputation. The problem with single imputation is the value tends to be less variable than the value we would have seen because it does not include the error variation normally seen in observed data. The idea behind multiple imputation is to reinclude that error variation. (Faraway, LMR 2015, p.202)
Multiple imputation can be done using the Amelia package. Per Faraway, the assumption is the data is multivariate normal, so heavily skewed varibales should be log-transformed first.
BSO <- lm(data=BB.df, TEAM_BATTING_SO~.)
summary(BSO)
##
## Call:
## lm(formula = TEAM_BATTING_SO ~ ., data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.359 -4.984 -1.532 4.132 119.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.602514 8.211806 4.336 1.53e-05 ***
## TARGET_WINS 0.059394 0.031249 1.901 0.057505 .
## TEAM_BATTING_2B 0.284286 0.020965 13.560 < 2e-16 ***
## TEAM_BATTING_3B 0.197878 0.030376 6.514 9.42e-11 ***
## TEAM_BATTING_HR 0.287345 0.022301 12.885 < 2e-16 ***
## TEAM_BATTING_BB 0.808344 0.052042 15.533 < 2e-16 ***
## TEAM_BASERUN_SB 0.029634 0.007671 3.863 0.000116 ***
## TEAM_PITCHING_H -0.260618 0.018219 -14.304 < 2e-16 ***
## TEAM_PITCHING_BB -0.783418 0.049234 -15.912 < 2e-16 ***
## TEAM_PITCHING_SO 0.937860 0.002983 314.446 < 2e-16 ***
## TEAM_FIELDING_E 0.042656 0.010161 4.198 2.82e-05 ***
## TEAM_FIELDING_DP -0.031382 0.016741 -1.875 0.061003 .
## TEAM_BATTING_1B 0.252070 0.020234 12.457 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.59 on 1822 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.9961
## F-statistic: 3.882e+04 on 12 and 1822 DF, p-value: < 2.2e-16
#remove Double Plays + Pitching Strikeouts + Stolen Bases from linear model due to missing data
BSO.1 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_PITCHING_SO -TEAM_BASERUN_SB)
summary(BSO.1)
##
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_SO -
## TEAM_BASERUN_SB, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -370.83 -82.29 -1.58 77.00 403.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1989.8234 44.5651 44.650 < 2e-16 ***
## TARGET_WINS -0.6958 0.2404 -2.894 0.00384 **
## TEAM_BATTING_2B -0.2356 0.1752 -1.345 0.17887
## TEAM_BATTING_3B -2.1561 0.2483 -8.685 < 2e-16 ***
## TEAM_BATTING_HR 0.8995 0.1818 4.948 8.17e-07 ***
## TEAM_BATTING_BB 1.1335 0.4351 2.605 0.00927 **
## TEAM_PITCHING_H 0.4099 0.1517 2.701 0.00697 **
## TEAM_PITCHING_BB -1.2974 0.4114 -3.154 0.00164 **
## TEAM_FIELDING_E -0.3473 0.0782 -4.440 9.52e-06 ***
## TEAM_BATTING_1B -1.4755 0.1643 -8.981 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113.8 on 1825 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.726, Adjusted R-squared: 0.7247
## F-statistic: 537.4 on 9 and 1825 DF, p-value: < 2.2e-16
#remove Doubles from linear model
BSO.2 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_PITCHING_SO -TEAM_BASERUN_SB - TEAM_BATTING_2B)
summary(BSO.2)
##
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_SO -
## TEAM_BASERUN_SB - TEAM_BATTING_2B, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -369.33 -81.63 -1.84 77.82 400.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1985.10053 44.43632 44.673 < 2e-16 ***
## TARGET_WINS -0.65687 0.23869 -2.752 0.005982 **
## TEAM_BATTING_3B -2.00367 0.22090 -9.070 < 2e-16 ***
## TEAM_BATTING_HR 1.07932 0.12322 8.759 < 2e-16 ***
## TEAM_BATTING_BB 0.61003 0.19455 3.136 0.001742 **
## TEAM_PITCHING_H 0.22486 0.06403 3.512 0.000455 ***
## TEAM_PITCHING_BB -0.80239 0.18378 -4.366 1.34e-05 ***
## TEAM_FIELDING_E -0.31244 0.07381 -4.233 2.42e-05 ***
## TEAM_BATTING_1B -1.28900 0.08813 -14.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113.9 on 1826 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.7258, Adjusted R-squared: 0.7246
## F-statistic: 604.1 on 8 and 1826 DF, p-value: < 2.2e-16
##All p-values are low with a 561.8 F-statistic and adjusted R squared of 0.7098
#take a look
par(mfrow=c(2,2))
plot(BSO.2)
plot(BSO.2$residuals)
#prediction function
pred.BSO <- round(predict(BSO.2, BB.df))
impute <- function (a, a.impute){
ifelse (is.na(a), a.impute,a)
}
BSO.imp.1 <- impute(BB.df$TEAM_BATTING_SO, pred.BSO)
#place back in the data base with imputed data for SO's
BB.df$TEAM_BATTING_SO <- BSO.imp.1
PSO <- lm(data=BB.df, TEAM_PITCHING_SO~.)
summary(PSO)
##
## Call:
## lm(formula = TEAM_PITCHING_SO ~ ., data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129.311 -3.946 1.108 4.509 146.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.347331 8.720911 -0.154 0.87724
## TARGET_WINS -0.106518 0.032955 -3.232 0.00125 **
## TEAM_BATTING_2B -0.304767 0.022118 -13.779 < 2e-16 ***
## TEAM_BATTING_3B -0.243916 0.031959 -7.632 3.70e-14 ***
## TEAM_BATTING_HR -0.277931 0.023736 -11.709 < 2e-16 ***
## TEAM_BATTING_BB -0.850550 0.055016 -15.460 < 2e-16 ***
## TEAM_BATTING_SO 1.046965 0.003330 314.446 < 2e-16 ***
## TEAM_BASERUN_SB -0.011832 0.008133 -1.455 0.14590
## TEAM_PITCHING_H 0.281213 0.019203 14.644 < 2e-16 ***
## TEAM_PITCHING_BB 0.821919 0.052070 15.785 < 2e-16 ***
## TEAM_FIELDING_E -0.060329 0.010695 -5.641 1.96e-08 ***
## TEAM_FIELDING_DP 0.020598 0.017698 1.164 0.24463
## TEAM_BATTING_1B -0.288504 0.021221 -13.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 1822 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.9959, Adjusted R-squared: 0.9958
## F-statistic: 3.647e+04 on 12 and 1822 DF, p-value: < 2.2e-16
#remove Double Plays + Stolen Bases from linear model due to missing data
PSO.1 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_BASERUN_SB)
summary(PSO.1)
##
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_BASERUN_SB,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.886 -5.019 -1.589 4.139 123.070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.256643 7.714190 3.144 0.00169 **
## TARGET_WINS 0.115006 0.028933 3.975 7.31e-05 ***
## TEAM_BATTING_2B 0.282973 0.021075 13.427 < 2e-16 ***
## TEAM_BATTING_3B 0.198834 0.030522 6.514 9.41e-11 ***
## TEAM_BATTING_HR 0.264950 0.021885 12.107 < 2e-16 ***
## TEAM_BATTING_BB 0.823104 0.052216 15.764 < 2e-16 ***
## TEAM_PITCHING_H -0.257552 0.018304 -14.071 < 2e-16 ***
## TEAM_PITCHING_BB -0.799155 0.049377 -16.185 < 2e-16 ***
## TEAM_PITCHING_SO 0.944062 0.002671 353.488 < 2e-16 ***
## TEAM_FIELDING_E 0.059574 0.009453 6.302 3.68e-10 ***
## TEAM_BATTING_1B 0.248658 0.020305 12.246 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.66 on 1824 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
## F-statistic: 4.609e+04 on 10 and 1824 DF, p-value: < 2.2e-16
#all low P value and F statistic of 46090 with adj R squared of 0.996
#take a look
par(mfrow=c(2,2))
plot(PSO.1)
plot(PSO.1$residuals)
#place back in the model with imputed data for SO's
pred.PSO <- round(predict(PSO.1, BB.df))
PSO.imp.1 <- impute(BB.df$TEAM_PITCHING_SO, pred.PSO)
BB.df$TEAM_PITCHING_SO <- PSO.imp.1
SB <- lm(data=BB.df, TEAM_BASERUN_SB~.)
summary(SB)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ ., data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.937 -29.450 -3.022 25.189 185.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -161.43340 24.81951 -6.504 1.01e-10 ***
## TARGET_WINS 1.14570 0.09128 12.552 < 2e-16 ***
## TEAM_BATTING_2B -0.10549 0.06686 -1.578 0.114819
## TEAM_BATTING_3B 0.03292 0.09346 0.352 0.724736
## TEAM_BATTING_HR -0.61146 0.06939 -8.812 < 2e-16 ***
## TEAM_BATTING_BB 0.16800 0.16840 0.998 0.318610
## TEAM_BATTING_SO 0.27416 0.07097 3.863 0.000116 ***
## TEAM_PITCHING_H 0.14060 0.05835 2.409 0.016073 *
## TEAM_PITCHING_BB -0.15324 0.15978 -0.959 0.337660
## TEAM_PITCHING_SO -0.09806 0.06740 -1.455 0.145899
## TEAM_FIELDING_E 0.29777 0.03026 9.840 < 2e-16 ***
## TEAM_FIELDING_DP -0.34705 0.05032 -6.898 7.27e-12 ***
## TEAM_BATTING_1B -0.08062 0.06409 -1.258 0.208540
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.33 on 1822 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3903, Adjusted R-squared: 0.3863
## F-statistic: 97.21 on 12 and 1822 DF, p-value: < 2.2e-16
#remove out Double Plays from linear model due to missing data
SB.1 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP)
summary(SB.1)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.161 -29.240 -4.324 25.584 192.740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -218.43393 23.70040 -9.216 < 2e-16 ***
## TARGET_WINS 1.31500 0.08903 14.770 < 2e-16 ***
## TEAM_BATTING_2B -0.11774 0.06769 -1.739 0.08213 .
## TEAM_BATTING_3B 0.01353 0.09460 0.143 0.88632
## TEAM_BATTING_HR -0.67268 0.06970 -9.652 < 2e-16 ***
## TEAM_BATTING_BB 0.16989 0.17054 0.996 0.31928
## TEAM_BATTING_SO 0.30317 0.07174 4.226 2.5e-05 ***
## TEAM_PITCHING_H 0.15633 0.05905 2.648 0.00818 **
## TEAM_PITCHING_BB -0.16923 0.16179 -1.046 0.29571
## TEAM_PITCHING_SO -0.11346 0.06822 -1.663 0.09646 .
## TEAM_FIELDING_E 0.35943 0.02928 12.276 < 2e-16 ***
## TEAM_BATTING_1B -0.11318 0.06472 -1.749 0.08051 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.85 on 1823 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3744, Adjusted R-squared: 0.3706
## F-statistic: 99.19 on 11 and 1823 DF, p-value: < 2.2e-16
#remove Pitching Walks from linear model
SB.2 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB)
summary(SB.2)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.609 -29.102 -4.136 25.813 192.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -220.49651 23.61883 -9.336 < 2e-16 ***
## TARGET_WINS 1.32083 0.08886 14.864 < 2e-16 ***
## TEAM_BATTING_2B -0.06879 0.04891 -1.407 0.15974
## TEAM_BATTING_3B 0.05934 0.08386 0.708 0.47927
## TEAM_BATTING_HR -0.62211 0.05020 -12.394 < 2e-16 ***
## TEAM_BATTING_BB -0.00796 0.01310 -0.608 0.54355
## TEAM_BATTING_SO 0.32976 0.06709 4.915 9.66e-07 ***
## TEAM_PITCHING_H 0.10908 0.03803 2.868 0.00417 **
## TEAM_PITCHING_SO -0.13834 0.06394 -2.163 0.03064 *
## TEAM_FIELDING_E 0.36540 0.02872 12.723 < 2e-16 ***
## TEAM_BATTING_1B -0.06298 0.04343 -1.450 0.14715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.85 on 1824 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.374, Adjusted R-squared: 0.3706
## F-statistic: 109 on 10 and 1824 DF, p-value: < 2.2e-16
#remove Triples from linear model
SB.3 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B)
summary(SB.3)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.018 -29.148 -3.989 25.795 193.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.203e+02 2.361e+01 -9.331 < 2e-16 ***
## TARGET_WINS 1.334e+00 8.682e-02 15.368 < 2e-16 ***
## TEAM_BATTING_2B -7.737e-02 4.737e-02 -1.633 0.102606
## TEAM_BATTING_HR -6.383e-01 4.468e-02 -14.284 < 2e-16 ***
## TEAM_BATTING_BB -8.298e-03 1.309e-02 -0.634 0.526267
## TEAM_BATTING_SO 3.478e-01 6.205e-02 5.605 2.4e-08 ***
## TEAM_PITCHING_H 1.203e-01 3.458e-02 3.478 0.000517 ***
## TEAM_PITCHING_SO -1.567e-01 5.844e-02 -2.681 0.007400 **
## TEAM_FIELDING_E 3.701e-01 2.795e-02 13.242 < 2e-16 ***
## TEAM_BATTING_1B -7.369e-02 4.070e-02 -1.810 0.070390 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.85 on 1825 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3739, Adjusted R-squared: 0.3708
## F-statistic: 121.1 on 9 and 1825 DF, p-value: < 2.2e-16
#remove Walks from linear model
SB.4 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB)
summary(SB.4)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.336 -29.443 -4.327 26.045 193.400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -225.15763 22.35695 -10.071 < 2e-16 ***
## TARGET_WINS 1.32179 0.08457 15.630 < 2e-16 ***
## TEAM_BATTING_2B -0.08121 0.04698 -1.729 0.084001 .
## TEAM_BATTING_HR -0.64425 0.04367 -14.752 < 2e-16 ***
## TEAM_BATTING_SO 0.35256 0.06158 5.725 1.21e-08 ***
## TEAM_PITCHING_H 0.12251 0.03440 3.562 0.000378 ***
## TEAM_PITCHING_SO -0.16050 0.05812 -2.762 0.005810 **
## TEAM_FIELDING_E 0.37090 0.02791 13.289 < 2e-16 ***
## TEAM_BATTING_1B -0.07475 0.04066 -1.838 0.066159 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.84 on 1826 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3737, Adjusted R-squared: 0.371
## F-statistic: 136.2 on 8 and 1826 DF, p-value: < 2.2e-16
#remove Singles from linear model
SB.5 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B)
summary(SB.5)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.841 -29.320 -3.894 25.575 194.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -242.88942 20.18259 -12.035 < 2e-16 ***
## TARGET_WINS 1.34304 0.08383 16.021 < 2e-16 ***
## TEAM_BATTING_2B -0.02368 0.03505 -0.675 0.4995
## TEAM_BATTING_HR -0.59121 0.03280 -18.023 < 2e-16 ***
## TEAM_BATTING_SO 0.25759 0.03355 7.679 2.60e-14 ***
## TEAM_PITCHING_H 0.06550 0.01489 4.398 1.16e-05 ***
## TEAM_PITCHING_SO -0.06699 0.02814 -2.381 0.0174 *
## TEAM_FIELDING_E 0.38719 0.02648 14.619 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.87 on 1827 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3726, Adjusted R-squared: 0.3702
## F-statistic: 155 on 7 and 1827 DF, p-value: < 2.2e-16
#remove Doubles from linear model
SB.6 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B)
summary(SB.6)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.516 -29.068 -4.233 25.520 193.624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -238.52378 19.11672 -12.477 < 2e-16 ***
## TARGET_WINS 1.35020 0.08314 16.240 < 2e-16 ***
## TEAM_BATTING_HR -0.59070 0.03279 -18.015 < 2e-16 ***
## TEAM_BATTING_SO 0.24459 0.02748 8.902 < 2e-16 ***
## TEAM_PITCHING_H 0.05854 0.01075 5.445 5.89e-08 ***
## TEAM_PITCHING_SO -0.05579 0.02273 -2.455 0.0142 *
## TEAM_FIELDING_E 0.39219 0.02542 15.426 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.86 on 1828 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3724, Adjusted R-squared: 0.3703
## F-statistic: 180.8 on 6 and 1828 DF, p-value: < 2.2e-16
#remove Pitching Strikeouts
SB.7 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B - TEAM_PITCHING_SO)
summary(SB.7)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B -
## TEAM_PITCHING_SO, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.263 -29.291 -3.527 25.638 192.679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.088e+02 1.481e+01 -14.10 < 2e-16 ***
## TARGET_WINS 1.389e+00 8.175e-02 16.99 < 2e-16 ***
## TEAM_BATTING_HR -5.624e-01 3.073e-02 -18.30 < 2e-16 ***
## TEAM_BATTING_SO 1.797e-01 7.532e-03 23.86 < 2e-16 ***
## TEAM_PITCHING_H 3.769e-02 6.601e-03 5.71 1.32e-08 ***
## TEAM_FIELDING_E 3.986e-01 2.532e-02 15.74 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.92 on 1829 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3703, Adjusted R-squared: 0.3686
## F-statistic: 215.1 on 5 and 1829 DF, p-value: < 2.2e-16
#Add Singles back into linear model
SB.8 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO)
summary(SB.8)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.164 -29.210 -3.464 25.666 193.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.302e+02 2.232e+01 -10.314 < 2e-16 ***
## TARGET_WINS 1.375e+00 8.248e-02 16.667 < 2e-16 ***
## TEAM_BATTING_HR -5.563e-01 3.110e-02 -17.889 < 2e-16 ***
## TEAM_BATTING_SO 1.843e-01 8.342e-03 22.094 < 2e-16 ***
## TEAM_PITCHING_H 3.255e-02 7.726e-03 4.213 2.64e-05 ***
## TEAM_FIELDING_E 3.974e-01 2.534e-02 15.687 < 2e-16 ***
## TEAM_BATTING_1B 2.509e-02 1.960e-02 1.280 0.201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.91 on 1828 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3709, Adjusted R-squared: 0.3688
## F-statistic: 179.6 on 6 and 1828 DF, p-value: < 2.2e-16
#remove Pitching Hits from linear model
SB.9 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO - TEAM_PITCHING_H)
summary(SB.9)
##
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB -
## TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO -
## TEAM_PITCHING_H, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.27 -29.23 -3.23 26.13 193.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.318e+02 2.241e+01 -10.343 < 2e-16 ***
## TARGET_WINS 1.394e+00 8.273e-02 16.852 < 2e-16 ***
## TEAM_BATTING_HR -5.147e-01 2.962e-02 -17.375 < 2e-16 ***
## TEAM_BATTING_SO 1.816e-01 8.355e-03 21.735 < 2e-16 ***
## TEAM_FIELDING_E 4.111e-01 2.524e-02 16.284 < 2e-16 ***
## TEAM_BATTING_1B 6.801e-02 1.682e-02 4.043 5.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.1 on 1829 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3648, Adjusted R-squared: 0.3631
## F-statistic: 210.1 on 5 and 1829 DF, p-value: < 2.2e-16
#all low P value and F statistic of 303.5 with adj R squared of 0.4386
#take a look
par(mfrow=c(2,2))
plot(SB.9)
plot(SB.9$residuals)
#place back in the data base with imputed data for SB's
pred.SB <- round(predict(SB.9, BB.df))
SB.imp.1 <- impute(BB.df$TEAM_BASERUN_SB, pred.SB)
BB.df$TEAM_BASERUN_SB <- SB.imp.1
DP <- lm(data=BB.df, TEAM_FIELDING_DP~.)
summary(DP)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ ., data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.332 -12.811 -0.552 12.205 69.169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.474e+02 1.094e+01 13.473 < 2e-16 ***
## TARGET_WINS -3.483e-01 4.127e-02 -8.439 < 2e-16 ***
## TEAM_BATTING_2B 8.637e-04 1.289e-02 0.067 0.94660
## TEAM_BATTING_3B -4.693e-02 3.174e-02 -1.479 0.13938
## TEAM_BATTING_HR 9.574e-02 1.675e-02 5.716 1.27e-08 ***
## TEAM_BATTING_BB 1.752e-01 2.754e-02 6.363 2.49e-10 ***
## TEAM_BATTING_SO -9.557e-02 1.540e-02 -6.205 6.72e-10 ***
## TEAM_BASERUN_SB -7.459e-02 1.078e-02 -6.922 6.10e-12 ***
## TEAM_PITCHING_H 1.087e-02 4.166e-03 2.608 0.00918 **
## TEAM_PITCHING_BB -1.296e-01 2.516e-02 -5.150 2.87e-07 ***
## TEAM_PITCHING_SO 6.869e-02 1.396e-02 4.921 9.36e-07 ***
## TEAM_FIELDING_E -8.336e-02 1.027e-02 -8.115 8.67e-16 ***
## TEAM_BATTING_1B 2.330e-02 9.891e-03 2.355 0.01860 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.31 on 1875 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3646, Adjusted R-squared: 0.3605
## F-statistic: 89.65 on 12 and 1875 DF, p-value: < 2.2e-16
#remove batting 2B's
DP.1 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B)
summary(DP.1)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.312 -12.803 -0.549 12.210 69.140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.499122 10.893086 13.541 < 2e-16 ***
## TARGET_WINS -0.348500 0.041091 -8.481 < 2e-16 ***
## TEAM_BATTING_3B -0.046453 0.030914 -1.503 0.13311
## TEAM_BATTING_HR 0.095923 0.016511 5.810 7.34e-09 ***
## TEAM_BATTING_BB 0.175497 0.027232 6.444 1.47e-10 ***
## TEAM_BATTING_SO -0.095552 0.015396 -6.206 6.65e-10 ***
## TEAM_BASERUN_SB -0.074570 0.010768 -6.925 5.97e-12 ***
## TEAM_PITCHING_H 0.010962 0.003907 2.806 0.00507 **
## TEAM_PITCHING_BB -0.129803 0.024943 -5.204 2.16e-07 ***
## TEAM_PITCHING_SO 0.068675 0.013952 4.922 9.31e-07 ***
## TEAM_FIELDING_E -0.083572 0.009765 -8.559 < 2e-16 ***
## TEAM_BATTING_1B 0.023282 0.009886 2.355 0.01862 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.3 on 1876 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3646, Adjusted R-squared: 0.3609
## F-statistic: 97.85 on 11 and 1876 DF, p-value: < 2.2e-16
#remove batting 3B's
DP.2 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B)
summary(DP.2)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.434 -12.959 -0.585 12.448 68.003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.353372 10.870006 13.464 < 2e-16 ***
## TARGET_WINS -0.361663 0.040160 -9.006 < 2e-16 ***
## TEAM_BATTING_HR 0.100120 0.016278 6.150 9.42e-10 ***
## TEAM_BATTING_BB 0.178287 0.027178 6.560 6.94e-11 ***
## TEAM_BATTING_SO -0.094626 0.015388 -6.149 9.49e-10 ***
## TEAM_BASERUN_SB -0.075508 0.010754 -7.021 3.06e-12 ***
## TEAM_PITCHING_H 0.011727 0.003875 3.026 0.00251 **
## TEAM_PITCHING_BB -0.132288 0.024897 -5.314 1.20e-07 ***
## TEAM_PITCHING_SO 0.069184 0.013953 4.958 7.75e-07 ***
## TEAM_FIELDING_E -0.085506 0.009683 -8.831 < 2e-16 ***
## TEAM_BATTING_1B 0.020890 0.009760 2.140 0.03245 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.31 on 1877 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3638, Adjusted R-squared: 0.3604
## F-statistic: 107.3 on 10 and 1877 DF, p-value: < 2.2e-16
#remove batting 1B's
DP.3 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B)
summary(DP.3)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B -
## TEAM_BATTING_1B, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.318 -12.867 -0.594 12.183 70.274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165.329937 6.295067 26.263 < 2e-16 ***
## TARGET_WINS -0.359644 0.040187 -8.949 < 2e-16 ***
## TEAM_BATTING_HR 0.089824 0.015566 5.770 9.23e-09 ***
## TEAM_BATTING_BB 0.186157 0.026954 6.907 6.77e-12 ***
## TEAM_BATTING_SO -0.093203 0.015389 -6.057 1.68e-09 ***
## TEAM_BASERUN_SB -0.074197 0.010747 -6.904 6.88e-12 ***
## TEAM_PITCHING_H 0.017287 0.002878 6.007 2.27e-09 ***
## TEAM_PITCHING_BB -0.139872 0.024667 -5.670 1.65e-08 ***
## TEAM_PITCHING_SO 0.063535 0.013714 4.633 3.86e-06 ***
## TEAM_FIELDING_E -0.093118 0.009015 -10.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.33 on 1878 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3623, Adjusted R-squared: 0.3592
## F-statistic: 118.5 on 9 and 1878 DF, p-value: < 2.2e-16
#remove all remaining batting
DP.4 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB)
summary(DP.4)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B -
## TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB,
## data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.111 -13.561 -0.185 12.995 68.351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.304819 4.246078 37.989 < 2e-16 ***
## TARGET_WINS -0.237801 0.039123 -6.078 1.47e-09 ***
## TEAM_BASERUN_SB -0.107620 0.010144 -10.610 < 2e-16 ***
## TEAM_PITCHING_H 0.015502 0.002162 7.171 1.06e-12 ***
## TEAM_PITCHING_BB 0.028665 0.005106 5.614 2.28e-08 ***
## TEAM_PITCHING_SO -0.007577 0.002308 -3.283 0.00104 **
## TEAM_FIELDING_E -0.087006 0.008629 -10.083 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.87 on 1881 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3248, Adjusted R-squared: 0.3226
## F-statistic: 150.8 on 6 and 1881 DF, p-value: < 2.2e-16
#remove pitching strikeouts
DP.5 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB - TEAM_PITCHING_SO)
summary(DP.5)
##
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B -
## TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB -
## TEAM_PITCHING_SO, data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.517 -13.493 -0.397 13.052 67.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154.939459 3.787463 40.909 < 2e-16 ***
## TARGET_WINS -0.201265 0.037604 -5.352 9.75e-08 ***
## TEAM_BASERUN_SB -0.118868 0.009573 -12.418 < 2e-16 ***
## TEAM_PITCHING_H 0.013728 0.002099 6.542 7.81e-11 ***
## TEAM_PITCHING_BB 0.027092 0.005097 5.315 1.19e-07 ***
## TEAM_FIELDING_E -0.075351 0.007885 -9.556 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.92 on 1882 degrees of freedom
## (388 observations deleted due to missingness)
## Multiple R-squared: 0.3209, Adjusted R-squared: 0.3191
## F-statistic: 177.8 on 5 and 1882 DF, p-value: < 2.2e-16
#all low P value and F statistic of 174.4 with adj R squared of 0.4397
#take a look
par(mfrow=c(2,2))
plot(DP.5)
plot(DP.5$residuals)
#place back in the data base with imputed data for SB's
pred.DP <- round(predict(DP.5, BB.df))
DP.imp.1 <- impute(BB.df$TEAM_FIELDING_DP, pred.DP)
BB.df$TEAM_FIELDING_DP <- DP.imp.1
#test new data set
summary(BB.df)
## TARGET_WINS TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## Min. : 0.00 Min. : 69.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00
## Median : 82.00 Median :238.0 Median : 47.00 Median :102.00
## Mean : 80.79 Mean :241.2 Mean : 55.25 Mean : 99.61
## 3rd Qu.: 92.00 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00
## Max. :146.00 Max. :458.0 Max. :223.00 Max. :264.00
##
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 1137
## 1st Qu.:451.0 1st Qu.: 542 1st Qu.: 67.0 1st Qu.: 1419
## Median :512.0 Median : 730 Median :106.0 Median : 1518
## Mean :501.6 Mean : 727 Mean :139.6 Mean : 1779
## 3rd Qu.:580.0 3rd Qu.: 925 3rd Qu.:172.0 3rd Qu.: 1682
## Max. :878.0 Max. :1399 Max. :697.0 Max. :30132
##
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 32.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:123.8
## Median : 536.5 Median : 813.5 Median : 159.0 Median :145.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :141.8
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:162.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :466.0
## NA's :102
## TEAM_BATTING_1B
## Min. : 709.0
## 1st Qu.: 990.8
## Median :1050.0
## Mean :1073.2
## 3rd Qu.:1129.0
## Max. :2112.0
##
#most ever errors by a team is 639 by 1883 Philadelphia. Prorating to 162 games gives a value of 1046.
BB.df$TEAM_FIELDING_E[which(BB.df$TEAM_FIELDING_E > 1046)] <- 159
#most ever hits by a team is 1730. So replace all pitching hits >3000 to be conservative with the median
BB.df$TEAM_PITCHING_H[which(BB.df$TEAM_PITCHING_H >3000)] <- 1518
The most strikeouts thrown in a single season (unadjusted) that I could find was fewer that 1400.
#REVISIT THIS WITH A LEVERAGE POINT TEST FOR BAD OUTLIERS!
#The TEAM_PITCHING_SO variable has 25 outliers that are far beyond the most team pitched strikeouts
#that have ever occurred. Here's the line of R code to find the count:
nrow(data.frame(which(BB.df$TEAM_PITCHING_SO > 1450)))
## [1] 25
#Since those outliers skew the distribution severely, I would suggest that rather than transform the
#variable via a power transform we simply set those outliers to a more reasonable value, perhaps the
#value of the 3rd quartile (968). This can be done (if we choose) with the following line of R code:
#replace pitching SO & pitching hits & errors outliers with median
BB.df$TEAM_PITCHING_SO[which(BB.df$TEAM_PITCHING_SO >1450)] <- 813
#First argument must be strictly positive.
#summary( powerTransform( cbind( TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP) ~ 1, BB.df)
#This function call yields the "best" box-cox power transform exponent for each variable relative to the full data set #rather than limiting the results to being based solely on a single response/predictor pair. The estimates are found in #the column labeled: "Est. Power". We need to round up or down to the nearest common transform as described here (reposted #from Blackboard discussion):
BB.df$TARGET_WINS[which(BB.df$TARGET_WINS <= 0)] <- 1
m1 <- lm(BB.df$TARGET_WINS~log(BB.df$TEAM_BATTING_1B))
#http://stats.stackexchange.com/questions/137059/find-distribution-and-transform-to-normal-distribution
lambda <- c(-1,-0.5, -0.33, -0.25, 0, 0.25, 0.33, 0.5,1)
#invResPlot(m1,lambda)
#inverseResponsePlot(m1,key=TRUE)
#lambda <- c(-1,-0.1811955,0,1)
#RSS <- c(6847.993,6761.037,6764.793,6901.701)
#plot(lambda,RSS,type="l",ylab=expression(RSS(lambda)),xlab=expression(lambda))
#-1/3
#ty <- y^(-1/3)
#plot(density(ty,kern="gaussian"),type="l",main="Gaussian kernel density estimate",xlab=expression(Y^(-1/3)))
#rug(ty)
summary(BB.df)
Generalized Equation for Multiple Regression \[ \begin{aligned} \widehat{wins} &= \hat{\beta}_0 + \hat{\beta}_1 \times singles + \hat{\beta}_2 \times doubles + \hat{\beta}_3 \times triples + \hat{\beta}_4 \times homeruns + \hat{\beta}_5 \times walks + \hat{\beta}_6 \times strikeouts . . . + \end{aligned} \]
#wins x full panel
pairs(TARGET_WINS~.,
data=BB.df,pch=".",gap=.5,upper.panel=panel.smooth)
m1 <- lm(TARGET_WINS~., data = BB.df)
StanRes1 <- rstandard(m1)
summary(m1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = BB.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.377 -8.058 0.343 8.595 72.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.379e+01 5.533e+00 2.492 0.01278 *
## TEAM_BATTING_2B 3.971e-02 7.687e-03 5.166 2.61e-07 ***
## TEAM_BATTING_3B 9.976e-02 1.623e-02 6.147 9.39e-10 ***
## TEAM_BATTING_HR 1.132e-01 8.898e-03 12.720 < 2e-16 ***
## TEAM_BATTING_BB 3.157e-02 3.677e-03 8.586 < 2e-16 ***
## TEAM_BATTING_SO -5.136e-03 3.879e-03 -1.324 0.18567
## TEAM_BASERUN_SB 2.326e-02 4.711e-03 4.938 8.49e-07 ***
## TEAM_PITCHING_H -2.147e-03 1.576e-03 -1.363 0.17311
## TEAM_PITCHING_BB -2.888e-05 2.383e-03 -0.012 0.99033
## TEAM_PITCHING_SO 1.365e-03 3.126e-03 0.437 0.66243
## TEAM_FIELDING_E -9.015e-03 2.893e-03 -3.116 0.00186 **
## TEAM_FIELDING_DP -8.592e-02 1.351e-02 -6.359 2.47e-10 ***
## TEAM_BATTING_1B 3.868e-02 3.663e-03 10.561 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.11 on 2161 degrees of freedom
## (102 observations deleted due to missingness)
## Multiple R-squared: 0.2953, Adjusted R-squared: 0.2914
## F-statistic: 75.45 on 12 and 2161 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#SINGLES: transform skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_1B),main="Singles");rug(BB.df$TEAM_BATTING_1B)
plot(m1$model$TEAM_BATTING_1B,StanRes1,xlab="Singles",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_1B,StanRes1),lty=2,col=2)
#DOUBLES: add - high p-value
plot(density(BB.df$TEAM_BATTING_2B),main="Doubles");rug(BB.df$EAM_BATTING_2B)
plot(m1$model$TEAM_BATTING_2B,StanRes1,xlab="Doubles",ylab="Standardized
Residuals");abline(lsfit(m1$model$TEAM_BATTING_2B,StanRes1),lty=2,col=2)
#TRIPLES: transform skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_3B),main="Triples");rug(BB.df$TEAM_BATTING_3B)
plot(m1$model$TEAM_BATTING_3B,StanRes1,xlab="Triples",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_3B,StanRes1),lty=2,col=2)
#HOMERUNS: transform bimodal skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_HR),main="Homeruns");rug(BB.df$TEAM_BATTING_HR)
plot(m1$model$TEAM_BATTING_HR,StanRes1,xlab="Homeruns",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_HR,StanRes1),lty=2,col=2)
#WALKS: add
plot(density(BB.df$TEAM_BATTING_BB),main="Walks");rug(BB.df$TEAM_BATTING_BB)
plot(m1$model$TEAM_BATTING_BB,StanRes1,xlab="Walks",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_BB,StanRes1),lty=2,col=2)
#STRIKEOUTS: transform bimodal skew before adding to model high p-value
plot(density(BB.df$TEAM_BATTING_SO),main="Strikeouts");rug(BB.df$TEAM_BATTING_SO)
plot(m1$model$TEAM_BATTING_SO,StanRes1,xlab="Strikeouts",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_SO,StanRes1),lty=2,col=2)
#HIT BY PITCH: removed variable
#plot(density(BB.df$TEAM_BATTING_HBP,na.rm=TRUE),main="Hit By Pitch");rug(BB.df$TEAM_BATTING_HBP)
#plot(m1$model$TEAM_BATTING_HBP,StanRes1,xlab="Singles",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_BATTING_HBP,StanRes1),lty=2,col=2)
#STOLEN BASES: transform skew before adding to model
plot(density(BB.df$TEAM_BASERUN_SB),main="Stolen Bases");rug(BB.df$TEAM_BASERUN_SB)
plot(m1$model$TEAM_BASERUN_SB,StanRes1,xlab="Stolen Bases",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BASERUN_SB,StanRes1),lty=2,col=2)
#CAUGHT STEALING: removed variable
#plot(density(BB.df$TEAM_BASERUN_CS),main="Caught Stealing");rug(BB.df$TEAM_BASERUN_CS)
#plot(m1$model$TEAM_BASERUN_CS,StanRes1,xlab="Caught Stealing",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_BASERUN_CS,StanRes1),lty=2,col=2)
#PITCHING HITS: removed variable
#plot(density(BB.df$TEAM_PITCHING_H),main="Pitching Hits");rug(BB.df$TEAM_PITCHING_H)
#plot(m1$model$TEAM_PITCHING_H,StanRes1,xlab="Pitching Hits",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_PITCHING_H,StanRes1),lty=2,col=2)
#PITCHING HOMERUNS: removed variable
#plot(density(BB.df$TEAM_PITCHING_HR),main="Pitching Homeruns");rug(BB.df$TEAM_PITCHING_HR)
#plot(m1$model$TEAM_PITCHING_HR,StanRes1,xlab="Pitching Homeruns",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_PITCHING_HR,StanRes1),lty=2,col=2)
#PITCHING WALKS: correlated with batting walks
plot(density(BB.df$TEAM_PITCHING_BB),main="Pitching Walks");rug(BB.df$TEAM_PITCHING_BB)
plot(m1$model$TEAM_PITCHING_BB,StanRes1,xlab="Pitching Walks",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_PITCHING_BB,StanRes1),lty=2,col=2)
#PITCHING STRIKEOUTS: add* - high p-value
plot(density(BB.df$TEAM_PITCHING_SO, na.rm = TRUE),main="Pitching Strikeouts");rug(BB.df$TEAM_PITCHING_SO)
plot(m1$model$TEAM_PITCHING_SO,StanRes1,xlab="Pitching Strikeouts",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_PITCHING_SO,StanRes1),lty=2,col=2)
#FIELDING ERRORS: transform skew before adding to model
plot(density(BB.df$TEAM_FIELDING_E),main="Fielding Errors");rug(BB.df$TEAM_FIELDING_E)
plot(m1$model$TEAM_FIELDING_E,StanRes1,xlab="Fielding Errors",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_FIELDING_E,StanRes1),lty=2,col=2)
#DOUBLE PLAYS: add
plot(density(BB.df$TEAM_FIELDING_DP),main="Fielding Doubleplay");rug(BB.df$TEAM_FIELDING_DP)
plot(m1$model$TEAM_FIELDING_DP,StanRes1,xlab="Fielding Doubleplays",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_FIELDING_DP,StanRes1),lty=2,col=2)
par(mfrow=c(1,1))
\[ \begin{aligned} \widehat{wins} &= \hat{\beta}_0 + \hat{\beta}_1 \times doubles + \hat{\beta}_2 \times walks + \hat{\beta}_3 \times pitching strikeouts . . . + \hat{\beta}_4 \times doubleplays \end{aligned} \]
The most hits in a single season (unadjusted) is 1783–NL Philadelphia Phillies 1930. The most doubles in a single season (unadjusted) is 376–AL Texas in 2008. The most triples in a single season (unadjusted) is 153–NL Baltimore in 1894. Records that exceed these amounts should be adjusted either to NA or the median.
102 strikeout NAs can remain as long as they’re not counted in descriptive statistics as observations. 2085 hit by pitch NAs disqualifies this field from use in the model.
The most walks in a single season (unadjusted) is 835–AL Boston Red Sox in 1949. The fewest walks in a single season (unadjusted) is 282–NL St. Louis Cardinals 1908. Records that exceed these amounts should be adjusted either to NA or the median.
131 stolen base NAs can & 772 caught stealing NAs can remain as long as they’re not counted in descriptive statistics.
The most stolen bases in a single season (unadjusted) is 426–NL New York in 1893. The most caught stealing bases count in a single season (unadjusted) is 191–AL NY in 1914. Records that exceed these amounts should be adjusted either to NA or the median.
These statistics are co-linear and may be better used as a derived statistic for Expected value of team stolen bases E(SB) = SB * likelihood of success (SB/SB attempts)
286 Fielding error NAs can remain as long as they’re not counted in descriptive statistics as observations.
The most fielding errors in a single season (unadjusted) is 639–NL Philadelphia in 1883. The most fielding errors in a single season (unadjusted) post WWII is 234–NL Philadelphia in 1945. Records that exceed the post WWII amount should be adjusted either to NA or the median.
The most hits given up in a single season (unadjusted) that I could find was fewer that 2000. The most homeruns given up in a single season (unadjusted) that I could find was fewer that 250. The most walks given up in a single season (unadjusted) that I could find was fewer that 800. The most strikeouts thrown in a single season (unadjusted) that I could find was fewer that 1400. Records that exceed these amount should be adjusted either to NA or the median.