Data Dictionary
Exploring the data
Handle Missing Data
Impute Missing Data
Build model for Pitching SO
Preliminary Transformation
Correlations
Full Panel
Model Mo (Baseline) using forward selection
Model Mo
Model M1
Model M2
Model M3
Predictions and Assessment
Conclusion
Appendicies . . .

Source files can be found at !github-link

Data Dictionary

variable	definition	effect	Mo	M1	M2
`INDEX`	Identification Variable	None	N	N	N
`TARGET_WINS`	Number of wins	Positive	Y	Y	Y
`TEAM_BATTING_H`	Base Hits by batters	Removed	N	N	N
`TEAM_BATTING_1B`	Singles by batters (1B)	Positive	Y	Y	Y
`TEAM_BATTING_2B`	Doubles by batters (2B)	Positive	Y	Y	Y
`TEAM_BATTING_3B`	Triples by batters (3B)	Positive	Y	Y	Y
`TEAM_BATTING_HR`	Homeruns by batters (4B)	Positive	Y	Y	Y
`TEAM_BATTING_BB`	Walks by batters	Positive	Y	Y	Y
`TEAM_BATTING_SO`	Strikeouts by batters	Negative	Y	Y	Y
`TEAM_BATTING_HBP`	Batters hit by pitch	Removed	N	N	N
`TEAM_BASERUN_SB`	Stolen bases	Removed	N	N	N
`TEAM_BASERUN_CS`	Caught stealing	Removed	N	N	N
`TEAM_PITCHING_H`	Hits allowed	Negative	Y	Y	Y
`TEAM_PITCHING_HR`	Homeruns allowed	Removed	N	N	N
`TEAM_PITCHING_BB`	Walks allowed	Negative	Y	Y	Y
`TEAM_PITCHING_SO`	Strikeouts by pitchers	Positive	Y	Y	Y
`TEAM_FIELDING_E`	Errors	Negative	Y	Y	Y
`TEAM_FIELDING_DP`	Double Plays	Positive	Y	Y	Y

Exploring the data

Many of the statistics in the data sets provided have been extrapolated using base statistics from the deadball era circa ~1900-1920 and prior. Outliers that need to be adjusted can be found using the reference link !baseball-almanac. Note that during the deadball era, a nearly soft ball was used which had dramatic effect on power hitting and pitching statistics. Any adjustment that rationalize the data from this period into observations that include post WWII statistics should be bound by the later era’s limits so distributions aren’t skewed.

Handle Missing Data

variable	NA count	NA %	action
`TEAM_BATTING_SO`	102	4.48	impute w/ median
`TEAM_BASERUN_SB`	131	5.75	impute w/ median
`TEAM_BASERUN_CS`	772	33.89	removed variable
`TEAM_BATTING_HBP`	2085	91.53	removed variable
`TEAM_PITCHING_SO`	102	4.48	impute w/ median
`TEAM_FIELDING_DP`	286	12.55	impute w/ median

Deleting missing cases is the simplest strategy for dealing with missing data. It avoids the complexity and possible biases introduced by more sophisticated methods. The drawback is throwing away infomration that might allow more precise inference. If relatively few cases contain missing values deleting still leaves a large dataset or to communicate a simple data analysis method, the deltion strategy is satisfactory.

Standard errors are larger after deleting cases because of fewer records to fit the model. Larger standard errors results in less precise estimates. (Faraway, LMR 2015, p.200)

Single imputation . . causes bias, while deletion causes a loss of information. Multiple imputation is a way to reduce the bias caused by single imputation. The problem with single imputation is the value tends to be less variable than the value we would have seen because it does not include the error variation normally seen in observed data. The idea behind multiple imputation is to reinclude that error variation. (Faraway, LMR 2015, p.202)

Multiple imputation can be done using the Amelia package. Per Faraway, the assumption is the data is multivariate normal, so heavily skewed varibales should be log-transformed first.

Impute Missing Data

Build model for batting SO using Gelman approach

BSO <- lm(data=BB.df, TEAM_BATTING_SO~.)
summary(BSO)

## 
## Call:
## lm(formula = TEAM_BATTING_SO ~ ., data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -132.359   -4.984   -1.532    4.132  119.497 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.602514   8.211806   4.336 1.53e-05 ***
## TARGET_WINS       0.059394   0.031249   1.901 0.057505 .  
## TEAM_BATTING_2B   0.284286   0.020965  13.560  < 2e-16 ***
## TEAM_BATTING_3B   0.197878   0.030376   6.514 9.42e-11 ***
## TEAM_BATTING_HR   0.287345   0.022301  12.885  < 2e-16 ***
## TEAM_BATTING_BB   0.808344   0.052042  15.533  < 2e-16 ***
## TEAM_BASERUN_SB   0.029634   0.007671   3.863 0.000116 ***
## TEAM_PITCHING_H  -0.260618   0.018219 -14.304  < 2e-16 ***
## TEAM_PITCHING_BB -0.783418   0.049234 -15.912  < 2e-16 ***
## TEAM_PITCHING_SO  0.937860   0.002983 314.446  < 2e-16 ***
## TEAM_FIELDING_E   0.042656   0.010161   4.198 2.82e-05 ***
## TEAM_FIELDING_DP -0.031382   0.016741  -1.875 0.061003 .  
## TEAM_BATTING_1B   0.252070   0.020234  12.457  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.59 on 1822 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9961 
## F-statistic: 3.882e+04 on 12 and 1822 DF,  p-value: < 2.2e-16

#remove Double Plays + Pitching Strikeouts + Stolen Bases from linear model due to missing data
BSO.1 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_PITCHING_SO -TEAM_BASERUN_SB)
summary(BSO.1)

## 
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_SO - 
##     TEAM_BASERUN_SB, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -370.83  -82.29   -1.58   77.00  403.79 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1989.8234    44.5651  44.650  < 2e-16 ***
## TARGET_WINS        -0.6958     0.2404  -2.894  0.00384 ** 
## TEAM_BATTING_2B    -0.2356     0.1752  -1.345  0.17887    
## TEAM_BATTING_3B    -2.1561     0.2483  -8.685  < 2e-16 ***
## TEAM_BATTING_HR     0.8995     0.1818   4.948 8.17e-07 ***
## TEAM_BATTING_BB     1.1335     0.4351   2.605  0.00927 ** 
## TEAM_PITCHING_H     0.4099     0.1517   2.701  0.00697 ** 
## TEAM_PITCHING_BB   -1.2974     0.4114  -3.154  0.00164 ** 
## TEAM_FIELDING_E    -0.3473     0.0782  -4.440 9.52e-06 ***
## TEAM_BATTING_1B    -1.4755     0.1643  -8.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113.8 on 1825 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.726,  Adjusted R-squared:  0.7247 
## F-statistic: 537.4 on 9 and 1825 DF,  p-value: < 2.2e-16

#remove Doubles from linear model
BSO.2 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_PITCHING_SO -TEAM_BASERUN_SB - TEAM_BATTING_2B)
summary(BSO.2)

## 
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_SO - 
##     TEAM_BASERUN_SB - TEAM_BATTING_2B, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -369.33  -81.63   -1.84   77.82  400.41 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1985.10053   44.43632  44.673  < 2e-16 ***
## TARGET_WINS        -0.65687    0.23869  -2.752 0.005982 ** 
## TEAM_BATTING_3B    -2.00367    0.22090  -9.070  < 2e-16 ***
## TEAM_BATTING_HR     1.07932    0.12322   8.759  < 2e-16 ***
## TEAM_BATTING_BB     0.61003    0.19455   3.136 0.001742 ** 
## TEAM_PITCHING_H     0.22486    0.06403   3.512 0.000455 ***
## TEAM_PITCHING_BB   -0.80239    0.18378  -4.366 1.34e-05 ***
## TEAM_FIELDING_E    -0.31244    0.07381  -4.233 2.42e-05 ***
## TEAM_BATTING_1B    -1.28900    0.08813 -14.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113.9 on 1826 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.7258, Adjusted R-squared:  0.7246 
## F-statistic: 604.1 on 8 and 1826 DF,  p-value: < 2.2e-16

##All p-values are low with a 561.8 F-statistic and adjusted R squared of 0.7098
#take a look
par(mfrow=c(2,2))
plot(BSO.2)

plot(BSO.2$residuals)

#prediction function
pred.BSO <- round(predict(BSO.2, BB.df))
impute <- function (a, a.impute){
  ifelse (is.na(a), a.impute,a)
}
BSO.imp.1 <- impute(BB.df$TEAM_BATTING_SO, pred.BSO)

#place back in the data base with imputed data for SO's
BB.df$TEAM_BATTING_SO <- BSO.imp.1

Build model for Pitching SO

PSO <- lm(data=BB.df, TEAM_PITCHING_SO~.)
summary(PSO)

## 
## Call:
## lm(formula = TEAM_PITCHING_SO ~ ., data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -129.311   -3.946    1.108    4.509  146.072 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.347331   8.720911  -0.154  0.87724    
## TARGET_WINS      -0.106518   0.032955  -3.232  0.00125 ** 
## TEAM_BATTING_2B  -0.304767   0.022118 -13.779  < 2e-16 ***
## TEAM_BATTING_3B  -0.243916   0.031959  -7.632 3.70e-14 ***
## TEAM_BATTING_HR  -0.277931   0.023736 -11.709  < 2e-16 ***
## TEAM_BATTING_BB  -0.850550   0.055016 -15.460  < 2e-16 ***
## TEAM_BATTING_SO   1.046965   0.003330 314.446  < 2e-16 ***
## TEAM_BASERUN_SB  -0.011832   0.008133  -1.455  0.14590    
## TEAM_PITCHING_H   0.281213   0.019203  14.644  < 2e-16 ***
## TEAM_PITCHING_BB  0.821919   0.052070  15.785  < 2e-16 ***
## TEAM_FIELDING_E  -0.060329   0.010695  -5.641 1.96e-08 ***
## TEAM_FIELDING_DP  0.020598   0.017698   1.164  0.24463    
## TEAM_BATTING_1B  -0.288504   0.021221 -13.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.36 on 1822 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.9959, Adjusted R-squared:  0.9958 
## F-statistic: 3.647e+04 on 12 and 1822 DF,  p-value: < 2.2e-16

#remove Double Plays + Stolen Bases from linear model due to missing data
PSO.1 <- lm(data=BB.df, TEAM_BATTING_SO~. -TEAM_FIELDING_DP -TEAM_BASERUN_SB)
summary(PSO.1)

## 
## Call:
## lm(formula = TEAM_BATTING_SO ~ . - TEAM_FIELDING_DP - TEAM_BASERUN_SB, 
##     data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.886   -5.019   -1.589    4.139  123.070 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.256643   7.714190   3.144  0.00169 ** 
## TARGET_WINS       0.115006   0.028933   3.975 7.31e-05 ***
## TEAM_BATTING_2B   0.282973   0.021075  13.427  < 2e-16 ***
## TEAM_BATTING_3B   0.198834   0.030522   6.514 9.41e-11 ***
## TEAM_BATTING_HR   0.264950   0.021885  12.107  < 2e-16 ***
## TEAM_BATTING_BB   0.823104   0.052216  15.764  < 2e-16 ***
## TEAM_PITCHING_H  -0.257552   0.018304 -14.071  < 2e-16 ***
## TEAM_PITCHING_BB -0.799155   0.049377 -16.185  < 2e-16 ***
## TEAM_PITCHING_SO  0.944062   0.002671 353.488  < 2e-16 ***
## TEAM_FIELDING_E   0.059574   0.009453   6.302 3.68e-10 ***
## TEAM_BATTING_1B   0.248658   0.020305  12.246  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.66 on 1824 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.996 
## F-statistic: 4.609e+04 on 10 and 1824 DF,  p-value: < 2.2e-16

#all low P value and F statistic of 46090 with adj R squared of 0.996
#take a look
par(mfrow=c(2,2))
plot(PSO.1)

plot(PSO.1$residuals)

#place back in the model with imputed data for SO's
pred.PSO <- round(predict(PSO.1, BB.df))
PSO.imp.1 <- impute(BB.df$TEAM_PITCHING_SO, pred.PSO)
BB.df$TEAM_PITCHING_SO <- PSO.imp.1

Build model for SB

SB <- lm(data=BB.df, TEAM_BASERUN_SB~.)
summary(SB)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ ., data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.937 -29.450  -3.022  25.189 185.149 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -161.43340   24.81951  -6.504 1.01e-10 ***
## TARGET_WINS         1.14570    0.09128  12.552  < 2e-16 ***
## TEAM_BATTING_2B    -0.10549    0.06686  -1.578 0.114819    
## TEAM_BATTING_3B     0.03292    0.09346   0.352 0.724736    
## TEAM_BATTING_HR    -0.61146    0.06939  -8.812  < 2e-16 ***
## TEAM_BATTING_BB     0.16800    0.16840   0.998 0.318610    
## TEAM_BATTING_SO     0.27416    0.07097   3.863 0.000116 ***
## TEAM_PITCHING_H     0.14060    0.05835   2.409 0.016073 *  
## TEAM_PITCHING_BB   -0.15324    0.15978  -0.959 0.337660    
## TEAM_PITCHING_SO   -0.09806    0.06740  -1.455 0.145899    
## TEAM_FIELDING_E     0.29777    0.03026   9.840  < 2e-16 ***
## TEAM_FIELDING_DP   -0.34705    0.05032  -6.898 7.27e-12 ***
## TEAM_BATTING_1B    -0.08062    0.06409  -1.258 0.208540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.33 on 1822 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3903, Adjusted R-squared:  0.3863 
## F-statistic: 97.21 on 12 and 1822 DF,  p-value: < 2.2e-16

#remove out Double Plays from linear model due to missing data
SB.1 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP)
summary(SB.1)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP, data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.161  -29.240   -4.324   25.584  192.740 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -218.43393   23.70040  -9.216  < 2e-16 ***
## TARGET_WINS         1.31500    0.08903  14.770  < 2e-16 ***
## TEAM_BATTING_2B    -0.11774    0.06769  -1.739  0.08213 .  
## TEAM_BATTING_3B     0.01353    0.09460   0.143  0.88632    
## TEAM_BATTING_HR    -0.67268    0.06970  -9.652  < 2e-16 ***
## TEAM_BATTING_BB     0.16989    0.17054   0.996  0.31928    
## TEAM_BATTING_SO     0.30317    0.07174   4.226  2.5e-05 ***
## TEAM_PITCHING_H     0.15633    0.05905   2.648  0.00818 ** 
## TEAM_PITCHING_BB   -0.16923    0.16179  -1.046  0.29571    
## TEAM_PITCHING_SO   -0.11346    0.06822  -1.663  0.09646 .  
## TEAM_FIELDING_E     0.35943    0.02928  12.276  < 2e-16 ***
## TEAM_BATTING_1B    -0.11318    0.06472  -1.749  0.08051 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.85 on 1823 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3744, Adjusted R-squared:  0.3706 
## F-statistic: 99.19 on 11 and 1823 DF,  p-value: < 2.2e-16

#remove Pitching Walks from linear model
SB.2 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB)
summary(SB.2)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB, 
##     data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.609  -29.102   -4.136   25.813  192.235 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -220.49651   23.61883  -9.336  < 2e-16 ***
## TARGET_WINS         1.32083    0.08886  14.864  < 2e-16 ***
## TEAM_BATTING_2B    -0.06879    0.04891  -1.407  0.15974    
## TEAM_BATTING_3B     0.05934    0.08386   0.708  0.47927    
## TEAM_BATTING_HR    -0.62211    0.05020 -12.394  < 2e-16 ***
## TEAM_BATTING_BB    -0.00796    0.01310  -0.608  0.54355    
## TEAM_BATTING_SO     0.32976    0.06709   4.915 9.66e-07 ***
## TEAM_PITCHING_H     0.10908    0.03803   2.868  0.00417 ** 
## TEAM_PITCHING_SO   -0.13834    0.06394  -2.163  0.03064 *  
## TEAM_FIELDING_E     0.36540    0.02872  12.723  < 2e-16 ***
## TEAM_BATTING_1B    -0.06298    0.04343  -1.450  0.14715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.85 on 1824 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.374,  Adjusted R-squared:  0.3706 
## F-statistic:   109 on 10 and 1824 DF,  p-value: < 2.2e-16

#remove Triples from linear model
SB.3 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B)
summary(SB.3)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B, data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.018  -29.148   -3.989   25.795  193.727 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.203e+02  2.361e+01  -9.331  < 2e-16 ***
## TARGET_WINS       1.334e+00  8.682e-02  15.368  < 2e-16 ***
## TEAM_BATTING_2B  -7.737e-02  4.737e-02  -1.633 0.102606    
## TEAM_BATTING_HR  -6.383e-01  4.468e-02 -14.284  < 2e-16 ***
## TEAM_BATTING_BB  -8.298e-03  1.309e-02  -0.634 0.526267    
## TEAM_BATTING_SO   3.478e-01  6.205e-02   5.605  2.4e-08 ***
## TEAM_PITCHING_H   1.203e-01  3.458e-02   3.478 0.000517 ***
## TEAM_PITCHING_SO -1.567e-01  5.844e-02  -2.681 0.007400 ** 
## TEAM_FIELDING_E   3.701e-01  2.795e-02  13.242  < 2e-16 ***
## TEAM_BATTING_1B  -7.369e-02  4.070e-02  -1.810 0.070390 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.85 on 1825 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3739, Adjusted R-squared:  0.3708 
## F-statistic: 121.1 on 9 and 1825 DF,  p-value: < 2.2e-16

#remove Walks from linear model
SB.4 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB)
summary(SB.4)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB, data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.336  -29.443   -4.327   26.045  193.400 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -225.15763   22.35695 -10.071  < 2e-16 ***
## TARGET_WINS         1.32179    0.08457  15.630  < 2e-16 ***
## TEAM_BATTING_2B    -0.08121    0.04698  -1.729 0.084001 .  
## TEAM_BATTING_HR    -0.64425    0.04367 -14.752  < 2e-16 ***
## TEAM_BATTING_SO     0.35256    0.06158   5.725 1.21e-08 ***
## TEAM_PITCHING_H     0.12251    0.03440   3.562 0.000378 ***
## TEAM_PITCHING_SO   -0.16050    0.05812  -2.762 0.005810 ** 
## TEAM_FIELDING_E     0.37090    0.02791  13.289  < 2e-16 ***
## TEAM_BATTING_1B    -0.07475    0.04066  -1.838 0.066159 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.84 on 1826 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3737, Adjusted R-squared:  0.371 
## F-statistic: 136.2 on 8 and 1826 DF,  p-value: < 2.2e-16

#remove Singles from linear model
SB.5 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B)
summary(SB.5)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B, data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.841  -29.320   -3.894   25.575  194.213 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -242.88942   20.18259 -12.035  < 2e-16 ***
## TARGET_WINS         1.34304    0.08383  16.021  < 2e-16 ***
## TEAM_BATTING_2B    -0.02368    0.03505  -0.675   0.4995    
## TEAM_BATTING_HR    -0.59121    0.03280 -18.023  < 2e-16 ***
## TEAM_BATTING_SO     0.25759    0.03355   7.679 2.60e-14 ***
## TEAM_PITCHING_H     0.06550    0.01489   4.398 1.16e-05 ***
## TEAM_PITCHING_SO   -0.06699    0.02814  -2.381   0.0174 *  
## TEAM_FIELDING_E     0.38719    0.02648  14.619  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.87 on 1827 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3726, Adjusted R-squared:  0.3702 
## F-statistic:   155 on 7 and 1827 DF,  p-value: < 2.2e-16

#remove Doubles from linear model
SB.6 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B)
summary(SB.6)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B, 
##     data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.516  -29.068   -4.233   25.520  193.624 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -238.52378   19.11672 -12.477  < 2e-16 ***
## TARGET_WINS         1.35020    0.08314  16.240  < 2e-16 ***
## TEAM_BATTING_HR    -0.59070    0.03279 -18.015  < 2e-16 ***
## TEAM_BATTING_SO     0.24459    0.02748   8.902  < 2e-16 ***
## TEAM_PITCHING_H     0.05854    0.01075   5.445 5.89e-08 ***
## TEAM_PITCHING_SO   -0.05579    0.02273  -2.455   0.0142 *  
## TEAM_FIELDING_E     0.39219    0.02542  15.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.86 on 1828 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3724, Adjusted R-squared:  0.3703 
## F-statistic: 180.8 on 6 and 1828 DF,  p-value: < 2.2e-16

#remove Pitching Strikeouts
SB.7 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B - TEAM_PITCHING_SO)
summary(SB.7)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_1B - TEAM_BATTING_2B - 
##     TEAM_PITCHING_SO, data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.263  -29.291   -3.527   25.638  192.679 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.088e+02  1.481e+01  -14.10  < 2e-16 ***
## TARGET_WINS      1.389e+00  8.175e-02   16.99  < 2e-16 ***
## TEAM_BATTING_HR -5.624e-01  3.073e-02  -18.30  < 2e-16 ***
## TEAM_BATTING_SO  1.797e-01  7.532e-03   23.86  < 2e-16 ***
## TEAM_PITCHING_H  3.769e-02  6.601e-03    5.71 1.32e-08 ***
## TEAM_FIELDING_E  3.986e-01  2.532e-02   15.74  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.92 on 1829 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3703, Adjusted R-squared:  0.3686 
## F-statistic: 215.1 on 5 and 1829 DF,  p-value: < 2.2e-16

#Add Singles back into linear model
SB.8 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO)
summary(SB.8)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO, 
##     data = BB.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.164  -29.210   -3.464   25.666  193.809 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.302e+02  2.232e+01 -10.314  < 2e-16 ***
## TARGET_WINS      1.375e+00  8.248e-02  16.667  < 2e-16 ***
## TEAM_BATTING_HR -5.563e-01  3.110e-02 -17.889  < 2e-16 ***
## TEAM_BATTING_SO  1.843e-01  8.342e-03  22.094  < 2e-16 ***
## TEAM_PITCHING_H  3.255e-02  7.726e-03   4.213 2.64e-05 ***
## TEAM_FIELDING_E  3.974e-01  2.534e-02  15.687  < 2e-16 ***
## TEAM_BATTING_1B  2.509e-02  1.960e-02   1.280    0.201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.91 on 1828 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3709, Adjusted R-squared:  0.3688 
## F-statistic: 179.6 on 6 and 1828 DF,  p-value: < 2.2e-16

#remove Pitching Hits from linear model
SB.9 <- lm(data=BB.df, TEAM_BASERUN_SB~.-TEAM_FIELDING_DP -TEAM_PITCHING_BB -TEAM_BATTING_3B -TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO - TEAM_PITCHING_H)
summary(SB.9)

## 
## Call:
## lm(formula = TEAM_BASERUN_SB ~ . - TEAM_FIELDING_DP - TEAM_PITCHING_BB - 
##     TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BATTING_2B - TEAM_PITCHING_SO - 
##     TEAM_PITCHING_H, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107.27  -29.23   -3.23   26.13  193.70 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.318e+02  2.241e+01 -10.343  < 2e-16 ***
## TARGET_WINS      1.394e+00  8.273e-02  16.852  < 2e-16 ***
## TEAM_BATTING_HR -5.147e-01  2.962e-02 -17.375  < 2e-16 ***
## TEAM_BATTING_SO  1.816e-01  8.355e-03  21.735  < 2e-16 ***
## TEAM_FIELDING_E  4.111e-01  2.524e-02  16.284  < 2e-16 ***
## TEAM_BATTING_1B  6.801e-02  1.682e-02   4.043 5.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.1 on 1829 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3648, Adjusted R-squared:  0.3631 
## F-statistic: 210.1 on 5 and 1829 DF,  p-value: < 2.2e-16

#all low P value and F statistic of 303.5 with adj R squared of 0.4386
#take a look
par(mfrow=c(2,2))
plot(SB.9)

plot(SB.9$residuals)

#place back in the data base with imputed data for SB's
pred.SB <- round(predict(SB.9, BB.df))
SB.imp.1 <- impute(BB.df$TEAM_BASERUN_SB, pred.SB)
BB.df$TEAM_BASERUN_SB <- SB.imp.1

Build model to replace DP

DP <- lm(data=BB.df, TEAM_FIELDING_DP~.)
summary(DP)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ ., data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.332 -12.811  -0.552  12.205  69.169 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.474e+02  1.094e+01  13.473  < 2e-16 ***
## TARGET_WINS      -3.483e-01  4.127e-02  -8.439  < 2e-16 ***
## TEAM_BATTING_2B   8.637e-04  1.289e-02   0.067  0.94660    
## TEAM_BATTING_3B  -4.693e-02  3.174e-02  -1.479  0.13938    
## TEAM_BATTING_HR   9.574e-02  1.675e-02   5.716 1.27e-08 ***
## TEAM_BATTING_BB   1.752e-01  2.754e-02   6.363 2.49e-10 ***
## TEAM_BATTING_SO  -9.557e-02  1.540e-02  -6.205 6.72e-10 ***
## TEAM_BASERUN_SB  -7.459e-02  1.078e-02  -6.922 6.10e-12 ***
## TEAM_PITCHING_H   1.087e-02  4.166e-03   2.608  0.00918 ** 
## TEAM_PITCHING_BB -1.296e-01  2.516e-02  -5.150 2.87e-07 ***
## TEAM_PITCHING_SO  6.869e-02  1.396e-02   4.921 9.36e-07 ***
## TEAM_FIELDING_E  -8.336e-02  1.027e-02  -8.115 8.67e-16 ***
## TEAM_BATTING_1B   2.330e-02  9.891e-03   2.355  0.01860 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.31 on 1875 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3646, Adjusted R-squared:  0.3605 
## F-statistic: 89.65 on 12 and 1875 DF,  p-value: < 2.2e-16

#remove batting 2B's
DP.1 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B)
summary(DP.1)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.312 -12.803  -0.549  12.210  69.140 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      147.499122  10.893086  13.541  < 2e-16 ***
## TARGET_WINS       -0.348500   0.041091  -8.481  < 2e-16 ***
## TEAM_BATTING_3B   -0.046453   0.030914  -1.503  0.13311    
## TEAM_BATTING_HR    0.095923   0.016511   5.810 7.34e-09 ***
## TEAM_BATTING_BB    0.175497   0.027232   6.444 1.47e-10 ***
## TEAM_BATTING_SO   -0.095552   0.015396  -6.206 6.65e-10 ***
## TEAM_BASERUN_SB   -0.074570   0.010768  -6.925 5.97e-12 ***
## TEAM_PITCHING_H    0.010962   0.003907   2.806  0.00507 ** 
## TEAM_PITCHING_BB  -0.129803   0.024943  -5.204 2.16e-07 ***
## TEAM_PITCHING_SO   0.068675   0.013952   4.922 9.31e-07 ***
## TEAM_FIELDING_E   -0.083572   0.009765  -8.559  < 2e-16 ***
## TEAM_BATTING_1B    0.023282   0.009886   2.355  0.01862 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.3 on 1876 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3646, Adjusted R-squared:  0.3609 
## F-statistic: 97.85 on 11 and 1876 DF,  p-value: < 2.2e-16

#remove batting 3B's
DP.2 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B)
summary(DP.2)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B, 
##     data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.434 -12.959  -0.585  12.448  68.003 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      146.353372  10.870006  13.464  < 2e-16 ***
## TARGET_WINS       -0.361663   0.040160  -9.006  < 2e-16 ***
## TEAM_BATTING_HR    0.100120   0.016278   6.150 9.42e-10 ***
## TEAM_BATTING_BB    0.178287   0.027178   6.560 6.94e-11 ***
## TEAM_BATTING_SO   -0.094626   0.015388  -6.149 9.49e-10 ***
## TEAM_BASERUN_SB   -0.075508   0.010754  -7.021 3.06e-12 ***
## TEAM_PITCHING_H    0.011727   0.003875   3.026  0.00251 ** 
## TEAM_PITCHING_BB  -0.132288   0.024897  -5.314 1.20e-07 ***
## TEAM_PITCHING_SO   0.069184   0.013953   4.958 7.75e-07 ***
## TEAM_FIELDING_E   -0.085506   0.009683  -8.831  < 2e-16 ***
## TEAM_BATTING_1B    0.020890   0.009760   2.140  0.03245 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.31 on 1877 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3638, Adjusted R-squared:  0.3604 
## F-statistic: 107.3 on 10 and 1877 DF,  p-value: < 2.2e-16

#remove batting 1B's
DP.3 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B)
summary(DP.3)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B - 
##     TEAM_BATTING_1B, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.318 -12.867  -0.594  12.183  70.274 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      165.329937   6.295067  26.263  < 2e-16 ***
## TARGET_WINS       -0.359644   0.040187  -8.949  < 2e-16 ***
## TEAM_BATTING_HR    0.089824   0.015566   5.770 9.23e-09 ***
## TEAM_BATTING_BB    0.186157   0.026954   6.907 6.77e-12 ***
## TEAM_BATTING_SO   -0.093203   0.015389  -6.057 1.68e-09 ***
## TEAM_BASERUN_SB   -0.074197   0.010747  -6.904 6.88e-12 ***
## TEAM_PITCHING_H    0.017287   0.002878   6.007 2.27e-09 ***
## TEAM_PITCHING_BB  -0.139872   0.024667  -5.670 1.65e-08 ***
## TEAM_PITCHING_SO   0.063535   0.013714   4.633 3.86e-06 ***
## TEAM_FIELDING_E   -0.093118   0.009015 -10.330  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.33 on 1878 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3623, Adjusted R-squared:  0.3592 
## F-statistic: 118.5 on 9 and 1878 DF,  p-value: < 2.2e-16

#remove all remaining batting
DP.4 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB)
summary(DP.4)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B - 
##     TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB, 
##     data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.111 -13.561  -0.185  12.995  68.351 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      161.304819   4.246078  37.989  < 2e-16 ***
## TARGET_WINS       -0.237801   0.039123  -6.078 1.47e-09 ***
## TEAM_BASERUN_SB   -0.107620   0.010144 -10.610  < 2e-16 ***
## TEAM_PITCHING_H    0.015502   0.002162   7.171 1.06e-12 ***
## TEAM_PITCHING_BB   0.028665   0.005106   5.614 2.28e-08 ***
## TEAM_PITCHING_SO  -0.007577   0.002308  -3.283  0.00104 ** 
## TEAM_FIELDING_E   -0.087006   0.008629 -10.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.87 on 1881 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3248, Adjusted R-squared:  0.3226 
## F-statistic: 150.8 on 6 and 1881 DF,  p-value: < 2.2e-16

#remove pitching strikeouts
DP.5 <- lm(data=BB.df, TEAM_FIELDING_DP~.-TEAM_BATTING_2B -TEAM_BATTING_3B - TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB - TEAM_PITCHING_SO)
summary(DP.5)

## 
## Call:
## lm(formula = TEAM_FIELDING_DP ~ . - TEAM_BATTING_2B - TEAM_BATTING_3B - 
##     TEAM_BATTING_1B - TEAM_BATTING_HR - TEAM_BATTING_SO - TEAM_BATTING_BB - 
##     TEAM_PITCHING_SO, data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.517 -13.493  -0.397  13.052  67.500 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      154.939459   3.787463  40.909  < 2e-16 ***
## TARGET_WINS       -0.201265   0.037604  -5.352 9.75e-08 ***
## TEAM_BASERUN_SB   -0.118868   0.009573 -12.418  < 2e-16 ***
## TEAM_PITCHING_H    0.013728   0.002099   6.542 7.81e-11 ***
## TEAM_PITCHING_BB   0.027092   0.005097   5.315 1.19e-07 ***
## TEAM_FIELDING_E   -0.075351   0.007885  -9.556  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.92 on 1882 degrees of freedom
##   (388 observations deleted due to missingness)
## Multiple R-squared:  0.3209, Adjusted R-squared:  0.3191 
## F-statistic: 177.8 on 5 and 1882 DF,  p-value: < 2.2e-16

#all low P value and F statistic of 174.4 with adj R squared of 0.4397
#take a look
par(mfrow=c(2,2))
plot(DP.5)

plot(DP.5$residuals)

#place back in the data base with imputed data for SB's
pred.DP <- round(predict(DP.5, BB.df))
DP.imp.1 <- impute(BB.df$TEAM_FIELDING_DP, pred.DP)
BB.df$TEAM_FIELDING_DP <- DP.imp.1

#test new data set
summary(BB.df)

##   TARGET_WINS     TEAM_BATTING_2B TEAM_BATTING_3B  TEAM_BATTING_HR 
##  Min.   :  0.00   Min.   : 69.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:208.0   1st Qu.: 34.00   1st Qu.: 42.00  
##  Median : 82.00   Median :238.0   Median : 47.00   Median :102.00  
##  Mean   : 80.79   Mean   :241.2   Mean   : 55.25   Mean   : 99.61  
##  3rd Qu.: 92.00   3rd Qu.:273.0   3rd Qu.: 72.00   3rd Qu.:147.00  
##  Max.   :146.00   Max.   :458.0   Max.   :223.00   Max.   :264.00  
##                                                                    
##  TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :   0    Min.   :  0.0   Min.   : 1137  
##  1st Qu.:451.0   1st Qu.: 542    1st Qu.: 67.0   1st Qu.: 1419  
##  Median :512.0   Median : 730    Median :106.0   Median : 1518  
##  Mean   :501.6   Mean   : 727    Mean   :139.6   Mean   : 1779  
##  3rd Qu.:580.0   3rd Qu.: 925    3rd Qu.:172.0   3rd Qu.: 1682  
##  Max.   :878.0   Max.   :1399    Max.   :697.0   Max.   :30132  
##                                                                 
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 32.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:123.8   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :145.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :141.8   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:162.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :466.0   
##                   NA's   :102                                        
##  TEAM_BATTING_1B 
##  Min.   : 709.0  
##  1st Qu.: 990.8  
##  Median :1050.0  
##  Mean   :1073.2  
##  3rd Qu.:1129.0  
##  Max.   :2112.0  
##

Preliminary Transformation

FIELDING ERRORS TRANSFORMATION

#most ever errors by a team is 639 by 1883 Philadelphia.  Prorating to 162 games gives a value of 1046.
BB.df$TEAM_FIELDING_E[which(BB.df$TEAM_FIELDING_E > 1046)] <- 159

PITCHING HITS TRANSFORMATION

#most ever hits by a team is 1730.  So replace all pitching hits >3000 to be conservative with the median
BB.df$TEAM_PITCHING_H[which(BB.df$TEAM_PITCHING_H >3000)] <- 1518

PITCHING STRIKEOUT TRANSFORMATION

The most strikeouts thrown in a single season (unadjusted) that I could find was fewer that 1400.

#REVISIT THIS WITH A LEVERAGE POINT TEST FOR BAD OUTLIERS!
#The TEAM_PITCHING_SO variable has 25 outliers that are far beyond the most team pitched strikeouts 
#that have ever occurred. Here's the line of R code to find the count:
nrow(data.frame(which(BB.df$TEAM_PITCHING_SO > 1450)))

## [1] 25

#Since those outliers skew the distribution severely, I would suggest that rather than transform the 
#variable via a power transform we simply set those outliers to a more reasonable value, perhaps the 
#value of the 3rd quartile (968). This can be done (if we choose) with the following line of R code:
#replace pitching SO & pitching hits & errors outliers with median
BB.df$TEAM_PITCHING_SO[which(BB.df$TEAM_PITCHING_SO >1450)] <- 813

BATTING STRIKEOUT TRANSFORMATION

#First argument must be strictly positive.
#summary( powerTransform( cbind( TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP) ~ 1, BB.df)
#This function call yields the "best" box-cox power transform exponent for each variable relative to the full data set #rather than limiting the results to being based solely on a single response/predictor pair.  The estimates are found in #the column labeled: "Est. Power". We need to round up or down to the nearest common transform as described here (reposted #from Blackboard discussion):

BATTING SINGLE TRANSFORMATION

BB.df$TARGET_WINS[which(BB.df$TARGET_WINS <= 0)] <- 1
m1 <- lm(BB.df$TARGET_WINS~log(BB.df$TEAM_BATTING_1B))
#http://stats.stackexchange.com/questions/137059/find-distribution-and-transform-to-normal-distribution
lambda <- c(-1,-0.5, -0.33, -0.25, 0, 0.25, 0.33, 0.5,1)
#invResPlot(m1,lambda)
#inverseResponsePlot(m1,key=TRUE)
#lambda <- c(-1,-0.1811955,0,1)
#RSS <- c(6847.993,6761.037,6764.793,6901.701)
#plot(lambda,RSS,type="l",ylab=expression(RSS(lambda)),xlab=expression(lambda))
#-1/3
#ty <- y^(-1/3)
#plot(density(ty,kern="gaussian"),type="l",main="Gaussian kernel density estimate",xlab=expression(Y^(-1/3)))
#rug(ty)

Correlations

summary(BB.df)

Generalized Equation for Multiple Regression \[ \begin{aligned} \widehat{wins} &= \hat{\beta}_0 + \hat{\beta}_1 \times singles + \hat{\beta}_2 \times doubles + \hat{\beta}_3 \times triples + \hat{\beta}_4 \times homeruns + \hat{\beta}_5 \times walks + \hat{\beta}_6 \times strikeouts . . . + \end{aligned} \]

Full Panel

#wins x full panel
pairs(TARGET_WINS~.,
        data=BB.df,pch=".",gap=.5,upper.panel=panel.smooth)

m1 <- lm(TARGET_WINS~., data = BB.df)
StanRes1 <- rstandard(m1)

summary(m1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = BB.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.377  -8.058   0.343   8.595  72.394 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.379e+01  5.533e+00   2.492  0.01278 *  
## TEAM_BATTING_2B   3.971e-02  7.687e-03   5.166 2.61e-07 ***
## TEAM_BATTING_3B   9.976e-02  1.623e-02   6.147 9.39e-10 ***
## TEAM_BATTING_HR   1.132e-01  8.898e-03  12.720  < 2e-16 ***
## TEAM_BATTING_BB   3.157e-02  3.677e-03   8.586  < 2e-16 ***
## TEAM_BATTING_SO  -5.136e-03  3.879e-03  -1.324  0.18567    
## TEAM_BASERUN_SB   2.326e-02  4.711e-03   4.938 8.49e-07 ***
## TEAM_PITCHING_H  -2.147e-03  1.576e-03  -1.363  0.17311    
## TEAM_PITCHING_BB -2.888e-05  2.383e-03  -0.012  0.99033    
## TEAM_PITCHING_SO  1.365e-03  3.126e-03   0.437  0.66243    
## TEAM_FIELDING_E  -9.015e-03  2.893e-03  -3.116  0.00186 ** 
## TEAM_FIELDING_DP -8.592e-02  1.351e-02  -6.359 2.47e-10 ***
## TEAM_BATTING_1B   3.868e-02  3.663e-03  10.561  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 2161 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.2953, Adjusted R-squared:  0.2914 
## F-statistic: 75.45 on 12 and 2161 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
#SINGLES:                 transform skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_1B),main="Singles");rug(BB.df$TEAM_BATTING_1B)
plot(m1$model$TEAM_BATTING_1B,StanRes1,xlab="Singles",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_1B,StanRes1),lty=2,col=2)

#DOUBLES:                 add - high p-value
plot(density(BB.df$TEAM_BATTING_2B),main="Doubles");rug(BB.df$EAM_BATTING_2B)
plot(m1$model$TEAM_BATTING_2B,StanRes1,xlab="Doubles",ylab="Standardized 
Residuals");abline(lsfit(m1$model$TEAM_BATTING_2B,StanRes1),lty=2,col=2)

#TRIPLES:                 transform skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_3B),main="Triples");rug(BB.df$TEAM_BATTING_3B)
plot(m1$model$TEAM_BATTING_3B,StanRes1,xlab="Triples",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_3B,StanRes1),lty=2,col=2)

#HOMERUNS:                transform bimodal skew before adding to model - high p-value
plot(density(BB.df$TEAM_BATTING_HR),main="Homeruns");rug(BB.df$TEAM_BATTING_HR)
plot(m1$model$TEAM_BATTING_HR,StanRes1,xlab="Homeruns",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_HR,StanRes1),lty=2,col=2)

#WALKS:                   add
plot(density(BB.df$TEAM_BATTING_BB),main="Walks");rug(BB.df$TEAM_BATTING_BB)
plot(m1$model$TEAM_BATTING_BB,StanRes1,xlab="Walks",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_BB,StanRes1),lty=2,col=2)

#STRIKEOUTS:              transform bimodal skew before adding to model high p-value
plot(density(BB.df$TEAM_BATTING_SO),main="Strikeouts");rug(BB.df$TEAM_BATTING_SO)
plot(m1$model$TEAM_BATTING_SO,StanRes1,xlab="Strikeouts",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BATTING_SO,StanRes1),lty=2,col=2)

#HIT BY PITCH:            removed variable 
#plot(density(BB.df$TEAM_BATTING_HBP,na.rm=TRUE),main="Hit By Pitch");rug(BB.df$TEAM_BATTING_HBP)
#plot(m1$model$TEAM_BATTING_HBP,StanRes1,xlab="Singles",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_BATTING_HBP,StanRes1),lty=2,col=2)

#STOLEN BASES:            transform skew before adding to model 
plot(density(BB.df$TEAM_BASERUN_SB),main="Stolen Bases");rug(BB.df$TEAM_BASERUN_SB)
plot(m1$model$TEAM_BASERUN_SB,StanRes1,xlab="Stolen Bases",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_BASERUN_SB,StanRes1),lty=2,col=2)

#CAUGHT STEALING:         removed variable
#plot(density(BB.df$TEAM_BASERUN_CS),main="Caught Stealing");rug(BB.df$TEAM_BASERUN_CS)
#plot(m1$model$TEAM_BASERUN_CS,StanRes1,xlab="Caught Stealing",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_BASERUN_CS,StanRes1),lty=2,col=2)

#PITCHING HITS:           removed variable
#plot(density(BB.df$TEAM_PITCHING_H),main="Pitching Hits");rug(BB.df$TEAM_PITCHING_H)
#plot(m1$model$TEAM_PITCHING_H,StanRes1,xlab="Pitching Hits",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_PITCHING_H,StanRes1),lty=2,col=2)

#PITCHING HOMERUNS:       removed variable
#plot(density(BB.df$TEAM_PITCHING_HR),main="Pitching Homeruns");rug(BB.df$TEAM_PITCHING_HR)
#plot(m1$model$TEAM_PITCHING_HR,StanRes1,xlab="Pitching Homeruns",ylab="Standardized #Residuals");abline(lsfit(m1$model$TEAM_PITCHING_HR,StanRes1),lty=2,col=2)

#PITCHING WALKS:          correlated with batting walks
plot(density(BB.df$TEAM_PITCHING_BB),main="Pitching Walks");rug(BB.df$TEAM_PITCHING_BB)
plot(m1$model$TEAM_PITCHING_BB,StanRes1,xlab="Pitching Walks",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_PITCHING_BB,StanRes1),lty=2,col=2)

#PITCHING STRIKEOUTS:     add* - high p-value
plot(density(BB.df$TEAM_PITCHING_SO, na.rm = TRUE),main="Pitching Strikeouts");rug(BB.df$TEAM_PITCHING_SO)
plot(m1$model$TEAM_PITCHING_SO,StanRes1,xlab="Pitching Strikeouts",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_PITCHING_SO,StanRes1),lty=2,col=2)

#FIELDING ERRORS:         transform skew before adding to model   
plot(density(BB.df$TEAM_FIELDING_E),main="Fielding Errors");rug(BB.df$TEAM_FIELDING_E)
plot(m1$model$TEAM_FIELDING_E,StanRes1,xlab="Fielding Errors",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_FIELDING_E,StanRes1),lty=2,col=2)

#DOUBLE PLAYS:            add
plot(density(BB.df$TEAM_FIELDING_DP),main="Fielding Doubleplay");rug(BB.df$TEAM_FIELDING_DP)
plot(m1$model$TEAM_FIELDING_DP,StanRes1,xlab="Fielding Doubleplays",ylab="Standardized Residuals");abline(lsfit(m1$model$TEAM_FIELDING_DP,StanRes1),lty=2,col=2)
par(mfrow=c(1,1))

Model Mo (Baseline) using forward selection

\[ \begin{aligned} \widehat{wins} &= \hat{\beta}_0 + \hat{\beta}_1 \times doubles + \hat{\beta}_2 \times walks + \hat{\beta}_3 \times pitching strikeouts . . . + \hat{\beta}_4 \times doubleplays \end{aligned} \]

Model Mo

Model M1

Model M2

Model M3

Predictions and Assessment

Conclusion

Appendicies . . .

Slugging

The most hits in a single season (unadjusted) is 1783–NL Philadelphia Phillies 1930. The most doubles in a single season (unadjusted) is 376–AL Texas in 2008. The most triples in a single season (unadjusted) is 153–NL Baltimore in 1894. Records that exceed these amounts should be adjusted either to NA or the median.

On-Base

102 strikeout NAs can remain as long as they’re not counted in descriptive statistics as observations. 2085 hit by pitch NAs disqualifies this field from use in the model.

The most walks in a single season (unadjusted) is 835–AL Boston Red Sox in 1949. The fewest walks in a single season (unadjusted) is 282–NL St. Louis Cardinals 1908. Records that exceed these amounts should be adjusted either to NA or the median.

Base Running

131 stolen base NAs can & 772 caught stealing NAs can remain as long as they’re not counted in descriptive statistics.

The most stolen bases in a single season (unadjusted) is 426–NL New York in 1893. The most caught stealing bases count in a single season (unadjusted) is 191–AL NY in 1914. Records that exceed these amounts should be adjusted either to NA or the median.

These statistics are co-linear and may be better used as a derived statistic for Expected value of team stolen bases E(SB) = SB * likelihood of success (SB/SB attempts)

Fielding

286 Fielding error NAs can remain as long as they’re not counted in descriptive statistics as observations.

The most fielding errors in a single season (unadjusted) is 639–NL Philadelphia in 1883. The most fielding errors in a single season (unadjusted) post WWII is 234–NL Philadelphia in 1945. Records that exceed the post WWII amount should be adjusted either to NA or the median.

Pitching

The most hits given up in a single season (unadjusted) that I could find was fewer that 2000. The most homeruns given up in a single season (unadjusted) that I could find was fewer that 250. The most walks given up in a single season (unadjusted) that I could find was fewer that 800. The most strikeouts thrown in a single season (unadjusted) that I could find was fewer that 1400. Records that exceed these amount should be adjusted either to NA or the median.

Moneyball Multiple Regression Model