The money ball data set contains 2276 observations representing a baseball team’s season with 17 integer variables representing various offensive and defensive baseball statistics collected over the past 100+ years. The below box plots demonstrates the wide range of values that the data holds. Most of the variables contain significant outliers. Note the log scale.
money.ball %>%
select(-INDEX) %>%
gather() %>%
ggplot() +
geom_boxplot(aes(key, value)) +
scale_y_log10() +
coord_flip()
There is a sizeable amount of missing data as well. For example, the ‘hit by pitch’ category is missing over 90% of its data. There was a period of time in baseball where hitting a batter with a pitch was considered a ball (and not a free base). This may be part of the reason that this statistic was not tracked.
money.ball %>%
map_dbl(~sum(is.na(.))/nrow(money.ball)) %>%
kable()
| x | |
|---|---|
| INDEX | 0.0000000 |
| TARGET_WINS | 0.0000000 |
| TEAM_BATTING_H | 0.0000000 |
| TEAM_BATTING_2B | 0.0000000 |
| TEAM_BATTING_3B | 0.0000000 |
| TEAM_BATTING_HR | 0.0000000 |
| TEAM_BATTING_BB | 0.0000000 |
| TEAM_BATTING_SO | 0.0448155 |
| TEAM_BASERUN_SB | 0.0575571 |
| TEAM_BASERUN_CS | 0.3391916 |
| TEAM_BATTING_HBP | 0.9160808 |
| TEAM_PITCHING_H | 0.0000000 |
| TEAM_PITCHING_HR | 0.0000000 |
| TEAM_PITCHING_BB | 0.0000000 |
| TEAM_PITCHING_SO | 0.0448155 |
| TEAM_FIELDING_E | 0.0000000 |
| TEAM_FIELDING_DP | 0.1256591 |
The TEAM_BASERUN_CS statistic and TEAM_BATTING_HBP statistics are missing a sizeable amount of observations. When a statistic is missing a few observation (for example, TEAM_PITCHING_SO) modifications can be made to address this issue. However, as such a large proportion are missing I think its better to ignore these columns.
money.ball %<>%
select(-TEAM_BASERUN_CS, -TEAM_BATTING_HBP)
There are 19 observations that are missing large chunks of information. This would require estimating so much of their information that I do not believe it is representative of the original team. These observations will be removed.
money.ball %>%
filter(TEAM_BATTING_SO == 0) %>%
select(INDEX, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP) %>%
kable('html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| INDEX | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_PITCHING_SO | TEAM_FIELDING_DP |
|---|---|---|---|---|
| 325 | 0 | NA | 0 | NA |
| 326 | 0 | NA | 0 | NA |
| 435 | 0 | NA | 0 | NA |
| 459 | 0 | NA | 0 | NA |
| 952 | 0 | NA | 0 | NA |
| 953 | 0 | NA | 0 | NA |
| 1106 | 0 | NA | 0 | NA |
| 1107 | 0 | NA | 0 | NA |
| 1347 | 0 | 0 | 0 | NA |
| 1498 | 0 | NA | 0 | NA |
| 1502 | 0 | NA | 0 | NA |
| 1503 | 0 | NA | 0 | NA |
| 2037 | 0 | NA | 0 | NA |
| 2038 | 0 | NA | 0 | NA |
| 2048 | 0 | NA | 0 | NA |
| 2049 | 0 | NA | 0 | NA |
| 2253 | 0 | NA | 0 | NA |
| 2254 | 0 | NA | 0 | NA |
| 2486 | 0 | NA | 0 | NA |
| 2493 | 0 | NA | 0 | NA |
money.ball %<>%
filter(TEAM_BATTING_SO != 0)
There are 4 observations whose values lead me to believe they are mistakes. This includes a team with 0 wins and 3 teams with certain statistics much larger than any other team. I believe these may be extrapolation errors. These observations have been removed.
money.ball %<>%
filter(!INDEX %in% c(1347, 2380, 1494, 1769))
Next, we need to address the missing values from the above noted columns. We will use the caret package to make predicted values for the missing pieces of information. Finally, we remove the index column as it is no longer needed.
dummy.vars <- dummyVars(~ ., data = money.ball[, c(-1, -2)])
train.dummy <- predict(dummy.vars, money.ball[, c(-1, -2)])
pre.process <- preProcess(train.dummy, method='bagImpute')
imputed.data <- predict(pre.process, train.dummy)
money.ball %<>%
mutate(TEAM_BATTING_SO = imputed.data[, 6],
TEAM_BASERUN_SB = imputed.data[, 7],
TEAM_PITCHING_SO = imputed.data[, 11],
TEAM_FIELDING_DP = imputed.data[, 13]
)
money.ball %<>%
select(-INDEX)
The first model we will build consists of every single predictor provided.
l.all <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + log(TEAM_PITCHING_H) +
TEAM_PITCHING_HR + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E +
TEAM_FIELDING_DP, money.ball)
summary(l.all)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + log(TEAM_PITCHING_H) + TEAM_PITCHING_HR +
## log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E +
## TEAM_FIELDING_DP, data = money.ball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.571 -7.963 0.114 7.914 62.705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.926335 22.301096 -3.225 0.001278 **
## TEAM_BATTING_H 0.035095 0.005688 6.170 8.13e-10 ***
## TEAM_BATTING_2B -0.023971 0.009144 -2.621 0.008821 **
## TEAM_BATTING_3B 0.094736 0.017574 5.391 7.79e-08 ***
## TEAM_BATTING_HR 0.108282 0.031844 3.400 0.000685 ***
## TEAM_BATTING_BB 0.062572 0.008872 7.053 2.35e-12 ***
## TEAM_BATTING_SO -0.030650 0.004888 -6.270 4.35e-10 ***
## TEAM_BASERUN_SB 0.037098 0.004606 8.055 1.31e-15 ***
## log(TEAM_PITCHING_H) 18.782666 4.540913 4.136 3.67e-05 ***
## TEAM_PITCHING_HR -0.033798 0.028692 -1.178 0.238939
## log(TEAM_PITCHING_BB) -23.850051 3.970986 -6.006 2.23e-09 ***
## log(TEAM_PITCHING_SO) 17.790456 3.225685 5.515 3.90e-08 ***
## TEAM_FIELDING_E -0.042668 0.003485 -12.244 < 2e-16 ***
## TEAM_FIELDING_DP -0.113234 0.013899 -8.147 6.28e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.42 on 2137 degrees of freedom
## Multiple R-squared: 0.3271, Adjusted R-squared: 0.323
## F-statistic: 79.9 on 13 and 2137 DF, p-value: < 2.2e-16
\[\widehat{WINS}=0.03\times BATTING_H + -0.02\times BATTING_{2B} + 0.09\times BATTING_{3B} + 0.11\times BATTING_{HR} + \]
\[ 0.06\times BATTING_{BB} + -0.03\times BATTING_{SO} + 0.04\times BASERUN_{SB} + 18.47\times log(PITCHING_{H}) + \]
\[ -0.03\times PITCHING_{HR} + -23.85\times log(PITCHING_{BB}) + 17.79\times log(PITCHING_{SO}) + -0.04\times FIELDING_{E} + \]
\[ -0.11\times FIELDING_{DP} - 69.94\]
This model meets all the required assumptions for regression [see appendix]. However, there is much left to be desired by this model. First, it is incredibly complex, featuring 13 predictors, including 1 that is not statistically significant. Many of the predictors are highly correlated. There are a few instances of simpsons perceived paradox in the coefficients of the predictors. ‘Doubles’ (TEAM_BATTING_2B) has a negative slope which seems to indicate that as more doubles are hit, the number of wins decreases and ‘hits allowed’ (TEAM_PITCHING_H) has a positive slope which seems to indicate that as more hits are allowed, the number of wins increases.
This model is a good baseline, but it can be improved.
The second model will be built with backwards selection.
l.back <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + log(TEAM_PITCHING_H) +
log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E +
TEAM_FIELDING_DP, money.ball)
summary(l.back)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) +
## log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = money.ball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.500 -8.022 0.163 7.923 62.054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53.370702 15.787726 -3.381 0.000736 ***
## TEAM_BATTING_H 0.036097 0.005624 6.418 1.70e-10 ***
## TEAM_BATTING_2B -0.024002 0.009145 -2.625 0.008739 **
## TEAM_BATTING_3B 0.091170 0.017313 5.266 1.53e-07 ***
## TEAM_BATTING_HR 0.072537 0.009658 7.510 8.60e-14 ***
## TEAM_BATTING_BB 0.065057 0.008618 7.549 6.46e-14 ***
## TEAM_BATTING_SO -0.030364 0.004883 -6.219 6.01e-10 ***
## TEAM_BASERUN_SB 0.036880 0.004602 8.013 1.82e-15 ***
## log(TEAM_PITCHING_H) 17.219002 4.342956 3.965 7.59e-05 ***
## log(TEAM_PITCHING_BB) -25.232839 3.793858 -6.651 3.69e-11 ***
## log(TEAM_PITCHING_SO) 17.603164 3.222056 5.463 5.22e-08 ***
## TEAM_FIELDING_E -0.041522 0.003346 -12.408 < 2e-16 ***
## TEAM_FIELDING_DP -0.113522 0.013898 -8.168 5.29e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.42 on 2138 degrees of freedom
## Multiple R-squared: 0.3266, Adjusted R-squared: 0.3228
## F-statistic: 86.42 on 12 and 2138 DF, p-value: < 2.2e-16
\[\widehat{WINS}=0.04\times BATTING_H + -0.02\times BATTING_{2B} + 0.09\times BATTING_{3B} + 0.07\times BATTING_{HR} + \] \[ 0.06\times BATTING_{BB} + -0.03\times BATTING_{SO} + 0.04\times BASERUN_{SB} + 17.01\times log(PITCHING_H) +\] \[ -25.16\times log(PITCHING_{BB}) + 17.61\times log(PITCHING_{SO}) + \] \[ -0.04\times FIELDING_{E} + -0.11\times FIELDING_{DP} - 52.55\]
This model also meets all the assumptions for regression [see appendix]. This model is an improvement over the previous model, but only slightly. Backwards selection has removed 1 predictor but remains just as accurate as the first model. The \(R^2\) of both models is about \(0.32\).
We will try to get a bit agressive with the final model.
The third model will be built with newly created variables and provide me with some added leeway in making model decisions.
I began by combining all the positive offensive statistics. Each statistic is weighted by the number of bases a player advances. Singles, walks, and stolen base are weighted 1, doubles are weighted 2, triples are weighted 3 and homeruns are weighted 4. The new observation is called TEAM_BATTING_POS. The goals is to represent the number of bases of advancement each team produces each season.
money.ball.2 <- money.ball %>%
mutate(TEAM_BATTING_POS = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_2B * 1 +
TEAM_BATTING_3B * 2 + TEAM_BATTING_HR * 3) %>%
dplyr::select(-TEAM_BATTING_H, -TEAM_BATTING_BB, -TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_BASERUN_SB)
l.combine <- lm(TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) +
TEAM_FIELDING_E + TEAM_BATTING_POS + TEAM_FIELDING_DP, money.ball.2)
summary(l.combine)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_BB) +
## log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_BATTING_POS +
## TEAM_FIELDING_DP, data = money.ball.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.944 -8.024 0.219 8.137 55.811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.255206 13.105453 -0.248 0.804
## TEAM_BATTING_SO -0.039498 0.003704 -10.665 < 2e-16 ***
## log(TEAM_PITCHING_BB) -10.634259 1.606392 -6.620 4.53e-11 ***
## log(TEAM_PITCHING_SO) 17.945431 2.363292 7.593 4.62e-14 ***
## TEAM_FIELDING_E -0.028599 0.002268 -12.610 < 2e-16 ***
## TEAM_BATTING_POS 0.030976 0.001211 25.577 < 2e-16 ***
## TEAM_FIELDING_DP -0.124395 0.012740 -9.764 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.59 on 2144 degrees of freedom
## Multiple R-squared: 0.3064, Adjusted R-squared: 0.3045
## F-statistic: 157.9 on 6 and 2144 DF, p-value: < 2.2e-16
\[\widehat{WINS}=0.03\times BATTING_{POS} + -0.04\times BATTING_{SO} + -10.60\times log(PITCHING_{BB}) \] \[+ 17.86\times log(PITCHING_{SO}) + -0.02\times FIELDING_{E} + -0.12\times FIELDING_{DP} - 3.21\]
This model meets all the assumptions for regression. [see appendix]
While this model does lose some predictive power with an \(R^2\) of \(0.30\) down from \(0.32\), it may be considered a worthy tradeoff. First, it has the fewest predictors of any of the models at 6. In addition, none of the predictors fall victim to simpson’s perceived paradox and thus make intuitive and logical sense. This may aid in understanding for general managers and non-statistically inclinded people. Each predictor is highly significant, indicating that we have boiled down the information to the most critical pieces of information.
There does still appears to be high collinearity in the model, however, this was present in all 3 of the created models.
Ultimately, I selected the third model. While it does have the least predictive power, that small loss is offset by the intuitive logic of the model and it’s more simple make up.
There are a number of issues with the dataset that should be considered. The data provided was collected over a large period of time yet it lacked the time information. Thus the assumption that the residuals are independent could not be checked. This is important as the data was collected over a long period of time and the game of baseball has changed immensly in that time. Just how helpful is data from the 1800s for predicting games in 2000? I would imagine very little.
The model was used to predict the data of the test set. A sample of the results are below.
test.set <- read_csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Summer18\\621\\621week1\\moneyball-evaluation-data.csv')
dummy.vars <- dummyVars(~ ., data = test.set[, -1])
train.dummy <- predict(dummy.vars, test.set[, -1])
pre.process <- preProcess(train.dummy, method='bagImpute')
imputed.data <- predict(pre.process, train.dummy)
test.set %<>%
mutate(TEAM_BATTING_SO = imputed.data[, 6],
TEAM_BASERUN_SB = imputed.data[, 7],
TEAM_PITCHING_SO = imputed.data[, 13],
TEAM_FIELDING_DP = imputed.data[, 15]
) %>%
mutate(TEAM_BATTING_POS = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_2B * 1 +
TEAM_BATTING_3B * 2 + TEAM_BATTING_HR * 3) %>%
select(-TEAM_BATTING_H, -TEAM_BATTING_BB, -TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_BASERUN_SB, -INDEX, -TEAM_BASERUN_CS, -TEAM_BATTING_HBP)
predict(l.combine, newdata = test.set, interval='prediction') %>%
head(10) %>%
kable('html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| fit | lwr | upr |
|---|---|---|
| 59.36482 | 34.62673 | 84.10290 |
| 62.13709 | 37.41233 | 86.86184 |
| 71.96963 | 47.27310 | 96.66615 |
| 87.95768 | 63.25565 | 112.65971 |
| 75.51073 | 50.54426 | 100.47719 |
| 72.45458 | 47.70299 | 97.20616 |
| 74.37915 | 49.65643 | 99.10188 |
| 74.35472 | 49.63684 | 99.07261 |
| 69.56843 | 44.86000 | 94.27685 |
| 73.56611 | 48.86168 | 98.27055 |
In conclusion, a successfull regression was fit using 6 predictors with an \(R^2\) of \(0.3047\). This information was used to make predictions for a test set. The model is ready for deployment by the general managers however I wish to stress that without verifying the effect of collecting information over the course of a century, there are still concerns regarding the complete veracity of the information. Also, there remains high collinearity between the predictors that should be addressed.
Assumptions must be validated to ensure that the regression model is valid. A light review was the first two models is followed by a more indepth review of the final, selected model.
Model #1
After log transforming 3 of the predictors, all the of the predictors are linearly related to the response variable. The data is roughly normal although there are upper outliers with a long tail. The data is homoscedactic.
Unfortunately the collinearity is very high in many of the predictors. This will need to be addressed.
par(mfrow=c(2,2))
plot(l.all)
car::residualPlots(l.all)
## Test stat Pr(>|Test stat|)
## TEAM_BATTING_H 4.2821 1.934e-05 ***
## TEAM_BATTING_2B 0.3924 0.694803
## TEAM_BATTING_3B -1.8325 0.067022 .
## TEAM_BATTING_HR -0.5800 0.561967
## TEAM_BATTING_BB 9.0592 < 2.2e-16 ***
## TEAM_BATTING_SO -5.4974 4.315e-08 ***
## TEAM_BASERUN_SB -1.6904 0.091102 .
## log(TEAM_PITCHING_H) -1.3393 0.180605
## TEAM_PITCHING_HR -1.5999 0.109773
## log(TEAM_PITCHING_BB) -2.8516 0.004392 **
## log(TEAM_PITCHING_SO) -2.8505 0.004407 **
## TEAM_FIELDING_E -0.4036 0.686540
## TEAM_FIELDING_DP 3.9047 9.726e-05 ***
## Tukey test -1.1622 0.245158
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
faraway::vif(model.matrix(l.all)[, -1])
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 8.632924 2.383983 3.350857
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 48.951213 14.697851 19.025412
## TEAM_BASERUN_SB log(TEAM_PITCHING_H) TEAM_PITCHING_HR
## 2.187496 20.609660 40.490980
## log(TEAM_PITCHING_BB) log(TEAM_PITCHING_SO) TEAM_FIELDING_E
## 10.585458 15.004125 7.259933
## TEAM_FIELDING_DP
## 1.663723
Model #2
Due to the high level of similiarity to the previous model, there is no major difference between the analysis of this model and the previous model. This is due to the fact that the backwards selection was only able to elminate one variable.
An anova test demonstrates that there is no evidence of a statistically significant difference between the two models.
anova(l.all, l.back)
## Analysis of Variance Table
##
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## log(TEAM_PITCHING_H) + TEAM_PITCHING_HR + log(TEAM_PITCHING_BB) +
## log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2137 329487
## 2 2138 329701 -1 -213.95 1.3876 0.2389
Model #3
All the of the predictors are linearly related to the response variable as seen in the scatterplots. The data is very close to normal with minimal outliers. The data is homoscedactic. The residual plots show homoscedasticity across all predictors and the final fitted values.
Collinearity is not nearly as bad as the previous models but is still higher than we would like for 2 predictors. This will need to be addressed.
pairs(TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) +
log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_BATTING_POS + TEAM_FIELDING_DP, money.ball.2)
par(mfrow=c(2,2))
plot(l.combine)
car::residualPlots(l.combine)
## Test stat Pr(>|Test stat|)
## TEAM_BATTING_SO -1.5812 0.1139790
## log(TEAM_PITCHING_BB) -0.9950 0.3198315
## log(TEAM_PITCHING_SO) -1.9312 0.0535894 .
## TEAM_FIELDING_E -0.0046 0.9963666
## TEAM_BATTING_POS -3.7520 0.0001801 ***
## TEAM_FIELDING_DP 4.8369 1.413e-06 ***
## Tukey test -2.6141 0.0089455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
faraway::vif(model.matrix(l.combine)[, -1])
## TEAM_BATTING_SO log(TEAM_PITCHING_BB) log(TEAM_PITCHING_SO)
## 10.630395 1.686177 7.839502
## TEAM_FIELDING_E TEAM_BATTING_POS TEAM_FIELDING_DP
## 2.993364 1.723225 1.360611
Finally, the termplot shows that our transformation selections are all strong.
par(mfrow=c(2,3))
termplot(l.combine, partial.resid=TRUE, smooth=panel.smooth, terms=1:6)