Week 1 Assignment

By Brian Weinfeld

June 7th, 2018

DATA EXPLORATION

The money ball data set contains 2276 observations representing a baseball team’s season with 17 integer variables representing various offensive and defensive baseball statistics collected over the past 100+ years. The below box plots demonstrates the wide range of values that the data holds. Most of the variables contain significant outliers. Note the log scale.

money.ball %>%
  select(-INDEX) %>%
  gather() %>%
  ggplot() +
    geom_boxplot(aes(key, value)) +
    scale_y_log10() +
    coord_flip()

There is a sizeable amount of missing data as well. For example, the ‘hit by pitch’ category is missing over 90% of its data. There was a period of time in baseball where hitting a batter with a pitch was considered a ball (and not a free base). This may be part of the reason that this statistic was not tracked.

money.ball %>%
  map_dbl(~sum(is.na(.))/nrow(money.ball)) %>%
  kable()

	x
INDEX	0.0000000
TARGET_WINS	0.0000000
TEAM_BATTING_H	0.0000000
TEAM_BATTING_2B	0.0000000
TEAM_BATTING_3B	0.0000000
TEAM_BATTING_HR	0.0000000
TEAM_BATTING_BB	0.0000000
TEAM_BATTING_SO	0.0448155
TEAM_BASERUN_SB	0.0575571
TEAM_BASERUN_CS	0.3391916
TEAM_BATTING_HBP	0.9160808
TEAM_PITCHING_H	0.0000000
TEAM_PITCHING_HR	0.0000000
TEAM_PITCHING_BB	0.0000000
TEAM_PITCHING_SO	0.0448155
TEAM_FIELDING_E	0.0000000
TEAM_FIELDING_DP	0.1256591

DATA PREPARATION

The TEAM_BASERUN_CS statistic and TEAM_BATTING_HBP statistics are missing a sizeable amount of observations. When a statistic is missing a few observation (for example, TEAM_PITCHING_SO) modifications can be made to address this issue. However, as such a large proportion are missing I think its better to ignore these columns.

money.ball %<>%
  select(-TEAM_BASERUN_CS, -TEAM_BATTING_HBP)

There are 19 observations that are missing large chunks of information. This would require estimating so much of their information that I do not believe it is representative of the original team. These observations will be removed.

money.ball %>%
  filter(TEAM_BATTING_SO == 0) %>%
  select(INDEX, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP) %>%
  kable('html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))

INDEX	TEAM_BASERUN_SB	TEAM_FIELDING_DP
325	NA	NA
326	NA	NA
435	NA	NA
459	NA	NA
952	NA	NA
953	NA	NA
1106	NA	NA
1107	NA	NA
1347	0	NA
1498	NA	NA
1502	NA	NA
1503	NA	NA
2037	NA	NA
2038	NA	NA
2048	NA	NA
2049	NA	NA
2253	NA	NA
2254	NA	NA
2486	NA	NA
2493	NA	NA

money.ball %<>%
  filter(TEAM_BATTING_SO != 0)

There are 4 observations whose values lead me to believe they are mistakes. This includes a team with 0 wins and 3 teams with certain statistics much larger than any other team. I believe these may be extrapolation errors. These observations have been removed.

money.ball %<>%
  filter(!INDEX %in% c(1347, 2380, 1494, 1769))

Next, we need to address the missing values from the above noted columns. We will use the caret package to make predicted values for the missing pieces of information. Finally, we remove the index column as it is no longer needed.

dummy.vars <- dummyVars(~ ., data = money.ball[, c(-1, -2)])
train.dummy <- predict(dummy.vars, money.ball[, c(-1, -2)])


pre.process <- preProcess(train.dummy, method='bagImpute')
imputed.data <- predict(pre.process, train.dummy)

money.ball %<>%
  mutate(TEAM_BATTING_SO = imputed.data[, 6],
         TEAM_BASERUN_SB = imputed.data[, 7],
         TEAM_PITCHING_SO = imputed.data[, 11],
         TEAM_FIELDING_DP = imputed.data[, 13]
         )

money.ball %<>%
  select(-INDEX)

BUILD MODEL

The first model we will build consists of every single predictor provided.

l.all <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR +
              TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + log(TEAM_PITCHING_H) +
              TEAM_PITCHING_HR + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + 
              TEAM_FIELDING_DP, money.ball)
summary(l.all)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + log(TEAM_PITCHING_H) + TEAM_PITCHING_HR + 
##     log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = money.ball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.571  -7.963   0.114   7.914  62.705 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -71.926335  22.301096  -3.225 0.001278 ** 
## TEAM_BATTING_H          0.035095   0.005688   6.170 8.13e-10 ***
## TEAM_BATTING_2B        -0.023971   0.009144  -2.621 0.008821 ** 
## TEAM_BATTING_3B         0.094736   0.017574   5.391 7.79e-08 ***
## TEAM_BATTING_HR         0.108282   0.031844   3.400 0.000685 ***
## TEAM_BATTING_BB         0.062572   0.008872   7.053 2.35e-12 ***
## TEAM_BATTING_SO        -0.030650   0.004888  -6.270 4.35e-10 ***
## TEAM_BASERUN_SB         0.037098   0.004606   8.055 1.31e-15 ***
## log(TEAM_PITCHING_H)   18.782666   4.540913   4.136 3.67e-05 ***
## TEAM_PITCHING_HR       -0.033798   0.028692  -1.178 0.238939    
## log(TEAM_PITCHING_BB) -23.850051   3.970986  -6.006 2.23e-09 ***
## log(TEAM_PITCHING_SO)  17.790456   3.225685   5.515 3.90e-08 ***
## TEAM_FIELDING_E        -0.042668   0.003485 -12.244  < 2e-16 ***
## TEAM_FIELDING_DP       -0.113234   0.013899  -8.147 6.28e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.42 on 2137 degrees of freedom
## Multiple R-squared:  0.3271, Adjusted R-squared:  0.323 
## F-statistic:  79.9 on 13 and 2137 DF,  p-value: < 2.2e-16

\[\widehat{WINS}=0.03\times BATTING_H + -0.02\times BATTING_{2B} + 0.09\times BATTING_{3B} + 0.11\times BATTING_{HR} + \]

\[ 0.06\times BATTING_{BB} + -0.03\times BATTING_{SO} + 0.04\times BASERUN_{SB} + 18.47\times log(PITCHING_{H}) + \]

\[ -0.03\times PITCHING_{HR} + -23.85\times log(PITCHING_{BB}) + 17.79\times log(PITCHING_{SO}) + -0.04\times FIELDING_{E} + \]

\[ -0.11\times FIELDING_{DP} - 69.94\]

This model meets all the required assumptions for regression [see appendix]. However, there is much left to be desired by this model. First, it is incredibly complex, featuring 13 predictors, including 1 that is not statistically significant. Many of the predictors are highly correlated. There are a few instances of simpsons perceived paradox in the coefficients of the predictors. ‘Doubles’ (TEAM_BATTING_2B) has a negative slope which seems to indicate that as more doubles are hit, the number of wins decreases and ‘hits allowed’ (TEAM_PITCHING_H) has a positive slope which seems to indicate that as more hits are allowed, the number of wins increases.

This model is a good baseline, but it can be improved.

The second model will be built with backwards selection.

l.back <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR +
              TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + log(TEAM_PITCHING_H) +
              log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + 
              TEAM_FIELDING_DP, money.ball)
summary(l.back)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) + 
##     log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = money.ball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.500  -8.022   0.163   7.923  62.054 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -53.370702  15.787726  -3.381 0.000736 ***
## TEAM_BATTING_H          0.036097   0.005624   6.418 1.70e-10 ***
## TEAM_BATTING_2B        -0.024002   0.009145  -2.625 0.008739 ** 
## TEAM_BATTING_3B         0.091170   0.017313   5.266 1.53e-07 ***
## TEAM_BATTING_HR         0.072537   0.009658   7.510 8.60e-14 ***
## TEAM_BATTING_BB         0.065057   0.008618   7.549 6.46e-14 ***
## TEAM_BATTING_SO        -0.030364   0.004883  -6.219 6.01e-10 ***
## TEAM_BASERUN_SB         0.036880   0.004602   8.013 1.82e-15 ***
## log(TEAM_PITCHING_H)   17.219002   4.342956   3.965 7.59e-05 ***
## log(TEAM_PITCHING_BB) -25.232839   3.793858  -6.651 3.69e-11 ***
## log(TEAM_PITCHING_SO)  17.603164   3.222056   5.463 5.22e-08 ***
## TEAM_FIELDING_E        -0.041522   0.003346 -12.408  < 2e-16 ***
## TEAM_FIELDING_DP       -0.113522   0.013898  -8.168 5.29e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.42 on 2138 degrees of freedom
## Multiple R-squared:  0.3266, Adjusted R-squared:  0.3228 
## F-statistic: 86.42 on 12 and 2138 DF,  p-value: < 2.2e-16

\[\widehat{WINS}=0.04\times BATTING_H + -0.02\times BATTING_{2B} + 0.09\times BATTING_{3B} + 0.07\times BATTING_{HR} + \] \[ 0.06\times BATTING_{BB} + -0.03\times BATTING_{SO} + 0.04\times BASERUN_{SB} + 17.01\times log(PITCHING_H) +\] \[ -25.16\times log(PITCHING_{BB}) + 17.61\times log(PITCHING_{SO}) + \] \[ -0.04\times FIELDING_{E} + -0.11\times FIELDING_{DP} - 52.55\]

This model also meets all the assumptions for regression [see appendix]. This model is an improvement over the previous model, but only slightly. Backwards selection has removed 1 predictor but remains just as accurate as the first model. The \(R^2\) of both models is about \(0.32\).

We will try to get a bit agressive with the final model.

The third model will be built with newly created variables and provide me with some added leeway in making model decisions.

I began by combining all the positive offensive statistics. Each statistic is weighted by the number of bases a player advances. Singles, walks, and stolen base are weighted 1, doubles are weighted 2, triples are weighted 3 and homeruns are weighted 4. The new observation is called TEAM_BATTING_POS. The goals is to represent the number of bases of advancement each team produces each season.

money.ball.2 <- money.ball %>%
  mutate(TEAM_BATTING_POS = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_2B * 1 + 
           TEAM_BATTING_3B * 2 + TEAM_BATTING_HR * 3) %>%
  dplyr::select(-TEAM_BATTING_H, -TEAM_BATTING_BB, -TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_BASERUN_SB)

l.combine <- lm(TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + 
                  TEAM_FIELDING_E + TEAM_BATTING_POS + TEAM_FIELDING_DP, money.ball.2)
summary(l.combine)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_BB) + 
##     log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_BATTING_POS + 
##     TEAM_FIELDING_DP, data = money.ball.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.944  -8.024   0.219   8.137  55.811 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.255206  13.105453  -0.248    0.804    
## TEAM_BATTING_SO        -0.039498   0.003704 -10.665  < 2e-16 ***
## log(TEAM_PITCHING_BB) -10.634259   1.606392  -6.620 4.53e-11 ***
## log(TEAM_PITCHING_SO)  17.945431   2.363292   7.593 4.62e-14 ***
## TEAM_FIELDING_E        -0.028599   0.002268 -12.610  < 2e-16 ***
## TEAM_BATTING_POS        0.030976   0.001211  25.577  < 2e-16 ***
## TEAM_FIELDING_DP       -0.124395   0.012740  -9.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.59 on 2144 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3045 
## F-statistic: 157.9 on 6 and 2144 DF,  p-value: < 2.2e-16

\[\widehat{WINS}=0.03\times BATTING_{POS} + -0.04\times BATTING_{SO} + -10.60\times log(PITCHING_{BB}) \] \[+ 17.86\times log(PITCHING_{SO}) + -0.02\times FIELDING_{E} + -0.12\times FIELDING_{DP} - 3.21\]

This model meets all the assumptions for regression. [see appendix]

While this model does lose some predictive power with an \(R^2\) of \(0.30\) down from \(0.32\), it may be considered a worthy tradeoff. First, it has the fewest predictors of any of the models at 6. In addition, none of the predictors fall victim to simpson’s perceived paradox and thus make intuitive and logical sense. This may aid in understanding for general managers and non-statistically inclinded people. Each predictor is highly significant, indicating that we have boiled down the information to the most critical pieces of information.

There does still appears to be high collinearity in the model, however, this was present in all 3 of the created models.

SELECT MODELS

Ultimately, I selected the third model. While it does have the least predictive power, that small loss is offset by the intuitive logic of the model and it’s more simple make up.

There are a number of issues with the dataset that should be considered. The data provided was collected over a large period of time yet it lacked the time information. Thus the assumption that the residuals are independent could not be checked. This is important as the data was collected over a long period of time and the game of baseball has changed immensly in that time. Just how helpful is data from the 1800s for predicting games in 2000? I would imagine very little.

The model was used to predict the data of the test set. A sample of the results are below.

test.set <- read_csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Summer18\\621\\621week1\\moneyball-evaluation-data.csv') 

dummy.vars <- dummyVars(~ ., data = test.set[, -1])
train.dummy <- predict(dummy.vars, test.set[, -1])

pre.process <- preProcess(train.dummy, method='bagImpute')
imputed.data <- predict(pre.process, train.dummy)

test.set %<>%
  mutate(TEAM_BATTING_SO = imputed.data[, 6],
         TEAM_BASERUN_SB = imputed.data[, 7],
         TEAM_PITCHING_SO = imputed.data[, 13],
         TEAM_FIELDING_DP = imputed.data[, 15]
         ) %>%
  mutate(TEAM_BATTING_POS = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_2B * 1 + 
           TEAM_BATTING_3B * 2 + TEAM_BATTING_HR * 3) %>%
  select(-TEAM_BATTING_H, -TEAM_BATTING_BB, -TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_BASERUN_SB, -INDEX, -TEAM_BASERUN_CS, -TEAM_BATTING_HBP)

predict(l.combine, newdata = test.set, interval='prediction') %>%
  head(10) %>%
  kable('html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))

fit	lwr	upr
59.36482	34.62673	84.10290
62.13709	37.41233	86.86184
71.96963	47.27310	96.66615
87.95768	63.25565	112.65971
75.51073	50.54426	100.47719
72.45458	47.70299	97.20616
74.37915	49.65643	99.10188
74.35472	49.63684	99.07261
69.56843	44.86000	94.27685
73.56611	48.86168	98.27055

In conclusion, a successfull regression was fit using 6 predictors with an \(R^2\) of \(0.3047\). This information was used to make predictions for a test set. The model is ready for deployment by the general managers however I wish to stress that without verifying the effect of collecting information over the course of a century, there are still concerns regarding the complete veracity of the information. Also, there remains high collinearity between the predictors that should be addressed.

Appendix

Assumptions must be validated to ensure that the regression model is valid. A light review was the first two models is followed by a more indepth review of the final, selected model.

Model #1

After log transforming 3 of the predictors, all the of the predictors are linearly related to the response variable. The data is roughly normal although there are upper outliers with a long tail. The data is homoscedactic.

Unfortunately the collinearity is very high in many of the predictors. This will need to be addressed.

par(mfrow=c(2,2))
plot(l.all)

car::residualPlots(l.all)

##                       Test stat Pr(>|Test stat|)    
## TEAM_BATTING_H           4.2821        1.934e-05 ***
## TEAM_BATTING_2B          0.3924         0.694803    
## TEAM_BATTING_3B         -1.8325         0.067022 .  
## TEAM_BATTING_HR         -0.5800         0.561967    
## TEAM_BATTING_BB          9.0592        < 2.2e-16 ***
## TEAM_BATTING_SO         -5.4974        4.315e-08 ***
## TEAM_BASERUN_SB         -1.6904         0.091102 .  
## log(TEAM_PITCHING_H)    -1.3393         0.180605    
## TEAM_PITCHING_HR        -1.5999         0.109773    
## log(TEAM_PITCHING_BB)   -2.8516         0.004392 ** 
## log(TEAM_PITCHING_SO)   -2.8505         0.004407 ** 
## TEAM_FIELDING_E         -0.4036         0.686540    
## TEAM_FIELDING_DP         3.9047        9.726e-05 ***
## Tukey test              -1.1622         0.245158    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

faraway::vif(model.matrix(l.all)[, -1])

##        TEAM_BATTING_H       TEAM_BATTING_2B       TEAM_BATTING_3B 
##              8.632924              2.383983              3.350857 
##       TEAM_BATTING_HR       TEAM_BATTING_BB       TEAM_BATTING_SO 
##             48.951213             14.697851             19.025412 
##       TEAM_BASERUN_SB  log(TEAM_PITCHING_H)      TEAM_PITCHING_HR 
##              2.187496             20.609660             40.490980 
## log(TEAM_PITCHING_BB) log(TEAM_PITCHING_SO)       TEAM_FIELDING_E 
##             10.585458             15.004125              7.259933 
##      TEAM_FIELDING_DP 
##              1.663723

Model #2

Due to the high level of similiarity to the previous model, there is no major difference between the analysis of this model and the previous model. This is due to the fact that the backwards selection was only able to elminate one variable.

An anova test demonstrates that there is no evidence of a statistically significant difference between the two models.

anova(l.all, l.back)

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     log(TEAM_PITCHING_H) + TEAM_PITCHING_HR + log(TEAM_PITCHING_BB) + 
##     log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   2137 329487                           
## 2   2138 329701 -1   -213.95 1.3876 0.2389

Model #3

All the of the predictors are linearly related to the response variable as seen in the scatterplots. The data is very close to normal with minimal outliers. The data is homoscedactic. The residual plots show homoscedasticity across all predictors and the final fitted values.

Collinearity is not nearly as bad as the previous models but is still higher than we would like for 2 predictors. This will need to be addressed.

pairs(TARGET_WINS ~ TEAM_BATTING_SO + log(TEAM_PITCHING_H) + log(TEAM_PITCHING_BB) + 
        log(TEAM_PITCHING_SO) + TEAM_FIELDING_E + TEAM_BATTING_POS + TEAM_FIELDING_DP, money.ball.2)

par(mfrow=c(2,2))
plot(l.combine)

car::residualPlots(l.combine)

##                       Test stat Pr(>|Test stat|)    
## TEAM_BATTING_SO         -1.5812        0.1139790    
## log(TEAM_PITCHING_BB)   -0.9950        0.3198315    
## log(TEAM_PITCHING_SO)   -1.9312        0.0535894 .  
## TEAM_FIELDING_E         -0.0046        0.9963666    
## TEAM_BATTING_POS        -3.7520        0.0001801 ***
## TEAM_FIELDING_DP         4.8369        1.413e-06 ***
## Tukey test              -2.6141        0.0089455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

faraway::vif(model.matrix(l.combine)[, -1])

##       TEAM_BATTING_SO log(TEAM_PITCHING_BB) log(TEAM_PITCHING_SO) 
##             10.630395              1.686177              7.839502 
##       TEAM_FIELDING_E      TEAM_BATTING_POS      TEAM_FIELDING_DP 
##              2.993364              1.723225              1.360611

Finally, the termplot shows that our transformation selections are all strong.

par(mfrow=c(2,3))
termplot(l.combine, partial.resid=TRUE, smooth=panel.smooth, terms=1:6)