Given a set of professional baseball teams and their performance metrics for a 162-game season, how can we build a model that can be used to predict the number of wins for each team? The data set contains metrics of teams from 1871 to 2006. The assignment description can be found here.
library(tidyverse)
library(corrplot)
library(reshape2) # melt function for distributions of variables
library(moments) # determine skewness of residuals
library(MASS) # Box-Cox transformation
library(broom)
library(knitr)
Description of variables
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
TARGET_WINS.TARGET_WINS, the response variable, is the number wins
by the baseball team.TEAM_BATTING_HBP column. We will take a closer look at how
much is missing.| names | x |
|---|---|
| INDEX | 0.000000 |
| TARGET_WINS | 0.000000 |
| TEAM_BATTING_H | 0.000000 |
| TEAM_BATTING_2B | 0.000000 |
| TEAM_BATTING_3B | 0.000000 |
| TEAM_BATTING_HR | 0.000000 |
| TEAM_BATTING_BB | 0.000000 |
| TEAM_BATTING_SO | 4.481547 |
| TEAM_BASERUN_SB | 5.755712 |
| TEAM_BASERUN_CS | 33.919156 |
| TEAM_BATTING_HBP | 91.608084 |
| TEAM_PITCHING_H | 0.000000 |
| TEAM_PITCHING_HR | 0.000000 |
| TEAM_PITCHING_BB | 0.000000 |
| TEAM_PITCHING_SO | 4.481547 |
| TEAM_FIELDING_E | 0.000000 |
| TEAM_FIELDING_DP | 12.565905 |
TEAM_BATTING_HBP is missing.TEAM_BASERUN_CS is missing.There exists a strong positive correlation between:
TEAM_BASERUN_CS, TEAM_BASERUN_SBTEAM_BATTING_SO, TEAM_PITCHING_SOTEAM_BATTING_BB, TEAM_PITCHING_BBTEAM_BATTING_HR, TEAM_PITCHING_HRTEAM_PITCHING_H, TEAM_BATTING_2BTEAM_BATTING_H, TEAM_PTICHING_HThere exists a moderately positive correlation between:
TEAM_WINS, TEAM_PITCHING_BBTEAM_WINS, TEAM_BATTING_BB,TEAM_WINS, TEAM_PITCHING_HRThere exists a moderately negative correlation between:
TARGET_WINS, TEAM_FIELDING_E.TEAM_PITCHING_H, TEAM_PTICHING_SOTEAM_BATTING_H, TEAM_PITCHING_SOWe can further visualize the correlations against
TARGET_WINS using scatterplots:
We can visualize the variables using histograms to account for non-normal distributions:
One of the key takeaways here is that strikeouts by batters has a
bi-modal distribution. Several variables, such as strikeouts by pitchers
and walks allowed are skewed right. The response
variable,TARGET_WINS has a normal distribution.
Going forward, we will drop the TEAM_BATTING_HBP
variable since it has too many missing values.
We can start by fitting a model with all the predictors.
The skewness measure of the residuals is -0.01, which suggests a negative skew.
If the magnitude of the skew were larger, we could attempt to use the Box-Cox method to determine what transformation we should apply to decrease the skew.
Below are the estimates of the coefficients of Model 1.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = dplyr::select(df.train,
## -"INDEX"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.5627 -6.6932 -0.1328 6.5249 27.8525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.912438 6.642839 8.718 < 2e-16 ***
## TEAM_BATTING_H 0.015434 0.019626 0.786 0.4318
## TEAM_BATTING_2B -0.070472 0.009369 -7.522 9.36e-14 ***
## TEAM_BATTING_3B 0.161551 0.022192 7.280 5.43e-13 ***
## TEAM_BATTING_HR 0.073952 0.085392 0.866 0.3866
## TEAM_BATTING_BB 0.043765 0.046454 0.942 0.3463
## TEAM_BATTING_SO 0.018250 0.023463 0.778 0.4368
## TEAM_BASERUN_SB 0.035880 0.008687 4.130 3.83e-05 ***
## TEAM_BASERUN_CS 0.052124 0.018227 2.860 0.0043 **
## TEAM_PITCHING_H 0.019044 0.018381 1.036 0.3003
## TEAM_PITCHING_HR 0.022997 0.082092 0.280 0.7794
## TEAM_PITCHING_BB -0.004180 0.044692 -0.094 0.9255
## TEAM_PITCHING_SO -0.038176 0.022447 -1.701 0.0892 .
## TEAM_FIELDING_E -0.155876 0.009946 -15.672 < 2e-16 ***
## TEAM_FIELDING_DP -0.112885 0.013137 -8.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.556 on 1471 degrees of freedom
## (790 observations deleted due to missingness)
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4333
## F-statistic: 82.1 on 14 and 1471 DF, p-value: < 2.2e-16
Eight out of 15 of the predictors have the wrong signs, going against
the theoretical effects on TARGET_WINS mentioned in the
beginning.
Model 2 has a skewed value of 0.022. This is more skewed than the first model.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BASERUN_SB + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = dplyr::select(df.train, -"INDEX"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.785 -8.525 0.040 8.464 40.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.907583 3.066561 27.036 < 2e-16 ***
## TEAM_BATTING_2B 0.038364 0.006738 5.693 1.44e-08 ***
## TEAM_BATTING_3B 0.258746 0.016866 15.341 < 2e-16 ***
## TEAM_BASERUN_SB 0.054453 0.005679 9.588 < 2e-16 ***
## TEAM_FIELDING_E -0.120219 0.006655 -18.066 < 2e-16 ***
## TEAM_FIELDING_DP -0.065316 0.013372 -4.884 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.94 on 1931 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.2245, Adjusted R-squared: 0.2224
## F-statistic: 111.8 on 5 and 1931 DF, p-value: < 2.2e-16
The sign of TEAM_FIELDING_DP is counterintuitive.
We can use the Box-Cox method to see if a transformation should be applied to address the skewness.
Lambda = 1.1515
The lambda corresponding to the maximum log-likelihood is close to 1, so a transformation is not necessary.
The metric I am deciding to use to evaluate the models is Adjusted
R-squared. For Model 1, it is 0.43. This means that 43% of the variance
in TARGET_WINS is explained by the predictors. For Model 2,
it is 22%. Therefore, Model 1 should be used to make predictions.