In this homework assignment we will explore, analyze and model a data set containing 2276 professional baseball team records from the years 1871 to 2006. Our objective is to build a multiple linear regression model on the given training data to predict the number of wins for each team in the test data.
| Variable_Name | Definition | Theoretical_Effect |
|---|---|---|
| INDEX | Identification variable (do not use) | None |
| TARGET_WINS | Number of wins | — |
| TEAM_BATTING_H | Base hits (1B, 2B, 3B, HR) | Positive |
| TEAM_BATTING_2B | Doubles | Positive |
| TEAM_BATTING_3B | Triples | Positive |
| TEAM_BATTING_HR | Homeruns | Positive |
| TEAM_BATTING_BB | Walks | Positive |
| TEAM_BATTING_HBP | Hit by pitch | Positive |
| TEAM_BATTING_SO | Strikeouts | Negative |
| TEAM_BASERUN_SB | Stolen bases | Positive |
| TEAM_BASERUN_CS | Caught stealing | Negative |
| TEAM_FIELDING_E | Errors | Negative |
| TEAM_FIELDING_DP | Double plays | Positive |
| TEAM_PITCHING_BB | Walks allowed | Negative |
| TEAM_PITCHING_H | Hits allowed | Negative |
| TEAM_PITCHING_HR | Homeruns allowed | Negative |
| TEAM_PITCHING_SO | Strikeouts by pitchers | Positive |
The moneyball training data set contains 16 variables, excluding the index, and 2,276 observations. Each observational unit represents a single team’s statistics for that year’s performance. There are 15 predictor variables which are counts of various actions in baseball such as base hits, home runs, strikeouts, stolen bases, caught stealing, hits allows and more.
As seen below in our numerical summary, the data contains NA values in certain variables (TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP). These NA values will be addressed in the data preparation. In addition, TEAM_BATTING_HBP contains a large amount of NAs at a count of 2085. There is also certain variables with max values that deviate significantly from the interquartile ranges such as TEAM_PITCHING_H and TEAM_PITCHING_SO.
## Rows: 2,276
## Columns: 16
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0 0 0 0
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0 0 102 131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 772 2085 0 0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 102 0 286
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
The histogram and box plots above provide a better understanding of the distribution of our predictor variables. Most variables have a relatively normal distribution where others show strong left and right side skewing. The box plots also clue us into possible data entry errors as may be the case for TEAM_PITCHING_SO.
The correlation heat map helps us to see the relationship of variables
against the target variable and other predictors. Correlations are
mostly what was expected based on the theoretical effect given in the
introduction with some exceptions. An example of this can be seen with
TEAM_BASERUN_CS where the correlation is slightly positive (0.02240407)
when the theoretical effect is to have a negative impact on wins.
Diving deeper into the outliers for the TEAM_PITCHING_SO (pitchers striking out the opposing team’s hitter) variable we can see that the record for these teams also are paired with a 0 TEAM_PITCHING_HR (home runs allowed by the pitchers), and so it stand to reason that these outliers are not data errors.
For the outliers in TEAM_PITCHING_H (hits allowed by pitchers) our distribution shows us that the outliers are likely not data errors either. There are infrequent but other recorded values between our outliers and the IQR of our variable. Our outliers in this variable are plausible real recorded values that happen to fall far on our distribution’s right sided tail.
The variable TEAM_BATTING_HBP which represents a batter being hit by a pitch was removed as the influence is a factor outside of the batter’s controls and it’s not a repeatable skill. The variable also contained 2,085 NA values out of the total of 2,276 observations.
## 'data.frame': 2276 obs. of 15 variables:
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
Near zero variance variables are variables with observed values that barely change across observations. Because of this they contribute little to analysis and introduce unnecessary complexity along with multicollinearity risk. No variables were found to be near zero variance as seen below.
For data imputation we looked at the columns with missing values and used imputation on those columns that have a rate 5% missing data.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0.000000 0.000000 0.000000 0.000000
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0.000000 0.000000 4.481547 5.755712
## TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 33.919156 0.000000 0.000000 0.000000
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 4.481547 0.000000 12.565905
Used multiple imputation to impute the missing data using MICE predictive mean matching method.
For the first model we choose to include all the predictive variables. This will allow us to see which features have significant influence on our TARGET_WINS dependent variable.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = Training_imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.066 -8.413 0.173 8.114 47.738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.6652346 5.1731357 6.508 9.37e-11 ***
## TEAM_BATTING_H 0.0431257 0.0035895 12.014 < 2e-16 ***
## TEAM_BATTING_2B -0.0199054 0.0088954 -2.238 0.025337 *
## TEAM_BATTING_3B 0.0412403 0.0164442 2.508 0.012215 *
## TEAM_BATTING_HR 0.0576471 0.0265424 2.172 0.029968 *
## TEAM_BATTING_BB 0.0130473 0.0056243 2.320 0.020440 *
## TEAM_BATTING_SO -0.0150600 0.0024780 -6.077 1.43e-09 ***
## TEAM_BASERUN_SB 0.0494468 0.0054066 9.146 < 2e-16 ***
## TEAM_BASERUN_CS 0.0020950 0.0110596 0.189 0.849777
## TEAM_PITCHING_H 0.0013758 0.0003859 3.566 0.000371 ***
## TEAM_PITCHING_HR 0.0236405 0.0235842 1.002 0.316263
## TEAM_PITCHING_BB -0.0036554 0.0040041 -0.913 0.361385
## TEAM_PITCHING_SO 0.0015600 0.0008943 1.744 0.081220 .
## TEAM_FIELDING_E -0.0415048 0.0027079 -15.327 < 2e-16 ***
## TEAM_FIELDING_DP -0.1119556 0.0124114 -9.020 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.66 on 2261 degrees of freedom
## Multiple R-squared: 0.358, Adjusted R-squared: 0.354
## F-statistic: 90.06 on 14 and 2261 DF, p-value: < 2.2e-16
For the second model we narrowed down the variable selection based on our findings that TEAM_PITCHING_HR has high multicollinearity with TEAM_BATTING_HR, therefore we removed TEAM_PITCHING_HR. In addition, we removed TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_SO, TEAM_FIELDING_DP for missing values. Our thoughts here is that by removing these variables our model is more reliable due to removal of imputed values and reduced model complexity.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_H +
## TEAM_PITCHING_BB + TEAM_FIELDING_E, data = Training_imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.776 -8.875 0.097 8.860 55.466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.290e+00 3.443e+00 2.117 0.034376 *
## TEAM_BATTING_H 4.848e-02 3.207e-03 15.118 < 2e-16 ***
## TEAM_BATTING_2B -2.582e-02 9.057e-03 -2.851 0.004400 **
## TEAM_BATTING_3B 1.011e-01 1.665e-02 6.072 1.48e-09 ***
## TEAM_BATTING_HR 3.672e-02 7.749e-03 4.739 2.28e-06 ***
## TEAM_BATTING_BB -7.926e-05 4.585e-03 -0.017 0.986208
## TEAM_PITCHING_H -1.312e-03 3.683e-04 -3.561 0.000377 ***
## TEAM_PITCHING_BB 1.036e-02 2.802e-03 3.695 0.000225 ***
## TEAM_FIELDING_E -1.664e-02 2.368e-03 -7.025 2.81e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.48 on 2267 degrees of freedom
## Multiple R-squared: 0.27, Adjusted R-squared: 0.2675
## F-statistic: 104.8 on 8 and 2267 DF, p-value: < 2.2e-16
For our third model our group utilized the backward selection process where we removed the lowest p-value variables noted from model 1 and 2. Included in this model were only variables with p-values greater than 0.05.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO + TEAM_BASERUN_CS +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_BATTING_BB, data = Training_imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.659 -8.994 0.549 9.297 70.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.658983 1.850740 34.397 < 2e-16 ***
## TEAM_BATTING_SO -0.021016 0.001696 -12.388 < 2e-16 ***
## TEAM_BASERUN_CS 0.083583 0.007696 10.860 < 2e-16 ***
## TEAM_PITCHING_HR 0.116163 0.007706 15.075 < 2e-16 ***
## TEAM_PITCHING_BB -0.009051 0.002172 -4.166 3.21e-05 ***
## TEAM_BATTING_BB 0.037613 0.003223 11.669 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 2270 degrees of freedom
## Multiple R-squared: 0.1657, Adjusted R-squared: 0.1638
## F-statistic: 90.14 on 5 and 2270 DF, p-value: < 2.2e-16
While Model 1 has higher multicollinearity in certain predictors, our analysis identified Model 1 as the strongest regression model. It achieved the lowest residual error (12.66) and the highest adjusted R² (0.354), making it the most accurate and reliable predictor of team wins. Model 1’s residuals show a normal distribution and a normal looking Q-Q plot.
Model 1 shows that for a baseball team to increase their amount of wins for the season they should focus on increasing their batting home runs and stolen bases. TEAM_BATTING_HR has the greatest positive impact at a coefficient of 0.05764 and TEAM_BASERUN_SB has the second greatest positive impact with a coefficient of 0.04945. Conversely, minimizing fielding errors (TEAM_FIELDING_E) as this variable has the largest negative impact on wins with a coefficient of -0.041504.
The variable TEAM_BATTING_HR is noted to be highly correlated with TEAM_PITCHING_HR, however both of these variables have large theoretical impact to the probability of winning. Hitting a home run or allowing a home run directly influences the game’s score and therefore our group decided to keep these variables.
Model 1 variables VIF
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.823342 2.460052 2.995896 36.657149
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 6.756380 5.274069 4.349937 4.373084
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 4.182680 29.664612 6.297724 3.336076
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 5.399699 1.872039
Model 2 variables VIF
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2.691190 2.248967 2.707698 2.755238
## TEAM_BATTING_BB TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_FIELDING_E
## 3.958646 3.361075 2.720094 3.642208
Model 3 variables VIF
## TEAM_BATTING_SO TEAM_BASERUN_CS TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1.909613 1.635908 2.446552 1.432176
## TEAM_BATTING_BB
## 1.714261
Utilizing our model 1 below we can see our predicted TARGET_WINS for the evaluation data.
## [1] 61.70438 64.43788 74.03427 87.39829 58.94786 77.30199 86.13339
## [8] 76.26872 69.82539 73.39817 68.68975 82.94084 82.04394 83.30519
## [15] 86.00371 78.02754 73.63939 78.06545 71.50434 91.30627 81.36126
## [22] 83.82291 79.61094 72.07780 82.58964 88.28316 48.71756 74.33875
## [29] 82.72964 74.07607 90.01052 85.66996 81.48934 82.88474 78.94106
## [36] 86.30069 75.49494 89.97919 86.62608 91.18688 82.82761 90.68766
## [43] 26.96493 109.79863 97.22876 98.13209 100.82611 76.25749 68.20711
## [50] 79.56018 76.91483 85.61544 75.67395 73.50105 74.54285 78.78853
## [57] 92.67873 76.20721 64.58450 81.16847 88.29978 73.38585 88.15314
## [64] 86.27224 85.34943 108.55313 73.01577 79.03907 78.59596 88.13572
## [71] 84.77313 70.74176 77.95723 90.39901 80.00471 83.91870 82.31571
## [78] 83.67792 72.69503 77.56226 84.84680 87.35468 96.60434 74.03809
## [85] 84.48714 81.67617 83.82346 83.89574 89.98801 90.31530 83.03474
## [92] 83.68749 73.71849 87.69547 86.27199 85.21599 87.84104 101.48732
## [99] 85.53824 86.51020 78.84594 74.09628 83.65425 84.05378 78.11537
## [106] 63.05545 57.92238 76.62968 86.48213 57.39852 85.01666 86.85096
## [113] 94.61449 91.90134 81.10868 77.98767 85.54428 81.09600 73.48884
## [120] 77.50156 99.09390 69.19853 69.67346 68.15842 68.00319 88.09358
## [127] 90.02270 76.59586 92.76469 91.37175 85.09122 79.84423 79.90539
## [134] 85.03472 87.59056 71.73025 74.05494 77.55132 89.23137 81.18155
## [141] 63.94014 73.66388 90.29776 71.64263 71.34484 71.42443 76.51099
## [148] 78.86705 78.93489 82.97546 82.38224 80.33354 53.00456 68.93829
## [155] 76.46388 70.76381 89.54568 68.43599 90.84364 75.78387 102.72792
## [162] 107.37037 93.87661 103.47792 97.22779 89.54158 81.77328 82.51216
## [169] 73.62276 80.78200 89.72416 89.20888 80.09017 93.92141 82.66462
## [176] 72.87966 77.64884 70.23489 73.58130 79.10529 90.23682 88.50916
## [183] 86.02718 84.53059 84.86146 99.18920 87.99015 65.03207 64.47068
## [190] 115.06085 70.88524 84.05417 76.60068 77.27529 79.08308 67.76610
## [197] 78.06713 84.38418 79.32341 82.97360 73.77323 78.59396 72.37519
## [204] 91.71000 81.53258 83.28816 77.10427 76.87136 82.76409 72.50850
## [211] 104.82097 89.74709 81.07565 64.70332 67.65049 82.83278 78.40176
## [218] 94.62109 77.53758 78.18899 77.52839 74.00609 80.55173 72.57908
## [225] 70.83159 75.07318 81.50358 78.42806 81.18371 84.41642 81.95687
## [232] 93.48219 78.70169 89.32369 79.60386 74.66871 82.09326 77.39837
## [239] 88.68597 72.03140 88.47934 86.42141 83.39604 81.54812 60.92730
## [246] 88.06432 81.04736 85.19076 72.97059 84.39924 79.99281 62.77491
## [253] 95.70136 33.87954 69.47688 76.60465 82.90241 84.59043 76.51166