————————————————————————————————————————–
The original Training data set is comprised of 17 elements and 2276 total observations. Of those 17 elements, INDEX is simply an index value used for sorting while TARGET_WINS represents the response variable we are to use within our regression models. The remaining 15 elements are all potential predictor variables for our linear models. A summary table for the data set is provided below.
| variables | n | mean | sd | med | min | max | range | skew | kurtosis | se | NAs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 2276 | 81 | 16 | 82 | 0 | 146 | 146 | -0.40 | 1.03 | 0.33 | |
| TEAM_BATTING_H | 2276 | 1470 | 145 | 1454 | 891 | 2554 | 1663 | 1.57 | 7.28 | 3.03 | |
| TEAM_BATTING_2B | 2276 | 241 | 47 | 238 | 69 | 458 | 389 | 0.22 | 0.01 | 0.98 | |
| TEAM_BATTING_3B | 2276 | 55 | 28 | 47 | 0 | 223 | 223 | 1.11 | 1.50 | 0.59 | |
| TEAM_BATTING_HR | 2276 | 100 | 61 | 102 | 0 | 264 | 264 | 0.19 | -0.96 | 1.27 | |
| TEAM_BATTING_BB | 2276 | 502 | 123 | 512 | 0 | 878 | 878 | -1.03 | 2.18 | 2.57 | |
| TEAM_BATTING_SO | 2174 | 736 | 240 | 750 | 0 | 1399 | 1399 | -0.30 | -0.32 | 5.33 | 102 |
| TEAM_BASERUN_SB | 2145 | 125 | 88 | 101 | 0 | 697 | 697 | 1.97 | 5.49 | 1.90 | 131 |
| TEAM_BASERUN_CS | 1504 | 53 | 23 | 49 | 0 | 201 | 201 | 1.98 | 7.62 | 0.59 | 772 |
| TEAM_BATTING_HBP | 191 | 59 | 13 | 58 | 29 | 95 | 66 | 0.32 | -0.11 | 0.94 | 2085 |
| TEAM_PITCHING_H | 2276 | 1779 | 1407 | 1518 | 1137 | 30132 | 28995 | 10.33 | 141.84 | 29.49 | |
| TEAM_PITCHING_HR | 2276 | 106 | 61 | 107 | 0 | 343 | 343 | 0.29 | -0.60 | 1.28 | |
| TEAM_PITCHING_BB | 2276 | 553 | 166 | 536 | 0 | 3645 | 3645 | 6.74 | 96.97 | 3.49 | |
| TEAM_PITCHING_SO | 2174 | 818 | 553 | 813 | 0 | 19278 | 19278 | 22.17 | 671.19 | 11.86 | 102 |
| TEAM_FIELDING_E | 2276 | 246 | 228 | 159 | 65 | 1898 | 1833 | 2.99 | 10.97 | 4.77 | |
| TEAM_FIELDING_DP | 1990 | 146 | 26 | 149 | 52 | 228 | 176 | -0.39 | 0.18 | 0.59 | 286 |
At first glance this chart shows that there are missing values in 6 fields (especially TEAM_BATTING_HBP and TEAM_BASERUN_CS). In addition several values such as TEAM_PITCHING_H, TEAM_PITCHING_BB and TEAM_PITCHING_SO struggle with skew and kurtosis. The box plot visualizes some significant outliers in several data columns, especially in TEAM_PITCHING_H and TEAM_PITCHING_SO.
Using the cor function across the data frame we notice some strong correlations. TEAM_BATTING_H obviously has some colinearity with TEAM_BATTING_2B, TEAM_BATTING_3B and TEAM_BATTING_HR as these values are a subset of hits. TEAM_BATTING_BB and TEAM_PITCHING_BB have strong correlation, as do TEAM_PITCHING_HR and TEAM_BATTING_HR. Since we are focusing on wins, the following table shows the correlation when the NA’s are omitted:
| Value | Correlation with Wins |
|---|---|
| TEAM_BATTING_H | 0.46994665 |
| TEAM_BATTING_2B | 0.31298400 |
| TEAM_BATTING_3B | -0.12434586 |
| TEAM_BATTING_HR | 0.42241683 |
| TEAM_BATTING_BB | 0.46868793 |
| TEAM_BATTING_SO | -0.22889273 |
| TEAM_BASERUN_SB | 0.01483639 |
| TEAM_BASERUN_CS | -0.17875598 |
| TEAM_BATTING_HBP | 0.07350424 |
| TEAM_PITCHING_H | 0.47123431 |
| TEAM_PITCHING_HR | 0.42246683 |
| TEAM_PITCHING_BB | 0.46839882 |
| TEAM_PITCHING_SO | -0.22936481 |
| TEAM_FIELDING_E | -0.38668800 |
| TEAM_FIELDING_DP | 0.13168916 |
As a result of missing data, severe outliers, and collinearity there is a clear need for data preparation and transformation.
————————————————————————————————————————–
Our data preparation efforts for the training data set include the creation of one new derived variable, removing four predictor variables, imputing values for the remaining variables that had missing values (NA’s), and removal of a relatively small number of records that contained clearly egregious outlier values for particular variables. The results of these efforts were subsequently used as the basis for each of the five different linear models we created and evaluated.
We began our data preparation efforts by creating a new variable TEAM_BATTING_1B which represents offensive single base hits. (created by subtracting out the TEAM_BATTING doubles, triples and home runs from the TEAM_BATTING_H variable). We believe that separating out singles from the other unique hit values will minimize collinearity. The TEAM_BATTING_H variable is then removed from the data set since it is simply a linear combination of its component variables.,
The results of our data exploration efforts lead us to drop three other variables from the data set:
TEAM_BATTING_HBP: The TEAM_BATTING_HBP variable has very little correlation with the TARGET_WINS response variable and also contains 2085 missing values out of a total of 2277. Since it would be very difficult to accurately impute such a large proportion of any variable’s missing values, we choose to exclude the variable from our analysis.
TEAM_BASERUN_CS: This variable is strongly correlated (65.5%) with the TEAM_BASERUN_SB variable and is the 2nd largest source of NA’s in our data set. These combined facts lead us to exclude the variable from our analysis.
TEAM_PITCHING_HR: This variable is 97% correlated with TEAM_BATTING_HR. In fact, 815 cases (more than 35% of our total cases) have IDENTICAL values for pitched and batted HR’s. This high degree of correlation may be due to the time series nature of the data: as baseball evolved, more home runs were hit, which naturally causes the number of pitched home runs to increase. The statistics are basically opposite sides of the same coin to a large degree (even if there may be some variability between individual teams in any given year). The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose TEAM_PITCHING_HR since we believe the batting HR metric will be more predictive of TARGET_WINS than will the pitching HR statistic.
After removing these values, our next step is to impute the remaining missing data. To do this we will use a linear regression approach recommended by Faraway (p.201) and Fox (p.611). We are not using the mean or median as a replacement value for NA’s since regression yields imputed values that are much more consistent with the actual distribution of the data while introducing much less potential bias.
In the process of building each model we run analysis to ensure that there are no collinearity issues and all p-values are < \(.05\). Each model produces imputed distributions of the subject variables that are consistent with those of the original NA-populated data. It is our belief that this consistency indicates that the resulting predicted values for the missing values are an improvement over simply filling the NA’s with a mean or median. The replacement of the NA’s with numerical values allow us to run our final models on all records, not just those without NA’s. For consistency we will use the same approach with the evaluation data.
The variables with imputation regression models are described below:
TEAM_BATTING_SO: The adjusted \(R^2\) value for this regression model is 0.7223 and yields a distribution matching the variables prior to the NA’s replacement. For this predictor we impute a total of 131 missing values via regression.
TEAM_PITCHING_SO: The adjusted \(R^2\) value for this regression model is 0.9952. We impute 102 missing values for strikeouts.
TEAM_BASERUN_SB: The adjusted \(R^2\) value for this regression model is 0.3427. Despite the adjusted \(R^2\) being low relative to the models described above, the model yields a distribution matching that of the variable beforehand. Our model replaces 131 missing stolen base values.
TEAM_FIELDING_DP: The adjusted \(R^2\) for this model is 0.3904. We impute 286 missing values witha similar distribution to the previous data.
Our final data processing step is to eliminate some clearly egregious outliers identified via research through baseball-almanac.com, as suggested by Sheather (p. 57). For example, the record for the most pitching strikeouts in a single season is 1450 by the 2014 Cleveland Indians. Therefore we know that any records having TEAM_PITCHING_SO values above that point are aberrations.
Similarly, the most errors by team in a single season are 639 by Philadelphia in 1883. Prorating to 162 games we calculate that we should discard any records containing TEAM_FIELDING_E values above 1046.
The TEAM_PITCHING_H variable also appear to have numerous egregious outliers. For example, the most offensive hits by a team in a single season are 1730. As such, it is highly unlikely that any pitching staff would surrender more than 3000 hits in a single season. Such a total would indicate the team allows more than 18 hits per game. As such, any records having a TEAM_PITCHING_H value > 3000 are removed from the data set.
As result of this research, we feel confident in removing 104 records with eggregious outliers that are impossible from a historical perspective. Using this SME knowledge will help to normalize our data and improve the expected performance of our linear models.
The charts below show that our data transformation process is dramatically improving the data, There still are a few outliers but on a dramatically smaller scale, with a particularly significant change for the TEAM_PITCHING_H and TEAM_PITCHING_SO variables.
In addition, the chart below shows how improved the skew and kurtosis is in comparision with the original data set.
| variables | n | mean | sd | med | min | max | range | skew | kurtosis | se | NAs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 2172 | 81 | 15 | 82 | 21 | 135 | 114 | -0.22 | 0.08 | 0.31 | |
| TEAM_BATTING_2B | 2172 | 242 | 46 | 239 | 118 | 458 | 340 | 0.23 | -0.13 | 0.98 | |
| TEAM_BATTING_3B | 2172 | 54 | 27 | 47 | 11 | 190 | 179 | 0.99 | 0.65 | 0.58 | |
| TEAM_BATTING_HR | 2172 | 103 | 59 | 107 | 3 | 264 | 261 | 0.16 | -0.93 | 1.27 | |
| TEAM_BATTING_BB | 2172 | 516 | 100 | 518 | 73 | 878 | 805 | -0.32 | 0.96 | 2.15 | |
| TEAM_BATTING_SO | 2172 | 744 | 226 | 745 | 252 | 1399 | 1147 | 0.06 | -0.98 | 4.85 | |
| TEAM_BASERUN_SB | 2172 | 131 | 93 | 103 | 18 | 697 | 679 | 1.75 | 3.91 | 1.99 | |
| TEAM_PITCHING_H | 2172 | 1575 | 256 | 1508 | 1137 | 2960 | 1823 | 2.10 | 5.88 | 5.48 | |
| TEAM_PITCHING_BB | 2172 | 551 | 107 | 538 | 144 | 1123 | 979 | 0.70 | 1.40 | 2.29 | |
| TEAM_PITCHING_SO | 2172 | 789 | 223 | 797 | 301 | 1434 | 1133 | 0.15 | -0.63 | 4.79 | |
| TEAM_FIELDING_E | 2172 | 213 | 148 | 155 | 65 | 1965 | 900 | 2.18 | 4.55 | 3.18 | |
| TEAM_FIELDING_DP | 2172 | 143 | 28 | 146 | 56 | 228 | 172 | -0.30 | -0.14 | 0.59 | |
| TEAM_BATTING_1B | 2172 | 1061 | 102 | 1046 | 811 | 1656 | 845 | 0.89 | 1.49 | 2.20 |
Our training data set with the NA’s filled and the outliers removed can be found here:
https://github.com/spsstudent15/2016-02-621-W1/blob/master/621-HW1-Clean-Data.csv
We did use other model-specific data transformations, including Box-Cox power transforms and linear combinations of variables. These model-specific transformations are discussed within the individual model writeups provided in Part 3. ————————————————————————————————————————–
Our first model applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against all 12 remaining predictor variables. Simply removing the TEAM_BATTING_1B variable yields a model with all p-values less than \(.05\). However, VIF analysis shows evidence of multiple collinear variables within the model. Subsequent removals of TEAM_PITCHING_SO and TEAM_PITCHING_BB due to collinearity yield a model calling for the removal of TEAM_BATTING_2B on the basis of its p-value.
The final model of these iterations show clear evidence of a number of outliers as evidenced in R’s summary diagnostic plots. We will remove these outliers via a series of additional iterations yielding the following final model (which once again includes TEAM_BATTING_2B as the previous step improved the statistical significance of the variable):
| Coefficient | Variable |
|---|---|
| 66.261 | Intercept |
| - 0.017 | TEAM_BATTING_2B |
| + 0.150 | TEAM_BATTING_3B |
| + 0.109 | TEAM_BATTING_HR |
| + 0.022 | TEAM_BATTING_BB |
| - 0.019 | TEAM_BATTING_SO |
| + 0.065 | TEAM_BASERUN_SB |
| + 0.016 | TEAM_PITCHING_H |
| - 0.075 | TEAM_FIELDING_E |
| - 0.109 | TEAM_FIELDING_DP |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.49 | 0.3598 | 0.3572 | 134.4 | 132 |
However, the diagnostic plots of that model show a lack of linearity between the response variable TARGET_WINS and the predictor variable TEAM_FIELDING_E. Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of non-constant variability for variables such as TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E. Therefore, we transform The TEAM_FIELDING_E variable using a Box-Cox recommended power transform of (-1), or (1/y) and we began to recreate the model. Now, the resulting Added Variable plots show that all predictors are linearly related to the response, and we see an improvement in the variability of the residuals relative to TEAM_FIELDING_E. Furthermore, the plot of Y against the fitted values show an improvement in the linearity of the model.
The characteristic equation for this improved model is as follows:
| Coefficient | Variable |
|---|---|
| 52.88 | Intercept |
| + 0.168 | TEAM_BATTING_3B |
| + 0.096 | TEAM_BATTING_HR |
| + 0.027 | TEAM_BATTING_BB |
| - 0.027 | TEAM_BATTING_SO |
| + 0.034 | TEAM_BASERUN_SB |
| + 0.004 | TEAM_PITCHING_H |
| + 3252.31 | 1/TEAM_FIELDING_E |
| - 0.102 | TEAM_FIELDING_DP |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.86 | 0.3168 | 0.3143 | 124.8 | 141 |
The coefficients for TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, and TEAM_BASERUN_SB all make sense intuitively. The TEAM_FIELDING_DP coefficient surprises since baseball fans believe that more defensive double plays will improve a team’s chances of winning games. However, the variable itself is negatively correlated with TARGET_WINS (see the Data Exploration section), validating the negative coefficient. Similarly, the coefficient for TEAM_PITCHING_H is also counterintuitive, but the variable is actually positively correlated with TARGET_WINS. Finally, TEAM_FIELDING_E changes from negative in the earlier model to positive here. However, the coefficient now applies to the transformed version of the variable rather than the nominal values of the variable.
While this model is an improvement over earlier iterations, we still see component variables that appear to lack constant variability relative to the residuals for variables such as TEAM_BASERUN_SB. The lack of constant variability in the residuals is likely related to the skewed nature of the distributions of those individual variables. In our next models we attempt to address some of the skew issues by creating combinations of various variables.
————————————————————————————————————————–
This model employs a linear combination of four of the predictor variables to calculate the baseball statistc known as “Total Bases”. Total Bases is calculated using what our data set refers to as “TEAM_BATTING” variables as follows:
Inclusion of this new variable allows us to eliminate the four component variables from the model. In fact, the TOTAL_BASES variable appears to be nearly normally distributed, thereby negating the skew issues that were evident with its component variables.
This model applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against a derived value for total bases and the remaining 8 predictors. Three iterations of p-value / VIF backward selection remove TEAM_PITCHING_SO and TEAM_PITCHING_BB from the model. All other variables remain statistically significant with no signficant collinearity. However, evidence of multiple outliers are found through R’s summary diagnostic plots, forcing several additional iterations resulting in the following model:
| Coefficient | Variable |
|---|---|
| 48.486 | Intercept |
| + 0.022 | TEAM_BATTING_BB |
| - 0.015 | TEAM_BATTING_SO |
| + 0.063 | TEAM_BASERUN_SB |
| + 0.010 | TEAM_PITCHING_H |
| - 0.064 | TEAM_FIELDING_E |
| - 0.117 | TEAM_FIELDING_DP |
| + 0.018 | TOTAL_BASES |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.7 | 0.3365 | 0.3343 | 156 | 137 |
Once again, the diagnostic plots of that model show a lack of linearity between the response variable TARGET_WINS and one of the predictor variables (TEAM_FIELDING_E). Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of non-constant variability for variables such as TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E using a Box-Cox recommended power transform of (-1), or (1/y) we transformed TEAM_FIELDING_E creating a new model. The resulting Added Variable plots show that all predictors are linearly related to the response, and the variability of the residuals improve. Furthermore, the plot of Y against the fitted values shows an improvement in the linearity of the model. Therefore, this model appears to be an improvement over the first TOTAL_BASES model and the equation indicated by the model is as follows:
| Coefficient | Variable |
|---|---|
| 39.164 | Intercept |
| + 0.025 | TEAM_BATTING_BB |
| - 0.025 | TEAM_BATTING_SO |
| + 0.038 | TEAM_BASERUN_SB |
| + 2714.54 | 1/TEAM_FIELDING_E |
| - 0.115 | TEAM_FIELDING_DP |
| + 0.0197 | TOTAL_BASES |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.97 | 0.3048 | 0.3029 | 157.5 | 143 |
Like the first model,the coefficients for TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TOTAL_BASES all make sense intuitively. The TEAM_FIELDING_DP coefficient is negative, matching the correlation with TARGET_WINS. However, the coefficent for TEAM_FIELDING_E changes from negative in the earlier model to positive here, as it now applies to the *transformed reciprocal of the variable.
————————————————————————————————————————–
Our third model improves upon the “Total Bases” model by extending the TOTAL_BASES variable to include the TEAM_BATTING_BB and TEAM_BASERUN_SB variables, as both represent basepath advancements by a team’s offense. “Total Bases Plus”" (TB_PLUS) is calculated using TEAM_BATTING and TEAM_BASERUN variables as follows:
Including this new variable allows us to eliminate the two additional component variables from the model. In fact, the TB_PLUS variable, like the TOTAL_BASES variable from model 2 appears to be nearly normally distributed, thereby negating any skew issues evident in its component variables. A histogram of the distribution of the derived TB_PLUS variable is shown below:
This model also applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against the derived value for TB_PLUS and the remianing 6 predictor variables. Four iterations of p-value / VIF backward selection remove TEAM_PITCHING_H, TEAM_PITCHING_SO and TEAM_PITCHING_BB from the model. All other variables remain statistically significant with no signficant collinearity. However, once again we found evidence of multiple outliers via R’s summary diagnostic plots, and a series of additional iterations yield the following final model:
| Coefficient | Variable |
|---|---|
| 52.330 | Intercept |
| - 0.016 | TEAM_BATTING_SO |
| - 0.034 | TEAM_FIELDING_E |
| - 0.154 | TEAM_FIELDING_DP |
| + 0.025 | TB_PLUS |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 12.12 | 0.2944 | 0.2931 | 225.5 | 145 |
The coefficients for TEAM_BATTING_SO and TEAM_FIELDING_E make sense intuitively: the more strikeouts a team’s offense has, the less likely it is to put the ball in play, and the more fielding errors a team commits, the more likely they are to lose games. We see the same negative trend with TEAM_FIELDING_DP as in models 1 and 2. Most importantly, the coefficient for TB_PLUS positively correlates with the response variable as expected. As with the first two models, the diagnostic plots for this approach unfortunately show a lack of linearity between the response variable TARGET_WINS and the predictor variable TEAM_FIELDING_E. Furthermore, the plots of standardized residuals against each of the predictor variables demonstrate evidence of non-constant variability for the variables TEAM_BATTING_SO and TEAM_FIELDING_E.
As in model 2, we transform the TEAM_FIELDING_E predictor using its reciprocal. The resulting Added Variable plots showed that all predictors are linearly related to the response, and we found an improvement in the variability of the residuals relative to TEAM_FIELDING_E. Furthermore, the plot of Y against the fitted values shows a non-skewed linear relationship. The characteristic equation indicated by the model is as follows:
| Coefficient | Variable |
|---|---|
| 42.160 | Intercept |
| - 0.023 | TEAM_BATTING_SO |
| + 2366.82 | 1/TEAM_FIELDING_E |
| - 0.140 | TEAM_FIELDING_DP |
| + 0.022 | TB_PLUS |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 12.13 | 0.2932 | 0.2919 | 223.3 | 147 |
This model is an improvement over the first TB_PLUS model when the residual plots are considered, and the number of predictor variables used is two fewer than that of the “Total Bases” model. As in model 2, the coefficient for TEAM_FIELDING_E changes from negative to positive, and once again this is due to the fact that the coefficient now applies to the transformed version of the variable rather than the nominal values of the variable.
————————————————————————————————————————–
Sabermetrics has become the rage in baseball, actually popularized by Billy Beane and the data set we are exploring. As a result, we built a model that centers around one of these advance analytics known as BsR or base runs. This statistic (designed by David Smyth in the 1990’s) estimates the amount of runs a team SHOULD score, adding an intriguing element to a data set which does not include runs (see http://tangotiger.net/wiki_archive/Base_Runs.html for more information). The formula For BsR is as follows:
BSR = A*B/(B+C) +D where:
A = TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB
B = 1.02(1.4TEAM_TOTAL_BASES -0.6TEAM_BATTING_H + 0.1TEAM_BATTING_BB)
C = AT BATS - TEAM_BATTING_H (which we approximated with 3*TEAM_BATTING_H as the average batting average is around 0.250)
D = TEAM_BATTING_HR
Since we eliminate the value of TEAM_BATTING_H, we sum up singles, doubles, triples and home runs in the actual code, and the approach for TEAM_TOTAL_BASES is described in model 2. The data for BSR exhibit a fairly normal distribution.
Since BSR is a combination of all of the batting variables, we eliminate them from the regression resulting in a very strong model on the first iteration. All p-values are very low, and the variation values are all below 5 showing no probems with collinearity. The characteristic equation indicated by the model is as follows:
| Coefficient | Variable |
|---|---|
| 40.687320 | Intercept |
| + 0.062189 | BSR |
| - 0.116615 | TEAM_FIELDING_DP |
| - 0.058885 | TEAM_FIELDING_E |
| + 0.060347 | TEAM_BASERUN_SB |
| - 0.011457 | TEAM_PITCHING_SO |
| + 0.019419 | TEAM_PITCHING_H |
| - 0.017603 | TEAM_PITCHING_BB |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.99 | 0.3229 | 0.3207 | 147.4 | 144 |
The coefficients for this model overall do make sense. Errors and pitching walks contribute to fewer wins, and stolen bases and the BSR metric have strong influence on increasing wins. Double plays do have a slightly negative value, although this could be explained by a team allowing a large number of baserunners(and as mentioned above it matches the correlation with wins). The positive impact of allowing pitching hits is puzzling, but once again it agrees with the trends we see in the data provided. Overall, there is a lot to like about this model as well.
————————————————————————————————————————–
This model began with transformations to a simple regression of each predictor against wins, showing improvements in the normality of the distribution and more uniformly distributed residuals. In each case, we apply a Box-Cox transformation to improve the skew of predictor variables BEFORE creating the multi-regression model. The model then applies a simple Forward Selection strategy, adding variables two-at-a-time until none can be found that improve the model as measured by adjusted \(R^2\). We also derive SLUGGING and FIELDING_YIELD variables by combining some of the variables.
Pre-model transformation of individual predicotor variables are as follows:
Multiple iterations backwards and forwards based on p-value and vif values result in a model whose diagnostic plots show relatively good linearity between the response variable TARGET_WINS and the 6 predictor variable. Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of relatively uniform variability for each variable except the derived predictor FIELDING which has two clusters. The model equation is as follows:
| Coefficient | Variable |
|---|---|
| 78.730 | Intercept |
| + 28.760 | 1/(TEAM_PITCHING_BB^6) |
| - 21.360 | 1/(TEAM_BATTING_1B^2) |
| + 1.450 | (SLUGGING^3) / (SLUGGING^5) |
| - 125.800 | 1/(TEAM_BATTING_SB^25) |
| + 3.747e6 | (FIELDING^10) / (FIELDING^9) |
| - .1037 | (TEAM_PITCHING_SO^2) / (TEAM_PITCHING_SO^3) |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 11.97 | 0.3142 | 0.3123 | 164.2 | 143 |
The coefficients for this model are counterintuitive largely due to the transformations applied to each predictor variable. For example, one would think TEAM_BATTING_1B would correlate positively with winning, yet its coefficent is strongly negative. The same is true of TEAM_BATTING_SB and FIELDING. The sign and magnitude of each coefficient are the result of the transformations as well as the linear model. This does not invalidate the model, it simply means the coefficents become less useful as a check of the model’s fidelity.
————————————————————————————————————————–
The chart below summarizes the model statistics for all five of our models. The models are listed from left to right in accordance with the order in which they were described in Part 3.
| Metric | General Model | Total Bases | TB PLUS | Sabermetrics | Box-Cox First |
|---|---|---|---|---|---|
| RSE | 11.86 | 11.97 | 12.12 | 11.99 | 11.97 |
| R^2 | 0.3168 | 0.3048 | 0.2994 | 0.3229 | 0.3142 |
| Adj. R^2 | 0.3143 | 0.3029 | 0.2931 | 0.3207 | 0.3123 |
| F Stat. | 124.8 | 157.5 | 224.3 | 147.4 | 164.2 |
| MSE | 141 | 143 | 147 | 144 | 143 |
Each of our five models converge on similar \(R^2\) values, RSE’s, and MSE’s, and all yield residuals that are distributed normally without significant evidence of highly leveraged outliers. No signficant collinearity exists within any of the five models for any of their component predictor variables.
Of the five, we eliminate the General Model as it has the least favorable residual characteristics, with multiple predictors showing non-constant variability relative to the residuals. It also creates a relatively low F-statistic when compared to the other models.
Of the remaining four models, the Total Bases model was a great improvement over the General Model. However, it too displays some lack of constant variability of residuals relative to a couple of the predictor variables, so we can eliminate that one as well.
The Box-Cox First model makes use of a recommended Box-Cox transform for each individual predictor variable before the linear model is constructed via forward selection. While this model yields results similar to the others, we believe it is overly complex with the number of variables that comprise the model and the difficulty we have in explaining the predictor coefficients.
Of the remaining two, the Sabermetrics model yields a slightly large \(R^2\) value showing a slightly better possible fit. However, the TB Plus model is simpler as it make as it uses only 4 predictor variables while possessing a much larger F-statistic. Such a large difference in F-statistics indicates that the TB Plus model is explaining more of the variability of the training data than is the Sabermetrics model. Therefore, we select the TB PLUS model as the basis for our prediction of TARGET_WINS for the Evaluation data set.
To ensure the model’s efficacy when applied to the Evaluation data set, we apply the same set of transformations used on the Training data set prior to building our individual models. The results of those transformations (which include filling in the missing values) can be found here:
https://github.com/spsstudent15/2016-02-621-W1/blob/master/621-HW1-Clean-EvalData-.csv
Then, the additional transformations required for the TB PLUS model are applied to ensure conformity between the Training and Evaluation data sets.
The TB PLUS model is then applied to yield a set of INDEX / TARGET_WINS pairs. Since displaying the full set of predicted values would consume a large number of pages, a sample of the first 10 rows of that data set is displayed as an example:
| INDEX | TARGET_WINS |
|---|---|
| 9 | 61 |
| 10 | 66 |
| 14 | 72 |
| 47 | 86 |
| 60 | 66 |
| 63 | 73 |
| 74 | 82 |
| 83 | 71 |
| 98 | 69 |
| 120 | 74 |
Here are statistics for predicted wins vs. the original training data set and our prepared version from part 2:
| Set | n | Mean | sd | Median | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| Predicted Model | 259 | 81 | 9 | 81 | 50 | 104 | 54 | -0.05 | 0.35 | 0.55 |
| Training Data | 2276 | 81 | 16 | 82 | 0 | 146 | 146 | -0.40 | 1.03 | 0.33 |
| Prepared Data | 2172 | 81 | 15 | 82 | 21 | 135 | 114 | -0.22 | 0.08 | 0.31 |
The standard errors of the model and the coefficients of its component variables can be used to compute confidence intervals (CIs), prediction intervals (PIs), and perform hypothesis tests on the coefficients. However, such efforts are beyond the scope of this assignment.
The predicted wins in the evaluation data make sense. They have a mean and median of 81 and show little skew. The predictions range between 50 and 104, reasonable but not as varied as the ones in both the original training data and the data we processed to build the model. Based just on looking at the key statistics our model performs well. After exploring the unique challenges of the data due to outliers and missing data, we improved our data through a new field (singles) elimination of 3 columns and 104 rows, and imputation of the remaining missing values. After developing 5 distinct models using backward iterations, forward iterations, additional value combinations, Box Cox transformation and sabermetrics, we selected the best model. While not perfect, the resulting predictions are a step forward in predicting how often a baseball team will win.
The full set of predicted TARGET_WINS can be found at the following web link: https://github.com/spsstudent15/2016-02-621-W1/blob/master/HW1-PRED-EVAL-WINS-ONLY.csv
————————————————————————————————————————–
Diez, D.M., Barr, C.D., & Cetinkaya-Rundel, M. (2015). OpenIntro Statistics, Third Edition. Open Source. Print
Faraway, J. J. (2015). Extending linear models with R, Second Edition. Boca Raton, FL: Chapman & Hall/CRC. Print
Faraway, J. J. (2015). Linear models with R, Second Edition. Boca Raton, FL: Chapman & Hall/CRC. Print
Fox, John (2016). Applied Regression Analysis and Generalized Linear Models, Third Edition. Los Angeles, CA: Sage. Print.
Sheather, Simon J. (2009). A Modern Approach to Regression with R. New York, NY: Springer. Print