————————————————————————————————————————–

Part 1: Data Exploration

Data Summary

The original Training data set is comprised of 17 elements and 2276 total observations. Of those 17 elements, INDEX is simply an index value used for sorting while TARGET_WINS represents the response variable we are to use within our regression models. The remaining 15 elements are all potential predictor variables for our linear models. A summary table for the data set is provided below.

variables	n	mean	sd	med	min	max	range	skew	kurtosis	se	NAs
TARGET_WINS	2276	81	16	82	0	146	146	-0.40	1.03	0.33
TEAM_BATTING_H	2276	1470	145	1454	891	2554	1663	1.57	7.28	3.03
TEAM_BATTING_2B	2276	241	47	238	69	458	389	0.22	0.01	0.98
TEAM_BATTING_3B	2276	55	28	47	0	223	223	1.11	1.50	0.59
TEAM_BATTING_HR	2276	100	61	102	0	264	264	0.19	-0.96	1.27
TEAM_BATTING_BB	2276	502	123	512	0	878	878	-1.03	2.18	2.57
TEAM_BATTING_SO	2174	736	240	750	0	1399	1399	-0.30	-0.32	5.33	102
TEAM_BASERUN_SB	2145	125	88	101	0	697	697	1.97	5.49	1.90	131
TEAM_BASERUN_CS	1504	53	23	49	0	201	201	1.98	7.62	0.59	772
TEAM_BATTING_HBP	191	59	13	58	29	95	66	0.32	-0.11	0.94	2085
TEAM_PITCHING_H	2276	1779	1407	1518	1137	30132	28995	10.33	141.84	29.49
TEAM_PITCHING_HR	2276	106	61	107	0	343	343	0.29	-0.60	1.28
TEAM_PITCHING_BB	2276	553	166	536	0	3645	3645	6.74	96.97	3.49
TEAM_PITCHING_SO	2174	818	553	813	0	19278	19278	22.17	671.19	11.86	102
TEAM_FIELDING_E	2276	246	228	159	65	1898	1833	2.99	10.97	4.77
TEAM_FIELDING_DP	1990	146	26	149	52	228	176	-0.39	0.18	0.59	286

At first glance this chart shows that there are missing values in 6 fields (especially TEAM_BATTING_HBP and TEAM_BASERUN_CS). In addition several values such as TEAM_PITCHING_H, TEAM_PITCHING_BB and TEAM_PITCHING_SO struggle with skew and kurtosis. The box plot visualizes some significant outliers in several data columns, especially in TEAM_PITCHING_H and TEAM_PITCHING_SO.

Correlation Plot

Using the cor function across the data frame we notice some strong correlations. TEAM_BATTING_H obviously has some colinearity with TEAM_BATTING_2B, TEAM_BATTING_3B and TEAM_BATTING_HR as these values are a subset of hits. TEAM_BATTING_BB and TEAM_PITCHING_BB have strong correlation, as do TEAM_PITCHING_HR and TEAM_BATTING_HR. Since we are focusing on wins, the following table shows the correlation when the NA’s are omitted:

Value	Correlation with Wins
TEAM_BATTING_H	0.46994665
TEAM_BATTING_2B	0.31298400
TEAM_BATTING_3B	-0.12434586
TEAM_BATTING_HR	0.42241683
TEAM_BATTING_BB	0.46868793
TEAM_BATTING_SO	-0.22889273
TEAM_BASERUN_SB	0.01483639
TEAM_BASERUN_CS	-0.17875598
TEAM_BATTING_HBP	0.07350424
TEAM_PITCHING_H	0.47123431
TEAM_PITCHING_HR	0.42246683
TEAM_PITCHING_BB	0.46839882
TEAM_PITCHING_SO	-0.22936481
TEAM_FIELDING_E	-0.38668800
TEAM_FIELDING_DP	0.13168916

Conclusion of Data Exploration

As a result of missing data, severe outliers, and collinearity there is a clear need for data preparation and transformation.

————————————————————————————————————————–

Part 2: Data Preparation

Our data preparation efforts for the training data set include the creation of one new derived variable, removing four predictor variables, imputing values for the remaining variables that had missing values (NA’s), and removal of a relatively small number of records that contained clearly egregious outlier values for particular variables. The results of these efforts were subsequently used as the basis for each of the five different linear models we created and evaluated.

Step 1: New Variable Creation

We began our data preparation efforts by creating a new variable TEAM_BATTING_1B which represents offensive single base hits. (created by subtracting out the TEAM_BATTING doubles, triples and home runs from the TEAM_BATTING_H variable). We believe that separating out singles from the other unique hit values will minimize collinearity. The TEAM_BATTING_H variable is then removed from the data set since it is simply a linear combination of its component variables.,

Step 2: Removal of Variables Due to Collinearity and/or Missing Values

The results of our data exploration efforts lead us to drop three other variables from the data set:

TEAM_BATTING_HBP: The TEAM_BATTING_HBP variable has very little correlation with the TARGET_WINS response variable and also contains 2085 missing values out of a total of 2277. Since it would be very difficult to accurately impute such a large proportion of any variable’s missing values, we choose to exclude the variable from our analysis.
TEAM_BASERUN_CS: This variable is strongly correlated (65.5%) with the TEAM_BASERUN_SB variable and is the 2nd largest source of NA’s in our data set. These combined facts lead us to exclude the variable from our analysis.
TEAM_PITCHING_HR: This variable is 97% correlated with TEAM_BATTING_HR. In fact, 815 cases (more than 35% of our total cases) have IDENTICAL values for pitched and batted HR’s. This high degree of correlation may be due to the time series nature of the data: as baseball evolved, more home runs were hit, which naturally causes the number of pitched home runs to increase. The statistics are basically opposite sides of the same coin to a large degree (even if there may be some variability between individual teams in any given year). The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose TEAM_PITCHING_HR since we believe the batting HR metric will be more predictive of TARGET_WINS than will the pitching HR statistic.

Step 3: Imputation of Missing Values (NA’s)

Filling Missing Values in the Training Data Set

After removing these values, our next step is to impute the remaining missing data. To do this we will use a linear regression approach recommended by Faraway (p.201) and Fox (p.611). We are not using the mean or median as a replacement value for NA’s since regression yields imputed values that are much more consistent with the actual distribution of the data while introducing much less potential bias.

In the process of building each model we run analysis to ensure that there are no collinearity issues and all p-values are < \(.05\). Each model produces imputed distributions of the subject variables that are consistent with those of the original NA-populated data. It is our belief that this consistency indicates that the resulting predicted values for the missing values are an improvement over simply filling the NA’s with a mean or median. The replacement of the NA’s with numerical values allow us to run our final models on all records, not just those without NA’s. For consistency we will use the same approach with the evaluation data.

The variables with imputation regression models are described below:

TEAM_BATTING_SO: The adjusted \(R^2\) value for this regression model is 0.7223 and yields a distribution matching the variables prior to the NA’s replacement. For this predictor we impute a total of 131 missing values via regression.
TEAM_PITCHING_SO: The adjusted \(R^2\) value for this regression model is 0.9952. We impute 102 missing values for strikeouts.
TEAM_BASERUN_SB: The adjusted \(R^2\) value for this regression model is 0.3427. Despite the adjusted \(R^2\) being low relative to the models described above, the model yields a distribution matching that of the variable beforehand. Our model replaces 131 missing stolen base values.
TEAM_FIELDING_DP: The adjusted \(R^2\) for this model is 0.3904. We impute 286 missing values witha similar distribution to the previous data.

Step 4: Removal of Extreme Outliers From the Training Data Set

Our final data processing step is to eliminate some clearly egregious outliers identified via research through baseball-almanac.com, as suggested by Sheather (p. 57). For example, the record for the most pitching strikeouts in a single season is 1450 by the 2014 Cleveland Indians. Therefore we know that any records having TEAM_PITCHING_SO values above that point are aberrations.

Similarly, the most errors by team in a single season are 639 by Philadelphia in 1883. Prorating to 162 games we calculate that we should discard any records containing TEAM_FIELDING_E values above 1046.

The TEAM_PITCHING_H variable also appear to have numerous egregious outliers. For example, the most offensive hits by a team in a single season are 1730. As such, it is highly unlikely that any pitching staff would surrender more than 3000 hits in a single season. Such a total would indicate the team allows more than 18 hits per game. As such, any records having a TEAM_PITCHING_H value > 3000 are removed from the data set.

As result of this research, we feel confident in removing 104 records with eggregious outliers that are impossible from a historical perspective. Using this SME knowledge will help to normalize our data and improve the expected performance of our linear models.

Results of Imputation for Missing Values and Outlier Removal Process

The charts below show that our data transformation process is dramatically improving the data, There still are a few outliers but on a dramatically smaller scale, with a particularly significant change for the TEAM_PITCHING_H and TEAM_PITCHING_SO variables.

In addition, the chart below shows how improved the skew and kurtosis is in comparision with the original data set.

variables	n	mean	sd	med	min	max	range	skew	kurtosis	se
TARGET_WINS	2172	81	15	82	21	135	114	-0.22	0.08	0.31
TEAM_BATTING_2B	2172	242	46	239	118	458	340	0.23	-0.13	0.98
TEAM_BATTING_3B	2172	54	27	47	11	190	179	0.99	0.65	0.58
TEAM_BATTING_HR	2172	103	59	107	3	264	261	0.16	-0.93	1.27
TEAM_BATTING_BB	2172	516	100	518	73	878	805	-0.32	0.96	2.15
TEAM_BATTING_SO	2172	744	226	745	252	1399	1147	0.06	-0.98	4.85
TEAM_BASERUN_SB	2172	131	93	103	18	697	679	1.75	3.91	1.99
TEAM_PITCHING_H	2172	1575	256	1508	1137	2960	1823	2.10	5.88	5.48
TEAM_PITCHING_BB	2172	551	107	538	144	1123	979	0.70	1.40	2.29
TEAM_PITCHING_SO	2172	789	223	797	301	1434	1133	0.15	-0.63	4.79
TEAM_FIELDING_E	2172	213	148	155	65	1965	900	2.18	4.55	3.18
TEAM_FIELDING_DP	2172	143	28	146	56	228	172	-0.30	-0.14	0.59
TEAM_BATTING_1B	2172	1061	102	1046	811	1656	845	0.89	1.49	2.20

Our training data set with the NA’s filled and the outliers removed can be found here:

https://github.com/spsstudent15/2016-02-621-W1/blob/master/621-HW1-Clean-Data.csv

Step 5: Other Data Preparation Transformations: Refer to Model Descriptions

We did use other model-specific data transformations, including Box-Cox power transforms and linear combinations of variables. These model-specific transformations are discussed within the individual model writeups provided in Part 3. ————————————————————————————————————————–

Part 3: Build Models

Model 1: General Model Using Backward Selection

Approach:

Our first model applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against all 12 remaining predictor variables. Simply removing the TEAM_BATTING_1B variable yields a model with all p-values less than \(.05\). However, VIF analysis shows evidence of multiple collinear variables within the model. Subsequent removals of TEAM_PITCHING_SO and TEAM_PITCHING_BB due to collinearity yield a model calling for the removal of TEAM_BATTING_2B on the basis of its p-value.

The final model of these iterations show clear evidence of a number of outliers as evidenced in R’s summary diagnostic plots. We will remove these outliers via a series of additional iterations yielding the following final model (which once again includes TEAM_BATTING_2B as the previous step improved the statistical significance of the variable):

Coefficient	Variable
66.261	Intercept
- 0.017	TEAM_BATTING_2B
+ 0.150	TEAM_BATTING_3B
+ 0.109	TEAM_BATTING_HR
+ 0.022	TEAM_BATTING_BB
- 0.019	TEAM_BATTING_SO
+ 0.065	TEAM_BASERUN_SB
+ 0.016	TEAM_PITCHING_H
- 0.075	TEAM_FIELDING_E
- 0.109	TEAM_FIELDING_DP

RSE	R^2	Adj. R^2	F Stat.	MSE
11.49	0.3598	0.3572	134.4	132

Additional Iterations:

However, the diagnostic plots of that model show a lack of linearity between the response variable TARGET_WINS and the predictor variable TEAM_FIELDING_E. Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of non-constant variability for variables such as TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E. Therefore, we transform The TEAM_FIELDING_E variable using a Box-Cox recommended power transform of (-1), or (1/y) and we began to recreate the model. Now, the resulting Added Variable plots show that all predictors are linearly related to the response, and we see an improvement in the variability of the residuals relative to TEAM_FIELDING_E. Furthermore, the plot of Y against the fitted values show an improvement in the linearity of the model.

The characteristic equation for this improved model is as follows:

Coefficient	Variable
52.88	Intercept
+ 0.168	TEAM_BATTING_3B
+ 0.096	TEAM_BATTING_HR
+ 0.027	TEAM_BATTING_BB
- 0.027	TEAM_BATTING_SO
+ 0.034	TEAM_BASERUN_SB
+ 0.004	TEAM_PITCHING_H
+ 3252.31	1/TEAM_FIELDING_E
- 0.102	TEAM_FIELDING_DP

RSE	R^2	Adj. R^2	F Stat.	MSE
11.86	0.3168	0.3143	124.8	141

Conclusions:

The coefficients for TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, and TEAM_BASERUN_SB all make sense intuitively. The TEAM_FIELDING_DP coefficient surprises since baseball fans believe that more defensive double plays will improve a team’s chances of winning games. However, the variable itself is negatively correlated with TARGET_WINS (see the Data Exploration section), validating the negative coefficient. Similarly, the coefficient for TEAM_PITCHING_H is also counterintuitive, but the variable is actually positively correlated with TARGET_WINS. Finally, TEAM_FIELDING_E changes from negative in the earlier model to positive here. However, the coefficient now applies to the transformed version of the variable rather than the nominal values of the variable.

While this model is an improvement over earlier iterations, we still see component variables that appear to lack constant variability relative to the residuals for variables such as TEAM_BASERUN_SB. The lack of constant variability in the residuals is likely related to the skewed nature of the distributions of those individual variables. In our next models we attempt to address some of the skew issues by creating combinations of various variables.

————————————————————————————————————————–

Model 2: Total Bases

Approach:

This model employs a linear combination of four of the predictor variables to calculate the baseball statistc known as “Total Bases”. Total Bases is calculated using what our data set refers to as “TEAM_BATTING” variables as follows:

Singles + (2 * Doubles) + (3 * Triples) = (4 * Home Runs)

Inclusion of this new variable allows us to eliminate the four component variables from the model. In fact, the TOTAL_BASES variable appears to be nearly normally distributed, thereby negating the skew issues that were evident with its component variables.

This model applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against a derived value for total bases and the remaining 8 predictors. Three iterations of p-value / VIF backward selection remove TEAM_PITCHING_SO and TEAM_PITCHING_BB from the model. All other variables remain statistically significant with no signficant collinearity. However, evidence of multiple outliers are found through R’s summary diagnostic plots, forcing several additional iterations resulting in the following model:

Coefficient	Variable
48.486	Intercept
+ 0.022	TEAM_BATTING_BB
- 0.015	TEAM_BATTING_SO
+ 0.063	TEAM_BASERUN_SB
+ 0.010	TEAM_PITCHING_H
- 0.064	TEAM_FIELDING_E
- 0.117	TEAM_FIELDING_DP
+ 0.018	TOTAL_BASES

RSE	R^2	Adj. R^2	F Stat.	MSE
11.7	0.3365	0.3343	156	137

Transformation Iterations

Once again, the diagnostic plots of that model show a lack of linearity between the response variable TARGET_WINS and one of the predictor variables (TEAM_FIELDING_E). Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of non-constant variability for variables such as TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E using a Box-Cox recommended power transform of (-1), or (1/y) we transformed TEAM_FIELDING_E creating a new model. The resulting Added Variable plots show that all predictors are linearly related to the response, and the variability of the residuals improve. Furthermore, the plot of Y against the fitted values shows an improvement in the linearity of the model. Therefore, this model appears to be an improvement over the first TOTAL_BASES model and the equation indicated by the model is as follows:

Coefficient	Variable
39.164	Intercept
+ 0.025	TEAM_BATTING_BB
- 0.025	TEAM_BATTING_SO
+ 0.038	TEAM_BASERUN_SB
+ 2714.54	1/TEAM_FIELDING_E
- 0.115	TEAM_FIELDING_DP
+ 0.0197	TOTAL_BASES

RSE	R^2	Adj. R^2	F Stat.	MSE
11.97	0.3048	0.3029	157.5	143

Conclusions:

Like the first model,the coefficients for TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TOTAL_BASES all make sense intuitively. The TEAM_FIELDING_DP coefficient is negative, matching the correlation with TARGET_WINS. However, the coefficent for TEAM_FIELDING_E changes from negative in the earlier model to positive here, as it now applies to the *transformed reciprocal of the variable.

————————————————————————————————————————–

Model 3: Total Bases PLUS

Our third model improves upon the “Total Bases” model by extending the TOTAL_BASES variable to include the TEAM_BATTING_BB and TEAM_BASERUN_SB variables, as both represent basepath advancements by a team’s offense. “Total Bases Plus”" (TB_PLUS) is calculated using TEAM_BATTING and TEAM_BASERUN variables as follows:

Singles + (2 * Doubles) + (3 * Triples) = (4 * Home Runs) + BB + SB

Including this new variable allows us to eliminate the two additional component variables from the model. In fact, the TB_PLUS variable, like the TOTAL_BASES variable from model 2 appears to be nearly normally distributed, thereby negating any skew issues evident in its component variables. A histogram of the distribution of the derived TB_PLUS variable is shown below:

This model also applies simple Backward Selection methods through the use of p-values and variance inflation factors (VIF) against the derived value for TB_PLUS and the remianing 6 predictor variables. Four iterations of p-value / VIF backward selection remove TEAM_PITCHING_H, TEAM_PITCHING_SO and TEAM_PITCHING_BB from the model. All other variables remain statistically significant with no signficant collinearity. However, once again we found evidence of multiple outliers via R’s summary diagnostic plots, and a series of additional iterations yield the following final model:

Coefficient	Variable
52.330	Intercept
- 0.016	TEAM_BATTING_SO
- 0.034	TEAM_FIELDING_E
- 0.154	TEAM_FIELDING_DP
+ 0.025	TB_PLUS

RSE	R^2	Adj. R^2	F Stat.	MSE
12.12	0.2944	0.2931	225.5	145

Transformation Iterations

The coefficients for TEAM_BATTING_SO and TEAM_FIELDING_E make sense intuitively: the more strikeouts a team’s offense has, the less likely it is to put the ball in play, and the more fielding errors a team commits, the more likely they are to lose games. We see the same negative trend with TEAM_FIELDING_DP as in models 1 and 2. Most importantly, the coefficient for TB_PLUS positively correlates with the response variable as expected. As with the first two models, the diagnostic plots for this approach unfortunately show a lack of linearity between the response variable TARGET_WINS and the predictor variable TEAM_FIELDING_E. Furthermore, the plots of standardized residuals against each of the predictor variables demonstrate evidence of non-constant variability for the variables TEAM_BATTING_SO and TEAM_FIELDING_E.

As in model 2, we transform the TEAM_FIELDING_E predictor using its reciprocal. The resulting Added Variable plots showed that all predictors are linearly related to the response, and we found an improvement in the variability of the residuals relative to TEAM_FIELDING_E. Furthermore, the plot of Y against the fitted values shows a non-skewed linear relationship. The characteristic equation indicated by the model is as follows:

Coefficient	Variable
42.160	Intercept
- 0.023	TEAM_BATTING_SO
+ 2366.82	1/TEAM_FIELDING_E
- 0.140	TEAM_FIELDING_DP
+ 0.022	TB_PLUS

RSE	R^2	Adj. R^2	F Stat.	MSE
12.13	0.2932	0.2919	223.3	147

Conclusion:

This model is an improvement over the first TB_PLUS model when the residual plots are considered, and the number of predictor variables used is two fewer than that of the “Total Bases” model. As in model 2, the coefficient for TEAM_FIELDING_E changes from negative to positive, and once again this is due to the fact that the coefficient now applies to the transformed version of the variable rather than the nominal values of the variable.

————————————————————————————————————————–

Model 4: Sabermetrics Model

Approach

Sabermetrics has become the rage in baseball, actually popularized by Billy Beane and the data set we are exploring. As a result, we built a model that centers around one of these advance analytics known as BsR or base runs. This statistic (designed by David Smyth in the 1990’s) estimates the amount of runs a team SHOULD score, adding an intriguing element to a data set which does not include runs (see http://tangotiger.net/wiki_archive/Base_Runs.html for more information). The formula For BsR is as follows:

BSR = A*B/(B+C) +D where:

A = TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB
B = 1.02(1.4TEAM_TOTAL_BASES -0.6TEAM_BATTING_H + 0.1TEAM_BATTING_BB)
C = AT BATS - TEAM_BATTING_H (which we approximated with 3*TEAM_BATTING_H as the average batting average is around 0.250)
D = TEAM_BATTING_HR

Since we eliminate the value of TEAM_BATTING_H, we sum up singles, doubles, triples and home runs in the actual code, and the approach for TEAM_TOTAL_BASES is described in model 2. The data for BSR exhibit a fairly normal distribution.

Since BSR is a combination of all of the batting variables, we eliminate them from the regression resulting in a very strong model on the first iteration. All p-values are very low, and the variation values are all below 5 showing no probems with collinearity. The characteristic equation indicated by the model is as follows:

Coefficient	Variable
40.687320	Intercept
+ 0.062189	BSR
- 0.116615	TEAM_FIELDING_DP
- 0.058885	TEAM_FIELDING_E
+ 0.060347	TEAM_BASERUN_SB
- 0.011457	TEAM_PITCHING_SO
+ 0.019419	TEAM_PITCHING_H
- 0.017603	TEAM_PITCHING_BB

RSE	R^2	Adj. R^2	F Stat.	MSE
11.99	0.3229	0.3207	147.4	144

Conclusion:

The coefficients for this model overall do make sense. Errors and pitching walks contribute to fewer wins, and stolen bases and the BSR metric have strong influence on increasing wins. Double plays do have a slightly negative value, although this could be explained by a team allowing a large number of baserunners(and as mentioned above it matches the correlation with wins). The positive impact of allowing pitching hits is puzzling, but once again it agrees with the trends we see in the data provided. Overall, there is a lot to like about this model as well.

————————————————————————————————————————–

Model 5: Box-Cox First

Approach:

This model began with transformations to a simple regression of each predictor against wins, showing improvements in the normality of the distribution and more uniformly distributed residuals. In each case, we apply a Box-Cox transformation to improve the skew of predictor variables BEFORE creating the multi-regression model. The model then applies a simple Forward Selection strategy, adding variables two-at-a-time until none can be found that improve the model as measured by adjusted \(R^2\). We also derive SLUGGING and FIELDING_YIELD variables by combining some of the variables.

Pre-model transformation of individual predicotor variables are as follows:

TEAM_PITCHING_BB (boxcox-transform) \(\lambda\) => y^(1/6)
TEAM_BATTING_1B (derived variable) => TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR, (boxcox-transform) \(\lambda\) => 1/(y^2)
SLUGGING (derived variable) => 2 * TEAM_BATTING_3B + TEAM_BATTING_HR y^(3/5)
TEAM_BATTING_SB (boxcox-transform) \(\lambda\) => y^(-1/25)
FIELDING YIELD (derived variable) => TEAM_FIELDING_E + TEAM_FIELDING_DP y^(-(9/10))
TEAM_PITCHING_SO (boxcox-transform) \(\lambda\) (y^2/3)

Multiple iterations backwards and forwards based on p-value and vif values result in a model whose diagnostic plots show relatively good linearity between the response variable TARGET_WINS and the 6 predictor variable. Furthermore, the plots of standardized residuals against each of the predictor variables show evidence of relatively uniform variability for each variable except the derived predictor FIELDING which has two clusters. The model equation is as follows:

Coefficient	Variable
78.730	Intercept
+ 28.760	1/(TEAM_PITCHING_BB^6)
- 21.360	1/(TEAM_BATTING_1B^2)
+ 1.450	(SLUGGING^3) / (SLUGGING^5)
- 125.800	1/(TEAM_BATTING_SB^25)
+ 3.747e6	(FIELDING^10) / (FIELDING^9)
- .1037	(TEAM_PITCHING_SO^2) / (TEAM_PITCHING_SO^3)

RSE	R^2	Adj. R^2	F Stat.	MSE
11.97	0.3142	0.3123	164.2	143

Conclusion:

The coefficients for this model are counterintuitive largely due to the transformations applied to each predictor variable. For example, one would think TEAM_BATTING_1B would correlate positively with winning, yet its coefficent is strongly negative. The same is true of TEAM_BATTING_SB and FIELDING. The sign and magnitude of each coefficient are the result of the transformations as well as the linear model. This does not invalidate the model, it simply means the coefficents become less useful as a check of the model’s fidelity.

————————————————————————————————————————–

Part 4. Select Models

Step 1: Compare Key statistics

The chart below summarizes the model statistics for all five of our models. The models are listed from left to right in accordance with the order in which they were described in Part 3.

Metric	General Model	Total Bases	TB PLUS	Sabermetrics	Box-Cox First
RSE	11.86	11.97	12.12	11.99	11.97
R^2	0.3168	0.3048	0.2994	0.3229	0.3142
Adj. R^2	0.3143	0.3029	0.2931	0.3207	0.3123
F Stat.	124.8	157.5	224.3	147.4	164.2
MSE	141	143	147	144	143

Each of our five models converge on similar \(R^2\) values, RSE’s, and MSE’s, and all yield residuals that are distributed normally without significant evidence of highly leveraged outliers. No signficant collinearity exists within any of the five models for any of their component predictor variables.

Step 2: Pick the top two

Of the five, we eliminate the General Model as it has the least favorable residual characteristics, with multiple predictors showing non-constant variability relative to the residuals. It also creates a relatively low F-statistic when compared to the other models.

Of the remaining four models, the Total Bases model was a great improvement over the General Model. However, it too displays some lack of constant variability of residuals relative to a couple of the predictor variables, so we can eliminate that one as well.

The Box-Cox First model makes use of a recommended Box-Cox transform for each individual predictor variable before the linear model is constructed via forward selection. While this model yields results similar to the others, we believe it is overly complex with the number of variables that comprise the model and the difficulty we have in explaining the predictor coefficients.

Step 3: Pick the “best” model

Of the remaining two, the Sabermetrics model yields a slightly large \(R^2\) value showing a slightly better possible fit. However, the TB Plus model is simpler as it make as it uses only 4 predictor variables while possessing a much larger F-statistic. Such a large difference in F-statistics indicates that the TB Plus model is explaining more of the variability of the training data than is the Sabermetrics model. Therefore, we select the TB PLUS model as the basis for our prediction of TARGET_WINS for the Evaluation data set.

Step 4: Apply to the evaluation data

To ensure the model’s efficacy when applied to the Evaluation data set, we apply the same set of transformations used on the Training data set prior to building our individual models. The results of those transformations (which include filling in the missing values) can be found here:

https://github.com/spsstudent15/2016-02-621-W1/blob/master/621-HW1-Clean-EvalData-.csv

Then, the additional transformations required for the TB PLUS model are applied to ensure conformity between the Training and Evaluation data sets.

The TB PLUS model is then applied to yield a set of INDEX / TARGET_WINS pairs. Since displaying the full set of predicted values would consume a large number of pages, a sample of the first 10 rows of that data set is displayed as an example:

INDEX	TARGET_WINS
9	61
10	66
14	72
47	86
60	66
63	73
74	82
83	71
98	69
120	74

Here are statistics for predicted wins vs. the original training data set and our prepared version from part 2:

Set	n	Mean	sd	Median	Min	Max	Range	Skew	Kurtosis	SE
Predicted Model	259	81	9	81	50	104	54	-0.05	0.35	0.55
Training Data	2276	81	16	82	0	146	146	-0.40	1.03	0.33
Prepared Data	2172	81	15	82	21	135	114	-0.22	0.08	0.31

Using the Model for Inference

The standard errors of the model and the coefficients of its component variables can be used to compute confidence intervals (CIs), prediction intervals (PIs), and perform hypothesis tests on the coefficients. However, such efforts are beyond the scope of this assignment.

Conclusion:

The predicted wins in the evaluation data make sense. They have a mean and median of 81 and show little skew. The predictions range between 50 and 104, reasonable but not as varied as the ones in both the original training data and the data we processed to build the model. Based just on looking at the key statistics our model performs well. After exploring the unique challenges of the data due to outliers and missing data, we improved our data through a new field (singles) elimination of 3 columns and 104 rows, and imputation of the remaining missing values. After developing 5 distinct models using backward iterations, forward iterations, additional value combinations, Box Cox transformation and sabermetrics, we selected the best model. While not perfect, the resulting predictions are a step forward in predicting how often a baseball team will win.

The full set of predicted TARGET_WINS can be found at the following web link: https://github.com/spsstudent15/2016-02-621-W1/blob/master/HW1-PRED-EVAL-WINS-ONLY.csv

————————————————————————————————————————–

References

Bibliography

Diez, D.M., Barr, C.D., & Cetinkaya-Rundel, M. (2015). OpenIntro Statistics, Third Edition. Open Source. Print

Faraway, J. J. (2015). Extending linear models with R, Second Edition. Boca Raton, FL: Chapman & Hall/CRC. Print

Faraway, J. J. (2015). Linear models with R, Second Edition. Boca Raton, FL: Chapman & Hall/CRC. Print

Fox, John (2016). Applied Regression Analysis and Generalized Linear Models, Third Edition. Los Angeles, CA: Sage. Print.

Sheather, Simon J. (2009). A Modern Approach to Regression with R. New York, NY: Springer. Print

Resource Links

http://www.baseball-almanac.com/

https://raw.githubusercontent.com/spsstudent15/2016-02-621-W1

http://tangotiger.net/wiki_archive/Base_Runs.html

Data 621 Homework 1: Moneyball

Critical Thinking Group 2 - Armenoush Aslanian-Persico, James Topor, Jeff Nieman, Scott Karr

Part 1: Data Exploration

Data Summary

Correlation Plot

Conclusion of Data Exploration

Part 2: Data Preparation

Step 1: New Variable Creation

Step 2: Removal of Variables Due to Collinearity and/or Missing Values

Step 3: Imputation of Missing Values (NA’s)

Filling Missing Values in the Training Data Set

Step 4: Removal of Extreme Outliers From the Training Data Set

Results of Imputation for Missing Values and Outlier Removal Process

Step 5: Other Data Preparation Transformations: Refer to Model Descriptions

Part 3: Build Models

Model 1: General Model Using Backward Selection

Approach:

Additional Iterations:

Conclusions:

Model 2: Total Bases

Approach:

Transformation Iterations

Conclusions:

Model 3: Total Bases PLUS

Transformation Iterations

Conclusion:

Model 4: Sabermetrics Model

Approach

Conclusion:

Model 5: Box-Cox First

Approach:

Conclusion:

Part 4. Select Models

Step 1: Compare Key statistics

Step 2: Pick the top two

Step 3: Pick the “best” model

Step 4: Apply to the evaluation data

Using the Model for Inference

Conclusion:

References

Bibliography

Resource Links