To build an optimal multiple linear regression model for annual number of wins per team based on predictor values. The data is an annual record of baseball wins per team along with their respective explanatory statistics. The following paper outlines a procedure for the creation of three experimental regression models and explain the selection process of an optimal method.
The data has 2276 rows, with each record representing a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
The data has 15 attributes, of which 14 represent our predictor variables comprised of annual hitting, pitching, baserunning, and fielding statistics. Additional predictor variables are created and this is explained in section 3.
The response variable is the number of wins, or TARGET_WINS, which has a mean of 80.8 and standard deviation of 15.8. It follows a relatively normal, uniform distribution and we will not treat any values as outliers.
Predictive variables are much more diverse in shape and distribution (see graphical series below). The explanatory/predictor variables are the remaining variables that may have a positive or negative impact on the number of wins. ##Correlations in the data Correlations between variables were calculated, with particular interest to those > 5. One interesting result is the TARGET_WINS variable is not strongly correlated with the data, but the the pitching and batting home runs are very strongly related.
## Var1 Var2 value
## 1 TARGET_WINS TARGET_WINS 1.0000000
## 2 TEAM_BATTING_H TEAM_BATTING_H 1.0000000
## 3 TEAM_BATTING_2B TEAM_BATTING_2B 1.0000000
## 4 TEAM_BATTING_3B TEAM_BATTING_3B 1.0000000
## 5 TEAM_BATTING_HR TEAM_BATTING_HR 1.0000000
## 6 TEAM_BATTING_BB TEAM_BATTING_BB 1.0000000
## 7 TEAM_BATTING_SO TEAM_BATTING_SO 1.0000000
## 8 TEAM_BASERUN_SB TEAM_BASERUN_SB 1.0000000
## 9 TEAM_BASERUN_CS TEAM_BASERUN_CS 1.0000000
## 10 TEAM_BATTING_HBP TEAM_BATTING_HBP 1.0000000
## 11 TEAM_PITCHING_H TEAM_PITCHING_H 1.0000000
## 12 TEAM_PITCHING_HR TEAM_PITCHING_HR 1.0000000
## 13 TEAM_PITCHING_BB TEAM_PITCHING_BB 1.0000000
## 14 TEAM_PITCHING_SO TEAM_PITCHING_SO 1.0000000
## 15 TEAM_FIELDING_E TEAM_FIELDING_E 1.0000000
## 16 TEAM_FIELDING_DP TEAM_FIELDING_DP 1.0000000
## 17 TEAM_PITCHING_HR TEAM_BATTING_HR 0.9693714
## 18 TEAM_BATTING_HR TEAM_PITCHING_HR 0.9693714
## 19 TEAM_FIELDING_E TEAM_PITCHING_H 0.6677590
## 20 TEAM_PITCHING_H TEAM_FIELDING_E 0.6677590
## 21 TEAM_BATTING_2B TEAM_BATTING_H 0.5628497
## 22 TEAM_BATTING_H TEAM_BATTING_2B 0.5628497
## 23 TEAM_BATTING_BB TEAM_BATTING_HR 0.5137348
## 24 TEAM_BATTING_HR TEAM_BATTING_BB 0.5137348
## 25 TEAM_FIELDING_E TEAM_BATTING_3B 0.5097784
## 26 TEAM_BATTING_3B TEAM_FIELDING_E 0.5097784
Some predictor variables were chosen to evaluate the outliers and distribution of the data. The high number of ‘outliers’ may indicate a need for data transformations. In this case the base hits by batter and the field errors have a large amount of variability in the data.
## Warning: Removed 102 rows containing non-finite values (stat_boxplot).
Present in the data were some concerns, namely with missing data. In its entirety there are 3478 missing fields. This amounts to 9.55% of the data available. The following graphic shows a high completeness disparity between our values. This will be discussed in greater detail and resolved in section 3 of this document.
For columns with values for HBP (hit by pitch) and DP (double plays), the minimum values are >0 and there are a large number of missing values (10% of DP and 92% of HBP). While many of these values could very well be missing, it is unreasonable to believe that there are 0 values. To simplify this process, a value of 0 has been given to these missing values for the purposes of calculation for new columns, then will be removed with other columns containing missing values (CS, SB, SO) before the model is run.
Additionally, since (in baseball) the attacking team is the team whose turn it is at bat it might be interesting to see their success rates for those occurrences where a successful hit-at-bat occurs:
TEAM_BATTING_1B was deduced from TEAM_BATTING_H, as the number of singles, calculated through subtracting TEAM_BATTING_3B, TEAM_BATTING_HR, and TEAM_BATTING_2B from TEAM_BATTING_H.
TOTAL_BASES was added to show the produce of total bases achieved after successfully reaching base, summed together per annual results per team. For instance a triple (TEAM_BATTING_3B) is 3 bases and a walk (TEAM_BATTING_BB) is 1 base, while caught stealing (TEAM_BASERUN_CS) will detract a base.
TEAM_BATTING_HITSBASE was added to show positive attacking at bat situations (base hits + homeruns + hit by pitch)
TEAM_BATTING_ALLPOS was added to show positive attacking situations in batting and stolen bases
TEAM_BATTING_ALLPOSATTCK was added to show the number of positive attacking situations
The defending team is mostly effective with the pitcher, so some added variables for pitching could be interesting:
Other columns were added to see a form of success ratio:
Some columns have been removed from the analysis: TEAM_BASERUN_SB, TEAM_BASERUN_CS, and TEAM_BATTING_HBP. There are a significant number of NA’s (90% of the values are missing) in the HBP column and applying a zero value to this figure is misleading. By including the values that are present in our newly created columns, we can preserve the values of gaining a base on an attacking play by being hit by a ball without the noise. A similar methodology and thought process was used for SB and CS
Diagnostic plots of the variables were used to discover if any variables needed a mathematical transformation (i.e. log or square roots):
Skewed variables needed to be adjusted, so the log form of the following variables was used to normalize the data:
The following variables are skewed but were not adjusted: - TEAM_PITCHING_H #PITCHING_H still has too many high values - TEAM_PITCHING_BADPITCH #high values skew - TEAM_PITCHING_BASESGIVEN #high values skew - TEAM_POSRATIO #low values skew
These fields need to be fixed: - TEAM_FIELDING_DP (0 values skew) - TOTAL_BASES (0 values are skewing–should remove 0’s) - SBPERFECT (0 values are skewing)
Remaining ‘na’ values were omited from the data set before models were created.
As a measure of automation accuracy and a test against more traditional models, a stepwise approach was performed using a forward AIC (Akaike information criterion) method.
Current optimal results from stepwise method with diagnostic plots
#stepResults
sumary(stepResults)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 300.9533710 48.3434790 6.2253 5.825e-10
## TOTAL_BASES_B 0.0617399 0.0048655 12.6894 < 2.2e-16
## TEAM_PITCHING_SO -0.0083047 0.0027682 -3.0000 0.0027323
## TEAM_FIELDING_E -23.4927159 1.2899569 -18.2120 < 2.2e-16
## TEAM_BATTING_2B -0.0933401 0.0100098 -9.3249 < 2.2e-16
## TEAM_BATTING_HR -0.1768531 0.0209915 -8.4250 < 2.2e-16
## TEAM_BATTING_BB 0.0341314 0.0112872 3.0239 0.0025266
## TEAM_PITCHRATIO 41.4345456 12.2679678 3.3775 0.0007454
## TEAM_FIELDING_DP -0.0408027 0.0095856 -4.2566 2.170e-05
## TEAM_BATTING_ALLPOS -0.0332843 0.0069695 -4.7757 1.919e-06
## SBPERCENT 6.5380272 1.8148520 3.6025 0.0003228
## TEAM_POSRATIO -22.1353822 5.7268715 -3.8652 0.0001145
## TEAM_PITCHING_BB -26.8874857 7.8596741 -3.4209 0.0006364
## TEAM_PITCHING_BADPITCH 0.0118748 0.0075544 1.5719 0.1161273
## TEAM_BATTING_3B 1.8213870 1.0330535 1.7631 0.0780327
## TEAM_PITCHING_H -0.0107521 0.0076242 -1.4103 0.1586149
##
## n = 2042, p = 16, Residual SE = 11.27498, R-Squared = 0.39
par(mfrow=c(2,2))
plot(stepResults)
Discussion of coefficients: In the stepwise generated model many variables were selected that are linearly related to each other. The first 12 variables are statistically significant despite the multicollinearity. An argument can be made for the final 3 that they are not highly significant and could be removed. The relationships between the predictor variables result in some surprising coefficients that are difficult to directly interpret. For example The Total Bases variable has an expected positive impact on score, but the number of 2nd bases has a negative impact. Another example is the positive coefficient for the variable Pitch Ratio - this variable measures bad/good pitching outcomes, and a higher amount means a greater number of ‘bad’ outcomes (home runs allowed). The positive relationship however is probably related to the strange fact that pitching home runs allowed is highly, positively, related to the number of batting home runs - a counterintuitive relationship in the raw data.
Using backward elimination, and removing high p-value variables to get the highest adjusted \(R^2\), the final regression model is:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0021e+02 1.0418e+02 7.6811 2.438e-14
## TEAM_BATTING_H 8.7263e+01 1.9424e+01 4.4926 7.431e-06
## TEAM_BATTING_3B 4.6571e+00 1.3191e+00 3.5304 0.0004242
## TEAM_BATTING_HR -2.3127e-01 3.4381e-02 -6.7268 2.250e-11
## TEAM_PITCHING_H -1.2302e-02 7.4217e-03 -1.6576 0.0975558
## TEAM_PITCHING_BB -2.6626e+01 5.9743e+00 -4.4568 8.773e-06
## TEAM_PITCHING_SO -1.0034e-02 2.8204e-03 -3.5575 0.0003830
## TEAM_FIELDING_E -2.3383e+01 1.2864e+00 -18.1764 < 2.2e-16
## TEAM_FIELDING_DP -4.1636e-02 9.6607e-03 -4.3098 1.712e-05
## TEAM_BATTING_HITSBASE -1.6331e+02 1.8108e+01 -9.0184 < 2.2e-16
## TOTAL_BASES_B 9.8376e-02 1.6043e-02 6.1322 1.039e-09
## TEAM_BATTING_ALLPOSATTCK -3.7203e-02 1.6391e-02 -2.2697 0.0233317
## TEAM_PITCHING_BADPITCH 1.3279e-02 7.3068e-03 1.8174 0.0693041
## TEAM_PITCHRATIO 2.8194e+01 1.2390e+01 2.2755 0.0229800
## TEAM_POSRATIO -2.1759e+01 5.5920e+00 -3.8911 0.0001030
## SBPERCENT 7.1149e+00 1.7895e+00 3.9759 7.257e-05
##
## n = 2042, p = 16, Residual SE = 11.26632, R-Squared = 0.39
Discussion of coefficients All remaining variables after elimination are statistically significant to the model. The overall \(R^2\) decreased by 1 point, with the elimination of 7 variables. TOTAL_BASES: The final model indicates a positive impact on total wins of .09 per-bases-gained in batting situations because of the relationship between the TOTAL_BASES variables and other base-gain variables such as TEAM_BATTING_HR. And TEAM_BATTING_HITSBAS and TEAM_BATTING_ALLPOSATTCK we see unexpected negative coefficients in those areas. FIELDING_DP: This is a unexpected negative impact on total wins of .004 per double play.
The ‘simple’ model was designed to include a limited number of input variables using a more intuitive approach to the selection as a comparison to the backwards elimination or stepwise approach. This model was created by examining the correlation between variables, choosing those that have high relationships with the target variable and lower relationships with each other. There are 3 general types of variables to choose from - batting, pitching and fielding. A single variable was chosen from each category to represent these factors.
Variables chosen: TOTAL_BASES - a linear combination of the batting variables, weighted by base number lending greater significance to higher bases. TEAM_FIELDING_E - the log of the # of fielding errors. TEAM_PITCHRATIO - This is the ratio of the number of home runs allowed versus the number of strikeouts allowed
Model Output
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.3600281 4.2072616 6.2654 4.525e-10
## TOTAL_BASES_B 0.0248483 0.0012942 19.2004 < 2.2e-16
## TEAM_FIELDING_E -2.8387979 0.5977432 -4.7492 2.185e-06
## TEAM_PITCHRATIO -2.1453440 6.4311239 -0.3336 0.7387
##
## n = 2042, p = 4, Residual SE = 12.48170, R-Squared = 0.25
Coefficients Two out of three of the coefficients selected for the model were statistically significant. TOTAL_BASES: Wins increase by .0250 for each base per batter. TOTAL_BASES is a linear combination of all bases gained, weighted for the base number. Because of the high values (mean of 2814 and standard deviation of 257) the impact of TOTAL_BASES on the number of wins is not negligible. TEAM_FIELDING_E: Wins are decreased as the number fielding errors increase. Because our variable is the the log of the errors it is not a linear relationship, but rather -2.702 for each log of the field errors. TEAM_PITCHRATIO: This variable was not helpful in explaining the model, having a high p-value and was removed from the final version shown below.
Model was re-evaluated excluding the PitchRatio variable, which slightly alters the coefficients.
Revised Model Output
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.41700087 4.20287748 6.2855 3.986e-10
## TOTAL_BASES_B 0.02457139 0.00099255 24.7558 < 2.2e-16
## TEAM_FIELDING_E -2.75847597 0.54698313 -5.0431 4.986e-07
##
## n = 2042, p = 3, Residual SE = 12.47898, R-Squared = 0.25
Equation with transformed variables: Wins = .0246* TOTAL_BASES - 2.7584 * TEAM_FIELDING_E
Equation with original variables:
Wins = .0246 * (TEAM_BATTING_BB + TEAM_BATTING_1B + 2* TEAM_BATTING_2B + 3TEAM_BATTING_3B+ 4 TEAM_BATTING_HR + BASERUN_SB) -2.7584* log(TEAM_FIELDING_E)
A comparison of the stepwise, elimination, and simple models included evaluating the \(R^2\) values, the Aikake Information Criterion (AIC), and the Bayesian Information Criterion (BIC).
Assuming the model with the highest adjusted R squared is the “best” model.
The Adjusted R squared for each model is 0.3872123, 0.3881538, 0.2493513. Based on this criteria, the first model, Stepwise, should be selected as the best model. Let’s see if this result can be confirmed by another model selection method. We will focus in the next section on Aikake Information Criterion.
The idea here is that the model with the smallest AIC value is the “best”. We will be interested in the difference of AIC between fitted models.
The AIC for each model is given by 1.570672410^{4}, 1.570358410^{4}, 1.610814610^{4}.
According to the AIC criterion, the first model also has the smallest AIC. Based on this result, we can use the stepwise as our best model. We can use the BIC double check the results with the BIC criterion.
BIC and AIC work the same. The model with the smallest BIC is the “best” in the set of models fitted.
The BIC for each model is given by 1.580229210^{4}, 1.579915310^{4}, 1.613063210^{4} The BIC criterion is also showing the Stepwise model as the one with the smallest BIC difference among fitted model.
Evaluation data was treated to the same variable transformations and handeling of ‘na’ values as the original training data set.
evalData<-evaluationData
#applying same logic used for training data to deal w/ na's
evalData$TEAM_BATTING_HBP[is.na(evalData$TEAM_BATTING_HBP)] <- 0
evalData$TEAM_FIELDING_DP[is.na(evalData$TEAM_FIELDING_DP)] <- 0
evalData$TEAM_BATTING_1B<-evalData$TEAM_BATTING_H- evalData$TEAM_BATTING_2B- evalData$TEAM_BATTING_3B- evalData$TEAM_BATTING_HR
temp1B<-evalData$TEAM_BATTING_1B;temp1B[is.na(temp1B)]<-0
temp2B<-evalData$TEAM_BATTING_2B*2;temp2B[is.na(temp2B)]<-0
temp3B<-evalData$TEAM_BATTING_3B*3;temp3B[is.na(temp3B)]<-0
tempHR<-evalData$TEAM_BATTING_HR*4;tempHR[is.na(tempHR)]<-0
tempSB<-as.numeric(evalData$TEAM_BASERUN_SB);tempSB[is.na(tempSB)]<-0
tempCS<-evalData$TEAM_BASERUN_CS*-1;tempCS[is.na(tempCS)]<-0
tempBB<-as.numeric(evalData$TEAM_BATTING_BB);tempBB[is.na(tempBB)]<-0
tempHBP<-as.numeric(evalData$TEAM_BATTING_HBP);tempHBP[is.na(tempHBP)]<-0
evalData$TOTAL_BASES_A<-temp1B+temp2B+temp3B+tempHR+tempSB+tempCS+tempBB+tempHBP
evalData$TOTAL_BASES_B<-(evalData$TEAM_BATTING_2B*2)+(evalData$TEAM_BATTING_3B*3)+(evalData$TEAM_BATTING_HR*4)+ evalData$TEAM_BASERUN_SB+evalData$TEAM_BATTING_BB+evalData$TEAM_BATTING_1B
evalData$TEAM_BATTING_HITSBASE <- evalData$TEAM_BATTING_H + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B
evalData$TEAM_BATTING_ALLPOS <- evalData$TEAM_BATTING_H + evalData$TEAM_BATTING_BB + evalData$TEAM_BATTING_HBP
evalData$TEAM_BATTING_ALLPOSATTCK <- evalData$TEAM_BATTING_H + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B + evalData$TEAM_BATTING_HR + evalData$TEAM_BATTING_BB + evalData$TEAM_BATTING_HBP + evalData$TEAM_BASERUN_SB
evalData$TEAM_PITCHING_BADPITCH <- evalData$TEAM_PITCHING_H + evalData$TEAM_PITCHING_BB
evalData$TEAM_PITCHING_BASESGIVEN <- evalData$TEAM_PITCHING_H + evalData$TEAM_PITCHING_BB + evalData$TEAM_PITCHING_HR
evalData$TEAM_PITCHRATIO <- evalData$TEAM_PITCHING_HR / evalData$TEAM_PITCHING_SO
evalData$TEAM_POSRATIO <- evalData$TEAM_BATTING_ALLPOSATTCK / evalData$TEAM_PITCHING_BASESGIVEN
evalData$SBPERCENT <- evalData$TEAM_BASERUN_SB / (evalData$TEAM_BASERUN_SB + evalData$TEAM_BASERUN_CS)
evalData$TEAM_BATTING_H <- log(evalData$TEAM_BATTING_H)
evalData$TEAM_BATTING_3B <- log(evalData$TEAM_BATTING_3B)
evalData$TEAM_PITCHING_BB <- log(evalData$TEAM_PITCHING_BB)
evalData$TEAM_FIELDING_E <- log(evalData$TEAM_FIELDING_E)
evalData$TEAM_BATTING_HITSBASE <- log(evalData$TEAM_BATTING_HITSBASE)
evalData$TEAM_BATTING_1B <- log(evalData$TEAM_BATTING_1B)
evalData<-evalData%>%select(-TEAM_BATTING_HBP)
evalData$TEAM_BASERUN_SB <- NULL
evalData$TEAM_BASERUN_CS <- NULL
# variables actually need for our model, creating subset so we don't worry about the na's in other fields
evalDatasub <- evalData%>%select(TOTAL_BASES_B, TEAM_PITCHING_SO, TEAM_FIELDING_E, TEAM_BATTING_2B, TEAM_BATTING_HR,TEAM_BATTING_BB, TEAM_PITCHRATIO,TEAM_FIELDING_DP,TEAM_BATTING_ALLPOS,TEAM_POSRATIO,TEAM_PITCHING_BB,TEAM_PITCHING_BADPITCH,TEAM_PITCHING_H,TEAM_BATTING_3B,SBPERCENT)
# 259 obs,31 are NA
evalDatasub<-na.omit(evalDatasub)
evalDatasub$TARGET_WINS<-predict(best_model, newdata=evalDatasub, type='response')
We were able to use our selected model to predict Target Wins for the evaluation data for 228 complete observations out of the 259 total available. We found the mean of the predicted values was81.2145352 with a standard deviation of 8.0708116. This can be compared to the mean and standard deviation of the training data set used to create the model of 80.7908612 and 15.7521525. Comparing density of actual wins vs. predict wins
***