Introduction and Data Overview

The purpose of this assignment is to predict the number of baseball games a team will win based on a data set of historic performance data and wins. First, missing data is accounted for and, where possible, null values and outliers are imputed using variable-specific regression tree models. Then four linear models are attempted using different variable selection techniques with the purpose of minimizing prediction error. Additionally, a boosted regression tree model is tested following the four linear models to attempt to improve predictive accuracy. Finally, R code is provided to score a fresh data file and make predictions for teams not included in the original sample.

Exploratory Data Analysis and Prep

Missing Data

The dataset includes information on about 2,200 basball teams between 1871 and 2006. All stats are adjusted for a 162 game season. Two variables (batters hit by pitch and baserunners caught stealing) were removed since they were more than 15% blank (Figure 0.1). For the remaining variables, null values are imputed using the variable-specific regression trees. One new variable is also added to consolidate all non-home-run hit variables into one count of bases earned from hits (hit = 1 base, double = 2 bases, etc).

Skewed Distributions and Outliers

For building an OLS linear model, we need data that is not skewed heavily by outliers or overly clustered in any one area. To remedy skew problems, we first identify outliers (1.5xIQR above or below the first and third quartile), convert those values to nulls, and impute those null values using regression trees.

Initial Correlations

One last data exploration step is to determine which variables correlate most strongly with TARGET_WINS. No variable alone explains a great deal of the variance in the data, but total_bases explains the most, followed by TEAM_PITCHING_BB and TEAM_PITCHING_HR (Figure 0.2).

Model Selection

Four OLS models will be developed for this project using different methods of variable selection: manual, backwards stepwise, bi-directional stepwise, and principal components regression. Additionally, a boosted regression tree model will be developed for comparison with the four main models. The evaluation criteria is root mean squared error (RMSE) calculated on a holdout sample of the training data. The best model will be the one that minimizes RMSE. A model that simply applies the average win count to every team has a predictive error rate of 15 games, so the model developed for this report should represent a significant improvement to be considered successful.

Simple Heuristic OLS Regression

A simple linear regression model using some of the variables that correlate with win-count includes total bases earned, fielding errors, and stolen bases. The model only explains 22% of the variance in the training data, but the model does predict an average of 81 wins with RMSE of 14 games - only one game better than applying the average model.


Call:
lm(formula = TARGET_WINS ~ total_bases + TEAM_PITCHING_BB + TEAM_PITCHING_HR, 
    data = train.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.325  -8.940   0.743   9.029  49.221 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       8.159602   3.383572   2.412    0.016 *  
total_bases       0.027487   0.001479  18.581  < 2e-16 ***
TEAM_PITCHING_BB -0.005120   0.003408  -1.502    0.133    
TEAM_PITCHING_HR  0.030349   0.005696   5.329 1.11e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14 on 1817 degrees of freedom
Multiple R-squared:  0.2178,    Adjusted R-squared:  0.2165 
F-statistic: 168.7 on 3 and 1817 DF,  p-value: < 2.2e-16

Backwards Variable Selection

Backwards variable selection techniques use Akaike Information Criterion (AIC) to select the model that best explains the variance in the existing data without overfitting the model. The method initially trains a model using all available variables and sequentially removes variables based on their impact on AIC. For each number of variables in a model, the backwards selection algorithm returns the best variables to include in the model. In out of sample testing, marginal improvements to RMSE diminish after 6 variables are included in the model (Figure 2.1).

Similar to the simple heuristic model, residuals appear to be normally distributed with a bit more uncertainty on the ends of the distribution of predicted values. The six-variable model predicts an average of 81 wins per team with an RSME of 13 games.

Bi-Directional Variable Selection

Bi-direcitonal techniques work similarly to backwards variable selection, but at each iteration of model building, variables can be added or removed. Again, for each number of variables in a model, the algorithm returns the best variables to include. In out of sample testing, marginal improvements to RMSE again diminish when 6 variables are included in the model (Figure 3.1).

Model residuals appear to be normally distributed with a bit more uncertainty on the ends of the distribution of predicted values. Similar to backwards variable selection, the six-variable bi-directional linear regression model predicts an average of 81 wins per team with an RSME value of 13 games.

Principal Components Regression

Principal components regression involves developing a principal components model for dimension reduction purposes and using those components that explain the bulk of the variation in the data as regressors in a linear model. Figure 4.1 illustrates the results of the principal components algorithm which concentrated roughly 80% of the variation in the data in the first three components.

The principal components regression model explained less variance with greater uncertainty than any of the other models tested (81 wins per team predicted with RMSE of 14). This is unlikely to be the ideal model for predicting outcomes.

NULL

Model Evaluation

The primary evaluation criteria for the four predictive models is root mean squared error calculated using a holdout set of 20% of the training data. Figure 5.1 shows scatter plots of predicted values and actual values for each model as well as a reference line demonstrating the pattern of a set of perfect predictions. Figure 5.2 ranks each model according to RMSE.

All of the models generally follow similar patterns in Figure 5.1. Principal components regression follows the line of perfect fit more loosely than the rest while the two stepwise models have more attractive distributions. Please find the R code for applying the model to fresh data in Appendix b.

Conclusion

In order to predict the number of wins a team will earn in a season, four OLS regression models were attempted. The first model used heuristic variable selection. The next two used variations of step-wise selection. The last model was a principal components regression model. According to RMSE calculated out of sample, the two step-wise models performed best with bi-directional step-wise variable selection producing the final winning model.

Appendices

Appendix A: Boosted Regression Tree

Regression trees are built by repeatedly partitioning the data according to regression models that minimize sum of squared error terms and averaging the results of the partitioned values. Like the other models, the regression tree predictions average 81 wins, but the root mean squared error is lower at 11 games. Residuals from the resulting model are normally distributed averaging zero, so this model shows promise for improving prediction error!

How does the boosted regression tree stack up against the other four models?

[1] 80.59166
[1] 80.79086

The pattern in Figure 6.1 is similar to that of the simple heuristic model. However, the boosted regression trees demonstrates a significant improvement in RMSE of almost two full games.

Appendix B: Applying the Model Out of Sample

The code below applies the transformations and model algorithms to implement the boosted regression tree model.

Appendix C: Analysis Code

All code for this project can be found on my github page here.

