DATA 621 Homework #1

Introduction

We have been given a dataset with 2276 records summarizing a major league baseball team’s season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season. The objective is to build a linear regression model to predict the number of wins for a team.

Working Theory

We are working on the premise that there are “good” teams and there are “bad” teams. The good teams win more than the bad teams. We are assuming that some of the predictors will be higher for the good teams than for the bad teams. Consequently we can use these variables to predict how many times a team will win in a season.

Notes About the Data

There are some difficulties with this dataset. First it covers such a wide time period. We know there are different “eras” of baseball. This data will span multiple eras. Has the fundamental relationships between winning and these predictors change over time? We think it has. If so this will be a challenge.

Data Exploration

First Look at the Data

We will first look at the data to get a sense of what we have.

TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 750.0 Median :101.0 Median : 49.0 Median :58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA NA NA NA NA NA NA’s :102 NA’s :131 NA’s :772 NA’s :2085 NA NA NA NA’s :102 NA NA’s :286

Some initial observations:

  • The response variable (TARGET_WINS) looks to be normally distributed. This supports the working theory that there are good teams and bad teams. There are also a lot of average teams.
  • There are also quite a few variables with missing values. We may need to deal with these in order to have the largest data set possible for modeling.
  • A couple variables are bimodal (TEAM_BATTING_HR, TEAM_BATTING_SO TEAM_PITCHING_HR). This may be a challenge as some of them are missing values and that may be a challenge in filling in missing values.
  • Some variables are right skewed (TEAM_BASERUN_CS, TEAM_BASERUN_SB, etc.). This might support the good team theory. It may also introduce non-normally distributed residuals in the model. We shall see.

Correlations

Let’s take a look at the correlations. The following is the correlations from the complete cases only:

Correlations with Response Variable

Let’s take a look at how the predictors are correlated with the response variable:

Variable Variable Correlation
TARGET_WINS TEAM_PITCHING_H 0.4712343
TARGET_WINS TEAM_BATTING_H 0.4699467
TARGET_WINS TEAM_BATTING_BB 0.4686879
TARGET_WINS TEAM_PITCHING_BB 0.4683988
TARGET_WINS TEAM_PITCHING_HR 0.4224668
TARGET_WINS TEAM_BATTING_HR 0.4224168
TARGET_WINS TEAM_FIELDING_E -0.3866880
TARGET_WINS TEAM_BATTING_2B 0.3129840
TARGET_WINS TEAM_PITCHING_SO -0.2293648
TARGET_WINS TEAM_BATTING_SO -0.2288927
TARGET_WINS TEAM_FIELDING_DP -0.1958660
TARGET_WINS TEAM_BASERUN_CS -0.1787560
TARGET_WINS TEAM_BATTING_3B -0.1243459
TARGET_WINS TEAM_BATTING_HBP 0.0735042
TARGET_WINS TEAM_BASERUN_SB 0.0148364

It looks like the hits, walks, home runs, and errors have the strongest correlations with wins. None of these correlations are particularly strong. This suggests there is a lot of ‘noise’ in these relationships.

It is interesting to note allowing hits is positively correlated with wins. How strange! It is also noteworthy that pitching strikeouts is negatively correlated with winning. That does not make any sense. When one examines the scatter plots above it becomes apparent that these correlations are being effected by some outliers.

Strong Correlations (Absolute Value > 0.5)

Are any predictors are correlated with each other? We will only look for “strong” correlations:

Variable Variable Correlation
TEAM_BATTING_HR TEAM_PITCHING_HR 0.9999326
TEAM_BATTING_BB TEAM_PITCHING_BB 0.9998814
TEAM_BATTING_SO TEAM_PITCHING_SO 0.9997684
TEAM_BATTING_H TEAM_PITCHING_H 0.9991927
TEAM_BASERUN_SB TEAM_BASERUN_CS 0.6247378
TEAM_BATTING_H TEAM_BATTING_2B 0.5617729
TEAM_BATTING_2B TEAM_PITCHING_H 0.5604535

There are 4 variables that have a correlation that is almost 1! We will need to be careful to prevent adding autocorrelation errors to our model.

Strange Data Values

Missing Values

During our first look at the data it was noted that there were variables that are missing data. Here’s a look at what variables are missing data and how big of a problem it is:

Variable Missing Data Number of Records Share of Total
TEAM_BATTING_HBP 2085 92%
TEAM_BASERUN_CS 772 34%
TEAM_FIELDING_DP 286 13%
TEAM_BASERUN_SB 131 5.8%
TEAM_BATTING_SO 102 4.5%
TEAM_PITCHING_SO 102 4.5%

The hit by pitcher varriable is missing over 90% of it’s data. We will exclude it from consideration in our model.

Caught stealling a base (TEAM_BASERUN_CS) is next on the list. It may be possible to predict it using TEAM_BASERUN_SB since they are strongly correlated, but there are 131 times they both are missing data.

The strike outs are going to be a little tricky because of their bimodal distribution. All 102

Zero Values

There are also variables that have verly low values. Let’s see how big of a problem this is:

Variable With Zeros Number of Records Share of Total
TEAM_BATTING_SO 20 0.9%
TEAM_PITCHING_SO 20 0.9%
TEAM_BATTING_HR 15 0.7%
TEAM_PITCHING_HR 15 0.7%
TEAM_BASERUN_SB 2 0.1%
TEAM_BATTING_3B 2 0.1%
TARGET_WINS 1 0%
TEAM_BASERUN_CS 1 0%
TEAM_BATTING_BB 1 0%
TEAM_PITCHING_BB 1 0%

This isn’t nearly as large of a problem as the missing values.

Deeper Dive into the Variables

TEAM_BATTING_3B

This field represents triples hit by the team. Triples aren’t very common because the ball is still in the field of play (unlike a homerun) but the batter still has enough time to get 3 bases.

Looking at the distribution, the value of zero doesn’t look too unusual. Even if it were, the value is not likely to have a large impact.

TEAM_BATTING_BB

This variable represents when the batter is “walked” by the pitcher (also known as Base on Balls):

Four balls will walk a batter in modern baseball, however that wasn’t always the case. A century or more ago (within the date range of this data set) walks took as many as 9 balls to happen1. Knowing this, and looking at the left-tail of the values above, it is not unreasonable that there might be a season with no walks. Like triples above, leaving the one zero data point in is unlikely to adversely impact any regression, since there are valid values nearby.

TEAM_BATTING_SO

Here we saw some NA values, 102 of them to be specific. Plus we have 20 zero values as well.

First, the zero values seem nigh-impossible. An entire season (162 games) without a single batter being struck out seems highly suspect, let alone 20 of them in the dataset.

We will replace these values with imputed values, but the distribution looks to be bimodal, so using a mean or median (which is squarely between those peaks) may cause some issues with the model. So, instead, we will impute values using regression.

We will impute values for this variable by looking at it’s nearest neighbors (based on other variables) and taking a weighted average of their values.

TEAM_PITCHING_BB

Here we have no NA values and a single zero:

As we did with walks above, we can assume that is is possible to have no walks (and therefore pitch no walks). So, we will leave the zero alone.

However, there are some really high values in the data, which strains reality a little. We could take anything defined as an outlier (\(1.5 \cdot \text{IQR}\)) and set it to NA so those records will be excluded from any model we build with this variable. But, when you do the math it seems extreme, but plausible. For example, the most number of games in a season in MLB is 162 (currently). With a max value or 3,645 walks pitched you get 22.5 walks per game on average. Divided equally amongst 9 innings, it comes out to 2.5 walks per inning.

I’d be surprised that any pitcher wouldn’t be removed after an inning or two of 2-3 walks, but neither can we rule it out as a possibility.

TEAM_PITCHING_SO

This variable represents strikeouts pitched. We see that there are 102 NA values and a lot of extremely high values:

The extreme values need to be handled. First, a typical game will be 9 innings in length, and in each inning you can only pitch 3 strikeouts (because then your part of the inning is over). Those 27 potential strikeouts multiplied by 162 games means an upper limit near 4,374 a season.

Games can go beyond 9 innings, but even if every game in a season was as long as the longest ever MLB game (26 innings) you can only have 12,636 strikeouts. So, the max value of 19278 is invalid.

We’ll make a high-yet-reasonable assumption of a mean 11 innings per game. We will call anything more than 5,346 strikeouts an invalid data point by setting them to NA so they will be imputed prior to modeling.

Data Preparation

Results

Here’s what the data look like after imputation and correction:

TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_1B
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 66.0 Min. : 0.0 Min. : 0.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 65.0 Min. : 52.0 Min. : 709.0
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 554.0 1st Qu.: 67.0 1st Qu.: 43.00 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 619 1st Qu.: 127.0 1st Qu.:130.0 1st Qu.: 990.8
Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 732.0 Median :104.8 Median : 58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 797 Median : 159.0 Median :147.0 Median :1050.0
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.2 Mean :124.8 Mean : 70.49 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 796 Mean : 246.5 Mean :145.4 Mean :1073.2
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:154.0 3rd Qu.: 91.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957 3rd Qu.: 249.2 3rd Qu.:162.0 3rd Qu.:1129.0
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :4224 Max. :1898.0 Max. :228.0 Max. :2112.0

Model Building

We will divide the data into training and test sets using a 70/30 split. We will build our models on the training set and evaluate it on the test set.

Kitchen Sink Model

We will begin with a “kitchen sink” model.


Call:
lm(formula = TARGET_WINS ~ ., data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.393  -8.577  -0.072   8.370  61.635 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.9460976  7.0651290   0.842  0.40013    
TEAM_BATTING_H    0.0546947  0.0045070  12.135  < 2e-16 ***
TEAM_BATTING_2B  -0.0237996  0.0110657  -2.151  0.03165 *  
TEAM_BATTING_3B   0.0368255  0.0205168   1.795  0.07286 .  
TEAM_BATTING_HR   0.0860825  0.0354739   2.427  0.01535 *  
TEAM_BATTING_BB  -0.0001871  0.0060299  -0.031  0.97526    
TEAM_BATTING_SO  -0.0016401  0.0043993  -0.373  0.70934    
TEAM_BASERUN_SB  -0.0011231  0.0066068  -0.170  0.86503    
TEAM_BASERUN_CS   0.1097642  0.0188991   5.808 7.63e-09 ***
TEAM_PITCHING_H  -0.0006330  0.0005148  -1.229  0.21908    
TEAM_PITCHING_HR -0.0019716  0.0313430  -0.063  0.94985    
TEAM_PITCHING_BB  0.0128312  0.0039117   3.280  0.00106 ** 
TEAM_PITCHING_SO -0.0020795  0.0031650  -0.657  0.51126    
TEAM_FIELDING_E  -0.0204896  0.0029772  -6.882 8.47e-12 ***
TEAM_FIELDING_DP -0.1077032  0.0169111  -6.369 2.49e-10 ***
TEAM_BATTING_1B          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.95 on 1580 degrees of freedom
Multiple R-squared:  0.3286,    Adjusted R-squared:  0.3226 
F-statistic: 55.23 on 14 and 1580 DF,  p-value: < 2.2e-16

It does a fairly good job predicting, but there are a lot of variables that are not statistically significant.

Simple Model

Let’s try to create a simplier model. We will pick variables that had high correlations and exclude the pitching variables which would introduce autocorrelation issues.


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
    TEAM_FIELDING_E, data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.833  -9.025   0.218   8.837  50.486 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.800109   4.004492   1.199    0.231    
TEAM_BATTING_H   0.048614   0.002519  19.299  < 2e-16 ***
TEAM_BATTING_BB  0.015707   0.003792   4.142 3.62e-05 ***
TEAM_FIELDING_E -0.012961   0.002115  -6.129 1.11e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.8 on 1591 degrees of freedom
Multiple R-squared:  0.2314,    Adjusted R-squared:  0.2299 
F-statistic: 159.7 on 3 and 1591 DF,  p-value: < 2.2e-16

This model isn’t as fitted to the data as well but the variables are statistically significant.

Higher Order Stepwise Regression

For the third model we will use a stepwise regression method using a backwards elimination process. We also introduce some higher order polynomial variables.


Call:
lm(formula = poly_call[2], data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.197  -7.453  -0.130   7.130  57.064 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.721e+02  7.268e+01   2.368 0.018011 *  
TEAM_BATTING_2B       -1.264e+00  7.704e-01  -1.641 0.101053    
TEAM_PITCHING_H        2.634e-02  6.173e-03   4.267 2.10e-05 ***
TEAM_PITCHING_BB       2.401e-01  4.163e-02   5.766 9.76e-09 ***
TEAM_FIELDING_E       -2.257e-01  2.575e-02  -8.765  < 2e-16 ***
TEAM_FIELDING_DP      -3.092e+00  1.875e+00  -1.649 0.099352 .  
I(TEAM_BATTING_2B^2)   1.257e-02  4.968e-03   2.530 0.011503 *  
I(TEAM_BATTING_3B^2)   1.914e-03  4.189e-04   4.569 5.29e-06 ***
I(TEAM_BATTING_BB^2)  -7.454e-04  1.490e-04  -5.004 6.26e-07 ***
I(TEAM_BATTING_SO^2)   1.190e-04  2.980e-05   3.994 6.80e-05 ***
I(TEAM_BASERUN_SB^2)   4.939e-05  1.323e-05   3.733 0.000196 ***
I(TEAM_PITCHING_H^2)  -5.287e-06  1.114e-06  -4.747 2.25e-06 ***
I(TEAM_PITCHING_HR^2)  1.352e-03  3.473e-04   3.894 0.000103 ***
I(TEAM_PITCHING_BB^2) -3.054e-04  5.339e-05  -5.720 1.27e-08 ***
I(TEAM_PITCHING_SO^2) -5.613e-05  9.962e-06  -5.634 2.09e-08 ***
I(TEAM_FIELDING_E^2)   3.503e-04  6.256e-05   5.600 2.53e-08 ***
I(TEAM_FIELDING_DP^2)  3.198e-02  2.113e-02   1.514 0.130285    
I(TEAM_BATTING_2B^3)  -4.506e-05  1.389e-05  -3.244 0.001205 ** 
I(TEAM_BATTING_3B^3)  -6.372e-06  2.690e-06  -2.369 0.017974 *  
I(TEAM_BATTING_HR^3)   8.187e-07  4.736e-07   1.729 0.084096 .  
I(TEAM_BATTING_BB^3)   1.565e-06  3.265e-07   4.793 1.80e-06 ***
I(TEAM_BATTING_SO^3)  -1.283e-07  4.649e-08  -2.759 0.005864 ** 
I(TEAM_BASERUN_CS^3)   1.147e-05  2.496e-06   4.594 4.70e-06 ***
I(TEAM_PITCHING_H^3)   3.317e-10  7.906e-11   4.195 2.88e-05 ***
I(TEAM_PITCHING_HR^3) -8.134e-06  2.985e-06  -2.725 0.006497 ** 
I(TEAM_PITCHING_BB^3)  1.488e-07  2.539e-08   5.860 5.64e-09 ***
I(TEAM_PITCHING_SO^3)  3.989e-08  7.580e-09   5.262 1.62e-07 ***
I(TEAM_FIELDING_E^3)  -2.455e-07  5.708e-08  -4.301 1.80e-05 ***
I(TEAM_FIELDING_DP^3) -1.515e-04  1.027e-04  -1.475 0.140401    
I(TEAM_BATTING_1B^3)   1.097e-08  1.086e-09  10.106  < 2e-16 ***
I(TEAM_BATTING_2B^4)   5.422e-08  1.422e-08   3.814 0.000142 ***
I(TEAM_BATTING_BB^4)  -8.775e-10  2.085e-10  -4.208 2.72e-05 ***
I(TEAM_BATTING_SO^4)   3.654e-11  2.070e-11   1.765 0.077712 .  
I(TEAM_BASERUN_CS^4)  -4.467e-08  1.378e-08  -3.242 0.001212 ** 
I(TEAM_PITCHING_H^4)  -6.315e-15  1.742e-15  -3.626 0.000297 ***
I(TEAM_PITCHING_HR^4)  1.345e-08  6.111e-09   2.200 0.027934 *  
I(TEAM_PITCHING_BB^4) -2.198e-11  3.762e-12  -5.843 6.24e-09 ***
I(TEAM_PITCHING_SO^4) -7.847e-12  1.473e-12  -5.329 1.13e-07 ***
I(TEAM_FIELDING_E^4)   5.595e-11  1.732e-11   3.230 0.001262 ** 
I(TEAM_FIELDING_DP^4)  2.668e-07  1.824e-07   1.463 0.143723    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 1555 degrees of freedom
Multiple R-squared:  0.4748,    Adjusted R-squared:  0.4616 
F-statistic: 36.04 on 39 and 1555 DF,  p-value: < 2.2e-16

This model has the highest adjusted R-squared but has some variables that are statistically insignificant.

Model Selection

In order to select which model is the “best” we will test it against a test set. We will examine the difference between the predicted and actual values. Since the wins are in whole numbers and the predict function will generate floating point numbers, I will be rounding the results.

Final Model Selection

In deciding which model to use to make my final predictions I will look at the amount of error in the predictions:

Kitchen Sink Simple Stepwise
Min. :-47.0000 Min. :-46.0000 Min. :-175.000
1st Qu.: -9.0000 1st Qu.:-10.0000 1st Qu.: -9.000
Median : 1.0000 Median : 0.0000 Median : -1.000
Mean : -0.2819 Mean : -0.3862 Mean : -0.141
3rd Qu.: 8.0000 3rd Qu.: 9.0000 3rd Qu.: 7.000
Max. : 49.0000 Max. : 47.0000 Max. : 435.000

The stepwise regression model did not preform well with out of sample data. The high adjusted R-squared is probably evidence of overfitting. The simple and “kitchen sink” model have similar preformance. I will use the simple model because I prefer to keep it simple stupid. The residuals of the simple model are mostly normally distributed. The adjusted R squared indicates about a quarter of the variation is explained by the model. I will complete this homework by generating predictions off of the evaluation set using the simple model.

Mike Silva

2019-09-25