Background

Given a set of professional baseball teams and their performance metrics for a 162-game season, how can we build a model that can be used to predict the number of wins for each team? The data set contains metrics of teams from 1871 to 2006. The assignment description can be found here.

Data Exploration

library(tidyverse)
library(corrplot)
library(reshape2)  # melt function for distributions of variables
library(moments)   # determine skewness of residuals
library(MASS)   # Box-Cox transformation
library(broom)
library(knitr)

Description of variables

Description of variables

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286
## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...
  • There are 2,276 observations and 17 columns.
  • Of these columns, 15 are predictors.
  • Of the remaining 2, there is an index column, and the column with the response variable is named TARGET_WINS.
  • Each observation represents a baseball team.
  • TARGET_WINS, the response variable, is the number wins by the baseball team.
  • There appears to be many missing values in the TEAM_BATTING_HBP column. We will take a closer look at how much is missing.

Missing values

names x
INDEX 0.000000
TARGET_WINS 0.000000
TEAM_BATTING_H 0.000000
TEAM_BATTING_2B 0.000000
TEAM_BATTING_3B 0.000000
TEAM_BATTING_HR 0.000000
TEAM_BATTING_BB 0.000000
TEAM_BATTING_SO 4.481547
TEAM_BASERUN_SB 5.755712
TEAM_BASERUN_CS 33.919156
TEAM_BATTING_HBP 91.608084
TEAM_PITCHING_H 0.000000
TEAM_PITCHING_HR 0.000000
TEAM_PITCHING_BB 0.000000
TEAM_PITCHING_SO 4.481547
TEAM_FIELDING_E 0.000000
TEAM_FIELDING_DP 12.565905
  • Over 90% of TEAM_BATTING_HBP is missing.
  • Over one-third of TEAM_BASERUN_CS is missing.
  • Other columns contain less missing data.

Correlation plot

There exists a strong positive correlation between:

  • TEAM_BASERUN_CS, TEAM_BASERUN_SB
  • TEAM_BATTING_SO, TEAM_PITCHING_SO
  • TEAM_BATTING_BB, TEAM_PITCHING_BB
  • TEAM_BATTING_HR, TEAM_PITCHING_HR
  • TEAM_PITCHING_H, TEAM_BATTING_2B
  • TEAM_BATTING_H, TEAM_PTICHING_H

There exists a moderately positive correlation between:

  • TEAM_WINS, TEAM_PITCHING_BB
  • TEAM_WINS, TEAM_BATTING_BB,
  • TEAM_WINS, TEAM_PITCHING_HR

There exists a moderately negative correlation between:

  • TARGET_WINS, TEAM_FIELDING_E.
  • TEAM_PITCHING_H, TEAM_PTICHING_SO
  • TEAM_BATTING_H, TEAM_PITCHING_SO

We can further visualize the correlations against TARGET_WINS using scatterplots:

Distributions

We can visualize the variables using histograms to account for non-normal distributions:

One of the key takeaways here is that strikeouts by batters has a bi-modal distribution. Several variables, such as strikeouts by pitchers and walks allowed are skewed right. The response variable,TARGET_WINS has a normal distribution.

Build Models

Going forward, we will drop the TEAM_BATTING_HBP variable since it has too many missing values.

Model 1

We can start by fitting a model with all the predictors.

The skewness measure of the residuals is -0.01, which suggests a negative skew.

If the magnitude of the skew were larger, we could attempt to use the Box-Cox method to determine what transformation we should apply to decrease the skew.

Below are the estimates of the coefficients of Model 1.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = dplyr::select(df.train, 
##     -"INDEX"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.5627  -6.6932  -0.1328   6.5249  27.8525 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      57.912438   6.642839   8.718  < 2e-16 ***
## TEAM_BATTING_H    0.015434   0.019626   0.786   0.4318    
## TEAM_BATTING_2B  -0.070472   0.009369  -7.522 9.36e-14 ***
## TEAM_BATTING_3B   0.161551   0.022192   7.280 5.43e-13 ***
## TEAM_BATTING_HR   0.073952   0.085392   0.866   0.3866    
## TEAM_BATTING_BB   0.043765   0.046454   0.942   0.3463    
## TEAM_BATTING_SO   0.018250   0.023463   0.778   0.4368    
## TEAM_BASERUN_SB   0.035880   0.008687   4.130 3.83e-05 ***
## TEAM_BASERUN_CS   0.052124   0.018227   2.860   0.0043 ** 
## TEAM_PITCHING_H   0.019044   0.018381   1.036   0.3003    
## TEAM_PITCHING_HR  0.022997   0.082092   0.280   0.7794    
## TEAM_PITCHING_BB -0.004180   0.044692  -0.094   0.9255    
## TEAM_PITCHING_SO -0.038176   0.022447  -1.701   0.0892 .  
## TEAM_FIELDING_E  -0.155876   0.009946 -15.672  < 2e-16 ***
## TEAM_FIELDING_DP -0.112885   0.013137  -8.593  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.556 on 1471 degrees of freedom
##   (790 observations deleted due to missingness)
## Multiple R-squared:  0.4386, Adjusted R-squared:  0.4333 
## F-statistic:  82.1 on 14 and 1471 DF,  p-value: < 2.2e-16

Eight out of 15 of the predictors have the wrong signs, going against the theoretical effects on TARGET_WINS mentioned in the beginning.

  • TEAM_BATTING_3B
  • TEAM_BATTING_HR
  • TEAM_BATTING_BB
  • TEAM_BATTING_SO
  • TEAM_PITCHING_HR
  • TEAM_PITCHING_BB
  • TEAM_PITCHING_SO
  • TEAM_FIELDING_DP

Model 2: Variable Selection by Intuition

Model 2 has a skewed value of 0.022. This is more skewed than the first model.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = dplyr::select(df.train, -"INDEX"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.785  -8.525   0.040   8.464  40.564 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      82.907583   3.066561  27.036  < 2e-16 ***
## TEAM_BATTING_2B   0.038364   0.006738   5.693 1.44e-08 ***
## TEAM_BATTING_3B   0.258746   0.016866  15.341  < 2e-16 ***
## TEAM_BASERUN_SB   0.054453   0.005679   9.588  < 2e-16 ***
## TEAM_FIELDING_E  -0.120219   0.006655 -18.066  < 2e-16 ***
## TEAM_FIELDING_DP -0.065316   0.013372  -4.884 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.94 on 1931 degrees of freedom
##   (339 observations deleted due to missingness)
## Multiple R-squared:  0.2245, Adjusted R-squared:  0.2224 
## F-statistic: 111.8 on 5 and 1931 DF,  p-value: < 2.2e-16

The sign of TEAM_FIELDING_DP is counterintuitive.

We can use the Box-Cox method to see if a transformation should be applied to address the skewness.

Lambda = 1.1515

The lambda corresponding to the maximum log-likelihood is close to 1, so a transformation is not necessary.

Model Evaluation

The metric I am deciding to use to evaluate the models is Adjusted R-squared. For Model 1, it is 0.43. This means that 43% of the variance in TARGET_WINS is explained by the predictors. For Model 2, it is 22%. Therefore, Model 1 should be used to make predictions.