Homework 1 - Money Ball

Data Exploration

Load data and rename columns. I removed TEAM in all column names since it wasn’t informative.

path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/hw1/'
df <- read.csv(paste0(path, 'moneyball-training-data.csv')) %>%
  select(-INDEX)
names(df) <- gsub('TEAM_', '', x=names(df))

The training data has 16 columns and 2276 observations, the INDEX column was removed as stated. The target column is TARGET_WINS and the remaining columns are all potential predictor variables for a linear models

summary(df)

##   TARGET_WINS       BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##    BASERUN_CS     BATTING_HBP      PITCHING_H     PITCHING_HR   
##  Min.   :  0.0   Min.   :29.00   Min.   : 1137   Min.   :  0.0  
##  1st Qu.: 38.0   1st Qu.:50.50   1st Qu.: 1419   1st Qu.: 50.0  
##  Median : 49.0   Median :58.00   Median : 1518   Median :107.0  
##  Mean   : 52.8   Mean   :59.36   Mean   : 1779   Mean   :105.7  
##  3rd Qu.: 62.0   3rd Qu.:67.00   3rd Qu.: 1682   3rd Qu.:150.0  
##  Max.   :201.0   Max.   :95.00   Max.   :30132   Max.   :343.0  
##  NA's   :772     NA's   :2085                                   
##   PITCHING_BB      PITCHING_SO        FIELDING_E      FIELDING_DP   
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0  
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0  
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0  
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4  
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0  
##                   NA's   :102                        NA's   :286

These variables are all on a different scale which is ok for multiple linear regression. Some variables are missing values are others look like they might contain a lot of outliars.

visdat::vis_miss(df, sort_miss = TRUE)

df <- df %>% select(-BATTING_HBP)

BATTING_HBP is dropped because it is missing 91% of observations. No targets are missing.

correlation <- cor(df)
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt')

From the correlation plot we see some strong correlations between variables and some between predictor and target. It looks like BATTING_H, BATTING_2B, BATTING_BB have the strongest correlation with TARGET_WIN. BATTING_HR and PITCHING_HR have a strong correlation of 0.97 so one can be dropped.

inspectdf::inspect_num(df) %>% show_plot()

TARGET_WINS, FIELDING_DP, BATTING_H, BATTING_2B, BATTING_BB are all normally distributed BASERUN_CS, BATTING_3B, FIELDGING_E, BASERUN_SB are right skewed PITCHING_HR, FIELDING_E, BASERUN_SB, PITCHING_BB, PITCHING_SO, PITCHING_H are heavily skewed with PITCHING_HR being unimodal. All non-normally distributed variable might need for feature engineering PITCHING_SO could be dropped since it has only a few unique values, therefore we can drop PITCHING_SO.

Overall it looks like the BATTING variables have the strongest correlation with the TARGET_WINS

df %>% stack() %>% ggplot(aes(x=ind, y=values)) + geom_boxplot() + coord_flip()

## Warning: Removed 1393 rows containing non-finite values (stat_boxplot).

From the boxplot we can see PICHING_H has a lot of outliars making it a predictor that will need extra work

Data Preparation

With the data exploration complete we can now drop certain columns and impute missing values. The NA will be replaced with the median value of the cells since we are working with ints.

BATTING_HBP and PITCHING_SO are dropped for the above reason

df <- df %>% 
  mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>% 
  select(-PITCHING_SO)

From knowledge of baseball and a bit of searching new features were created as a linear combination of the current.

df <- df %>% 
  mutate(OBP = (BATTING_H + BATTING_BB) / (BATTING_H + BATTING_BB - BASERUN_CS),
         BAV = BATTING_H / (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR))

Where OBP is On Base Percentage and BAV is Batting Average

Build Models

Model 1

This model will use the imputted data frame and all predictor columns (after removals)

m1 <- lm(TARGET_WINS ~ ., data=df)
summary(m1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.664  -8.535   0.264   8.158  59.676 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.747e+02  2.476e+02   0.706  0.48048    
## BATTING_H    8.425e-02  9.789e-03   8.606  < 2e-16 ***
## BATTING_2B  -1.650e-01  3.439e-02  -4.798 1.70e-06 ***
## BATTING_3B  -8.388e-02  3.812e-02  -2.200  0.02789 *  
## BATTING_HR  -6.980e-02  4.221e-02  -1.653  0.09839 .  
## BATTING_BB   3.435e-03  5.970e-03   0.575  0.56512    
## BATTING_SO  -6.395e-03  2.286e-03  -2.797  0.00520 ** 
## BASERUN_SB   2.598e-02  4.358e-03   5.962 2.89e-09 ***
## BASERUN_CS  -9.470e-02  1.376e-01  -0.688  0.49130    
## PITCHING_H  -8.749e-04  3.676e-04  -2.380  0.01741 *  
## PITCHING_HR -1.376e-02  2.313e-02  -0.595  0.55190    
## PITCHING_BB  8.718e-03  3.053e-03   2.856  0.00433 ** 
## FIELDING_E  -1.905e-02  2.480e-03  -7.681 2.33e-14 ***
## FIELDING_DP -1.292e-01  1.306e-02  -9.892  < 2e-16 ***
## OBP          1.457e+02  2.432e+02   0.599  0.54915    
## BAV         -3.669e+02  8.306e+01  -4.417 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2260 degrees of freedom
## Multiple R-squared:  0.3185, Adjusted R-squared:  0.3139 
## F-statistic:  70.4 on 15 and 2260 DF,  p-value: < 2.2e-16

Model 2

For this model I will use only theoretical effect the positive impact on wins

theory_pos <- c('TARGET_WINS', 'BATTING_H', 'BATTING_3B', 'BATTING_HR', 'BATTING_BB', 'BASERUN_SB', 'FIELDING_DP')
df2 <- df %>% select(all_of(theory_pos))

m2 <- lm(TARGET_WINS ~ ., data=df2)
summary(m2)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.604  -8.257   0.173   8.731  56.570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.985121   3.442151   3.772 0.000166 ***
## BATTING_H    0.040411   0.002384  16.951  < 2e-16 ***
## BATTING_3B   0.070408   0.016445   4.281 1.93e-05 ***
## BATTING_HR   0.064145   0.007515   8.536  < 2e-16 ***
## BATTING_BB   0.030630   0.002824  10.848  < 2e-16 ***
## BASERUN_SB   0.014557   0.003962   3.674 0.000244 ***
## FIELDING_DP -0.129550   0.013038  -9.936  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.41 on 2269 degrees of freedom
## Multiple R-squared:  0.2767, Adjusted R-squared:  0.2748 
## F-statistic: 144.7 on 6 and 2269 DF,  p-value: < 2.2e-16

Model 3

For this model only predictors that have greater than 15% correlation with TARGET_WINS

cor_cols <- data.frame(correlation) %>% filter(TARGET_WINS > 0.2) %>% rownames()
df3 <- df %>% select(cor_cols)

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cor_cols)` instead of `cor_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

m3 <- lm(TARGET_WINS ~ ., data=df3)
summary(m3)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.753  -8.719   0.529   9.176  48.847 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.753191   3.384614  -0.518    0.605    
## BATTING_H    0.045218   0.002538  17.819   <2e-16 ***
## BATTING_2B  -0.004206   0.008088  -0.520    0.603    
## BATTING_BB   0.034136   0.002557  13.348   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 2272 degrees of freedom
## Multiple R-squared:  0.2196, Adjusted R-squared:  0.2185 
## F-statistic: 213.1 on 3 and 2272 DF,  p-value: < 2.2e-16

Model 4

From the 3 models above the p-vals for BATTING_H, BATTING_2B, BASERUN_SB, FIELDING_E, BAV indicate that they play a strong role in predicting TARGET_WINS. Therefore this last model will only use them as predictors

low_pvals <- c('BATTING_H', 'BATTING_2B', 'BASERUN_SB', 'FIELDING_E', 'BAV', 'TARGET_WINS')
df4 <- df %>% select(low_pvals)

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(low_pvals)` instead of `low_pvals` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

m4 <- lm(formula = TARGET_WINS ~., data=df4)
summary(m4)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.395  -9.163   0.093   8.722  61.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.633e+01  1.432e+01   6.728 2.17e-11 ***
## BATTING_H    6.113e-02  2.963e-03  20.632  < 2e-16 ***
## BATTING_2B  -7.139e-02  1.229e-02  -5.809 7.15e-09 ***
## BASERUN_SB   3.653e-02  3.546e-03  10.303  < 2e-16 ***
## FIELDING_E  -2.178e-02  1.638e-03 -13.301  < 2e-16 ***
## BAV         -1.106e+02  1.746e+01  -6.335 2.86e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 2270 degrees of freedom
## Multiple R-squared:  0.2816, Adjusted R-squared:   0.28 
## F-statistic: 177.9 on 5 and 2270 DF,  p-value: < 2.2e-16

Select Models

The table below evaluates all of the models with RMSE, MSE, adjR2 and complexity. Both RMSE and adjR2 quantify how well a regression model fits a dataset. The RMSE tells us how well a regression model can predict the value of the response variable in absolute terms while adjR2 tells us how well a model can predict the value of the response variable in percentage terms. Since this is a multivariate regression problem we evaluates with adjR2 as opposed to R2. Complexity is determined by the number of variables used by the model.

MSE = c(mean(m1$residuals^2), mean(m2$residuals^2), mean(m3$residuals^2), mean(m4$residuals^2))
RMSE = sqrt(MSE)
adjR2 = c(summary(m1)$adj.r.squared, summary(m2)$adj.r.squared, summary(m3)$adj.r.squared, summary(m4)$adj.r.squared)
model=c('m1','m2','m3','m4')
variables = c(length(m1$coefficients), length(m2$coefficients), length(m3$coefficients), length(m4$coefficients))

mdf <- data.frame(model=model, variables=variables, RMSE=RMSE, MSE=MSE, adjR2=adjR2)

From the table above m1 is the best model with the lowest RMSE/MSE and highest adjR2; however it is the most complex. m3 is the simplest with 4 variables but the worst in terms of all other metrics. The best model seem to be m4 with 6 variables and an adjR2 that is only 3 point below the best.

Evaluation

The evaluation dataset has to go through the same preprocessing as the training data. We impute missing values with the median and create the BAV column, then select the important X values to feed into the model m4.

edf <- read.csv(paste0(path, 'moneyball-evaluation-data.csv'))
names(edf) <- gsub('TEAM_', '', x=names(edf))
edf <- edf %>% 
  mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>%
  mutate(BAV = BATTING_H / (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR)) %>%
  select(low_pvals[1:5])
edf$TARGET_WINS <- predict(m4, edf)

glimpse(edf)

## Rows: 259
## Columns: 6
## $ BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259, 1397…
## $ BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243, 239…
## $ BASERUN_SB  <dbl> 62, 54, 59, 148, 92, 92, 365, 185, 150, 52, 64, 48, 31, 27…
## $ FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200, 150…
## $ BAV         <dbl> 0.8086957, 0.8200134, 0.8205882, 0.7558939, 0.8396281, 0.8…
## $ TARGET_WINS <dbl> 67.86022, 68.51494, 76.52813, 87.43947, 67.23702, 66.36193…

Comparing statistics from the known TARGET_WINS to the predicted TARGET_WINS

summary(df$TARGET_WINS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

summary(edf$TARGET_WINS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.59   76.44   81.29   80.48   85.83  111.71

Our predicted values have a similar mean and IQR range; however the min and max vary. This is still kind of surprising because the model had an adjR2 of 0.28 which isn’t very good but still captured enough of the relationship between X~y to make reasonable predictions on the evaluation set.