Load data and rename columns. I removed TEAM in all column names since it wasn’t informative.
path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/hw1/'
df <- read.csv(paste0(path, 'moneyball-training-data.csv')) %>%
select(-INDEX)
names(df) <- gsub('TEAM_', '', x=names(df))
The training data has 16 columns and 2276 observations, the INDEX column was removed as stated. The target column is TARGET_WINS and the remaining columns are all potential predictor variables for a linear models
summary(df)
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
These variables are all on a different scale which is ok for multiple linear regression. Some variables are missing values are others look like they might contain a lot of outliars.
visdat::vis_miss(df, sort_miss = TRUE)
df <- df %>% select(-BATTING_HBP)
BATTING_HBP is dropped because it is missing 91% of observations. No targets are missing.
correlation <- cor(df)
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt')
From the correlation plot we see some strong correlations between variables and some between predictor and target. It looks like BATTING_H, BATTING_2B, BATTING_BB have the strongest correlation with TARGET_WIN. BATTING_HR and PITCHING_HR have a strong correlation of 0.97 so one can be dropped.
inspectdf::inspect_num(df) %>% show_plot()
TARGET_WINS, FIELDING_DP, BATTING_H, BATTING_2B, BATTING_BB are all normally distributed BASERUN_CS, BATTING_3B, FIELDGING_E, BASERUN_SB are right skewed PITCHING_HR, FIELDING_E, BASERUN_SB, PITCHING_BB, PITCHING_SO, PITCHING_H are heavily skewed with PITCHING_HR being unimodal. All non-normally distributed variable might need for feature engineering PITCHING_SO could be dropped since it has only a few unique values, therefore we can drop PITCHING_SO.
Overall it looks like the BATTING variables have the strongest correlation with the TARGET_WINS
df %>% stack() %>% ggplot(aes(x=ind, y=values)) + geom_boxplot() + coord_flip()
## Warning: Removed 1393 rows containing non-finite values (stat_boxplot).
From the boxplot we can see PICHING_H has a lot of outliars making it a predictor that will need extra work
With the data exploration complete we can now drop certain columns and impute missing values. The NA will be replaced with the median value of the cells since we are working with ints.
BATTING_HBP and PITCHING_SO are dropped for the above reason
df <- df %>%
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>%
select(-PITCHING_SO)
From knowledge of baseball and a bit of searching new features were created as a linear combination of the current.
df <- df %>%
mutate(OBP = (BATTING_H + BATTING_BB) / (BATTING_H + BATTING_BB - BASERUN_CS),
BAV = BATTING_H / (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR))
Where OBP is On Base Percentage and BAV is Batting Average
This model will use the imputted data frame and all predictor columns (after removals)
m1 <- lm(TARGET_WINS ~ ., data=df)
summary(m1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.664 -8.535 0.264 8.158 59.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.747e+02 2.476e+02 0.706 0.48048
## BATTING_H 8.425e-02 9.789e-03 8.606 < 2e-16 ***
## BATTING_2B -1.650e-01 3.439e-02 -4.798 1.70e-06 ***
## BATTING_3B -8.388e-02 3.812e-02 -2.200 0.02789 *
## BATTING_HR -6.980e-02 4.221e-02 -1.653 0.09839 .
## BATTING_BB 3.435e-03 5.970e-03 0.575 0.56512
## BATTING_SO -6.395e-03 2.286e-03 -2.797 0.00520 **
## BASERUN_SB 2.598e-02 4.358e-03 5.962 2.89e-09 ***
## BASERUN_CS -9.470e-02 1.376e-01 -0.688 0.49130
## PITCHING_H -8.749e-04 3.676e-04 -2.380 0.01741 *
## PITCHING_HR -1.376e-02 2.313e-02 -0.595 0.55190
## PITCHING_BB 8.718e-03 3.053e-03 2.856 0.00433 **
## FIELDING_E -1.905e-02 2.480e-03 -7.681 2.33e-14 ***
## FIELDING_DP -1.292e-01 1.306e-02 -9.892 < 2e-16 ***
## OBP 1.457e+02 2.432e+02 0.599 0.54915
## BAV -3.669e+02 8.306e+01 -4.417 1.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.05 on 2260 degrees of freedom
## Multiple R-squared: 0.3185, Adjusted R-squared: 0.3139
## F-statistic: 70.4 on 15 and 2260 DF, p-value: < 2.2e-16
For this model I will use only theoretical effect the positive impact on wins
theory_pos <- c('TARGET_WINS', 'BATTING_H', 'BATTING_3B', 'BATTING_HR', 'BATTING_BB', 'BASERUN_SB', 'FIELDING_DP')
df2 <- df %>% select(all_of(theory_pos))
m2 <- lm(TARGET_WINS ~ ., data=df2)
summary(m2)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.604 -8.257 0.173 8.731 56.570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.985121 3.442151 3.772 0.000166 ***
## BATTING_H 0.040411 0.002384 16.951 < 2e-16 ***
## BATTING_3B 0.070408 0.016445 4.281 1.93e-05 ***
## BATTING_HR 0.064145 0.007515 8.536 < 2e-16 ***
## BATTING_BB 0.030630 0.002824 10.848 < 2e-16 ***
## BASERUN_SB 0.014557 0.003962 3.674 0.000244 ***
## FIELDING_DP -0.129550 0.013038 -9.936 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.41 on 2269 degrees of freedom
## Multiple R-squared: 0.2767, Adjusted R-squared: 0.2748
## F-statistic: 144.7 on 6 and 2269 DF, p-value: < 2.2e-16
For this model only predictors that have greater than 15% correlation with TARGET_WINS
cor_cols <- data.frame(correlation) %>% filter(TARGET_WINS > 0.2) %>% rownames()
df3 <- df %>% select(cor_cols)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cor_cols)` instead of `cor_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
m3 <- lm(TARGET_WINS ~ ., data=df3)
summary(m3)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.753 -8.719 0.529 9.176 48.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.753191 3.384614 -0.518 0.605
## BATTING_H 0.045218 0.002538 17.819 <2e-16 ***
## BATTING_2B -0.004206 0.008088 -0.520 0.603
## BATTING_BB 0.034136 0.002557 13.348 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 2272 degrees of freedom
## Multiple R-squared: 0.2196, Adjusted R-squared: 0.2185
## F-statistic: 213.1 on 3 and 2272 DF, p-value: < 2.2e-16
From the 3 models above the p-vals for BATTING_H, BATTING_2B, BASERUN_SB, FIELDING_E, BAV indicate that they play a strong role in predicting TARGET_WINS. Therefore this last model will only use them as predictors
low_pvals <- c('BATTING_H', 'BATTING_2B', 'BASERUN_SB', 'FIELDING_E', 'BAV', 'TARGET_WINS')
df4 <- df %>% select(low_pvals)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(low_pvals)` instead of `low_pvals` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
m4 <- lm(formula = TARGET_WINS ~., data=df4)
summary(m4)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.395 -9.163 0.093 8.722 61.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.633e+01 1.432e+01 6.728 2.17e-11 ***
## BATTING_H 6.113e-02 2.963e-03 20.632 < 2e-16 ***
## BATTING_2B -7.139e-02 1.229e-02 -5.809 7.15e-09 ***
## BASERUN_SB 3.653e-02 3.546e-03 10.303 < 2e-16 ***
## FIELDING_E -2.178e-02 1.638e-03 -13.301 < 2e-16 ***
## BAV -1.106e+02 1.746e+01 -6.335 2.86e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 2270 degrees of freedom
## Multiple R-squared: 0.2816, Adjusted R-squared: 0.28
## F-statistic: 177.9 on 5 and 2270 DF, p-value: < 2.2e-16
The table below evaluates all of the models with RMSE, MSE, adjR2 and complexity. Both RMSE and adjR2 quantify how well a regression model fits a dataset. The RMSE tells us how well a regression model can predict the value of the response variable in absolute terms while adjR2 tells us how well a model can predict the value of the response variable in percentage terms. Since this is a multivariate regression problem we evaluates with adjR2 as opposed to R2. Complexity is determined by the number of variables used by the model.
MSE = c(mean(m1$residuals^2), mean(m2$residuals^2), mean(m3$residuals^2), mean(m4$residuals^2))
RMSE = sqrt(MSE)
adjR2 = c(summary(m1)$adj.r.squared, summary(m2)$adj.r.squared, summary(m3)$adj.r.squared, summary(m4)$adj.r.squared)
model=c('m1','m2','m3','m4')
variables = c(length(m1$coefficients), length(m2$coefficients), length(m3$coefficients), length(m4$coefficients))
mdf <- data.frame(model=model, variables=variables, RMSE=RMSE, MSE=MSE, adjR2=adjR2)
From the table above m1 is the best model with the lowest RMSE/MSE and highest adjR2; however it is the most complex. m3 is the simplest with 4 variables but the worst in terms of all other metrics. The best model seem to be m4 with 6 variables and an adjR2 that is only 3 point below the best.
The evaluation dataset has to go through the same preprocessing as the training data. We impute missing values with the median and create the BAV column, then select the important X values to feed into the model m4.
edf <- read.csv(paste0(path, 'moneyball-evaluation-data.csv'))
names(edf) <- gsub('TEAM_', '', x=names(edf))
edf <- edf %>%
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>%
mutate(BAV = BATTING_H / (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR)) %>%
select(low_pvals[1:5])
edf$TARGET_WINS <- predict(m4, edf)
glimpse(edf)
## Rows: 259
## Columns: 6
## $ BATTING_H <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259, 1397…
## $ BATTING_2B <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243, 239…
## $ BASERUN_SB <dbl> 62, 54, 59, 148, 92, 92, 365, 185, 150, 52, 64, 48, 31, 27…
## $ FIELDING_E <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200, 150…
## $ BAV <dbl> 0.8086957, 0.8200134, 0.8205882, 0.7558939, 0.8396281, 0.8…
## $ TARGET_WINS <dbl> 67.86022, 68.51494, 76.52813, 87.43947, 67.23702, 66.36193…
Comparing statistics from the known TARGET_WINS to the predicted TARGET_WINS
summary(df$TARGET_WINS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
summary(edf$TARGET_WINS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.59 76.44 81.29 80.48 85.83 111.71
Our predicted values have a similar mean and IQR range; however the min and max vary. This is still kind of surprising because the model had an adjR2 of 0.28 which isn’t very good but still captured enough of the relationship between X~y to make reasonable predictions on the evaluation set.