Homework #1 Assignment Requirements
Author: James Mundy
Date: March 1, 2020
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
The goal of the Moneyball assignment is to develop three or more regression models that predict team wins.
As an avid baseball fan of many years and having recently read Analyzing Baseball Data with R, I begin this assignment with a the following premise:
Baseball teams win games (Target_Wins) by scoring more runs than their opponent. Thus the number of wins earned over a season must be related to both the number of runs scored and allowed.
As a result I will seek to utilize variables that capture runs scored and allowed. Below I have listed four metrics utilized in baseball to capture run scoring and one metric that addresses runs allowed.
Batting Average (AVG) - One of the oldest and most universal tools to measure a hitter’s success at the plate, batting average is determined by dividing a player’s hits by his total at-bats.
On base Percentage (OBP) - OBP refers to how frequently a batter reaches base per plate appearance. Times on base include hits, walks and hit-by-pitches, but do not include errors, times reached on a fielder’s choice or a dropped third strike.
Slugging Percentage (SLG) - Slugging percentage represents the total number of bases a player records per at-bat. Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation.
On-base Plus Slugging (OPS) - OPS adds on-base percentage and slugging percentage to get one number that unites the two. It’s meant to combine how well a hitter can reach base, with how well he can hit for average and for power.
Run Differential (RD) - In baseball, run differential is a cumulative team statistic that combines offensive and defensive scoring. Run differential is calculated by subtracting runs allowed from runs scored.
I will use the Moneyball data set to calculate these five statistics (or similar alternatives) and use these statistics as the basis of my regression models.
I will utilize the SKIM package and ggplot charts to perform my exploratory data analysis. The objective of this is analysis is to develop a better understanding of the data to include its shape, central tendencies, completeness (missing data) and its correlation to our response variable Target_Wins. Here are some key takeaways from my data exploration:
Below you will find the Skim function results, density charts, scatter plots and correlation tables and plots that support the take always above and provide additional detailed information.
mbData <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-training-data.csv")
mbData <- mbData %>%
select(-INDEX)
mbData2 <- mbData
mbTest <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-evaluation-data.csv")
mbTest <- mbTest %>%
select(-INDEX)
The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.
Name | mbData |
Number of rows | 2276 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
numeric | 16 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 0 | 1.00 | 80.79 | 15.75 | 0 | 71.0 | 82.0 | 92.00 | 146 | ▁▁▇▅▁ |
TEAM_BATTING_H | 0 | 1.00 | 1469.27 | 144.59 | 891 | 1383.0 | 1454.0 | 1537.25 | 2554 | ▁▇▂▁▁ |
TEAM_BATTING_2B | 0 | 1.00 | 241.25 | 46.80 | 69 | 208.0 | 238.0 | 273.00 | 458 | ▁▆▇▂▁ |
TEAM_BATTING_3B | 0 | 1.00 | 55.25 | 27.94 | 0 | 34.0 | 47.0 | 72.00 | 223 | ▇▇▂▁▁ |
TEAM_BATTING_HR | 0 | 1.00 | 99.61 | 60.55 | 0 | 42.0 | 102.0 | 147.00 | 264 | ▇▆▇▅▁ |
TEAM_BATTING_BB | 0 | 1.00 | 501.56 | 122.67 | 0 | 451.0 | 512.0 | 580.00 | 878 | ▁▁▇▇▁ |
TEAM_BATTING_SO | 102 | 0.96 | 735.61 | 248.53 | 0 | 548.0 | 750.0 | 930.00 | 1399 | ▁▆▇▇▁ |
TEAM_BASERUN_SB | 131 | 0.94 | 124.76 | 87.79 | 0 | 66.0 | 101.0 | 156.00 | 697 | ▇▃▁▁▁ |
TEAM_BASERUN_CS | 772 | 0.66 | 52.80 | 22.96 | 0 | 38.0 | 49.0 | 62.00 | 201 | ▃▇▁▁▁ |
TEAM_BATTING_HBP | 2085 | 0.08 | 59.36 | 12.97 | 29 | 50.5 | 58.0 | 67.00 | 95 | ▂▇▇▅▁ |
TEAM_PITCHING_H | 0 | 1.00 | 1779.21 | 1406.84 | 1137 | 1419.0 | 1518.0 | 1682.50 | 30132 | ▇▁▁▁▁ |
TEAM_PITCHING_HR | 0 | 1.00 | 105.70 | 61.30 | 0 | 50.0 | 107.0 | 150.00 | 343 | ▇▇▆▁▁ |
TEAM_PITCHING_BB | 0 | 1.00 | 553.01 | 166.36 | 0 | 476.0 | 536.5 | 611.00 | 3645 | ▇▁▁▁▁ |
TEAM_PITCHING_SO | 102 | 0.96 | 817.73 | 553.09 | 0 | 615.0 | 813.5 | 968.00 | 19278 | ▇▁▁▁▁ |
TEAM_FIELDING_E | 0 | 1.00 | 246.48 | 227.77 | 65 | 127.0 | 159.0 | 249.25 | 1898 | ▇▁▁▁▁ |
TEAM_FIELDING_DP | 286 | 0.87 | 146.39 | 26.23 | 52 | 131.0 | 149.0 | 164.00 | 228 | ▁▂▇▆▁ |
mbData %>%
gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
ggplot(., aes(value)) +
geom_density(fill = "Blue", color="Blue") +
facet_wrap(~variable, scales ="free", ncol = 4)
mbData %>%
gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
ggplot(., aes(x= variable, y=value)) +
geom_boxplot() +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
mbData %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "blue", color="blue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
temp <- mbData %>%
cor(., use = "complete.obs") #%>%
temp[lower.tri(temp, diag=TRUE)] <- ""
temp <- temp %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(Variable, Correlation, -rowname) %>%
filter(Variable != rowname) %>%
filter(Correlation != "") %>%
mutate(Correlation = as.numeric(Correlation)) %>%
rename(` Variable` = rowname) %>%
arrange(desc(abs(Correlation)))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Variable | Variable | Correlation |
---|---|---|
TARGET_WINS | TEAM_PITCHING_H | 0.4712343 |
TARGET_WINS | TEAM_BATTING_H | 0.4699467 |
TARGET_WINS | TEAM_BATTING_BB | 0.4686879 |
TARGET_WINS | TEAM_PITCHING_BB | 0.4683988 |
TARGET_WINS | TEAM_PITCHING_HR | 0.4224668 |
TARGET_WINS | TEAM_BATTING_HR | 0.4224168 |
TARGET_WINS | TEAM_FIELDING_E | -0.3866880 |
TARGET_WINS | TEAM_BATTING_2B | 0.3129840 |
TARGET_WINS | TEAM_PITCHING_SO | -0.2293648 |
TARGET_WINS | TEAM_BATTING_SO | -0.2288927 |
TARGET_WINS | TEAM_FIELDING_DP | -0.1958660 |
TARGET_WINS | TEAM_BASERUN_CS | -0.1787560 |
TARGET_WINS | TEAM_BATTING_3B | -0.1243459 |
TARGET_WINS | TEAM_BATTING_HBP | 0.0735042 |
TARGET_WINS | TEAM_BASERUN_SB | 0.0148364 |
Now with our enhanced understanding of the data set, I will begin to prepare the data for model building. My data preparation will include addressing missing data, outliers and feature engineering or creating new variables.
Six variables in the Moneyball data set are missing data. The variables are set forth below along with each variable’s complete_rate as provided by the Skim function.
Two of these variables (TEAM_BATTING_HBP and TEAM_BASERUN_CS) are inputs to our new variables. Therefore, in each case, NA values are replaced with the major league baseball 2019 and 2018 average for Hit By Pitch (HBP) and Caught Stealing (CS). The values were obtained from Team rankings.com and are 65 and 30, respectively.
The Target Wins box plot below introduces the possibility of outliers in our response variable. I will utilize MLB historical win/loss information to eliminate these outliers.
mbData2 %>%
gather(variable, value, TARGET_WINS) %>%
ggplot(., aes(x= variable, y=value)) +
geom_boxplot() +
labs(x = element_blank(), y = element_blank())
Use dplyr to filter out the outlier rows.
mbData2 %>%
gather(variable, value, TARGET_WINS) %>%
ggplot(., aes(x= variable, y=value)) +
geom_boxplot() +
labs(x = element_blank(), y = element_blank())
Now, with the missing data and outlier challenges addressed, I will direct my attention to the creation of our five new variables and the selection of variables for the regression model(s).
We will supplement the the Moneyball data set with five new variables: On base Percentage (OBP), Slugging Percentage (SLG), On Base Plus Slugging (OPS) and Batting Average (AVG) and Run Differential (RD). Plate Appearances and At Bats are two additional variables that are inputs in a new variable calculations. The definitions for each of these variables follow:
On Base Percentage = (H + BB + HBP) / (AB + BB + HBP + SF)
Slugging Percentage = (1B + 22B + 33B + 4*HR) / AB
Batting Average = H / AB
Plate Appearance(PA) = H + BB + K + HBP + SH + SF + DI + E+ DFO
Total Bases (TB) = (1B + 22B + 33B + 4*HR)
Where: H = Hit, BB = Walk, K = Strikeout, HBP = Hit by Pitch, SH = Sacrifice Hit, SF = Sacrifice Fly, DI = Defensive Interference, E = Fielding error,DFO = Defensive Fielding Out.
The provided data set does not include all the variable necessary to calculate these metrics. Therefore, I will create and utilize the following alternative definitions:
On Base Percentage = (H + BB + HBP) / PA
Slugging Percentage = (1B + 22B + 33B + 4*HR) / (PA-BB-HBP)
Batting Average = H / (PA-BB-HBP)
Plate Appearance = H + BB + HBP + (162 * 9 * 3) - cs - DP
Run Differential = (Team_Batting_H + Team_Batting_HR) - (Team_Pitching_H + Team_Pitching_HR)
mbData2 <- mbData2 %>%
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -TEAM_BATTING_3B -TEAM_BATTING_HR) %>%
mutate(PA = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + 4374 - TEAM_BASERUN_CS) %>%
mutate(AB = TEAM_BATTING_H + 4374 - TEAM_BASERUN_CS)
mbData2 <- mbData2 %>%
mutate(OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/ PA) %>%
mutate(SLG = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))/AB) %>%
mutate(TB = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))) %>%
mutate(AVG = TEAM_BATTING_H/AB) %>%
mutate(OPS = OBP + SLG) %>%
mutate(RD = TEAM_BATTING_H + TEAM_BATTING_HR - TEAM_PITCHING_H - TEAM_PITCHING_HR) %>%
mutate(NI = TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E)
I’ve replace missing data, removed outliers, created new variables. I’ve also selected eight explanatory variables for use in my regression models. Below I have recreated some of the exploratory data visuals for the revised data set.
Note - After some preliminary model building I decided to expand my variable set to increase my chances of producing a good model.
The Skim function reveals a cleaner and smaller revised data set
Name | mbData2 |
Number of rows | 2262 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
numeric | 26 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 0 | 1.00 | 80.70 | 15.18 | 21.00 | 71.00 | 82.00 | 91.00 | 120.00 | ▁▂▇▇▂ |
TEAM_BATTING_H | 0 | 1.00 | 1468.03 | 139.35 | 992.00 | 1383.00 | 1454.00 | 1536.00 | 2554.00 | ▁▇▁▁▁ |
TEAM_BATTING_2B | 0 | 1.00 | 241.17 | 46.34 | 69.00 | 208.00 | 238.00 | 272.75 | 458.00 | ▁▆▇▂▁ |
TEAM_BATTING_3B | 0 | 1.00 | 55.10 | 27.80 | 0.00 | 34.00 | 47.00 | 72.00 | 223.00 | ▇▇▂▁▁ |
TEAM_BATTING_HR | 0 | 1.00 | 100.08 | 60.41 | 0.00 | 42.00 | 102.50 | 147.00 | 264.00 | ▇▆▇▅▁ |
TEAM_BATTING_BB | 0 | 1.00 | 503.37 | 120.17 | 29.00 | 452.00 | 513.00 | 580.00 | 878.00 | ▁▁▇▆▁ |
TEAM_BATTING_SO | 101 | 0.96 | 738.37 | 245.84 | 0.00 | 551.00 | 752.00 | 932.00 | 1399.00 | ▁▆▇▇▁ |
TEAM_BASERUN_SB | 123 | 0.95 | 124.72 | 87.63 | 0.00 | 66.00 | 101.00 | 156.00 | 697.00 | ▇▃▁▁▁ |
TEAM_BASERUN_CS | 0 | 1.00 | 45.18 | 21.56 | 11.00 | 30.00 | 38.00 | 55.00 | 201.00 | ▇▃▁▁▁ |
TEAM_BATTING_HBP | 0 | 1.00 | 64.52 | 4.07 | 29.00 | 65.00 | 65.00 | 65.00 | 95.00 | ▁▁▇▁▁ |
TEAM_PITCHING_H | 0 | 1.00 | 1745.98 | 1258.05 | 1137.00 | 1418.00 | 1517.00 | 1679.00 | 30132.00 | ▇▁▁▁▁ |
TEAM_PITCHING_HR | 0 | 1.00 | 106.01 | 61.07 | 0.00 | 50.00 | 108.00 | 150.00 | 343.00 | ▇▇▆▁▁ |
TEAM_PITCHING_BB | 0 | 1.00 | 553.57 | 164.44 | 119.00 | 477.00 | 537.00 | 611.00 | 3645.00 | ▇▁▁▁▁ |
TEAM_PITCHING_SO | 101 | 0.96 | 819.04 | 553.58 | 0.00 | 615.00 | 814.00 | 968.00 | 19278.00 | ▇▁▁▁▁ |
TEAM_FIELDING_E | 0 | 1.00 | 242.21 | 218.81 | 65.00 | 127.00 | 158.00 | 248.00 | 1898.00 | ▇▁▁▁▁ |
TEAM_FIELDING_DP | 275 | 0.88 | 146.47 | 26.15 | 52.00 | 131.00 | 149.00 | 164.00 | 228.00 | ▁▂▇▆▁ |
TEAM_BATTING_1B | 0 | 1.00 | 1071.68 | 124.85 | 709.00 | 990.00 | 1050.00 | 1128.00 | 2112.00 | ▃▇▁▁▁ |
PA | 0 | 1.00 | 6364.74 | 179.05 | 5543.00 | 6254.00 | 6362.00 | 6472.00 | 7229.00 | ▁▂▇▂▁ |
AB | 0 | 1.00 | 5796.85 | 142.00 | 5336.00 | 5708.00 | 5782.50 | 5867.00 | 6854.00 | ▁▇▁▁▁ |
OBP | 0 | 1.00 | 0.32 | 0.02 | 0.22 | 0.31 | 0.32 | 0.33 | 0.40 | ▁▁▇▅▁ |
SLG | 0 | 1.00 | 0.37 | 0.04 | 0.24 | 0.34 | 0.37 | 0.39 | 0.48 | ▁▃▇▅▁ |
TB | 0 | 1.00 | 2119.65 | 255.25 | 1295.00 | 1947.00 | 2126.50 | 2284.00 | 3290.00 | ▁▆▇▁▁ |
AVG | 0 | 1.00 | 0.25 | 0.02 | 0.19 | 0.24 | 0.25 | 0.26 | 0.37 | ▁▇▃▁▁ |
OPS | 0 | 1.00 | 0.68 | 0.05 | 0.46 | 0.65 | 0.69 | 0.72 | 0.87 | ▁▂▇▅▁ |
RD | 0 | 1.00 | -283.88 | 1223.06 | -28458.00 | -108.00 | -78.00 | 0.00 | 29.00 | ▁▁▁▁▇ |
NI | 0 | 1.00 | 2647.77 | 1465.38 | 1814.00 | 2206.00 | 2360.00 | 2579.00 | 33156.00 | ▇▁▁▁▁ |
The density plot of RD reveals significant skew. A transformation may need to be employed. Home Runs remains bi-modal.
mbData2 %>%
gather(variable, value, TARGET_WINS:RD) %>%
ggplot(., aes(value)) +
geom_density(fill = "Blue", color="Blue") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
The Scatter plots generally reveal a linear relationship between the response variable and New variables.
mbData2 %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "blue", color="blue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
## Warning: Removed 600 rows containing non-finite values (stat_smooth).
## Warning: Removed 600 rows containing missing values (geom_point).
Here is the correlations table for the revised data set.
temp2 <- mbData2 %>%
cor(., use = "complete.obs") #%>%
temp2[lower.tri(temp2, diag=TRUE)] <- ""
temp2 <- temp2 %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(Variable, Correlation, -rowname) %>%
filter(Variable != rowname) %>%
filter(Correlation != "") %>%
mutate(Correlation = as.numeric(Correlation)) %>%
rename(` Variable` = rowname) %>%
arrange(desc(abs(Correlation)))
temp2 %>%
filter(` Variable` == "TARGET_WINS") %>%
kable() %>%
kable_styling()
Variable | Variable | Correlation |
---|---|---|
TARGET_WINS | OBP | 0.4425900 |
TARGET_WINS | PA | 0.4358739 |
TARGET_WINS | OPS | 0.4239949 |
TARGET_WINS | TB | 0.3891767 |
TARGET_WINS | SLG | 0.3743504 |
TARGET_WINS | AVG | 0.3526064 |
TARGET_WINS | TEAM_BATTING_H | 0.3522027 |
TARGET_WINS | AB | 0.3447749 |
TARGET_WINS | TEAM_BATTING_BB | 0.3015885 |
TARGET_WINS | TEAM_PITCHING_BB | 0.2717246 |
TARGET_WINS | NI | 0.2699488 |
TARGET_WINS | TEAM_PITCHING_H | 0.2212359 |
TARGET_WINS | TEAM_BATTING_HR | 0.2201378 |
TARGET_WINS | TEAM_PITCHING_HR | 0.2197311 |
TARGET_WINS | TEAM_BATTING_2B | 0.2133341 |
TARGET_WINS | TEAM_FIELDING_E | -0.1848233 |
TARGET_WINS | TEAM_BATTING_1B | 0.1599655 |
TARGET_WINS | TEAM_BATTING_3B | 0.1227178 |
TARGET_WINS | TEAM_BASERUN_SB | 0.1213532 |
TARGET_WINS | TEAM_PITCHING_SO | -0.0655594 |
TARGET_WINS | TEAM_BATTING_SO | -0.0594744 |
TARGET_WINS | TEAM_FIELDING_DP | -0.0366452 |
TARGET_WINS | TEAM_BATTING_HBP | 0.0207607 |
TARGET_WINS | TEAM_BASERUN_CS | -0.0084982 |
TARGET_WINS | RD | -0.0065063 |
I will use my selected variable to build several models to predict wins. The variables selected reflect my strategy of using variables that are related to runs scored and/or runs allowed. I will utilize the training data set to create the various models. Next, in subsequent sections I will select the best model and apply the test data set to that model. I will provide summary, broom:tidy and brooom:glance data for each model.
Note - As I tested various models, I came across an interesting finding. If I only included data rows that were populated with Hit By Pitch values the R-Squared of all models increase by 10 to 15 percentage points. I determined however, despite the favorable R-Squared results, that resulting data set would be too small to develop viable models.
##
## Call:
## lm(formula = TARGET_WINS ~ OBP + OPS + AVG + RD, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.174 -8.878 0.249 9.216 67.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.817e+01 4.975e+00 -5.662 1.69e-08 ***
## OBP 1.630e+02 3.417e+01 4.771 1.95e-06 ***
## OPS 3.931e+01 1.128e+01 3.485 0.000502 ***
## AVG 1.199e+02 2.795e+01 4.290 1.86e-05 ***
## RD 1.406e-03 2.643e-04 5.319 1.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 2257 degrees of freedom
## Multiple R-squared: 0.2163, Adjusted R-squared: 0.2149
## F-statistic: 155.7 on 4 and 2257 DF, p-value: < 2.2e-16
m7 <- lm(TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB, data = mbData2)
summary(m7)
##
## Call:
## lm(formula = TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E +
## TEAM_PITCHING_BB, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.795 -9.064 0.202 9.087 42.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.168667 3.961676 0.295 0.768
## OPS 22.722267 11.059706 2.055 0.040 *
## RD 0.042823 0.004197 10.204 < 2e-16 ***
## TEAM_PITCHING_H 0.040885 0.004145 9.864 < 2e-16 ***
## TEAM_FIELDING_E -0.009496 0.002205 -4.306 1.73e-05 ***
## TEAM_PITCHING_BB 0.012742 0.002178 5.850 5.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 2256 degrees of freedom
## Multiple R-squared: 0.2278, Adjusted R-squared: 0.2261
## F-statistic: 133.1 on 5 and 2256 DF, p-value: < 2.2e-16
m8 <- lm(TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_PITCHING_BB, data = mbData2)
summary(m8)
##
## Call:
## lm(formula = TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E +
## TEAM_PITCHING_BB, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.786 -8.903 0.169 9.111 41.434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.701446 3.292833 1.731 0.0835 .
## RD 0.050192 0.002180 23.026 < 2e-16 ***
## TEAM_PITCHING_H 0.048185 0.002135 22.565 < 2e-16 ***
## TEAM_FIELDING_E -0.012065 0.001818 -6.636 4.03e-11 ***
## TEAM_PITCHING_BB 0.014524 0.001999 7.265 5.10e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.36 on 2257 degrees of freedom
## Multiple R-squared: 0.2263, Adjusted R-squared: 0.225
## F-statistic: 165.1 on 4 and 2257 DF, p-value: < 2.2e-16
In the Build phase of this project, I built 9 regression models. I have narrowed those 9 models down to 3 models that use to determine the best model. To select the best model I will test my final 3 models against the Moneyball evaluation data set. To accomplish this I will first create the same variable that were created in the training data set. Then I will use the models to predict Wins in the evaluation data set (Note -Target Wins is not included in the evaluation set).
In the final analysis, Models 4, 7 and 8 all produce very similar results. However, Owing to the counter intuitive coefficients in models 7 and 8, I selected model 4 as the best model. Model 4 benefits from brand name baseball metrics, intuitive coefficients, and an R-Squared of 22%. Model 8 is a very close second owing to the elegance of it only using Run Differential, Hits Allowed, Walks Allowed and Fielding Errors. Model 8 also had a higher F-Statistic. All Three model produce what I believe are good predictions. See below for additional detail.
#Clean-up missing data
mbTest <- mbTest %>%
mutate(TEAM_BATTING_HBP = replace_na(TEAM_BATTING_HBP,65)) %>%
mutate(TEAM_BASERUN_CS = replace_na(TEAM_BASERUN_CS,30))
mbTest <- mbTest %>%
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -TEAM_BATTING_3B -TEAM_BATTING_HR) %>%
mutate(PA = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + 4374 - TEAM_BASERUN_CS) %>%
mutate(AB = TEAM_BATTING_H + 4374 - TEAM_BASERUN_CS)
mbTest <- mbTest %>%
mutate(OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/ PA) %>%
mutate(SLG = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))/AB) %>%
mutate(TB = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))) %>%
mutate(AVG = TEAM_BATTING_H/AB) %>%
mutate(OPS = OBP + SLG) %>%
mutate(RD = TEAM_BATTING_H + TEAM_BATTING_HR - TEAM_PITCHING_H - TEAM_PITCHING_HR) %>%
mutate(NI = TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E)
Model 4 uses sabermetric statistics On base Percentage, On base Plus Slugging, Batting Average and Run Differential to produce a model with five statistically significant variables, high t stats and zero p-values. The coefficients are all intuitive (all positive) and the predicted results are reasonable. R-Squared of 21.6%
##
## Call:
## lm(formula = TARGET_WINS ~ OBP + OPS + AVG + RD, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.174 -8.878 0.249 9.216 67.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.817e+01 4.975e+00 -5.662 1.69e-08 ***
## OBP 1.630e+02 3.417e+01 4.771 1.95e-06 ***
## OPS 3.931e+01 1.128e+01 3.485 0.000502 ***
## AVG 1.199e+02 2.795e+01 4.290 1.86e-05 ***
## RD 1.406e-03 2.643e-04 5.319 1.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 2257 degrees of freedom
## Multiple R-squared: 0.2163, Adjusted R-squared: 0.2149
## F-statistic: 155.7 on 4 and 2257 DF, p-value: < 2.2e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.07 76.92 81.43 80.43 85.29 104.32
Model 7 is a derivative of model 4. It eliminates On base Percentage and Batting Average and replaces them with some fielding and pitching metrics (Hits Allowed, Walks Allowed and Errors). The R-Squared of this model increased slightly to (22.7%), however, two of the coefficients are counter intuitive. One would expect Wins to go down with Hit allowed and Walks allowed increase.
##
## Call:
## lm(formula = TARGET_WINS ~ OPS + RD + TEAM_PITCHING_H + TEAM_FIELDING_E +
## TEAM_PITCHING_BB, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.795 -9.064 0.202 9.087 42.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.168667 3.961676 0.295 0.768
## OPS 22.722267 11.059706 2.055 0.040 *
## RD 0.042823 0.004197 10.204 < 2e-16 ***
## TEAM_PITCHING_H 0.040885 0.004145 9.864 < 2e-16 ***
## TEAM_FIELDING_E -0.009496 0.002205 -4.306 1.73e-05 ***
## TEAM_PITCHING_BB 0.012742 0.002178 5.850 5.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 2256 degrees of freedom
## Multiple R-squared: 0.2278, Adjusted R-squared: 0.2261
## F-statistic: 133.1 on 5 and 2256 DF, p-value: < 2.2e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.56 76.83 81.49 80.46 85.40 106.93
Model 8 is also a derivative of model 4. It eliminates On base Plus Slugging, On base Percentage and Batting Average and includes the same fielding a pitching metrics as Model 7 (Hits Allowed, Walks Allowed and Errors). The R-Squared of this model was almost identical to Model 7 at 22.6%. Model 8 suffered from the same issue as Model 7. Hits allowed and Walks allowed both have positive coefficients which I believe is counter intuitive.
##
## Call:
## lm(formula = TARGET_WINS ~ RD + TEAM_PITCHING_H + TEAM_FIELDING_E +
## TEAM_PITCHING_BB, data = mbData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.786 -8.903 0.169 9.111 41.434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.701446 3.292833 1.731 0.0835 .
## RD 0.050192 0.002180 23.026 < 2e-16 ***
## TEAM_PITCHING_H 0.048185 0.002135 22.565 < 2e-16 ***
## TEAM_FIELDING_E -0.012065 0.001818 -6.636 4.03e-11 ***
## TEAM_PITCHING_BB 0.014524 0.001999 7.265 5.10e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.36 on 2257 degrees of freedom
## Multiple R-squared: 0.2263, Adjusted R-squared: 0.225
## F-statistic: 165.1 on 4 and 2257 DF, p-value: < 2.2e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.05 77.01 81.45 80.50 85.39 107.78