Homework #1 Assignment Requirements
Author: James Mundy
Date: March 1, 2020
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
The goal of the Moneyball assignment is to develop three or more regression models that predict team wins.
As an avid baseball fan of many years and having recently read Analyzing Baseball Data with R, I begin this assignment with a the following premise:
Baseball teams win games (Target_Wins) by scoring more runs than their opponent. Thus the number of wins earned over a season must be related to both the number of runs scored and allowed.
As a result I will seek to utilize variables that capture runs scored and allowed. Below I have listed four metrics utilized in baseball to capture run scoring and one metric that addresses runs allowed.
Batting Average (AVG) - One of the oldest and most universal tools to measure a hitter’s success at the plate, batting average is determined by dividing a player’s hits by his total at-bats.
On base Percentage (OBP) - OBP refers to how frequently a batter reaches base per plate appearance. Times on base include hits, walks and hit-by-pitches, but do not include errors, times reached on a fielder’s choice or a dropped third strike.
Slugging Percentage (SLG) - Slugging percentage represents the total number of bases a player records per at-bat. Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation.
On-base Plus Slugging (OPS) - OPS adds on-base percentage and slugging percentage to get one number that unites the two. It’s meant to combine how well a hitter can reach base, with how well he can hit for average and for power.
Run Differential (RD) - In baseball, run differential is a cumulative team statistic that combines offensive and defensive scoring. Run differential is calculated by subtracting runs allowed from runs scored.
I will use the Moneyball data set to calculate these five statistics (or similar alternatives) and use these statistics as the basis of my regression models.
I will utilize the SKIM package and ggplot charts to perform my exploratory data analysis. The objective of this is analysis is to develop a better understanding of the data to include its shape, central tendencies, completeness (missing data) and its correlation to our response variable Target_Wins. Here are some key takeaways from my data exploration:
Below you will find the Skim function results, density charts, scatter plots and correlation tables and plots that support the take always above and provide additional detailed information.
mbData <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-training-data.csv")
mbData <- mbData %>%
select(-INDEX)
mbData2 <- mbData
mbTest <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA621/master/hw1/moneyball-evaluation-data.csv")
mbTest <- mbTest %>%
select(-INDEX)
The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.
Name | mbData |
Number of rows | 2276 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
numeric | 16 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 0 | 1.00 | 80.79 | 15.75 | 0 | 71.0 | 82.0 | 92.00 | 146 | ▁▁▇▅▁ |
TEAM_BATTING_H | 0 | 1.00 | 1469.27 | 144.59 | 891 | 1383.0 | 1454.0 | 1537.25 | 2554 | ▁▇▂▁▁ |
TEAM_BATTING_2B | 0 | 1.00 | 241.25 | 46.80 | 69 | 208.0 | 238.0 | 273.00 | 458 | ▁▆▇▂▁ |
TEAM_BATTING_3B | 0 | 1.00 | 55.25 | 27.94 | 0 | 34.0 | 47.0 | 72.00 | 223 | ▇▇▂▁▁ |
TEAM_BATTING_HR | 0 | 1.00 | 99.61 | 60.55 | 0 | 42.0 | 102.0 | 147.00 | 264 | ▇▆▇▅▁ |
TEAM_BATTING_BB | 0 | 1.00 | 501.56 | 122.67 | 0 | 451.0 | 512.0 | 580.00 | 878 | ▁▁▇▇▁ |
TEAM_BATTING_SO | 102 | 0.96 | 735.61 | 248.53 | 0 | 548.0 | 750.0 | 930.00 | 1399 | ▁▆▇▇▁ |
TEAM_BASERUN_SB | 131 | 0.94 | 124.76 | 87.79 | 0 | 66.0 | 101.0 | 156.00 | 697 | ▇▃▁▁▁ |
TEAM_BASERUN_CS | 772 | 0.66 | 52.80 | 22.96 | 0 | 38.0 | 49.0 | 62.00 | 201 | ▃▇▁▁▁ |
TEAM_BATTING_HBP | 2085 | 0.08 | 59.36 | 12.97 | 29 | 50.5 | 58.0 | 67.00 | 95 | ▂▇▇▅▁ |
TEAM_PITCHING_H | 0 | 1.00 | 1779.21 | 1406.84 | 1137 | 1419.0 | 1518.0 | 1682.50 | 30132 | ▇▁▁▁▁ |
TEAM_PITCHING_HR | 0 | 1.00 | 105.70 | 61.30 | 0 | 50.0 | 107.0 | 150.00 | 343 | ▇▇▆▁▁ |
TEAM_PITCHING_BB | 0 | 1.00 | 553.01 | 166.36 | 0 | 476.0 | 536.5 | 611.00 | 3645 | ▇▁▁▁▁ |
TEAM_PITCHING_SO | 102 | 0.96 | 817.73 | 553.09 | 0 | 615.0 | 813.5 | 968.00 | 19278 | ▇▁▁▁▁ |
TEAM_FIELDING_E | 0 | 1.00 | 246.48 | 227.77 | 65 | 127.0 | 159.0 | 249.25 | 1898 | ▇▁▁▁▁ |
TEAM_FIELDING_DP | 286 | 0.87 | 146.39 | 26.23 | 52 | 131.0 | 149.0 | 164.00 | 228 | ▁▂▇▆▁ |
temp <- mbData %>%
cor(., use = "complete.obs") #%>%
temp[lower.tri(temp, diag=TRUE)] <- ""
temp <- temp %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(Variable, Correlation, -rowname) %>%
filter(Variable != rowname) %>%
filter(Correlation != "") %>%
mutate(Correlation = as.numeric(Correlation)) %>%
rename(` Variable` = rowname) %>%
arrange(desc(abs(Correlation)))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Variable | Variable | Correlation |
---|---|---|
TARGET_WINS | TEAM_PITCHING_H | 0.4712343 |
TARGET_WINS | TEAM_BATTING_H | 0.4699467 |
TARGET_WINS | TEAM_BATTING_BB | 0.4686879 |
TARGET_WINS | TEAM_PITCHING_BB | 0.4683988 |
TARGET_WINS | TEAM_PITCHING_HR | 0.4224668 |
TARGET_WINS | TEAM_BATTING_HR | 0.4224168 |
TARGET_WINS | TEAM_FIELDING_E | -0.3866880 |
TARGET_WINS | TEAM_BATTING_2B | 0.3129840 |
TARGET_WINS | TEAM_PITCHING_SO | -0.2293648 |
TARGET_WINS | TEAM_BATTING_SO | -0.2288927 |
TARGET_WINS | TEAM_FIELDING_DP | -0.1958660 |
TARGET_WINS | TEAM_BASERUN_CS | -0.1787560 |
TARGET_WINS | TEAM_BATTING_3B | -0.1243459 |
TARGET_WINS | TEAM_BATTING_HBP | 0.0735042 |
TARGET_WINS | TEAM_BASERUN_SB | 0.0148364 |
Now with our enhanced understanding of the data set, I will begin to prepare the data for model building. My data preparation will include addressing missing data, outliers and feature engineering or creating new variables.
Six variables in the Moneyball data set are missing data. The variables are set forth below along with each variable’s complete_rate as provided by the Skim function.
Two of these variables (TEAM_BATTING_HBP and TEAM_BASERUN_CS) are inputs to our new variables. Therefore, in each case, NA values are replaced with the major league baseball 2019 and 2018 average for Hit By Pitch (HBP) and Caught Stealing (CS). The values were obtained from Team rankings.com and are 65 and 30, respectively.
The Target Wins box plot below introduces the possibility of outliers in our response variable. I will utilize MLB historical win/loss information to eliminate these outliers.
Use dplyr to filter out the outlier rows.
mbData2 %>%
gather(variable, value, TARGET_WINS) %>%
ggplot(., aes(x= variable, y=value)) +
geom_boxplot() +
labs(x = element_blank(), y = element_blank())
Now, with the missing data and outlier challenges addressed, I will direct my attention to the creation of our five new variables and the selection of variables for the regression model(s).
We will supplement the the Moneyball data set with five new variables: On base Percentage (OBP), Slugging Percentage (SLG), On Base Plus Slugging (OPS) and Batting Average (AVG) and Run Differential (RD). Plate Appearances and At Bats are two additional variables that are inputs in a new variable calculations. The definitions for each of these variables follow:
On Base Percentage = (H + BB + HBP) / (AB + BB + HBP + SF)
Slugging Percentage = (1B + 22B + 33B + 4*HR) / AB
Batting Average = H / AB
Plate Appearance(PA) = H + BB + K + HBP + SH + SF + DI + E+ DFO
Total Bases (TB) = (1B + 22B + 33B + 4*HR)
Where: H = Hit, BB = Walk, K = Strikeout, HBP = Hit by Pitch, SH = Sacrifice Hit, SF = Sacrifice Fly, DI = Defensive Interference, E = Fielding error,DFO = Defensive Fielding Out.
The provided data set does not include all the variable necessary to calculate these metrics. Therefore, I will create and utilize the following alternative definitions:
On Base Percentage = (H + BB + HBP) / PA
Slugging Percentage = (1B + 22B + 33B + 4*HR) / (PA-BB-HBP)
Batting Average = H / (PA-BB-HBP)
Plate Appearance = H + BB + HBP + (162 * 9 * 3) - cs - DP
Run Differential = (Team_Batting_H + Team_Batting_HR) - (Team_Pitching_H + Team_Pitching_HR)
mbData2 <- mbData2 %>%
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -TEAM_BATTING_3B -TEAM_BATTING_HR) %>%
mutate(PA = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + 4374 - TEAM_BASERUN_CS) %>%
mutate(AB = TEAM_BATTING_H + 4374 - TEAM_BASERUN_CS)
mbData2 <- mbData2 %>%
mutate(OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/ PA) %>%
mutate(SLG = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))/AB) %>%
mutate(TB = ((TEAM_BATTING_1B*1) +(2*TEAM_BATTING_2B)+(TEAM_BATTING_3B*3)+(TEAM_BATTING_HR*4))) %>%
mutate(AVG = TEAM_BATTING_H/AB) %>%
mutate(OPS = OBP + SLG) %>%
mutate(RD = TEAM_BATTING_H + TEAM_BATTING_HR - TEAM_PITCHING_H - TEAM_PITCHING_HR) %>%
mutate(NI = TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E)
I’ve replace missing data, removed outliers, created new variables. I’ve also selected eight explanatory variables for use in my regression models. Below I have recreated some of the exploratory data visuals for the revised data set.
Note - After some preliminary model building I decided to expand my variable set to increase my chances of producing a good model.
The Skim function reveals a cleaner and smaller revised data set
Name | mbData2 |
Number of rows | 2262 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
numeric | 26 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 0 | 1.00 | 80.70 | 15.18 | 21.00 | 71.00 | 82.00 | 91.00 | 120.00 | ▁▂▇▇▂ |
TEAM_BATTING_H | 0 | 1.00 | 1468.03 | 139.35 | 992.00 | 1383.00 | 1454.00 | 1536.00 | 2554.00 | ▁▇▁▁▁ |
TEAM_BATTING_2B | 0 | 1.00 | 241.17 | 46.34 | 69.00 | 208.00 | 238.00 | 272.75 | 458.00 | ▁▆▇▂▁ |
TEAM_BATTING_3B | 0 | 1.00 | 55.10 | 27.80 | 0.00 | 34.00 | 47.00 | 72.00 | 223.00 | ▇▇▂▁▁ |
TEAM_BATTING_HR | 0 | 1.00 | 100.08 | 60.41 | 0.00 | 42.00 | 102.50 | 147.00 | 264.00 | ▇▆▇▅▁ |
TEAM_BATTING_BB | 0 | 1.00 | 503.37 | 120.17 | 29.00 | 452.00 | 513.00 | 580.00 | 878.00 | ▁▁▇▆▁ |
TEAM_BATTING_SO | 101 | 0.96 | 738.37 | 245.84 | 0.00 | 551.00 | 752.00 | 932.00 | 1399.00 | ▁▆▇▇▁ |
TEAM_BASERUN_SB | 123 | 0.95 | 124.72 | 87.63 |