Data Analysis of Foul Ball Impact on fans
Often times the objective of Data Scientists is to take data in one form and transform it for simpler downstream Analysis. This is accomplished by tidying and transformation operations. Although this task has been completed in R, it could also have been tackled using other languages such as Python.
Task - choose one of the provided datasets on fivethirtyeight.com
Selected - We Watched 906 Foul Balls To Find Out Where The Most Dangerous Ones Land - By Annette Choi Filed under MLB Published Jul. 15, 2019
There were 906 foul balls collected from the most foul-heavy day at each of the the 10 stadiums that produced the most foul balls, as of June 5, 2019.The primary focus of this dataset was to observe where the most dangerous ones landed. More specifically our subset created will indicate what type of hit was most seen in specific zones of the stadium.
#library tidyverse and ggplot2 being called to load
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
Allows us to access in our original data file through our code by utilizing the cvs. In this case being stored in my Github Data607 repository
foulball_dataframe <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA607/master/foulballs_data.csv')
#Using dim() function to get dimension of the specified data frame:
dim(foulball_dataframe)
## [1] 906 7
We observe that upon taking in the data, returned are 906 entries of data(rows) and 7 column variables. Next we will further the work on creating a subset of the columns in our selected dataset. In addition, we will convert columns names to meaningful terms, as well as replacing non-intuitve abbreviations.
#Using colnames()function we can retrieve the column names of our matrix
colnames(foulball_dataframe)
## [1] "ï..matchup" "game_date" "type_of_hit" "exit_velocity"
## [5] "predicted_zone" "camera_zone" "used_zone"
Taking the info in our main dataset we could establish a dataset containing a subset of the columns. For the purpose of this assignment we will utilize Type of Hit, Exit Velocity and Used Zone. “Type of hit” being self explanatory as to the kind of strike technique used in baseball. “Exit Velocity” being the measure of the baseball as it comes off the bat. Lastly, “Used Zone” being referred to the stadium seating location position.For brevity sake, Zone 1 being distance referred to as behind the dugout and Zone 6/7 being on the opposite end of the stadium near the foul posts.Zone 2 through 5 being in between.
foulball_subset_dataframe <- select(foulball_dataframe, 'type_of_hit','exit_velocity','used_zone')
head(foulball_subset_dataframe,15)
## type_of_hit exit_velocity used_zone
## 1 Ground NA 1
## 2 Fly NA 4
## 3 Fly 56.9 4
## 4 Fly 78.8 1
## 5 Fly NA 2
## 6 Ground NA 1
## 7 Fly 74.8 2
## 8 Ground NA 1
## 9 Fly 70.7 4
## 10 Fly 73.4 4
## 11 Fly 76.0 5
## 12 Line NA 1
## 13 Fly 72.1 2
## 14 Fly NA 4
## 15 Line 95.9 5
We could now use the rename() to give our columns better labels that we assume easier to understand.
foulball_info_dataframe <- rename(foulball_subset_dataframe,Type='type_of_hit',Velocity='exit_velocity',Zone="used_zone")
##colnames() function to get current renamed column labels
colnames(foulball_info_dataframe)
## [1] "Type" "Velocity" "Zone"
#We use head() function to get an example of 15 entries using our newly renamed columns
head(foulball_info_dataframe,15)
## Type Velocity Zone
## 1 Ground NA 1
## 2 Fly NA 4
## 3 Fly 56.9 4
## 4 Fly 78.8 1
## 5 Fly NA 2
## 6 Ground NA 1
## 7 Fly 74.8 2
## 8 Ground NA 1
## 9 Fly 70.7 4
## 10 Fly 73.4 4
## 11 Fly 76.0 5
## 12 Line NA 1
## 13 Fly 72.1 2
## 14 Fly NA 4
## 15 Line 95.9 5
We incorporate into our code ‘group_by’ as well as ‘summarize’ to count exactly how many foul ball entries there were according to type of hit
foulball_info_dataframe %>%
group_by(Type) %>%
summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## Type count
## <chr> <int>
## 1 Batter hits self 17
## 2 Fly 522
## 3 Ground 226
## 4 Line 87
## 5 Pop Up 54
Similarly we incorporate above mentioned functions to get a count of how many foul balls existed in each Zone
foulball_info_dataframe %>%
group_by(Zone) %>%
summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## Zone count
## <int> <int>
## 1 1 278
## 2 2 96
## 3 3 80
## 4 4 215
## 5 5 226
## 6 6 6
## 7 7 5
Lastly we incorporate above mentioned functions to get a compliation count of how many foul balls existed by hit type in each Zone
foulball_info_dataframe %>%
group_by(Type, Zone) %>%
summarize(count=n())
## `summarise()` regrouping output by 'Type' (override with `.groups` argument)
## # A tibble: 18 x 3
## # Groups: Type [5]
## Type Zone count
## <chr> <int> <int>
## 1 Batter hits self 1 17
## 2 Fly 1 80
## 3 Fly 2 58
## 4 Fly 3 56
## 5 Fly 4 165
## 6 Fly 5 152
## 7 Fly 6 6
## 8 Fly 7 5
## 9 Ground 1 73
## 10 Ground 2 38
## 11 Ground 3 23
## 12 Ground 4 40
## 13 Ground 5 52
## 14 Line 1 54
## 15 Line 3 1
## 16 Line 4 10
## 17 Line 5 22
## 18 Pop Up 1 54
In conclusion, based on the subset we’ve created for the purpose of this assignment, Fly balls it seems appear most often in every zone.Additionally, Zone 1 coming in at 278 occurrences seems to see the most variety of foul balls.Zones 6 and 7 see the least amount of foul ball impact. To extend the work on this particular dataset, it would be interesting to take into account how many of the foul balls that impacted fans were due to holes in the safety netting. It is said that MLB have been slowly installing more netting around the field to protect fans. So it would be interesting to create another dataset comparing foul ball impacts on fans post and pre installing of the safety nets.