DATA607_Assignment1

Loading Data into a Data Frame -

Data Analysis of Foul Ball Impact on fans

Often times the objective of Data Scientists is to take data in one form and transform it for simpler downstream Analysis. This is accomplished by tidying and transformation operations. Although this task has been completed in R, it could also have been tackled using other languages such as Python.

Task - choose one of the provided datasets on fivethirtyeight.com

Selected - We Watched 906 Foul Balls To Find Out Where The Most Dangerous Ones Land - By Annette Choi Filed under MLB Published Jul. 15, 2019

INTRODUCTION

There were 906 foul balls collected from the most foul-heavy day at each of the the 10 stadiums that produced the most foul balls, as of June 5, 2019.The primary focus of this dataset was to observe where the most dangerous ones landed. More specifically our subset created will indicate what type of hit was most seen in specific zones of the stadium.

#library tidyverse and ggplot2 being called to load
library(tidyverse)

## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)

Acquiring our Data

Allows us to access in our original data file through our code by utilizing the cvs. In this case being stored in my Github Data607 repository

foulball_dataframe <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA607/master/foulballs_data.csv')

#Using dim() function to get dimension of the specified data frame:
dim(foulball_dataframe)

## [1] 906   7

We observe that upon taking in the data, returned are 906 entries of data(rows) and 7 column variables. Next we will further the work on creating a subset of the columns in our selected dataset. In addition, we will convert columns names to meaningful terms, as well as replacing non-intuitve abbreviations.

#Using colnames()function we can retrieve the column names of our matrix
colnames(foulball_dataframe)

## [1] "ï..matchup"     "game_date"      "type_of_hit"    "exit_velocity" 
## [5] "predicted_zone" "camera_zone"    "used_zone"

Analyzing Data Collected from the 10 most foul-ball-heavy game days to form Subset

Taking the info in our main dataset we could establish a dataset containing a subset of the columns. For the purpose of this assignment we will utilize Type of Hit, Exit Velocity and Used Zone. “Type of hit” being self explanatory as to the kind of strike technique used in baseball. “Exit Velocity” being the measure of the baseball as it comes off the bat. Lastly, “Used Zone” being referred to the stadium seating location position.For brevity sake, Zone 1 being distance referred to as behind the dugout and Zone 6/7 being on the opposite end of the stadium near the foul posts.Zone 2 through 5 being in between.

foulball_subset_dataframe <- select(foulball_dataframe, 'type_of_hit','exit_velocity','used_zone')
head(foulball_subset_dataframe,15)

##    type_of_hit exit_velocity used_zone
## 1       Ground            NA         1
## 2          Fly            NA         4
## 3          Fly          56.9         4
## 4          Fly          78.8         1
## 5          Fly            NA         2
## 6       Ground            NA         1
## 7          Fly          74.8         2
## 8       Ground            NA         1
## 9          Fly          70.7         4
## 10         Fly          73.4         4
## 11         Fly          76.0         5
## 12        Line            NA         1
## 13         Fly          72.1         2
## 14         Fly            NA         4
## 15        Line          95.9         5

We could now use the rename() to give our columns better labels that we assume easier to understand.

foulball_info_dataframe <- rename(foulball_subset_dataframe,Type='type_of_hit',Velocity='exit_velocity',Zone="used_zone")
##colnames() function to get current renamed column labels
colnames(foulball_info_dataframe)

## [1] "Type"     "Velocity" "Zone"

#We use head() function to get an example of 15 entries using our newly renamed columns
head(foulball_info_dataframe,15)

##      Type Velocity Zone
## 1  Ground       NA    1
## 2     Fly       NA    4
## 3     Fly     56.9    4
## 4     Fly     78.8    1
## 5     Fly       NA    2
## 6  Ground       NA    1
## 7     Fly     74.8    2
## 8  Ground       NA    1
## 9     Fly     70.7    4
## 10    Fly     73.4    4
## 11    Fly     76.0    5
## 12   Line       NA    1
## 13    Fly     72.1    2
## 14    Fly       NA    4
## 15   Line     95.9    5

Exploratory Data

We incorporate into our code ‘group_by’ as well as ‘summarize’ to count exactly how many foul ball entries there were according to type of hit

foulball_info_dataframe %>%
 group_by(Type) %>%
 summarize(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 2
##   Type             count
##   <chr>            <int>
## 1 Batter hits self    17
## 2 Fly                522
## 3 Ground             226
## 4 Line                87
## 5 Pop Up              54

Similarly we incorporate above mentioned functions to get a count of how many foul balls existed in each Zone

foulball_info_dataframe %>%
 group_by(Zone) %>%
 summarize(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 7 x 2
##    Zone count
##   <int> <int>
## 1     1   278
## 2     2    96
## 3     3    80
## 4     4   215
## 5     5   226
## 6     6     6
## 7     7     5

Lastly we incorporate above mentioned functions to get a compliation count of how many foul balls existed by hit type in each Zone

foulball_info_dataframe %>%
 group_by(Type, Zone) %>%
 summarize(count=n())

## `summarise()` regrouping output by 'Type' (override with `.groups` argument)

## # A tibble: 18 x 3
## # Groups:   Type [5]
##    Type              Zone count
##    <chr>            <int> <int>
##  1 Batter hits self     1    17
##  2 Fly                  1    80
##  3 Fly                  2    58
##  4 Fly                  3    56
##  5 Fly                  4   165
##  6 Fly                  5   152
##  7 Fly                  6     6
##  8 Fly                  7     5
##  9 Ground               1    73
## 10 Ground               2    38
## 11 Ground               3    23
## 12 Ground               4    40
## 13 Ground               5    52
## 14 Line                 1    54
## 15 Line                 3     1
## 16 Line                 4    10
## 17 Line                 5    22
## 18 Pop Up               1    54

Analysis Graphics

Conclusion

In conclusion, based on the subset we’ve created for the purpose of this assignment, Fly balls it seems appear most often in every zone.Additionally, Zone 1 coming in at 278 occurrences seems to see the most variety of foul balls.Zones 6 and 7 see the least amount of foul ball impact. To extend the work on this particular dataset, it would be interesting to take into account how many of the foul balls that impacted fans were due to holes in the safety netting. It is said that MLB have been slowly installing more netting around the field to protect fans. So it would be interesting to create another dataset comparing foul ball impacts on fans post and pre installing of the safety nets.