PITCHf/x, created and maintained by Sportvision, is a system that tracks the speeds and trajectories of pitched baseballs. This system, which made its debut in the 2006 MLB playoffs, is installed in every MLB stadium wiki.
We will make pitch location charts, visulaize lots of charts which will include:
We will pick a game: Pittsburgh Pirates Vs Washington Nationals - June 20th, 2015. There’s a youtube video, watch it after visualization.
Luckily, the dataset is available. Here’s the link. If we go deeper we will get xml file. R provides us a very nice package that will scrape the data for us and setup dataframe to use.
So, let’s get started with coding.
The package is “pitchRx”. Install the package and load it. Then to scrape the data, the function we need is scrape() and put the game id.
Then get the datasets available to use from the bdata. ‘atbat’ and ‘pitch’ are the one we are interested in.
install.packages('pitchRx')
library(pitchRx)
#store in a variable
bdata <- scrape(game.ids = "gid_2015_06_20_pitmlb_wasmlb_1")
## http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
#get the datasets
names(bdata)
## [1] "atbat" "action" "pitch" "po" "runner"
Explore the dataset:
#Atbat Datset
head(bdata$atbat)
## pitcher batter num b s o start_tfs start_tfs_zulu stand b_height
## 1 453286 543281 1 0 0 1 200534 2015-06-20T20:05:34Z R 5-8
## 2 453286 543281 20 1 3 1 204606 2015-06-20T20:46:06Z R 5-8
## 3 453286 543281 60 0 2 3 222530 2015-06-20T22:25:30Z R 5-8
## 4 453286 543281 44 1 2 1 215033 2015-06-20T21:50:33Z R 5-8
## 5 453286 516782 2 1 1 2 200615 2015-06-20T20:06:15Z R 6-1
## 6 453286 516782 45 0 3 2 215201 2015-06-20T21:52:01Z R 6-1
## p_throws atbat_des
## 1 R Josh Harrison pops out to second baseman Danny Espinosa.
## 2 R Josh Harrison strikes out swinging.
## 3 R Josh Harrison flies out to left fielder Michael Taylor.
## 4 R Josh Harrison flies out to left fielder Michael Taylor.
## 5 R Starling Marte flies out to right fielder Bryce Harper.
## 6 R Starling Marte strikes out on a foul tip.
## atbat_des_es
## 1 Josh Harrison batea elevadito de out a segunda base Danny Espinosa.
## 2 Josh Harrison se poncha tirándole.
## 3 Josh Harrison batea elevado de out a jardinero izquierdo Michael Taylor.
## 4 Josh Harrison batea elevado de out a jardinero izquierdo Michael Taylor.
## 5 Starling Marte batea elevado de out a jardinero derecho Bryce Harper.
## 6 Starling Marte se poncha con foul tip.
## event_num event event_es play_guid
## 1 5 Pop Out Elevado de Out 19c450e5-07b6-47b4-8a54-d5cebd2cfb7c
## 2 137 Strikeout Ponche 8dbee2e9-681e-4292-b399-373f032730e3
## 3 466 Flyout Elevado de Out cbc63121-510c-4bf1-980d-c92b5920d211
## 4 342 Flyout Elevado de Out 69262bbe-5f9b-4356-8e16-30103f3aa629
## 5 11 Flyout Elevado de Out ba0e609f-93da-4931-b2a6-1ad283528fb9
## 6 348 Strikeout Ponche 0d1a059e-ee8b-412b-a6ad-989c3032aefb
## home_team_runs away_team_runs
## 1 0 0
## 2 0 0
## 3 6 0
## 4 5 0
## 5 0 0
## 6 5 0
## url
## 1 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 2 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 3 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 4 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 5 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 6 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## inning_side inning next_ score event2 event2_es batter_name
## 1 top 1 Y <NA> <NA> <NA> Josh Harrison
## 2 top 4 Y <NA> <NA> <NA> Josh Harrison
## 3 top 9 N <NA> <NA> <NA> Josh Harrison
## 4 top 7 Y <NA> <NA> <NA> Josh Harrison
## 5 top 1 Y <NA> <NA> <NA> Starling Marte
## 6 top 7 Y <NA> <NA> <NA> Starling Marte
## pitcher_name gameday_link date
## 1 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
## 2 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
## 3 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
## 4 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
## 5 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
## 6 Max Scherzer gid_2015_06_20_pitmlb_wasmlb_1 2015_06_20
#pitch Dataset
head(bdata$pitch)
## des des_es id type tfs tfs_zulu
## 1 In play, out(s) En juego, out(s) 3 X 200552 2015-06-20T20:05:52Z
## 2 Ball Bola mala 7 B 200625 2015-06-20T20:06:25Z
## 3 Called Strike Strike cantado 8 S 200636 2015-06-20T20:06:36Z
## 4 In play, out(s) En juego, out(s) 9 X 200655 2015-06-20T20:06:55Z
## 5 Called Strike Strike cantado 13 S 200735 2015-06-20T20:07:35Z
## 6 In play, out(s) En juego, out(s) 14 X 200753 2015-06-20T20:07:53Z
## x y event_num sv_id
## 1 131.94 171.12 3 150620_160711
## 2 163.92 175.44 7 150620_160749
## 3 77.85 189.69 8 150620_160801
## 4 80.29 162.99 9 150620_160815
## 5 107.70 169.36 13 150620_160900
## 6 87.46 184.27 14 150620_160915
## play_guid start_speed end_speed sz_top sz_bot
## 1 19c450e5-07b6-47b4-8a54-d5cebd2cfb7c 91.8 85.7 3.57 1.57
## 2 0bf4bee5-4ddc-44d9-8972-29f1f0140919 92.9 85.8 3.42 1.40
## 3 129e197d-4bbd-4bab-a19f-63eea6a53b8a 93.5 87.0 3.42 1.32
## 4 ba0e609f-93da-4931-b2a6-1ad283528fb9 94.3 87.4 3.47 1.53
## 5 ab9a787d-2846-4d2f-9637-4a29474a33b7 93.4 86.8 3.60 1.60
## 6 0bba014d-0989-46c6-9705-d25fc9f65a4f 93.8 87.1 3.50 1.60
## pfx_x pfx_z px pz x0 y0 z0 vx0 vy0 vz0 ax
## 1 -8.51 6.12 -0.392 2.506 -2.907 50 5.337 9.675 -134.282 -3.674 -15.799
## 2 -9.50 4.87 -1.231 2.346 -3.143 50 5.324 8.458 -135.870 -3.730 -17.836
## 3 -8.66 6.35 1.027 1.818 -2.758 50 5.373 13.304 -136.414 -5.942 -16.535
## 4 -6.14 7.85 0.963 2.807 -2.698 50 5.517 12.175 -137.735 -4.307 -11.912
## 5 -8.64 7.75 0.244 2.571 -2.909 50 5.351 11.584 -136.473 -4.335 -16.492
## 6 -7.77 6.67 0.775 2.019 -2.891 50 5.204 12.708 -136.949 -5.103 -14.934
## ay az break_y break_angle break_length pitch_type
## 1 24.451 -20.736 23.9 33.4 5.6 FF
## 2 28.675 -22.954 23.8 34.2 6.2 FF
## 3 26.177 -19.973 23.9 33.7 5.4 FF
## 4 27.864 -16.875 23.8 28.2 4.2 FF
## 5 26.550 -17.298 23.8 38.2 4.9 FF
## 6 27.016 -19.279 23.8 31.7 5.0 FF
## type_confidence zone nasty spin_dir spin_rate cc mt
## 1 0.903 4 49 234.096 2107.295
## 2 2.000 13 64 242.664 2143.887
## 3 2.000 14 57 233.576 2185.439
## 4 2.000 12 54 217.904 2042.330
## 5 2.000 6 44 227.948 2361.037
## 6 2.000 14 55 229.190 2090.203
## url
## 1 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 2 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 3 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 4 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 5 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## 6 http://gd2.mlb.com/components/game/mlb/year_2015/month_06/day_20/gid_2015_06_20_pitmlb_wasmlb_1/inning/inning_all.xml
## inning_side inning next_ num on_1b on_2b on_3b
## 1 top 1 Y 1 NA NA NA
## 2 top 1 Y 2 NA NA NA
## 3 top 1 Y 2 NA NA NA
## 4 top 1 Y 2 NA NA NA
## 5 top 1 Y 3 NA NA NA
## 6 top 1 Y 3 NA NA NA
## gameday_link count
## 1 gid_2015_06_20_pitmlb_wasmlb_1 0-0
## 2 gid_2015_06_20_pitmlb_wasmlb_1 0-0
## 3 gid_2015_06_20_pitmlb_wasmlb_1 1-0
## 4 gid_2015_06_20_pitmlb_wasmlb_1 1-1
## 5 gid_2015_06_20_pitmlb_wasmlb_1 0-0
## 6 gid_2015_06_20_pitmlb_wasmlb_1 0-1
Why to use bdata$ again and again. Let’s put both into new variables.
atbat <- bdata$atbat
pitch <- bdata$pitch
Why we are joining the two datasets? Because I want to know like, if I have a particular pitch, then who the batter was? and this information we will get from atbat dataframe.
Looking the datasets above. We can see a column ‘num’ that is common to both. ‘num’ gives unique number to each atbat in the game.
So, now we will first join the two dataframes by ‘num’. To do this we’ll bring dplyr package.
library(dplyr)
#store into new variable
nh <- inner_join(atbat, pitch, by="num")
If you want to know what the columns means :
Click Here
To get the column names we use:
names(atbat)
## [1] "pitcher" "batter" "num" "b"
## [5] "s" "o" "start_tfs" "start_tfs_zulu"
## [9] "stand" "b_height" "p_throws" "atbat_des"
## [13] "atbat_des_es" "event_num" "event" "event_es"
## [17] "play_guid" "home_team_runs" "away_team_runs" "url"
## [21] "inning_side" "inning" "next_" "score"
## [25] "event2" "event2_es" "batter_name" "pitcher_name"
## [29] "gameday_link" "date"
names(pitch)
## [1] "des" "des_es" "id"
## [4] "type" "tfs" "tfs_zulu"
## [7] "x" "y" "event_num"
## [10] "sv_id" "play_guid" "start_speed"
## [13] "end_speed" "sz_top" "sz_bot"
## [16] "pfx_x" "pfx_z" "px"
## [19] "pz" "x0" "y0"
## [22] "z0" "vx0" "vy0"
## [25] "vz0" "ax" "ay"
## [28] "az" "break_y" "break_angle"
## [31] "break_length" "pitch_type" "type_confidence"
## [34] "zone" "nasty" "spin_dir"
## [37] "spin_rate" "cc" "mt"
## [40] "url" "inning_side" "inning"
## [43] "next_" "num" "on_1b"
## [46] "on_2b" "on_3b" "gameday_link"
## [49] "count"
Now we will see which one to pick from atbat dataframe.
We are only interested in ‘top’ inning_side. So we will filter out that first and then select the columns.
nh <- inner_join(atbat, pitch, by="num")%>%
filter(inning_side.x=='top')%>%
select(num,start_tfs,stand,event,inning.x,batter_name,des,tfs,start_speed,px,pz,pitch_type)
#Take a look
head(nh)
## num start_tfs stand event inning.x batter_name des
## 1 1 200534 R Pop Out 1 Josh Harrison In play, out(s)
## 2 20 204606 R Strikeout 4 Josh Harrison Ball
## 3 20 204606 R Strikeout 4 Josh Harrison Called Strike
## 4 20 204606 R Strikeout 4 Josh Harrison Foul Tip
## 5 20 204606 R Strikeout 4 Josh Harrison Swinging Strike
## 6 60 222530 R Flyout 9 Josh Harrison Foul
## tfs start_speed px pz pitch_type
## 1 200552 91.8 -0.392 2.506 FF
## 2 204703 93.5 -0.998 3.833 FF
## 3 204715 85.5 0.327 2.947 SL
## 4 204730 85.3 -0.729 2.850 SL
## 5 204751 95.1 0.354 2.152 FF
## 6 222539 95.7 -0.132 2.680 FF
Now that we have build our dataframe. It’s time to do some visualization.
First we’ll built the strike zone. Strike Zone varies from batter to batter as it depends on batter’s height. So we will take average of strike zone area.
We will have 1.9*1.9 area of strike zone, and we will look it as catcher’s view who stands behind the batter.
As shown in image.
Strike Zone
First, we will make x coordinate vector and z coordinate vector and will combine them both to make dataframe.
x <- c(-.95,.95,.95,-.95,-.95)
z <- c(1.6,1.6,3.5,3.5,1.6)
#store in dataframe
sz <- data_frame(x,z)
Now, we will use ggplot to make the strike zone. px and pz will be useful for pitches location(coordinate) and we will start_speed for the size of the pitch.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed))+
scale_size(range = c(0.01,3))
Now, we will differentiate pitches by pitch_type whether pitch was fastball or slider or changeup etc.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_type))+
scale_size(range = c(0.01,3))
Ok, this looks cool. But we got abbrevations of pitch_type instead of full description.
So, we will change that to full description.
First we need to search the pitch_type and change it to its full name.
#store pitch_type in new variable
pitch_desc <- nh$pitch_type
#search abbr. and change to full name
pitch_desc[which(pitch_desc=='FF')] <- "four-seam fastball"
pitch_desc[which(pitch_desc=='SL')] <- "slider"
pitch_desc[which(pitch_desc=='FC')] <- "fastball cutter"
pitch_desc[which(pitch_desc=='CU')] <- "curveball"
pitch_desc[which(pitch_desc=='CH')] <- "changeup"
#Add new column to the dataframe
nh$pitch_desc <- pitch_desc
Now, change pitch_type to pitch_desc
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.01,3))
Now, that looks more informative.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_hue(h=c(180,0), c=100, l=50)
#pitch type visualization
#Manual Color Specification
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('green','blue','pink','red','black'))
Now we are interested in right hand and left hand batter. We want ‘L’ for left handed and ’R’for right handed on the plot.
We will use stand to do that. We will put Left handed on 1.5 x-axis and R on -1.5, keep in mind we are the catcher.
#Store stand in new variable
stand_xcoord <- nh$stand
#Initialize
stand_xcoord[which(stand_xcoord=='L')] <- 1.5
stand_xcoord[which(stand_xcoord=='R')] <- -1.5
#change to numeric
stand_xcoord <- as.numeric(stand_xcoord)
#Create new column in dataframe
nh$stand_xcoord <- stand_xcoord
Now we will plot text using geom_text.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('red','blue','green','yellow','black'))+
facet_wrap(~stand)+
geom_text(data = nh,aes(label=stand, x = stand_xcoord),y=2.5,size=6)
Now, we are interested in plotting batter name and inning on the plot.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = nh,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('green','blue','yellow','red','black'))+
facet_wrap(~num)+
geom_text(data = nh,aes(label=stand, x = stand_xcoord),y=2.5,size=4)+
geom_text(data = nh,aes(label=batter_name),x=0,y=0.5,size=2.5)+
geom_text(data = nh,aes(label=inning.x),x=0,y=4.5,size=2.2)
Now, we will plot pitches of specific batter and specific inning.
#Specific Batter and inning
batter <- "Pedro Alvarez"
inning <- 5
#create atbat dataframe
ab <- nh%>%filter(batter_name==batter, inning.x==inning)
Plot the inning and batter name as the title of the plot
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = ab,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('black','blue','red','yellow','green'))+
geom_text(data = ab,aes(label=stand, x = stand_xcoord),y=2.5,size=5)+
xlim(-2,2)+
ylim(0,4.5)+
ggtitle(paste("Inning",inning," | ",batter))
Next, we want label to the pitches, as what happened to that pitch? We will get that information from ‘des’ column.
#labelling pitches
des <- nh$des
event <- nh$event
#This is use because instead of "In play" as des we need what exactly happened.
des[which(des=='In play, out(s)')] <- event[which(des=='In play, out(s)')]
nh$des2 <- des
ab <- nh%>%filter(batter_name==batter, inning.x==inning)
Now plot the data.
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = ab,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('black','blue','red','yellow','green'))+
geom_text(data = ab,aes(label=stand, x = stand_xcoord),y=2.5,size=5)+
xlim(-2,2)+
ylim(0,4.5)+
ggtitle(paste("Inning",inning," | ",batter))+
geom_text(data = ab, aes(label=des2, x=px, y=pz),size=2.5, vjust=2)
Now, that’s what we wanted. great!!!
Next, we want numbering to the pitches, which one was first? which second?..and so on. We will use ‘tfs’ column for the numbering.
#Counting number of pitches in an atbat
nh <- nh%>%
arrange(tfs)
temp <- nh%>%
group_by(num)%>%
summarise(num_of_pitches <- n())
pitch_enum <- unlist(lapply(temp$`num_of_pitches <- n()`,seq))
nh$pitch_enum <- pitch_enum
Now plot the data. Visulaize Pitch Enumeration
#visualize pitch enum
#Specific Batter and inning
batter <- "Jose Tabata"
inning <- 9
#create atbat dataframe
ab <- nh%>%filter(batter_name==batter, inning.x==inning)
ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = ab,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(0.005,2.5))+
scale_color_manual(values = c('violet','red','black','yellow','green'))+
geom_text(data = ab,aes(label=stand, x = stand_xcoord),y=2.5,size=5)+
xlim(-2,2)+
ylim(0,4.5)+
ggtitle(paste("Inning",inning," | ",batter))+
geom_text(data = ab, aes(label=des2, x=px, y=pz),size=2.5, vjust=2)+
geom_text(data = ab, aes(label=pitch_enum, x=px, y=pz),size=2.5, vjust=-1.2)
Now we want to plot like above for every batter and will save the image.
First we will specify the colors for pitch description, so that colors are same for particular description in every plot.
colors <- c("red","blue","orange","green","purple")
names(colors) <- c('four-seam fastball','slider','fastball cutter','curveball','changeup')
Next, using for loop we will save plot of every batter using gsave().
#Specific Batter and inning
for(i in unique(nh$num)){
#create atbat dataframe
ab <- nh%>%filter(num==i)
batter <- ab$batter_name[1]
inning <- ab$inning.x[1]
pitches <- unique(ab$pitch_desc)
#Speed Size should be same in every plot
zmax <-(max(ab$start_speed)-75.4)/22
zmin <-(min(ab$start_speed)-75.4)/22
plot <- ggplot()+
geom_path(data = sz, aes(x=x, y=z))+
coord_equal()+
xlab("feet from home plate")+
ylab("feet above the ground")+
geom_point(data = ab,aes(x=px,y=pz,size=start_speed, color=pitch_desc))+
scale_size(range = c(2.495*zmin+0.005,2.495*zmax+0.005))+
scale_color_manual(values = colors[pitches])+
geom_text(data = ab,aes(label=stand, x = stand_xcoord),y=2.5,size=5)+
xlim(-2,2)+
ylim(0,5)+
ggtitle(paste("Inning",inning," | ",batter))+
geom_text(data = ab, aes(label=des2, x=px, y=pz),size=2.5, vjust=2)+
geom_text(data = ab, aes(label=pitch_enum, x=px, y=pz),size=2.5, vjust=-1.2)
ggsave(paste("atbat",i,".png",sep = ""), plot)
}
Now we will look at some some of the plots we saved as image.