PITCHf/x Data

We will be using a data set that’s quite similar to the one you used on your recent homework. We’ll start (as usual) by loading the data and loading the needed packages. One difference between this data and your last set is that all of these pitches are called pitches. I also included the names of the catchers and umpires in addition to the names of the pitchers.

# Reading in data on pitches from MLB 2016 regular season
p <- read.csv('/home/rstudioshared/shared_files/data/called_pitches.csv')

# loading the dplyr and ggplot2 library
library(dplyr)
library(ggplot2)

# looking at the first six rows of data
head(p)

Called Strike Percentage

First, let’s see what percentage of called pitches are strikes. We can do this by taking the mean of called_strike which takes on the value of 1 if a pitch is a strike. This is similar to how we used the “shot_made_flag” in our jump shot data.

p %>% summarize(strike_percentage = mean(called_strike))

We might also be interested in how this percentage varies by top/bottom of the inning, inning and pitch_group. This is a job for the group_by function:

p %>% group_by(inning_top_bottom) %>% summarize(strike_percentage = mean(called_strike))
p %>% group_by(pitch_group) %>% summarize(strike_percentage = mean(called_strike))
p %>% group_by(inning) %>% summarize(strike_percentage = mean(called_strike))

As usual we might want to know how many data points fall in each category and we can use the length function to figure that out:

p %>% group_by(inning) %>% summarize(strike_percentage = mean(called_strike),
                                     called_pitches = length(called_strike))

Lastly, we could plot this data. Let’s put inning on the x-axis, strike_percentage on the y-axis and size our points by the number of called pitches they represent.

p %>% group_by(inning) %>% summarize(strike_percentage = mean(called_strike),
                                     called_pitches = length(called_strike)) %>%
  ggplot() + geom_point(aes(x=inning, y=strike_percentage, size=called_pitches))

The Shape of the Strike Zone

px and pz are the horizontal and vertical locations of the pitch. We can group by these locations rounded to the nearest 1/10th of a foot using the following code:

p %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike))

You can see that this gives us far to many locations to look at, even if we (as below) filter out locations with fewer than 100 pitches:

p %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% 
  filter(called_pitches>=100)

However, we can make sense of this data by plotting it!

We’ll make the (rounded) horizontal and vertical locations of the pitches the x- and y-coordinates of the pitches and color the points by the strike_percentage at that location by adding to our previous code:

p %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% filter(called_pitches>=100) %>%
  ggplot() + geom_point(aes(x=pxbin, y=pzbin, col=strike_percentage))

This looks a bit better, perhaps, if we change the color scale:

p %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
       called_pitches = length(called_strike)) %>% filter(called_pitches>=100) %>%
  ggplot() + geom_point(aes(x=pxbin, y=pzbin, col=strike_percentage))+ 
  scale_colour_gradient(low="white", high="red")

Filtering Pitches

Notice, that (almost!) all of the called strikes fall within a 3 foot by 3 foot box – within 1.5 feet of the center of the plate and between 1 and 4 feet high. Pitches outside of that box are surely balls. Let’s create a new data set, p2, that has only pitches within that box.

p2 <- p %>% filter(px>= -1.5 & px <= 1.5, pz>= 1, pz<=4)

The Size of the Strike Zone

Let’s round locations to the nearest 10th of a football again, using this smaller data set, and for each location determine whether more than half of the locations are strikes. For each location, we’ll create a new variable in_strikezone that takes on the value of 1 if more than half of the pitches in that location are strikes and 0 otherwise:

p2 %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% ungroup() %>%
  mutate(in_strikezone = strike_percentage>0.5)

Next, let’s count up home many locations are in the strike zone:

p2 %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% ungroup() %>%
  mutate(in_strikezone = strike_percentage>0.5)

You should find that 341 locations are within the strike zone. Each of these locations is 1/10th of a foot by 1/10th of a foot. In other words, each location has an area of 1/100th of a square foot. Therefore, we can conclude that the strike zone is roughly 341/100 = 3.41 square feet in area.

We might be able to do a bit better than this, however. Instead of calling a location with 60% strikes entirely within the strike zone and a location with 40% strikes entirely outside of the strike zone we could count the former location as 0.6 in and the latter as 0.4 in and then sum up these values. This time, I will divide by 100 within the code to put the final value in terms of square feet.

p2 %>% group_by(pxbin = round(px, 1), pzbin=round(pz,1)) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% ungroup() %>%
  summarize(strike_zone_size = sum(strike_percentage)/100)

Now, we get a value of 3.43 square feet. How big is this? The calculation below shows that this is equivalent to a 22 inch by 22 inch square. To put this in perspective, the plate is 17 inches wide.

12*sqrt(3.43)

Bigger Bins

For the upcoming analysis we’re going to want to make sure that we have called pitches in each bin so, to play it safe, let’s make larger bins. These bins will be 1/5th of a foot in either direction or 1/25th of a square foot in area. We can make these bins using the cut function:

p2 <- p2 %>% mutate(pxbin = cut(px, seq(-1.5, 1.5, 0.2)),
              pzbin = cut(pz, seq(1, 4, 0.2)))

Let’s try our strike zone size calculation again using these bins. Note that now we need to divide by 25 rather than 100 (because there are now only 25 locations per square foot):

p2 %>% group_by(pxbin, pzbin) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% ungroup() %>%
  summarize(strike_zone_size = sum(strike_percentage)/25)

Good news! Our strike zone is still 3.43 square feet in area.

Pitcher’s Umps and Hitters Umps

How does the strike zone size depend on the umpire? To find out, we can group by locations as well as umpire and then, calculate the size of the strike zone for each umpire. Let’s also order the data to find the umpires with the largest and smallest strike zones:

p2 %>% group_by(uname, pxbin, pzbin) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% group_by(uname) %>%
  summarize(strike_zone_size = sum(strike_percentage)/25) %>% 
  arrange(strike_zone_size)

p2 %>% group_by(uname, pxbin, pzbin) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% group_by(uname) %>%
  summarize(strike_zone_size = sum(strike_percentage)/25) %>% 
  arrange(desc(strike_zone_size))

It’s possible that some of these umpires didn’t call very many pitches so it would be a good idea to determine the number of pitches each umpire saw and filter out umpires who didn’t see very many. Here are the largest strike zones by umpires who saw at least 1000 pitches:

p2 %>% group_by(uname, pxbin, pzbin) %>% 
  summarize(strike_percentage = mean(called_strike),
            called_pitches = length(called_strike)) %>% group_by(uname) %>%
  summarize(strike_zone_size = sum(strike_percentage)/25, called_pitches=sum(called_pitches)) %>% 
  filter(called_pitches>=1000) %>%
  arrange(desc(strike_zone_size))

Bill Miller calls a large strike zone and maybe now we can understand Bryce Harper’s frustration.

Catcher Framing

Catchers are trained to have soft hands and to make pitches on the edges of the strike zone look better than they are. Some catchers are better at this than others of course – in fact, this may well be the most important element of a catcher’s defense. Try modifying the code above to find the catchers who give their pitchers the largest and smallest strike zones.

Veteran Pitchers?

Do some pitchers get the benefit of the doubt while others don’t? Find the pitchers with the smallest and largest strike zones.

Home Cooking?

Does the home team get a friendlier strike zone than the visitors? Look carefully at the data set and then find the strike zone size for home and away teams.

Batter Handedness?

Umpires stand in the slot between the catcher and the batter but their position for left-handed and right-handed batter may not be mirror images and this could effect their strike zones. Who gets the more favorable strike zone, lefties or righties?

What else can you find?

What other factors could effect the size of the strike zone? See what you can find.

The Size of the Strike Zone

Sports Data Science

PITCHf/x Data

Called Strike Percentage

The Shape of the Strike Zone

Filtering Pitches

The Size of the Strike Zone

Bigger Bins

Pitcher’s Umps and Hitters Umps

Catcher Framing

Veteran Pitchers?

Home Cooking?

Batter Handedness?

What else can you find?