Introduction

When should you expect to see a fastball versus an off-speed pitch?

At what velocity is a pitch expected to be thrown at?

These questions are important for batters and pitchers alike. Batters can have an idea of what to expect through their at bat. Pitchers can see where they stack up compared to the rest for a given pitch.

MLB Pitch Data including data for every pitch thrown in the 2015-2018 MLB seasons will be analyzed and visualized to see distribution of pitch speeds and the proportion of pitch type used through an at bat.

Additionally, some fun stats interesting for all baseball fans are included as well!

Packages Required

library(readr) # import data
library(magrittr) # use pipe operator
library(tidyr) # tidy messy data
library(dplyr) # manipulate data
library(purrr) # apply functions to multiple variabls
library(broom) # create data frames from other formats
library(ggplot2) # visualize data

Data Preparation

The MLB Pitch Data 2015-2018 is posted on Kaggle. It was scraped from this webpage which is a part of MLB.com. This collection of data includes the following 4 data sets as csv files:

To start, the data is loaded into R.

atbats <- read_csv("data/atbats.csv")
games <- read_csv("data/games.csv")
pitches <- read_csv("data/pitches.csv")
player_names <- read_csv("data/player_names.csv")

Here are descriptions for each variable in these data sets. Because there are 70 columns within this data, any variable not used for analysis will not be described. However, all variables were initially considered when cleaning data and may be seen in those steps.

At Bats

Games

Pitches

Player Names

Data Cleaning

At Bats

The structure of the data should be evaluated and modify as necessary.

ID numbers are numerics but don’t actually represent a number value. Additionally, all actual numeric values here should logically only take on integer values. For example, there can’t be 2.3 outs in the 4.6th inning with 3.7 runs for the picher’s team!

To maintain the original imported data, a new data frame atbats_clean will be used as changes are made to data types. The structure was then checked again to ensure the correct changes were made.

str(atbats)

atbats_clean <- atbats

atbats_clean$ab_id <- as.character(atbats_clean$ab_id)
atbats_clean$batter_id <- as.character(atbats_clean$batter_id)
atbats_clean$g_id <- as.character(atbats_clean$g_id)
atbats_clean$pitcher_id <- as.character(atbats_clean$pitcher_id)
atbats_clean$inning <- as.integer(atbats_clean$inning)
atbats_clean$o <- as.integer(atbats_clean$o)
atbats_clean$p_score <- as.integer(atbats_clean$p_score)

str(atbats_clean)

After adjusting the structure of the data, summary statistics will give a better idea of the data that will be worked with.

summary(atbats)
##      ab_id             batter_id         event          
##  Min.   :2.015e+09   Min.   :112526   Length:740389     
##  1st Qu.:2.016e+09   1st Qu.:457759   Class :character  
##  Median :2.017e+09   Median :519317   Mode  :character  
##  Mean   :2.017e+09   Mean   :520223                     
##  3rd Qu.:2.018e+09   3rd Qu.:592273                     
##  Max.   :2.018e+09   Max.   :673633                     
##       g_id               inning             o            p_score      
##  Min.   :201500001   Min.   : 1.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:201600013   1st Qu.: 3.000   1st Qu.:1.000   1st Qu.: 0.000  
##  Median :201700015   Median : 5.000   Median :2.000   Median : 1.000  
##  Mean   :201651556   Mean   : 5.008   Mean   :1.677   Mean   : 2.286  
##  3rd Qu.:201800006   3rd Qu.: 7.000   3rd Qu.:2.000   3rd Qu.: 4.000  
##  Max.   :201802431   Max.   :19.000   Max.   :3.000   Max.   :25.000  
##    p_throws           pitcher_id        stand              top         
##  Length:740389      Min.   :112526   Length:740389      Mode :logical  
##  Class :character   1st Qu.:462136   Class :character   FALSE:363106   
##  Mode  :character   Median :534910   Mode  :character   TRUE :377283   
##                     Mean   :526830                                     
##                     3rd Qu.:592836                                     
##                     Max.   :673633

The most important check here is for nonsensible numeric values such as a negative value or outs above 3. Additionally, you can see that the top variable is a logical and has more TRUE’s than FALSE’s which makes sense as many times the bottom of the 9th (or final inning of the game) may not need to be played if the home team is leading.

To further check the character variables, the unique() function was used to ensure that the pitcher’s throwing hand and the batter stance could only be L for left or R for right as well as make sure all event descriptions made sense.

unique(atbats_clean$event)
unique(atbats_clean$p_throws)
unique(atbats_clean$stand)

Next, missing values were checked and there are none in the atbats data.

sum(is.na(atbats))
## [1] 0

Games

games was checked an cleaned in a very similar way as atbats.

Summary statistics can be see below.

str(games)

games_clean <- games

games_clean$g_id <- as.character(games_clean$g_id)
games_clean$attendance <- as.integer(games_clean$attendance)
games_clean$away_final_score <- as.integer(games_clean$away_final_score)
games_clean$home_final_score <- as.integer(games_clean$home_final_score)

str(games_clean)
summary(games_clean)
##    attendance    away_final_score  away_team              date           
##  Min.   :    0   Min.   : 0.000   Length:9718        Min.   :2015-04-05  
##  1st Qu.:21791   1st Qu.: 2.000   Class :character   1st Qu.:2016-04-03  
##  Median :30113   Median : 4.000   Mode  :character   Median :2017-04-02  
##  Mean   :29765   Mean   : 4.372                      Mean   :2017-01-01  
##  3rd Qu.:37953   3rd Qu.: 6.000                      3rd Qu.:2018-03-29  
##  Max.   :56310   Max.   :24.000                      Max.   :2018-10-01  
##   elapsed_time       g_id           home_final_score  home_team        
##  Min.   : 75.0   Length:9718        Min.   : 0.000   Length:9718       
##  1st Qu.:167.0   Class :character   1st Qu.: 2.000   Class :character  
##  Median :182.0   Mode  :character   Median : 4.000   Mode  :character  
##  Mean   :184.8                      Mean   : 4.538                     
##  3rd Qu.:199.0                      3rd Qu.: 6.000                     
##  Max.   :409.0                      Max.   :25.000                     
##   start_time        umpire_1B          umpire_2B        
##  Length:9718       Length:9718        Length:9718       
##  Class1:hms        Class :character   Class :character  
##  Class2:difftime   Mode  :character   Mode  :character  
##  Mode  :numeric                                         
##                                                         
##                                                         
##   umpire_3B          umpire_HP          venue_name       
##  Length:9718        Length:9718        Length:9718       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    weather              wind               delay         
##  Length:9718        Length:9718        Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.000  
##                                        Mean   :   3.886  
##                                        3rd Qu.:   0.000  
##                                        Max.   :1860.000
unique(games_clean$away_team)
unique(games_clean$home_team)
unique(games_clean$venue_name)

This data does have a few missing values. They are only in the umpire_2b variable. Because this variable isn’t used for analysis, all observations will be kept in the data set.

sum(is.na(games))
## [1] 3
games %>% 
map_df(~sum(is.na(.))) %>% 
  gather(variable, num_missing) %>% 
  filter(num_missing > 0)
## # A tibble: 1 x 2
##   variable  num_missing
##   <chr>           <int>
## 1 umpire_2B           3

Pitches

From importing the pitches data and checking the Global Environment, you can see it’s a large data set with many variables. So, handling missing values and then selecting only needed variables was done first.

This displays the total number of missing values as well as breaks down the variables where the missing values are. Variables to select was determined based on what variables had no missing values and what variables were needed regardless of missing values.

sum(is.na(pitches))
colSums(is.na(pitches))
pitches %>% 
  select(ab_id, b_count, end_speed, outs, on_1b, on_2b, on_3b, pitch_type, s_count, start_speed, type) %>% 
  map_df(~sum(is.na(.))) %>% 
  gather(variable, num_missing) %>% 
  filter(num_missing > 0)
## # A tibble: 3 x 2
##   variable    num_missing
##   <chr>             <int>
## 1 end_speed         14114
## 2 pitch_type        14189
## 3 start_speed       14114

The number of missing values was just under a half a percent of all pitch data, so here the pitches_clean is created with ommitting any observations with missing values. A check is then done to make sure all missing values were removed.

pitches_clean <- pitches %>% 
  select(ab_id, b_count, end_speed, outs, on_1b, on_2b, on_3b, pitch_type, s_count, start_speed, type) %>% 
  na.omit()

sum(is.na(pitches_clean))
## [1] 0

Here the same structure check and data type changes were performed as they were with the previous 2 data sets.

str(pitches_clean)

pitches_clean$ab_id <- as.character(pitches_clean$ab_id)
pitches_clean$b_count <- as.integer(pitches_clean$b_count)
pitches_clean$outs <- as.integer(pitches_clean$outs)
pitches_clean$s_count <- as.integer(pitches_clean$s_count)

Summary statistics are explored and checks of the character variables are performed as well.

summary(pitches_clean)
##     ab_id              b_count         end_speed          outs       
##  Length:2852973     Min.   :0.0000   Min.   :32.40   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:77.90   1st Qu.:0.0000  
##  Mode  :character   Median :1.0000   Median :82.50   Median :1.0000  
##                     Mean   :0.8807   Mean   :81.36   Mean   :0.9813  
##                     3rd Qu.:2.0000   3rd Qu.:85.40   3rd Qu.:2.0000  
##                     Max.   :4.0000   Max.   :96.90   Max.   :2.0000  
##    on_1b           on_2b           on_3b          pitch_type       
##  Mode :logical   Mode :logical   Mode :logical   Length:2852973    
##  FALSE:1989943   FALSE:2326343   FALSE:2582291   Class :character  
##  TRUE :863030    TRUE :526630    TRUE :270682    Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##     s_count        start_speed         type          
##  Min.   :0.0000   Min.   : 33.90   Length:2852973    
##  1st Qu.:0.0000   1st Qu.: 84.30   Class :character  
##  Median :1.0000   Median : 89.70   Mode  :character  
##  Mean   :0.8832   Mean   : 88.38                     
##  3rd Qu.:2.0000   3rd Qu.: 93.00                     
##  Max.   :2.0000   Max.   :105.00
unique(pitches_clean$type)
unique(pitches_clean$pitch_type)

From the summary, you can see that there are ball counts of 4. This does not make sense because when a pitch is thrown, there cannot already be 4 balls. Since there are only 14 instances of this, these observations were filtered out.

pitches_clean <- filter(pitches_clean, b_count < 4)

Pitch type is a big part of this analysis, so this variable needs to be explored further. Here you can see the proportion of each pitch type thrown from all 2852959 pitches.

pitch_type_table <- table(pitches_clean$pitch_type)
pitch_type_table %>% 
  prop.table() %>% 
  round(3) %>% 
  as_data_frame() %>% 
  rename(pitch = Var1, proportion = n) %>% 
  arrange(desc(proportion))
## # A tibble: 18 x 2
##    pitch proportion
##    <chr>      <dbl>
##  1 FF         0.356
##  2 SL         0.158
##  3 FT         0.118
##  4 CH         0.103
##  5 SI         0.085
##  6 CU         0.082
##  7 FC         0.052
##  8 KC         0.023
##  9 FS         0.015
## 10 KN         0.004
## 11 IN         0.002
## 12 AB         0    
## 13 EP         0    
## 14 FA         0    
## 15 FO         0    
## 16 PO         0    
## 17 SC         0    
## 18 UN         0

The Kaggle page linked above can be viewed to see which pitch each abbreviation corresponds to. Here pitch types will be broken down into 4 buckets:

breakingballs <- c("CU","KC","SC","SL")
changeups <- c("CH","KN","EP")
fastballs <- c("FC","FF","FS","FT","SI")
other_pitches <- c("FO","PO","IN","UN","AB","FA")

pitches_clean$pitch_type_b <- pitches_clean$pitch_type
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% fastballs] <- "FB"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% breakingballs] <- "BB"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% changeups] <- "CU"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% other_pitches] <- "OT"

Here’s a view of the proportions for each of these new pitch types

pitch_type_b_table <- table(pitches_clean$pitch_type_b)
pitch_type_b_table %>% 
  prop.table() %>% 
  round(3) %>% 
  as_data_frame() %>% 
  rename(pitch = Var1, proportion = n) %>% 
  arrange(desc(proportion)) %>% 
  ggplot(aes(x = pitch, y = proportion, fill = pitch)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = proportion), vjust = -0.25) +
  ggtitle("Pitch Type Proportions")

Player Names

This data set is simply an ID, first name, and last name. The id variable was changed to player_id for more clarity later when joining data. Also, it was changed to a character value as all other ID’s have been.

str(player_names)
sum(is.na(player_names))
player_names_clean <- player_names
names(player_names_clean)[1] <- "player_id"
player_names_clean$player_id <- as.character(player_names_clean$player_id)
str(player_names_clean)

Exploratory Data Analysis

First, the relationship between pitch velocity and pitch type will be explored.

A data frame speed_means is created to get mean velocity for each pitch type as well as its standard deviation. Then, the distribution and frequencies for each pitch type can be viewed.

speed_means <- pitches_clean %>% 
                subset(pitch_type_b %in% c("FB","BB","CU")) %>% 
                group_by(pitch_type_b) %>% 
                summarise(Mean = mean(start_speed),
                          SD = sd(start_speed))

ggplot(subset(pitches_clean, pitch_type_b %in% c("FB","BB","CU")), aes(x = start_speed, fill = pitch_type_b)) +
  geom_histogram(binwidth = 1, color = "grey30") +
  geom_vline(data = speed_means, aes(xintercept = Mean), linetype = "dashed") +
  facet_grid(~ pitch_type_b) +
  xlim(60,105) +
  ylab("Frequency") +
  xlab("Pitch Speed (mph)") +
  ggtitle("Pitch Velocity by Pitch Type") +
  scale_fill_discrete(name = "Pitch Type", labels = c("Breaking Balls", "Changeups", "Fastballs")) +
  theme(panel.grid.minor = element_blank(),
        axis.ticks = element_blank())

You can see that fastballs are thrown much more often than breaking balls or changeups. Additionally, fastballs have a tighter distribution. This makes sense as fastballs are more of a standard pitch to be thrown as fast as possible, whereas breaking balls and changeups are used to throw off the batter by having more motion on the ball or confusing their timing.

Now, when can we expected to see each of these pitches throughout an at bat?

A table can be created to see how many pitches are thrown at each combination of balls and strikes throughout the at bat.

count_table <- table(pitches_clean$b_count, pitches_clean$s_count)
(count_prop <- prop.table(count_table))
##    
##              0          1          2
##   0 0.25822804 0.12827910 0.06466129
##   1 0.10140735 0.10241717 0.09478300
##   2 0.03487327 0.05296851 0.08074354
##   3 0.01108709 0.02198630 0.04856537

There are 12 possible ball-strike counts from an 0-0 count to a 3-2 count. A new variable bs-count is created to have what the count is before each pitch.

pitches_clean$bs_count <- paste(pitches_clean$b_count, pitches_clean$s_count, sep = "-")
head(pitches_clean[,c("ab_id","bs_count")],7)
## # A tibble: 7 x 2
##   ab_id      bs_count
##   <chr>      <chr>   
## 1 2015000001 0-0     
## 2 2015000001 0-1     
## 3 2015000001 0-2     
## 4 2015000001 0-2     
## 5 2015000001 1-2     
## 6 2015000001 2-2     
## 7 2015000002 0-0

Here you can see the result of creating this variable and how it flows from one pitch to the next through an at bat. This flows correctly as the first 2 pitches were strikes, followed by a foul ball which kept that count at 0-2, then two balls were thrown before the final pitch of the at bat when there was a 2-2 count.

This next section of code creates a table showing the pitch types for each ball-strike count.

type_count_table <- table(pitches_clean$pitch_type_b, pitches_clean$bs_count)
(type_count_prop <- prop.table(type_count_table))
##     
##               0-0          0-1          0-2          1-0          1-1
##   BB 6.546992e-02 3.817896e-02 2.377847e-02 2.036447e-02 2.849287e-02
##   CU 1.976544e-02 1.573279e-02 6.062828e-03 1.335911e-02 1.453543e-02
##   FB 1.723880e-01 7.427271e-02 3.475690e-02 6.715554e-02 5.931596e-02
##   OT 6.046354e-04 9.463858e-05 6.309239e-05 5.282235e-04 7.290676e-05
##     
##               1-2          2-0          2-1          2-2          3-0
##   BB 3.474358e-02 3.860904e-03 1.049437e-02 2.555207e-02 2.274831e-04
##   CU 1.066016e-02 3.158826e-03 6.975565e-03 1.006779e-02 2.001431e-04
##   FB 4.931196e-02 2.729482e-02 3.546774e-02 4.507706e-02 1.007270e-02
##   OT 6.729855e-05 5.587182e-04 3.084517e-05 4.661827e-05 5.867592e-04
##     
##               3-1          3-2
##   BB 2.075389e-03 1.019573e-02
##   CU 1.453929e-03 4.886155e-03
##   FB 1.842543e-02 3.346631e-02
##   OT 3.154619e-05 1.717515e-05

The table shows proportions as it relates to all 2852959 pitches. We want it to be the proportion within each ball-strike count.

Here a data frame is created from the table and each value is divided by the column sum to get the within count proportion. The data frame is then taken from wide to long data to be used for visualizing in the next step.

typeVcount <- type_count_table %>% 
  tidy() %>% 
  select(pitch_type_b = Var1, bs_count = Var2, count = n) %>% 
  spread(bs_count, count)

typeVcount_colSums <- colSums(typeVcount[,-1])

for (i in 1:(ncol(typeVcount) - 1)) {
  typeVcount[,i + 1] <- typeVcount[,i + 1]/typeVcount_colSums[i]
}

typeVcount_long <- typeVcount %>% 
  gather(bs_count, prop, -pitch_type_b)

Now that we have the data cleaned up and formatted correctly, we will visualize pitch type use throughout the count.

ggplot(subset(typeVcount_long, pitch_type_b %in% c("BB","CU","FB")), 
       aes(x = bs_count, y = prop, fill = pitch_type_b)) + 
  geom_bar(stat = "identity", position = position_dodge()) +
  ylab("Proportion") +
  xlab("Ball-Strike Count") +
  scale_x_discrete(limits = c("0-0","0-1","1-0","0-2","1-1","2-0","1-2","2-1","3-0","2-2","3-1")) +
  scale_fill_discrete(name = "Pitch Type", labels = c("Breaking Balls", "Changeups", "Fastballs")) +
  ggtitle("Pitch Type Used through an At Bat")

As you step through each count, whether it’s a 1 pitch count (ie 0-1 or 1-0) or a 4 pitch count (ie 2-2 or 3-1), fastballs are used more often when the pitcher is at a disadvantage (generally meaning when there are more balls then strikes in the count). For example, looking at 1-2 through 3-0 counts, there is a significant drop off of breaking balls and changeups being used.

After finding these significant insights analyzing pitch types, here are some fun stats found within the data!

Fastest Pitch

Keeping with pitch analysis, here are the 3 instances of the top recorded pitch speed of 105mph.

pitches_clean %>% 
  filter(start_speed == max(start_speed)) %>% 
  inner_join(atbats_clean, by = "ab_id") %>% 
  select(start_speed, pitch_type, batter_id, pitcher_id) %>% 
  inner_join(player_names_clean, by = c("pitcher_id" = "player_id")) %>% 
  inner_join(player_names_clean, by = c("batter_id" = "player_id"), suffix = c("_pitcher", "_batter")) %>% 
  select(MPH = start_speed, PitchType = pitch_type, PitcherFirst = first_name_pitcher, PitcherLast =  last_name_pitcher, BatterFirst = first_name_batter, BatterLast = last_name_batter)
## # A tibble: 3 x 6
##     MPH PitchType PitcherFirst PitcherLast BatterFirst BatterLast
##   <dbl> <chr>     <chr>        <chr>       <chr>       <chr>     
## 1   105 FF        Aroldis      Chapman     J.J.        Hardy     
## 2   105 SI        Jordan       Hicks       Odubel      Herrera   
## 3   105 SI        Jordan       Hicks       Odubel      Herrera

You can see that all of these happened in the 9th inning when closers came out to finish the game. Closers don’t throw as many pitches as starters and other relief pitchers. The pitches they do throw are expected to be elite pitches and thrown at very high speeds as you can see here. Amazingly, two of these pitches were thrown by the same pitcher in the same at bat!

Here’s an article about this at bat Hicks hits 105 mph– twice – on radar gun.

Slowest pitch of the at bat was 103.7mph…

Game with the Highest Attendance

games_clean %>% 
  filter(attendance == max(attendance)) %>% 
  select(Date = date, Attendance = attendance, Away = away_team, Home = home_team)
## # A tibble: 1 x 4
##   Date       Attendance Away  Home 
##   <date>          <int> <chr> <chr>
## 1 2018-07-21      56310 sfn   oak

Longest Game (most innings and elapsed time)

atbats_clean %>% 
  filter(inning == max(inning)) %>% 
  select(g_id) %>% 
  unique() %>% 
  inner_join(games_clean, by = "g_id") %>% 
  select(Date = date, elapsed_time, Away = away_team, Home = home_team, AwayScore = away_final_score, HomeScore = home_final_score) %>% 
  arrange(desc(elapsed_time))
## # A tibble: 3 x 6
##   Date       elapsed_time Away  Home  AwayScore HomeScore
##   <date>            <dbl> <chr> <chr>     <int>     <int>
## 1 2015-04-10          409 bos   nya           6         5
## 2 2016-07-01          373 cle   tor           2         1
## 3 2017-09-05          360 tor   bos           2         3

Most Runs Scored in a Game

atbats_clean %>% 
  filter(p_score == max(p_score)) %>% 
  select(g_id) %>% 
  unique() %>% 
  inner_join(games_clean, by = "g_id") %>% 
  select(Date = date, Away = away_team, Home = home_team, AwayScore = away_final_score, HomeScore = home_final_score)
## # A tibble: 1 x 5
##   Date       Away  Home  AwayScore HomeScore
##   <date>     <chr> <chr>     <int>     <int>
## 1 2018-07-31 nyn   was           4        25

Summary

When should you expect to see a fastball versus an off-speed pitch? At what velocity is a pitch expected to be thrown at?

We now know the answers to these questions.

The major insights found include the following:

This information can be useful for all baseball players whether preparing for an at bat or analyzing pitcher ability.