The Data

This dataset contains data on each pitch thrown during the Major League Baseball 2015-2018 regular seasons. 4 tables were downloaded from a Kaggle page, which scraped its data from here. The tables available and their definitions are below:

atbats - Contains the information on what batter and pitcher was involved in the at bat scenario and what was the outcome of the event
games - Contains the information on all of the games that have taken place during the seasons like attendance and final score
pitches - Contains the information on each of the pitches that have taken place and includes type of pitch as well as speed
player_names - Contains the information on all of the players first and last names

The goal of this study is to try and understand which pitchers are performing the best when the stakes are the highest. Being “clutch” is a coveted quality of any athlete and pitchers are no different. They are often tasked with dealing with pivotal situations like performance in the last innings of the games or when the bases end up being loaded or performance against good batters. We will review situations such as these to glean what pitchers stand unphased on the mound and have the perverbial “ice” in the veins.

The Process

Cleaning

We will begin by loading all of the required packages for the project.

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)

The data.table package will be used for bringing in the data into the R environment, the dplyr package’s main function will be for combining the data, the tidyr package will be used to unite certain variables, and the ggplot2 will be used to display graphs.

Next step is to load each of the files into data.frames

atbats <- fread("atbats.csv")
games <- fread("games.csv")
pitches <- fread("pitches.csv")
player_names <- fread("player_names.csv")

Here are the columns that are available for use and as you can see there are quite a lot of them. We will be mainly focusing on the outcomes of the pitcher and batter as opposed to the types of pitches the pitcher throws.

names(atbats)

##  [1] "ab_id"      "batter_id"  "event"      "g_id"       "inning"    
##  [6] "o"          "p_score"    "p_throws"   "pitcher_id" "stand"     
## [11] "top"

names(games)

##  [1] "attendance"       "away_final_score" "away_team"       
##  [4] "date"             "elapsed_time"     "g_id"            
##  [7] "home_final_score" "home_team"        "start_time"      
## [10] "umpire_1B"        "umpire_2B"        "umpire_3B"       
## [13] "umpire_HP"        "venue_name"       "weather"         
## [16] "wind"             "delay"

names(pitches)

##  [1] "ab_id"           "ax"              "ay"             
##  [4] "az"              "b_count"         "b_score"        
##  [7] "break_angle"     "break_length"    "break_y"        
## [10] "code"            "end_speed"       "nasty"          
## [13] "on_1b"           "on_2b"           "on_3b"          
## [16] "outs"            "pfx_x"           "pfx_z"          
## [19] "pitch_num"       "pitch_type"      "px"             
## [22] "pz"              "s_count"         "spin_dir"       
## [25] "spin_rate"       "start_speed"     "sz_bot"         
## [28] "sz_top"          "type"            "type_confidence"
## [31] "vx0"             "vy0"             "vz0"            
## [34] "x"               "x0"              "y"              
## [37] "y0"              "z0"              "zone"

names(player_names)

## [1] "id"         "first_name" "last_name"

We will be joining all of the datasets and one of the future issues we will come across is the fact that we have to join to the player_names table twice. Once, to match pitcher names and another to match against batter names. The code below addresses that issue by creating a copy, one for batters and the other for pitchers. The other issue is that the ID columns are not character values, so we will be changing that as well.

batters <- player_names
pitchers <- player_names
names(batters) <- c("batter_id", "bat_first_name", "bat_last_name")
names(pitchers) <- c("pitcher_id", "pit_first_name", "pit_last_name")

batters$batter_id <- as.character(batters$batter_id)
pitchers$pitcher_id <- as.character(pitchers$pitcher_id)
pitches$ab_id <- as.character(pitches$ab_id)

atbats_clean <- atbats %>%
  mutate(ab_id = as.character(ab_id),
         batter_id = as.character(batter_id),
         pitcher_id = as.character(pitcher_id))

As you will see by the code below the only dataset from the group that has any missing values is the pitches dataset. The max amount of missing values is 14,189. This is less than 1% of the entire pitches dataset so I have elected to remove them for simplicity’s sake.

colSums(is.na(atbats))

##      ab_id  batter_id      event       g_id     inning          o 
##          0          0          0          0          0          0 
##    p_score   p_throws pitcher_id      stand        top 
##          0          0          0          0          0

colSums(is.na(games))

##       attendance away_final_score        away_team             date 
##                0                0                0                0 
##     elapsed_time             g_id home_final_score        home_team 
##                0                0                0                0 
##       start_time        umpire_1B        umpire_2B        umpire_3B 
##                0                0                0                0 
##        umpire_HP       venue_name          weather             wind 
##                0                0                0                0 
##            delay 
##                0

colSums(is.na(pitches))

##           ab_id              ax              ay              az 
##               0           14189           14189           14189 
##         b_count         b_score     break_angle    break_length 
##               0               0           14189           14189 
##         break_y            code       end_speed           nasty 
##           14189               0           14114           14189 
##           on_1b           on_2b           on_3b            outs 
##               0               0               0               0 
##           pfx_x           pfx_z       pitch_num      pitch_type 
##           14142           14142               0               0 
##              px              pz         s_count        spin_dir 
##           14189           14189               0           14189 
##       spin_rate     start_speed          sz_bot          sz_top 
##           14189           14114            2083            2083 
##            type type_confidence             vx0             vy0 
##               0           14189           14189           14189 
##             vz0               x              x0               y 
##           14189               0           14189               0 
##              y0              z0            zone 
##           14189           14189           14189

colSums(is.na(player_names))

##         id first_name  last_name 
##          0          0          0

pitches_clean <- na.omit(pitches)

Transforming

The following code brings all of the tables together into a main data set. It also creates a count for how many batters are on base at the time of the pitch, a variable we will use later on in the analysis.

main_data <- games %>%
  left_join(atbats_clean, by = "g_id") %>%
  left_join(batters, by = "batter_id") %>%
  left_join(pitchers, by = "pitcher_id") %>%
  left_join(pitches_clean, by = "ab_id") %>%
  mutate(amount_on_b = on_1b + on_2b + on_3b) %>%
  unite(pit_full_name, pit_last_name, pit_first_name, sep = ", ") %>%
  unite(bat_full_name, bat_last_name, bat_first_name, sep = ", ") 

main_data %>%
  select(g_id, ab_id, event, inning, bat_full_name, pit_full_name, amount_on_b) %>%
  summarize(Number_of_games = n_distinct(g_id),
            Number_of_at_bats = n_distinct(ab_id),
            Different_events = n_distinct(event),
            Number_of_innings = n_distinct(inning),
            Total_batters = n_distinct(bat_full_name),
            Total_pitchers = n_distinct(pit_full_name),
            Avg_on_base = mean(amount_on_b, na.rm = TRUE))

##   Number_of_games Number_of_at_bats Different_events Number_of_innings
## 1            9718            740389               30                19
##   Total_batters Total_pitchers Avg_on_base
## 1          1684           1330   0.5819691

ggplot(main_data, aes(x = amount_on_b)) +
  geom_histogram() + 
  scale_y_continuous(name = "# of Pitches", labels = scales::comma) +
  scale_x_continuous(name = "Players on base") + 
  ggtitle("Number of pitches with players on base")

Exploration

There are various types of pitchers within the MLB and they are used in a variety of ways. We will be categorizing them in this project by the amount of pitches they have thrown. In order to determine a cutoff point we need to explore the data. We start by getting a count of how many pitches each pitcher threw across the 3 years during all innings.

pitch_count <- main_data %>%
  group_by(pit_full_name) %>%
 tally()

summary(pitch_count$n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   267.2  1015.0  2147.2  3011.8 13488.0

ggplot(pitch_count, aes(x = n)) +
  geom_density() + 
  scale_y_continuous(name = "Density") +
  scale_x_continuous(name = "Pitches Thrown") + 
  ggtitle("Density Plot of Amount of Pitches Thrown per Pitcher")

As you can see by the density plot above, we see a good cutoff for the number of pitches thrown for a pretty consistent pitcher is around 5000 (another option could be 3000, 3rd Quantile).

We will do the same thing as before but this time, filter for pitches thrown in later innings.

pitch_count_late <- main_data %>%
  filter(inning > 6) %>%
  group_by(pit_full_name) %>%
  tally()

summary(pitch_count_late$n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    97.0   351.0   724.5   915.5  4778.0

ggplot(pitch_count_late, aes(x = n)) +
  geom_histogram() + 
  scale_y_continuous(name = "# of Pitchers") +
  scale_x_continuous(name = "Pitches Thrown", labels = scales::comma) + 
  ggtitle("Histogram of Amount of Pitches Thrown per Pitcher", 
          subtitle = "Pitches thrown within the 7th, 8th, and 9th Inning")

From the histogram above, we see a good cutoff for the number of pitches thrown for a pretty consistent pitcher in later innings is 1000. I will actually take the lower mean cutoff of 725 because restricting it to a higher value may constrict the data to only including pitchers who are closers. I want it to be open to starting pitchers who last longer as well.

Last but not least, the code below returns the possibe outcomes for an at-bat situation. We will be focused on outcomes that cause a favorable result for the pitcher.

distinct(main_data, event)

##                   event
## 1             Groundout
## 2                Double
## 3                Single
## 4             Strikeout
## 5                  Walk
## 6            Runner Out
## 7                Flyout
## 8              Forceout
## 9               Pop Out
## 10          Intent Walk
## 11              Lineout
## 12             Home Run
## 13               Triple
## 14         Hit By Pitch
## 15     Grounded Into DP
## 16             Sac Bunt
## 17      Fielders Choice
## 18       Bunt Groundout
## 19          Field Error
## 20          Double Play
## 21              Sac Fly
## 22  Fielders Choice Out
## 23         Bunt Pop Out
## 24 Catcher Interference
## 25       Strikeout - DP
## 26  Batter Interference
## 27           Sac Fly DP
## 28         Bunt Lineout
## 29    Sacrifice Bunt DP
## 30          Triple Play

Analysis

Now that we have the main dataset and our cutoff points, we can start to dig for the information we are after. We will start by finding out what pitchers, that threw over our cutoff of 5000 pitches, allow the fewest batters on base across the board.

least_on_base <- main_data %>%
  group_by(pit_full_name) %>%
  summarize(avg_on_base = mean(amount_on_b, na.rm = TRUE) ) %>%
  arrange(avg_on_base)
  
least_on_base_new <- least_on_base %>%
  left_join(pitch_count, by = "pit_full_name") %>%
  filter(n > 5000) %>%
  arrange(avg_on_base)

ggplot(least_on_base_new, aes(x = avg_on_base)) +
  geom_histogram() +
  scale_x_continuous(name = "Average on base", labels = scales::percent) +
  scale_y_continuous(name = "# of Pitchers") + 
  ggtitle("Average amount of players on base for pitchers")

The graph indicates that pitchers from our criteria set typically have a 50%-60% chance of having a player on base at the time of the pitch.

Now we’ll find what pitchers, who threw for 725 or more pitches, managed to get a favorable result for the at-bat during the last few innings.

favorable_outcomes <- c("Strikeout", "Groundout", "Flyout", 
                        "Lineout", "Bunt Lineout", "Bunt Pop Out",
                        "Bunt Groundout", "Runner Out")

pit_late_good_results <- main_data %>%
  left_join(pitch_count_late,by = "pit_full_name") %>%
  filter(n >= 725) %>%
  select(pit_full_name, inning, event, ab_id) %>%
  group_by(pit_full_name) %>%
  filter(inning > 6, event %in% favorable_outcomes) %>%
  summarize(at_bat_cnt = n_distinct(ab_id)) %>%
  arrange(desc(at_bat_cnt))

Now let’s find what pitchers managed to get a favorable result for the at-bat while there are any batters on bases.

pit_onb_good_results <- main_data %>%
  group_by(pit_full_name) %>%
  filter( amount_on_b > 0, event %in% favorable_outcomes) %>%
  summarize(distinct_at_b = n_distinct(ab_id)) %>%
    arrange(desc(distinct_at_b)) %>%
  top_n(100)

Summary

We have all of our criteria and now we want to see what pitchers are a part of all the groups. This will tell us what pitchers for the past 3 MLB seasons have allowed the fewest batters on base, performed the best in the last innings of the game, and gotten favorable results when players end up on base. All of these items point to well performing pitchers that have been able to get results when it matters the most.

pit_common <- semi_join(pit_onb_good_results, pit_late_good_results, least_on_base_new, by = "pit_full_name")

pit_common$pit_full_name

##  [1] "Sale, Chris"        "Archer, Chris"      "Scherzer, Max"     
##  [4] "Porcello, Rick"     "Cole, Gerrit"       "Lester, Jon"       
##  [7] "Quintana, Jose"     "Hamels, Cole"       "Martinez, Carlos"  
## [10] "Kluber, Corey"      "Verlander, Justin"  "Bauer, Trevor"     
## [13] "Leake, Mike"        "Teheran, Julio"     "Arrieta, Jake"     
## [16] "Price, David"       "Gausman, Kevin"     "Greinke, Zack"     
## [19] "Roark, Tanner"      "deGrom, Jacob"      "Bumgarner, Madison"
## [22] "Estrada, Marco"     "Hammel, Jason"      "Colon, Bartolo"    
## [25] "Corbin, Patrick"    "Keuchel, Dallas"    "Carrasco, Carlos"  
## [28] "Samardzija, Jeff"   "Cueto, Johnny"      "Duffy, Danny"      
## [31] "Nola, Aaron"        "Tanaka, Masahiro"   "Kershaw, Clayton"  
## [34] "Gray, Sonny"        "Strasburg, Stephen" "Hernandez, Felix"  
## [37] "McHugh, Collin"     "Santiago, Hector"   "Moore, Matt"       
## [40] "Sanchez, Anibal"    "Pomeranz, Drew"     "Dickey, R.A."      
## [43] "Garcia, Jaime"      "Stroman, Marcus"    "Severino, Luis"    
## [46] "Santana, Ervin"     "Bundy, Dylan"       "Richard, Clayton"  
## [49] "Sanchez, Aaron"     "Montgomery, Mike"

By no means is this the only way to find out how pitchers are performing in intense situations. If someone were to continue this analysis they could do a comparison of the performance of a pitcher vs accomplished batters, efficiency on the mound (do they have favorable results in less pitches), or zero in on performances between the top performing teams. My process was just small angle of what possible questions can be answered out there by the world of data analysis. Thank you for your time and have a nice day!

MLB Pitching Data Analysis

Charles Brendan Cooper

2019-04-27