This dataset contains data on each pitch thrown during the Major League Baseball 2015-2018 regular seasons. 4 tables were downloaded from a Kaggle page, which scraped its data from here. The tables available and their definitions are below:
atbats - Contains the information on what batter and pitcher was involved in the at bat scenario and what was the outcome of the event
games - Contains the information on all of the games that have taken place during the seasons like attendance and final score
pitches - Contains the information on each of the pitches that have taken place and includes type of pitch as well as speed
player_names - Contains the information on all of the players first and last names
The goal of this study is to try and understand which pitchers are performing the best when the stakes are the highest. Being “clutch” is a coveted quality of any athlete and pitchers are no different. They are often tasked with dealing with pivotal situations like performance in the last innings of the games or when the bases end up being loaded or performance against good batters. We will review situations such as these to glean what pitchers stand unphased on the mound and have the perverbial “ice” in the veins.
We will begin by loading all of the required packages for the project.
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
The data.table
package will be used for bringing in the data into the R environment, the dplyr
package’s main function will be for combining the data, the tidyr
package will be used to unite certain variables, and the ggplot2
will be used to display graphs.
Next step is to load each of the files into data.frames
atbats <- fread("atbats.csv")
games <- fread("games.csv")
pitches <- fread("pitches.csv")
player_names <- fread("player_names.csv")
Here are the columns that are available for use and as you can see there are quite a lot of them. We will be mainly focusing on the outcomes of the pitcher and batter as opposed to the types of pitches the pitcher throws.
names(atbats)
## [1] "ab_id" "batter_id" "event" "g_id" "inning"
## [6] "o" "p_score" "p_throws" "pitcher_id" "stand"
## [11] "top"
names(games)
## [1] "attendance" "away_final_score" "away_team"
## [4] "date" "elapsed_time" "g_id"
## [7] "home_final_score" "home_team" "start_time"
## [10] "umpire_1B" "umpire_2B" "umpire_3B"
## [13] "umpire_HP" "venue_name" "weather"
## [16] "wind" "delay"
names(pitches)
## [1] "ab_id" "ax" "ay"
## [4] "az" "b_count" "b_score"
## [7] "break_angle" "break_length" "break_y"
## [10] "code" "end_speed" "nasty"
## [13] "on_1b" "on_2b" "on_3b"
## [16] "outs" "pfx_x" "pfx_z"
## [19] "pitch_num" "pitch_type" "px"
## [22] "pz" "s_count" "spin_dir"
## [25] "spin_rate" "start_speed" "sz_bot"
## [28] "sz_top" "type" "type_confidence"
## [31] "vx0" "vy0" "vz0"
## [34] "x" "x0" "y"
## [37] "y0" "z0" "zone"
names(player_names)
## [1] "id" "first_name" "last_name"
We will be joining all of the datasets and one of the future issues we will come across is the fact that we have to join to the player_names
table twice. Once, to match pitcher names and another to match against batter names. The code below addresses that issue by creating a copy, one for batters and the other for pitchers. The other issue is that the ID columns are not character values, so we will be changing that as well.
batters <- player_names
pitchers <- player_names
names(batters) <- c("batter_id", "bat_first_name", "bat_last_name")
names(pitchers) <- c("pitcher_id", "pit_first_name", "pit_last_name")
batters$batter_id <- as.character(batters$batter_id)
pitchers$pitcher_id <- as.character(pitchers$pitcher_id)
pitches$ab_id <- as.character(pitches$ab_id)
atbats_clean <- atbats %>%
mutate(ab_id = as.character(ab_id),
batter_id = as.character(batter_id),
pitcher_id = as.character(pitcher_id))
As you will see by the code below the only dataset from the group that has any missing values is the pitches
dataset. The max amount of missing values is 14,189. This is less than 1% of the entire pitches
dataset so I have elected to remove them for simplicity’s sake.
colSums(is.na(atbats))
## ab_id batter_id event g_id inning o
## 0 0 0 0 0 0
## p_score p_throws pitcher_id stand top
## 0 0 0 0 0
colSums(is.na(games))
## attendance away_final_score away_team date
## 0 0 0 0
## elapsed_time g_id home_final_score home_team
## 0 0 0 0
## start_time umpire_1B umpire_2B umpire_3B
## 0 0 0 0
## umpire_HP venue_name weather wind
## 0 0 0 0
## delay
## 0
colSums(is.na(pitches))
## ab_id ax ay az
## 0 14189 14189 14189
## b_count b_score break_angle break_length
## 0 0 14189 14189
## break_y code end_speed nasty
## 14189 0 14114 14189
## on_1b on_2b on_3b outs
## 0 0 0 0
## pfx_x pfx_z pitch_num pitch_type
## 14142 14142 0 0
## px pz s_count spin_dir
## 14189 14189 0 14189
## spin_rate start_speed sz_bot sz_top
## 14189 14114 2083 2083
## type type_confidence vx0 vy0
## 0 14189 14189 14189
## vz0 x x0 y
## 14189 0 14189 0
## y0 z0 zone
## 14189 14189 14189
colSums(is.na(player_names))
## id first_name last_name
## 0 0 0
pitches_clean <- na.omit(pitches)
The following code brings all of the tables together into a main data set. It also creates a count for how many batters are on base at the time of the pitch, a variable we will use later on in the analysis.
main_data <- games %>%
left_join(atbats_clean, by = "g_id") %>%
left_join(batters, by = "batter_id") %>%
left_join(pitchers, by = "pitcher_id") %>%
left_join(pitches_clean, by = "ab_id") %>%
mutate(amount_on_b = on_1b + on_2b + on_3b) %>%
unite(pit_full_name, pit_last_name, pit_first_name, sep = ", ") %>%
unite(bat_full_name, bat_last_name, bat_first_name, sep = ", ")
main_data %>%
select(g_id, ab_id, event, inning, bat_full_name, pit_full_name, amount_on_b) %>%
summarize(Number_of_games = n_distinct(g_id),
Number_of_at_bats = n_distinct(ab_id),
Different_events = n_distinct(event),
Number_of_innings = n_distinct(inning),
Total_batters = n_distinct(bat_full_name),
Total_pitchers = n_distinct(pit_full_name),
Avg_on_base = mean(amount_on_b, na.rm = TRUE))
## Number_of_games Number_of_at_bats Different_events Number_of_innings
## 1 9718 740389 30 19
## Total_batters Total_pitchers Avg_on_base
## 1 1684 1330 0.5819691
ggplot(main_data, aes(x = amount_on_b)) +
geom_histogram() +
scale_y_continuous(name = "# of Pitches", labels = scales::comma) +
scale_x_continuous(name = "Players on base") +
ggtitle("Number of pitches with players on base")
There are various types of pitchers within the MLB and they are used in a variety of ways. We will be categorizing them in this project by the amount of pitches they have thrown. In order to determine a cutoff point we need to explore the data. We start by getting a count of how many pitches each pitcher threw across the 3 years during all innings.
pitch_count <- main_data %>%
group_by(pit_full_name) %>%
tally()
summary(pitch_count$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 267.2 1015.0 2147.2 3011.8 13488.0
ggplot(pitch_count, aes(x = n)) +
geom_density() +
scale_y_continuous(name = "Density") +
scale_x_continuous(name = "Pitches Thrown") +
ggtitle("Density Plot of Amount of Pitches Thrown per Pitcher")
As you can see by the density plot above, we see a good cutoff for the number of pitches thrown for a pretty consistent pitcher is around 5000 (another option could be 3000, 3rd Quantile).
We will do the same thing as before but this time, filter for pitches thrown in later innings.
pitch_count_late <- main_data %>%
filter(inning > 6) %>%
group_by(pit_full_name) %>%
tally()
summary(pitch_count_late$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 97.0 351.0 724.5 915.5 4778.0
ggplot(pitch_count_late, aes(x = n)) +
geom_histogram() +
scale_y_continuous(name = "# of Pitchers") +
scale_x_continuous(name = "Pitches Thrown", labels = scales::comma) +
ggtitle("Histogram of Amount of Pitches Thrown per Pitcher",
subtitle = "Pitches thrown within the 7th, 8th, and 9th Inning")
From the histogram above, we see a good cutoff for the number of pitches thrown for a pretty consistent pitcher in later innings is 1000. I will actually take the lower mean cutoff of 725 because restricting it to a higher value may constrict the data to only including pitchers who are closers. I want it to be open to starting pitchers who last longer as well.
Last but not least, the code below returns the possibe outcomes for an at-bat situation. We will be focused on outcomes that cause a favorable result for the pitcher.
distinct(main_data, event)
## event
## 1 Groundout
## 2 Double
## 3 Single
## 4 Strikeout
## 5 Walk
## 6 Runner Out
## 7 Flyout
## 8 Forceout
## 9 Pop Out
## 10 Intent Walk
## 11 Lineout
## 12 Home Run
## 13 Triple
## 14 Hit By Pitch
## 15 Grounded Into DP
## 16 Sac Bunt
## 17 Fielders Choice
## 18 Bunt Groundout
## 19 Field Error
## 20 Double Play
## 21 Sac Fly
## 22 Fielders Choice Out
## 23 Bunt Pop Out
## 24 Catcher Interference
## 25 Strikeout - DP
## 26 Batter Interference
## 27 Sac Fly DP
## 28 Bunt Lineout
## 29 Sacrifice Bunt DP
## 30 Triple Play
Now that we have the main dataset and our cutoff points, we can start to dig for the information we are after. We will start by finding out what pitchers, that threw over our cutoff of 5000 pitches, allow the fewest batters on base across the board.
least_on_base <- main_data %>%
group_by(pit_full_name) %>%
summarize(avg_on_base = mean(amount_on_b, na.rm = TRUE) ) %>%
arrange(avg_on_base)
least_on_base_new <- least_on_base %>%
left_join(pitch_count, by = "pit_full_name") %>%
filter(n > 5000) %>%
arrange(avg_on_base)
ggplot(least_on_base_new, aes(x = avg_on_base)) +
geom_histogram() +
scale_x_continuous(name = "Average on base", labels = scales::percent) +
scale_y_continuous(name = "# of Pitchers") +
ggtitle("Average amount of players on base for pitchers")
The graph indicates that pitchers from our criteria set typically have a 50%-60% chance of having a player on base at the time of the pitch.
Now we’ll find what pitchers, who threw for 725 or more pitches, managed to get a favorable result for the at-bat during the last few innings.
favorable_outcomes <- c("Strikeout", "Groundout", "Flyout",
"Lineout", "Bunt Lineout", "Bunt Pop Out",
"Bunt Groundout", "Runner Out")
pit_late_good_results <- main_data %>%
left_join(pitch_count_late,by = "pit_full_name") %>%
filter(n >= 725) %>%
select(pit_full_name, inning, event, ab_id) %>%
group_by(pit_full_name) %>%
filter(inning > 6, event %in% favorable_outcomes) %>%
summarize(at_bat_cnt = n_distinct(ab_id)) %>%
arrange(desc(at_bat_cnt))
Now let’s find what pitchers managed to get a favorable result for the at-bat while there are any batters on bases.
pit_onb_good_results <- main_data %>%
group_by(pit_full_name) %>%
filter( amount_on_b > 0, event %in% favorable_outcomes) %>%
summarize(distinct_at_b = n_distinct(ab_id)) %>%
arrange(desc(distinct_at_b)) %>%
top_n(100)
We have all of our criteria and now we want to see what pitchers are a part of all the groups. This will tell us what pitchers for the past 3 MLB seasons have allowed the fewest batters on base, performed the best in the last innings of the game, and gotten favorable results when players end up on base. All of these items point to well performing pitchers that have been able to get results when it matters the most.
pit_common <- semi_join(pit_onb_good_results, pit_late_good_results, least_on_base_new, by = "pit_full_name")
pit_common$pit_full_name
## [1] "Sale, Chris" "Archer, Chris" "Scherzer, Max"
## [4] "Porcello, Rick" "Cole, Gerrit" "Lester, Jon"
## [7] "Quintana, Jose" "Hamels, Cole" "Martinez, Carlos"
## [10] "Kluber, Corey" "Verlander, Justin" "Bauer, Trevor"
## [13] "Leake, Mike" "Teheran, Julio" "Arrieta, Jake"
## [16] "Price, David" "Gausman, Kevin" "Greinke, Zack"
## [19] "Roark, Tanner" "deGrom, Jacob" "Bumgarner, Madison"
## [22] "Estrada, Marco" "Hammel, Jason" "Colon, Bartolo"
## [25] "Corbin, Patrick" "Keuchel, Dallas" "Carrasco, Carlos"
## [28] "Samardzija, Jeff" "Cueto, Johnny" "Duffy, Danny"
## [31] "Nola, Aaron" "Tanaka, Masahiro" "Kershaw, Clayton"
## [34] "Gray, Sonny" "Strasburg, Stephen" "Hernandez, Felix"
## [37] "McHugh, Collin" "Santiago, Hector" "Moore, Matt"
## [40] "Sanchez, Anibal" "Pomeranz, Drew" "Dickey, R.A."
## [43] "Garcia, Jaime" "Stroman, Marcus" "Severino, Luis"
## [46] "Santana, Ervin" "Bundy, Dylan" "Richard, Clayton"
## [49] "Sanchez, Aaron" "Montgomery, Mike"
By no means is this the only way to find out how pitchers are performing in intense situations. If someone were to continue this analysis they could do a comparison of the performance of a pitcher vs accomplished batters, efficiency on the mound (do they have favorable results in less pitches), or zero in on performances between the top performing teams. My process was just small angle of what possible questions can be answered out there by the world of data analysis. Thank you for your time and have a nice day!