As a longtime baseball fan, I became intrigued by the public dataset
bigquery-public-data.baseball available on Google BigQuery,
and developed a number of questions while performing an initial
exploration of the schema:
Since this data set was too large to simply download and manipulate via Excel / Google Sheets, it was necessary to query subsets of the data and build my own tables. The approach that I settled on involved creating two new tables in BigQuery:
To understand the meaning of certain aspects of the raw data (like
definitions of outcomeIDs), I relied on the Sportsradar
MLB v7 Statistics Feeds reference. Since this document is
dynamically updated (last version as I write this being November of
2023), there is no publicly available, precise reference for values
recorded in 2016 (at least, none that I could find). So it’s possible
error could be introduced that way (for example, the current Sportsradar
reference does not provide a definition for
startingStrikes.)
While I am a long-suffering fan of the Detroit Tigers, I don’t believe that bias intruded on this analysis, since none of the questions I was interested in focused on performance at the team level.
A handful of cleaning steps were necessary in the data preparation that followed:
To help identify pitching events that both included a full count
and did not include a full count, I concatenated gameID,
inningNumber, inningHalf, and
atBatSequenceNumber.
To simplify the list of full count pitching outcomes, I grouped
the actual list of 36 outcomes to a summarized list of 20 outcomes - for
example, counting outcomeId kKL (Strike Looking) and kKS
(Strike Swinging) as full count strikeouts without making any
distinction between the two.
To create the tables described in the Introduction above, I wrote individual queries for full count strikeouts, walks, and other outcomes; as well as non-full count pitching outcomes. As an example, here’s the query for compiling all full count strikeout events:
SELECT CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber) AS fcPitches, CONCAT (gameId, ".",inningNumber) AS fcInnings, description, outcomeId, outcomeDescription, CONCAT (hitterLastName, ", ",hitterFirstName) AS hitterFullName, hitterId, hitterBatHand, hitterPitchCount, CONCAT (pitcherLastName, ", ",pitcherFirstName) AS pitcherFullName, pitcherId, pitcherThrowHand, pitcherPitchCount, pitchTypeDescription, balls, strikes
FROM `bigquery-public-data.baseball.games_wide`
WHERE balls =3 AND strikes =3 AND (outcomeId LIKE 'kK_' OR outcomeId LIKE 'oK_T_'OR outcomeId LIKE 'aK_AD_') AND outcomeId <> '-'
ORDER BY outcomeDescription;
CREATE TABLE `singular-glow-360823.baseball_2016_games.nfc_outcomes_clean` AS
SELECT DISTINCT * FROM `singular-glow-360823.baseball_2016_games.nfc_outcomes`;
# preparing data
fco_counts <- table(fco_clean$simpleOutcomeId)
fco_counts_df <- as.data.frame(fco_counts)
colnames(fco_counts_df) <- c('outcome','outcome_count')
fco_counts_df_ord <- fco_counts_df[order(fco_counts_df$outcome_count, decreasing = TRUE),]
str(fco_counts_df_ord)
## 'data.frame': 12 obs. of 2 variables:
## $ outcome : Factor w/ 12 levels "double","fielders choice",..: 4 12 10 5 3 9 7 1 6 8 ...
## $ outcome_count: int 14205 7478 5903 3157 2585 2322 950 740 546 409 ...
#drawing plot
library(ggplot2)
library(forcats)
fco_counts_df_ord %>%
mutate(outcome = fct_reorder(outcome, outcome_count)) %>%
ggplot(aes(x = outcome, y = outcome_count)) +
geom_bar(stat = "identity", fill = "#f68060", alpha = .6, width = .4) +
coord_flip() +
xlab("") +
theme_bw()
SELECT n.pitcherFullName, COUNT (DISTINCT n.nfcPitches) AS nfcEvents, COUNT (DISTINCT f.fcPitches) AS fcEvents, (COUNT (DISTINCT n.nfcPitches) + COUNT (DISTINCT f.fcPitches)) AS totalEvents, ((COUNT (DISTINCT f.fcPitches)) / (COUNT (DISTINCT f.fcPitches) + COUNT (DISTINCT n.nfcPitches))) AS pctFullCountEvents
FROM `singular-glow-360823.baseball_2016_games.nfc_outcomes_clean` n
FULL OUTER JOIN `baseball_2016_games.full_count_outcomes_clean` f ON n.pitcherId = f.pitcherId
GROUP BY n.pitcherId, n.pitcherFullName
HAVING totalEvents > 10
ORDER BY pctFullCountEvents DESC
LIMIT 5;
| pitcherFullName | nfcEvents | fcEvents | totalEvents | pctFullCountEvents |
|---|---|---|---|---|
| Magill, Matthew | 24 | 10 | 34 | 0.2941176 |
| Bethancourt, Christian | 10 | 4 | 14 | 0.2857143 |
| Olson, Tyler | 18 | 6 | 24 | 0.2500000 |
| Drabek, Kyle | 15 | 5 | 20 | 0.2500000 |
| Jackson, Luke | 85 | 28 | 113 | 0.2477876 |
Pitchers with the most full count strikeouts
SELECT pitcherFullName, COUNT (outcomeId) AS Strikeouts, COUNT (DISTINCT fcPitches) AS fcPitches, COUNT (outcomeId) / COUNT (DISTINCT fcPitches) AS strikeoutsPerFCPitch
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'kKS' OR outcomeId = 'kKL'
GROUP BY pitcherFullName
HAVING fcPitches > 4
ORDER BY strikeoutsPerFCPitch DESC
LIMIT 5;
| pitcherFullName | Strikeouts | fcPitches | strikeoutsPerFCPitch |
|---|---|---|---|
| Davis, Wade | 16 | 12 | 1.333333 |
| Edwards, Carl | 10 | 8 | 1.250000 |
| Glasnow, Tyler | 6 | 5 | 1.200000 |
| Cishek, Steven | 20 | 17 | 1.176471 |
| Abad, Fernando | 7 | 6 | 1.166667 |
Pitchers with the most full count walks
SELECT pitcherFullName, COUNT (outcomeId) AS Walks, COUNT (DISTINCT fcPitches) AS fcPitches, COUNT (outcomeId) / COUNT (DISTINCT fcPitches) AS walksPerFCPitch
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'bB' OR outcomeId = 'bDB' OR outcomeId = 'bIB'
GROUP BY pitcherFullName
HAVING fcPitches > 4
ORDER BY walksPerFCPitch DESC
LIMIT 5;
| pitcherFullName | Walks | fcPitches | walksPerFCPitch |
|---|---|---|---|
| Strop, Pedro | 10 | 7 | 1.428571 |
| Delabar, Steven | 7 | 5 | 1.400000 |
| Hessler, Keith | 7 | 5 | 1.400000 |
| Street, Huston | 8 | 6 | 1.333333 |
| Shreve, Chasen | 10 | 8 | 1.250000 |
Most effective pitches, full count strikeouts
SELECT pitchTypeDescription, pitcherThrowHand, hitterBatHand, COUNT (outcomeId) AS strikeouts
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'kKS' OR outcomeId = 'kKL'
GROUP BY pitchTypeDescription, pitcherThrowHand, hitterBatHand
ORDER BY strikeouts DESC
LIMIT 5;
| pitchTypeDescription | pitcherThrowHand | hitterBatHand | strikeouts |
|---|---|---|---|
| Fastball | R | R | 1027 |
| Fastball | R | L | 628 |
| Slider | R | R | 526 |
| Fastball | L | R | 460 |
| Slider | R | L | 293 |
Riskiest pitches, full count walks
SELECT pitchTypeDescription, pitcherThrowHand, hitterBatHand, COUNT (outcomeId) AS walks
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'bB' OR outcomeId = 'bDB' OR outcomeId = 'bIB'
GROUP BY pitchTypeDescription, pitcherThrowHand, hitterBatHand
ORDER BY walks DESC
LIMIT 5;
| pitchTypeDescription | pitcherThrowHand | hitterBatHand | walks |
|---|---|---|---|
| Fastball | R | R | 1445 |
| Fastball | R | L | 1086 |
| Fastball | L | R | 667 |
| Slider | R | R | 471 |
| Fastball | R | B | 397 |
SELECT CONCAT (pitcherLastName, ", ",pitcherFirstName) AS pitcherFullName, COUNT (DISTINCT (CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))) AS fcPitches, COUNT (DISTINCT outcomeId) AS stealCount, (COUNT (DISTINCT outcomeId) / (COUNT (DISTINCT(CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))))) AS pctStealFromFC
FROM `bigquery-public-data.baseball.games_wide`
WHERE startingBalls =3 AND startingStrikes =2 AND (rob1_outcomeId = 'SB2' OR rob2_outcomeId = 'SB3')
GROUP BY pitcherFullName
HAVING fcPitches > 1
ORDER BY pctStealFromFC DESC
LIMIT 5;
| pitcherFullName | fcPitches | stealCount | pctStealFromFC |
|---|---|---|---|
| Syndergaard, Noah | 3 | 3 | 1 |
| Wisler, Matthew | 2 | 2 | 1 |
| Peralta, Joel | 2 | 2 | 1 |
| Sanchez, Aaron | 2 | 2 | 1 |
| Harris, William | 2 | 2 | 1 |
SELECT CONCAT (hitterLastName, ", ",hitterFirstName) AS hitterFullName, COUNT (DISTINCT (CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))) AS fcPitches, COUNT (DISTINCT outcomeId) AS onBaseCount, (COUNT (DISTINCT outcomeId) / (COUNT (DISTINCT(CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))))) AS pctOnBaseFromFC
FROM `bigquery-public-data.baseball.games_wide`
WHERE startingBalls =3 AND startingStrikes =2 AND outcomeId = 'bB' OR (outcomeId LIKE 'b_B' OR outcomeId LIKE 'a_' OR outcomeId LIKE 'aHR' OR outcomeId LIKE 'a_AD_')
GROUP BY hitterFullName
HAVING fcPitches > 9
ORDER BY pctOnBaseFromFC DESC
LIMIT 5;
| hitterFullName | fcPitches | onBaseCount | pctOnBaseFromFC |
|---|---|---|---|
| Ethier, Andre | 12 | 6 | 0.5000000 |
| Lake, Junior | 12 | 6 | 0.5000000 |
| den Dekker, Matt | 11 | 5 | 0.4545455 |
| Patterson, Jordan | 11 | 5 | 0.4545455 |
| Margot, Manuel | 11 | 5 | 0.4545455 |
ggplot(data = fco_clean) +
geom_smooth(mapping = aes(x = pitcherPitchCount,y = hitterPitchCount))+
facet_wrap(~simpleOutcome,nrow=2) +
theme_bw()
The more I played around with the data, the more I understood that I could keep going as far as I wanted to - for example, filtering by a particular pitcher / batter combination, or by a pitch type, or by a particular team or group of teams (like a division). I can’t say any of the results were particularly surprising (since I had no expectations going into the exercise), but some of the results were entertaining to see as they sprung to life; for example, learning that the incidence of foul balls noticeably increased as pitch count increased above 100. Or quantifying the infrequency of hits relative to strikeouts and walks when a full count situation.
It’s certainly a data set that suggests many more questions and - if you’re a baseball fan - lots of fun while you explore.