Introduction

As a longtime baseball fan, I became intrigued by the public dataset bigquery-public-data.baseball available on Google BigQuery, and developed a number of questions while performing an initial exploration of the schema:

  1. What were the most frequent pitching outcomes of full counts?
  2. Which pitchers faced the greatest proportion of full counts during their time on the mound?
  3. Which pitchers had the highest proportion of strikeouts per full count / walks per full count?
  4. Which pitch types were most effective in delivering full count strikeouts / least effective in preventing full count walks?
  5. Which pitchers were most susceptible to stolen bases when in a full count situation?
  6. Which batters were most successful in getting on base or scoring when in a full count situation?
  7. What is the relationship between pitch count and walks or hits, in full count situations?

Since this data set was too large to simply download and manipulate via Excel / Google Sheets, it was necessary to query subsets of the data and build my own tables. The approach that I settled on involved creating two new tables in BigQuery:

  1. All full count pitching events
  2. All non-full count pitching events (only needed to answer question 2 above)

Bias

To understand the meaning of certain aspects of the raw data (like definitions of outcomeIDs), I relied on the Sportsradar MLB v7 Statistics Feeds reference. Since this document is dynamically updated (last version as I write this being November of 2023), there is no publicly available, precise reference for values recorded in 2016 (at least, none that I could find). So it’s possible error could be introduced that way (for example, the current Sportsradar reference does not provide a definition for startingStrikes.)

While I am a long-suffering fan of the Detroit Tigers, I don’t believe that bias intruded on this analysis, since none of the questions I was interested in focused on performance at the team level.

Data cleaning and preparation

A handful of cleaning steps were necessary in the data preparation that followed:

  1. To help identify pitching events that both included a full count and did not include a full count, I concatenated gameID, inningNumber, inningHalf, and atBatSequenceNumber.

  2. To simplify the list of full count pitching outcomes, I grouped the actual list of 36 outcomes to a summarized list of 20 outcomes - for example, counting outcomeId kKL (Strike Looking) and kKS (Strike Swinging) as full count strikeouts without making any distinction between the two.

  3. To create the tables described in the Introduction above, I wrote individual queries for full count strikeouts, walks, and other outcomes; as well as non-full count pitching outcomes. As an example, here’s the query for compiling all full count strikeout events:

SELECT CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber) AS fcPitches, CONCAT (gameId, ".",inningNumber) AS fcInnings, description, outcomeId, outcomeDescription, CONCAT (hitterLastName, ", ",hitterFirstName) AS hitterFullName, hitterId, hitterBatHand, hitterPitchCount, CONCAT (pitcherLastName, ", ",pitcherFirstName) AS pitcherFullName, pitcherId, pitcherThrowHand, pitcherPitchCount, pitchTypeDescription, balls, strikes
FROM `bigquery-public-data.baseball.games_wide`
WHERE balls =3 AND strikes =3 AND (outcomeId LIKE 'kK_' OR outcomeId LIKE 'oK_T_'OR outcomeId LIKE 'aK_AD_') AND outcomeId <> '-'
ORDER BY outcomeDescription;
  1. For the tables I created for full count pitching events and non-full count pitching events, I removed duplicate rows as follows:
CREATE TABLE `singular-glow-360823.baseball_2016_games.nfc_outcomes_clean` AS
SELECT DISTINCT * FROM `singular-glow-360823.baseball_2016_games.nfc_outcomes`;
  1. To prevent skew, I filtered queries to ensure a minimum sample size - for example, when evaluating which pitchers had the highest proportion of full count strikeouts, I only looked at pitchers who had 5 or more full count pitches.

Approach

Q1. What were the most frequent pitching outcomes of full counts?

# preparing data
fco_counts <- table(fco_clean$simpleOutcomeId)
fco_counts_df <- as.data.frame(fco_counts)
colnames(fco_counts_df) <- c('outcome','outcome_count')
fco_counts_df_ord <- fco_counts_df[order(fco_counts_df$outcome_count, decreasing = TRUE),]
str(fco_counts_df_ord)
## 'data.frame':    12 obs. of  2 variables:
##  $ outcome      : Factor w/ 12 levels "double","fielders choice",..: 4 12 10 5 3 9 7 1 6 8 ...
##  $ outcome_count: int  14205 7478 5903 3157 2585 2322 950 740 546 409 ...
#drawing plot
library(ggplot2)
library(forcats)
fco_counts_df_ord %>%
  mutate(outcome = fct_reorder(outcome, outcome_count)) %>%
  ggplot(aes(x = outcome, y = outcome_count)) +
  geom_bar(stat = "identity", fill = "#f68060", alpha = .6, width = .4) +
  coord_flip() +
  xlab("") +
  theme_bw()

Q2. Which pitchers faced the greatest proportion of full counts during their time on the mound?

SELECT n.pitcherFullName, COUNT (DISTINCT n.nfcPitches) AS nfcEvents, COUNT (DISTINCT f.fcPitches) AS fcEvents, (COUNT (DISTINCT n.nfcPitches) + COUNT (DISTINCT f.fcPitches)) AS totalEvents, ((COUNT (DISTINCT f.fcPitches)) / (COUNT (DISTINCT f.fcPitches) + COUNT (DISTINCT n.nfcPitches))) AS pctFullCountEvents
FROM `singular-glow-360823.baseball_2016_games.nfc_outcomes_clean` n
FULL OUTER JOIN `baseball_2016_games.full_count_outcomes_clean`  f ON n.pitcherId = f.pitcherId
GROUP BY n.pitcherId, n.pitcherFullName
HAVING totalEvents > 10
ORDER BY pctFullCountEvents DESC
LIMIT 5;
5 records
pitcherFullName nfcEvents fcEvents totalEvents pctFullCountEvents
Magill, Matthew 24 10 34 0.2941176
Bethancourt, Christian 10 4 14 0.2857143
Olson, Tyler 18 6 24 0.2500000
Drabek, Kyle 15 5 20 0.2500000
Jackson, Luke 85 28 113 0.2477876

Q3. Which pitchers had the highest proportion of strikeouts per full count / walks per full count?

Pitchers with the most full count strikeouts

SELECT pitcherFullName, COUNT (outcomeId) AS Strikeouts, COUNT (DISTINCT fcPitches) AS fcPitches, COUNT (outcomeId) / COUNT (DISTINCT fcPitches) AS strikeoutsPerFCPitch
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'kKS' OR outcomeId = 'kKL'
GROUP BY pitcherFullName
HAVING fcPitches > 4
ORDER BY strikeoutsPerFCPitch DESC
LIMIT 5;
5 records
pitcherFullName Strikeouts fcPitches strikeoutsPerFCPitch
Davis, Wade 16 12 1.333333
Edwards, Carl 10 8 1.250000
Glasnow, Tyler 6 5 1.200000
Cishek, Steven 20 17 1.176471
Abad, Fernando 7 6 1.166667

Pitchers with the most full count walks

SELECT pitcherFullName, COUNT (outcomeId) AS Walks, COUNT (DISTINCT fcPitches) AS fcPitches, COUNT (outcomeId) / COUNT (DISTINCT fcPitches) AS walksPerFCPitch
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'bB' OR outcomeId = 'bDB' OR outcomeId = 'bIB'
GROUP BY pitcherFullName
HAVING fcPitches > 4
ORDER BY walksPerFCPitch DESC
LIMIT 5;
5 records
pitcherFullName Walks fcPitches walksPerFCPitch
Strop, Pedro 10 7 1.428571
Delabar, Steven 7 5 1.400000
Hessler, Keith 7 5 1.400000
Street, Huston 8 6 1.333333
Shreve, Chasen 10 8 1.250000

Q4. Which pitch types were most effective in delivering full count strikeouts / least effective in preventing full count walks?

Most effective pitches, full count strikeouts

SELECT pitchTypeDescription, pitcherThrowHand, hitterBatHand, COUNT (outcomeId) AS strikeouts
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'kKS' OR outcomeId = 'kKL'
GROUP BY pitchTypeDescription, pitcherThrowHand, hitterBatHand
ORDER BY strikeouts DESC
LIMIT 5;
5 records
pitchTypeDescription pitcherThrowHand hitterBatHand strikeouts
Fastball R R 1027
Fastball R L 628
Slider R R 526
Fastball L R 460
Slider R L 293

Riskiest pitches, full count walks

SELECT pitchTypeDescription, pitcherThrowHand, hitterBatHand, COUNT (outcomeId) AS walks
FROM `singular-glow-360823.baseball_2016_games.full_count_outcomes_clean`
WHERE outcomeId = 'bB' OR outcomeId = 'bDB' OR outcomeId = 'bIB'
GROUP BY pitchTypeDescription, pitcherThrowHand, hitterBatHand
ORDER BY walks DESC
LIMIT 5;
5 records
pitchTypeDescription pitcherThrowHand hitterBatHand walks
Fastball R R 1445
Fastball R L 1086
Fastball L R 667
Slider R R 471
Fastball R B 397

Q5. Which pitchers were most susceptible to stolen bases when in a full count situation?

SELECT CONCAT (pitcherLastName, ", ",pitcherFirstName) AS pitcherFullName, COUNT (DISTINCT (CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))) AS fcPitches, COUNT (DISTINCT outcomeId) AS stealCount, (COUNT (DISTINCT outcomeId) / (COUNT (DISTINCT(CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))))) AS pctStealFromFC
FROM `bigquery-public-data.baseball.games_wide`
WHERE startingBalls =3 AND startingStrikes =2 AND (rob1_outcomeId = 'SB2' OR rob2_outcomeId = 'SB3')
GROUP BY pitcherFullName
HAVING fcPitches > 1
ORDER BY pctStealFromFC DESC
LIMIT 5;
5 records
pitcherFullName fcPitches stealCount pctStealFromFC
Syndergaard, Noah 3 3 1
Wisler, Matthew 2 2 1
Peralta, Joel 2 2 1
Sanchez, Aaron 2 2 1
Harris, William 2 2 1

Q6. Which batters were most successful in getting on base or scoring when in a full count situation?

SELECT CONCAT (hitterLastName, ", ",hitterFirstName) AS hitterFullName, COUNT (DISTINCT (CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))) AS fcPitches, COUNT (DISTINCT outcomeId) AS onBaseCount, (COUNT (DISTINCT outcomeId) / (COUNT (DISTINCT(CONCAT (gameId,".",inningNumber, ".", inningHalf, ".", atBatEventSequenceNumber))))) AS pctOnBaseFromFC
FROM `bigquery-public-data.baseball.games_wide`
WHERE startingBalls =3 AND startingStrikes =2 AND outcomeId = 'bB' OR (outcomeId LIKE 'b_B' OR outcomeId LIKE 'a_' OR outcomeId LIKE 'aHR' OR outcomeId LIKE 'a_AD_')
GROUP BY hitterFullName
HAVING fcPitches > 9
ORDER BY pctOnBaseFromFC DESC
LIMIT 5;
5 records
hitterFullName fcPitches onBaseCount pctOnBaseFromFC
Ethier, Andre 12 6 0.5000000
Lake, Junior 12 6 0.5000000
den Dekker, Matt 11 5 0.4545455
Patterson, Jordan 11 5 0.4545455
Margot, Manuel 11 5 0.4545455

Q7. What is the relationship between pitch count and walks or hits, in full count situations?

ggplot(data = fco_clean) + 
  geom_smooth(mapping = aes(x = pitcherPitchCount,y = hitterPitchCount))+
  facet_wrap(~simpleOutcome,nrow=2) +
  theme_bw()

Conclusion

The more I played around with the data, the more I understood that I could keep going as far as I wanted to - for example, filtering by a particular pitcher / batter combination, or by a pitch type, or by a particular team or group of teams (like a division). I can’t say any of the results were particularly surprising (since I had no expectations going into the exercise), but some of the results were entertaining to see as they sprung to life; for example, learning that the incidence of foul balls noticeably increased as pitch count increased above 100. Or quantifying the infrequency of hits relative to strikeouts and walks when a full count situation.

It’s certainly a data set that suggests many more questions and - if you’re a baseball fan - lots of fun while you explore.