Acquiring the data

Google BigQuery offers public datasets, with the first terabyte of analyzed data being free each month. One of the datasets is pitch-by-pitch data from the 2016 MLB season. This method of acquiring the information is convenient, as we do not have to set up a separate database and upload the data. I will not be walking through the setup process, but I will include the code used to acquired variables (including ones that are unused on this page) that I feel are most useful in exploring and analyzing the pitch-by-pitch data.

suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(tidyr))
suppressMessages(library(plotly))
suppressMessages(library(RColorBrewer))

suppressMessages(library(bigrquery))

### Input project ID
#project <- "project_id" 

### Query with all variables we might need
#sql <- "SELECT gameId, dayNight, venueOutfieldDistances, inningHalf, atBatEventType, awayTeamName, homeTeamName, homeFinalRuns, homeFinalHits, awayFinalRuns, awayFinalHits, inningEventType,
#outcomeId, outcomeDescription, pitcherFirstName, pitcherLastName, hitterLastName, hitterFirstName, hitterBatHand,
#pitcherThrowHand, pitchType, pitchSpeed, pitchZone, pitcherPitchCount, hitterPitchCount, hitLocation, hitType,
#startingBalls, startingStrikes, startingOuts, balls, strikes, outs, is_ab_over, is_hit, is_on_base, homeCurrentTotalRuns,
#awayCurrentTotalRuns
#FROM [bigquery-public-data:baseball.games_wide] 
#WHERE pitcherLastName = 'Sale'
#"

# Execute the query and store the result
#atbat <- query_exec(sql, project = project, useLegacySql = FALSE)

### My stored file
atbat <- read.csv("SaleMLB.csv")

This pulls pitch-by-pitch data for Chris Sale during the 2016 MLB season.

Q1: How many pitches per game did Chris Sale throw in the 2016 season?

atbat %>%
  group_by(gameId) %>%
  summarize(Pitches = max(pitcherPitchCount)) %>%
  ungroup() %>%
  summarize(Pitches = mean(Pitches))
## # A tibble: 1 x 1
##   Pitches
##     <dbl>
## 1    104.

Chris Sale threw an average of 104 pitches per game last year

Q2: What types of hits did Chris Sale typically allow?

The majority of home runs are hit off of fly balls. So we generally want a pitcher to have a low fly ball rate, a low line drive rate, and a high ground ball rate. However, pitchers with a high strikeout rate such as Chris Sale tend to allow more fly balls and line drives than “ground ball pitchers”.

hitTypes <- atbat %>%
  filter(hitType != "") %>%
  group_by(hitType) %>%
  summarize(n = n()) %>%
  mutate(freq = n / sum(n))

# 4 colors for 4 variables using 'Set3' from RColorBrewer
colors <- list(colorRampPalette(brewer.pal(4, "Set3"))(4))

plot_ly(hitTypes, labels = ~hitType, values = ~freq, type = 'pie',
             textposition = 'inside',
             textinfo = 'label+percent',
             insidetextfont = list(color = '#FFFFFF', size = 14),
             hoverinfo = 'text',
             text = ~paste(n, ' hits'),
             marker = list(colors = colors,
                           line = list(color = '#FFFFFF', width = 1)),
             showlegend = FALSE) %>%
  layout(title = 'Proportion of Hits Allowed by Pitch Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

24.8% is an incredibly low fly ball (FB) rate, especially for a strikeout pitcher. The high line drive (LD) rate is less surprising, with the high ground ball (GB) rate explaining why he does not allow many extra base hits. Last, about 10% of hits were popups (PU), which is pretty good also, meaning batters had trouble making good contact with the ball.

Q3: How much of a factor is fatigue in allowing hits?

Ideally, a starting pitcher will throw around 100 pitches when on adequate rest. This is often not the case, however, for a few reasons. Two being that fatigue tends to decrease fastball velocity and batters tend to learn to adjust by this mark. Since Chris Sale had an amazing 2016 season, we want to see if fatigue is associated with a significant increase in hits.

### Is fatigue a factor in allowing hits?
mod1 <- glm(is_hit ~ pitcherPitchCount, data = atbat, family = "binomial")
summary(mod1)$coefficients
##                       Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)       -2.723269573 0.14092167 -19.32470 3.328834e-83
## pitcherPitchCount -0.002363292 0.00231709  -1.01994 3.077571e-01

A logistic regression model suggests we do not have sufficient evidence that Chris Sale’s pitch count affects whether the batter he is facing gets a hit or not (regardless of the p-value cutoff). A rather impressive feat.

Q4: How many strike 3’s were looking vs swinging?

Strike3 <- atbat %>%
  filter((outcomeDescription == "Strike Looking" | outcomeDescription == "Strike Swinging") & is_ab_over == 1) %>%
  group_by(outcomeDescription) %>%
  summarize(n = n()) %>%
  mutate(freq = n / sum(n))
print(Strike3)
## # A tibble: 2 x 3
##   outcomeDescription     n  freq
##   <fct>              <int> <dbl>
## 1 Strike Looking        61 0.274
## 2 Strike Swinging      162 0.726

About 73% of Chris Sale’s strikeouts were swinging, rather than looking, meaning that the majority of strike 3’s on batters were swung at.