As I sat in quarantine, sulking about the delay of the 2020 MLB season, I realized there was no better way to numb the pain of missing baseball than to crunch some numbers behind some pitcher data. An important aspect of baseball, especially during the postseason, is knowing how long a starting pitcher will last in a game. This can be helpful for the MLB managers in understanding when they need to start warming up the bullpen and how many relief pitchers will be required to finish a game.
This notebook looks into the factors that affect the pitch count at which a manager pulls the starting pitcher from the game. In other words, I use survival analysis to attempt to model the behavior of an MLB manager: when do managers decide to relieve a starting pitcher? This assumes that the MLB managers make the correct choice of pulling starting pitchers, which is not always a true assumption. Nevertheless, here are a few takeaways from the analysis:
The notebook dives into the data preparation, engineering, and modeling that support these takeaways. Take a look!
These are the following packages that are used throughout the notebook.
library(dplyr)
library(stringr)
library(zoo)
library(lubridate)
library(survminer)
library(flexsurv)
library(knitr)
library(twextras)
library(kableExtra)
library(scales)
This notebook uses four data sets from Kaggle:
These data sets all provide MLB information from seasons 2015-2018, however this notebook will only focus on the most recent year: 2018.
# Read in the necessary data
atbats <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/atbats.csv",
stringsAsFactors = FALSE)
pitches <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/pitches.csv",
stringsAsFactors = FALSE)
games <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/games.csv",
stringsAsFactors = FALSE)
player_names <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/player_names.csv",
stringsAsFactors = FALSE)
# Add atbat info
pitch <- left_join(pitches, atbats, by = 'ab_id')
# Add game info
pitch <- left_join(pitch, games, by = 'g_id')
# Add pitcher names
player_names$p_name <- paste(player_names$first_name, player_names$last_name)
pitch <- left_join(pitch, dplyr::select(player_names,c(id, p_name)), by = c('pitcher_id' = 'id'))
# Add batter names
player_names$b_name <- paste(player_names$first_name, player_names$last_name)
pitch <- left_join(pitch, dplyr::select(player_names,c(id, b_name)), by = c('batter_id' = 'id'))
# Take out unnecessary columns
pitch <- pitch %>%
select(-c(ax, ay, az, sz_bot, sz_top, vx0, vy0, vz0, x, x0, y, y0, z0, pfx_x, pfx_z))
Now that we have the data merged to include game-level and pitch-level data for each at-bat, let’s determine which pitches were thrown by the starting pitchers. We will only focus on these data points for the remainder of the analysis.
# Create a dataset of game_id and starting pitchers
starting_pitchers_away <- pitch %>%
filter(inning == 1,
o %in% c(0, 1),
top == "False") %>%
arrange(g_id, event_num) %>%
group_by(g_id) %>%
slice(1) %>%
select(g_id, p_name, pitcher_id) %>%
rename(starter_away_name = p_name,
starter_away_id = pitcher_id) %>%
mutate(home = 0)
starting_pitchers_home <- pitch %>%
filter(inning == 1,
o %in% c(0, 1),
top == "True") %>%
arrange(g_id, event_num) %>%
group_by(g_id) %>%
slice(1) %>%
select(g_id, p_name, pitcher_id) %>%
rename(starter_home_name = p_name,
starter_home_id = pitcher_id) %>%
mutate(home = 1)
# Get pitch data for starting away pitchers
pitch_away <- inner_join(pitch, starting_pitchers_away, by = c('g_id' = 'g_id',
'pitcher_id' = 'starter_away_id')) %>%
select(-c(starter_away_name))
# Get pitch data for starting home pitchers
pitch_home <- inner_join(pitch, starting_pitchers_home, by = c('g_id' = 'g_id',
'pitcher_id' = 'starter_home_id')) %>%
select(-c(starter_home_name))
# Combine both datasets to have pitch data only for starting pitchers
starters <- rbind(pitch_home, pitch_away)
#Create unique identifier for pitcher and game
starters$gp_id <- as.numeric(paste(starters$g_id, starters$pitcher_id, sep=''))
starters <- starters %>%
arrange(gp_id,event_num)
# Fill in team for pitcher and the opp team
starters[which(starters$home==0),"team"] <- starters[which(starters$home==0), "away_team"]
starters[which(starters$home==0),"opp_team"] <- starters[which(starters$home==0), "home_team"]
starters[which(starters$home==1),"team"] <- starters[which(starters$home==1), "home_team"]
starters[which(starters$home==1),"opp_team"] <- starters[which(starters$home==1), "away_team"]
There are several variables we can create with the information provided by these data sets. Most importantly, we need to calculate the two target variables that are necessary to execute survival analysis:
# Cumulative pitch-count per pitcher in a game
starters <- starters %>%
mutate(flag = 1) %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(p_count = cumsum(flag)) %>%
ungroup() %>%
select(-c(flag)) %>%
arrange(gp_id, event_num) %>%
mutate(subbed = 0,# Add target variable for subbed
rownum = 1:n())
# Determine last pitch thrown (assume the pitchers are only subbed in-between batters)
last <- starters %>%
group_by(gp_id) %>%
filter(p_count == max(p_count)) %>%
mutate(subbed = 1)
starters[which(starters$rownum %in% c(last$rownum)),"subbed"] <- 1
Next, we will create several variables, both time-dependent and time-independent, that we think may be useful for predicting or explaining when a pitcher will be relieved.
The engineered time-dependent variables are the following:
The engineered time-independent variables are the following:
# Add indicator if batter gets on or gets out
outs <- starters %>%
arrange(gp_id, event_num) %>%
group_by(ab_id) %>%
filter(pitch_num == max(pitch_num),
grepl("out", str_replace_all((str_to_lower(event)), " ", ""))) %>%
mutate(out = 1)
starters$out <- ifelse(starters$rownum %in% outs$rownum, 1, 0)
# Add last pitch indicator
last_pitch_df <- starters %>%
arrange(gp_id, event_num) %>%
group_by(ab_id) %>%
filter(pitch_num == max(pitch_num)) %>%
mutate(last_pitch = 1)
starters$last_pitch <- ifelse(starters$rownum %in% last_pitch_df$rownum, 1, 0)
# Add strikeout indicator
strikeouts <- starters %>%
arrange(gp_id, event_num) %>%
mutate(flag = 1) %>%
group_by(gp_id) %>%
filter(last_pitch==1,
grepl("strikeout", str_replace_all((str_to_lower(event)), " ", ""))) %>%
mutate(strikeouts = 1) %>%
ungroup()
starters <- left_join(starters, select(strikeouts, c(rownum, strikeouts)), by = 'rownum')
# Add walk indicator
walks <- starters %>%
arrange(gp_id, event_num) %>%
mutate(flag = 1) %>%
group_by(gp_id) %>%
filter(last_pitch==1,
grepl("walk", str_replace_all((str_to_lower(event)), " ", ""))) %>%
mutate(walks = 1) %>%
ungroup()
starters <- left_join(starters, select(walks, c(rownum, walks)), by = 'rownum')
# Cumulative strikeouts and walks
starters <- starters %>%
mutate(walks = ifelse(is.na(walks), 0, walks),
strikeouts = ifelse(is.na(strikeouts), 0, strikeouts)) %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(cumstrikeouts = cumsum(strikeouts),
cumwalks = cumsum(walks)) %>%
ungroup() %>%
arrange(gp_id)
# Previous fastball speed
fastball <- c("FC", "FF","FT")
offspeed <- c("CH", "CU", "EP", "FS", "KC", "KN", "SC", "SI", "SL")
starters <- starters %>%
mutate(fast_speed = ifelse(starters$pitch_type %in% fastball, end_speed, NA)) %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(avg_fast_speed = mean(fast_speed, na.rm=TRUE),
sd_fast_speed = sd(fast_speed, na.rm=TRUE),
sdz_fast_speed = na.locf((fast_speed-avg_fast_speed)/sd_fast_speed, na.rm = FALSE)) %>%
ungroup() %>%
arrange(gp_id)
# Replace NAs w/ avg (0)
starters$sdz_fast_speed <- ifelse(is.na(starters$sdz_fast_speed), 0, starters$sdz_fast_speed)
# Previous nasty score for offspeed
starters <- starters %>%
mutate(nasty_offspeed = ifelse(starters$pitch_type %in% offspeed, nasty, NA)) %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(avg_nasty_offspeed = mean(nasty_offspeed, na.rm=TRUE),
sd_nasty_offspeed = sd(nasty_offspeed, na.rm=TRUE),
sdz_nasty_offspeed = na.locf((nasty_offspeed-avg_nasty_offspeed)/sd_nasty_offspeed, na.rm = FALSE)) %>%
ungroup() %>%
arrange(gp_id)
# Replace NAs w/ avg (0)
starters$sdz_nasty_offspeed <- ifelse(is.na(starters$sdz_nasty_offspeed), 0, starters$sdz_nasty_offspeed)
# Change event type to ball or strike if not the last pitch of the at-bat
starters$event <- ifelse(starters$last_pitch==0,
ifelse(starters$type=="S", "Strike", ifelse(starters$type=="B", "Ball", starters$type)),
starters$event)
# Calculate cumulative batters reached base in the game
reached <- starters %>%
mutate(reached = ifelse(last_pitch==1 & out==0, 1, 0)) %>%
filter(reached == 1) %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(cumreached = cumsum(reached)) %>%
ungroup() %>%
arrange(gp_id, event_num) %>%
select(cumreached, rownum)
starters <- left_join(starters, reached,'rownum')
starters <- starters %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(cumreached = na.locf(cumreached, na.rm = FALSE)) %>%
ungroup() %>%
arrange(gp_id, event_num)
#Replace NAs with 0s
starters$cumreached <- ifelse(is.na(starters$cumreached), 0, starters$cumreached)
# Calculate days since last start for a pitcher within a season
starters$date <- as.Date(starters$date)
starters$start_hour <- hour(as.POSIXct(starters$start_time,format="%H:%M"))+12
starters$start_hour <- ifelse(grepl("AM", starters$start_time), starters$start_hour-12, starters$start_hour)
starters$start_hour <- ifelse(starters$start_hour==24, 12, starters$start_hour)
starters <- starters %>%
mutate(start_year = year(date)) %>%
group_by(start_year, p_name) %>%
arrange(date, event_num) %>%
mutate(days_since_last_start = c(0,as.numeric(diff(date)))) %>%
mutate(days_since_last_start = ifelse(days_since_last_start==0, NA, days_since_last_start)) %>%
mutate(days_since_last_start = na.locf(days_since_last_start, na.rm = FALSE)) %>%
mutate(days_since_last_start = ifelse(is.na(days_since_last_start), 168, days_since_last_start)) %>%
ungroup()
# Calculate what start number of the year the game is (1st start, 2nd start, etc.)
starters <- starters %>%
group_by(start_year, p_name) %>%
arrange(date, event_num) %>%
mutate(game_number = cumsum(!duplicated(date))) %>%
ungroup()
# Split up weather into windspeed, temperature (F), and weather (fair, partly cloudy, etc.)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
starters <- starters %>%
mutate(windspeed = as.numeric(numextract(wind)),
temp_f = as.numeric(numextract(weather)),
sky_weather = str_extract(starters$weather, '\\b[^,]+$'))
# Calculate run differential
starters$run_diff <- starters$p_score - starters$b_score
# Make sure starting pitchers who throw complete games are accounted for
starters$subbed <- ifelse(starters$inning==9 & starters$o==3 & starters$last_pitch==1, 0, starters$subbed)
# Add indicator for NL
nl_teams <- c("atl", "mia", "nyn", "phi","was", "chn","cin", "mil", "pit", "sln", "ari",
"col", "lan", "sdn", "sfn")
starters$league <- ifelse(starters$team %in% nl_teams, "NL", "AL")
# Collect final list of variables for analysis
df <- starters %>%
select(end_speed, code, b_score, ab_id, b_count, s_count, outs, pitch_num,
on_1b, on_2b, on_3b, event, g_id, inning, o, p_score, p_throws,
stand, top, attendance, date, start_time, venue_name, delay, p_name, b_name, home, gp_id, team,league,
opp_team, p_count, event_num, subbed, out, last_pitch, cumstrikeouts, cumwalks, sdz_fast_speed,
sdz_nasty_offspeed,run_diff, days_since_last_start, game_number, windspeed, temp_f,
sky_weather, start_hour, type, code, cumreached, pitch_type)
df <- df %>%
group_by(gp_id) %>%
arrange(event_num) %>%
mutate(start = p_count-1,
end = p_count) %>%
ungroup()
Here is Chris Sale with the 2018 Boston Red Sox: an example of what the data set looks like so far with the created target variables (pitch_count is split-up into “start” and “end” to include time-dependent variables):
# View Chris Sale
preview <- df %>% select(date, game_number, p_name, start, end, subbed, type, everything()) %>%
filter(p_name == "Chris Sale", year(as.Date(date))==2018) %>%
arrange(date, end) %>% head(10)
kable(preview, align = 'c', caption = "Chris Sale: 2018 Boston Red Sox") %>%
kable_styling(font_size = 12, position = 'center', full_width = FALSE, bootstrap_options = "hover") %>%
scroll_box(width = "100%", height = "350px")
| date | game_number | p_name | start | end | subbed | type | event_num | event | b_score | p_score | run_diff | b_count | s_count | pitch_num | out | o | inning | p_throws | stand | on_1b | on_2b | on_3b | attendance | start_hour | venue_name | delay | windspeed | temp_f | sky_weather | team | league | opp_team | cumstrikeouts | cumwalks | sdz_fast_speed | sdz_nasty_offspeed | days_since_last_start | code | cumreached | bs_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018-03-29 | 1 | Chris Sale | 0 | 1 | 0 | S | 25 | X | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 0.0926013 | 0.0000000 | 168 | C | 0 | 0,0 |
| 2018-03-29 | 1 | Chris Sale | 1 | 2 | 0 | S | 26 | X | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 0.0926013 | 0.1341089 | 168 | C | 0 | 0,1 |
| 2018-03-29 | 1 | Chris Sale | 2 | 3 | 0 | B | 27 | X | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 0.1965416 | 0.1341089 | 168 | B | 0 | 0,2 |
| 2018-03-29 | 1 | Chris Sale | 3 | 4 | 0 | B | 28 | X | 0 | 0 | 0 | 1 | 2 | 4 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 0.1965416 | 0.8577383 | 168 | B | 0 | 1,2 |
| 2018-03-29 | 1 | Chris Sale | 4 | 5 | 0 | S | 29 | X | 0 | 0 | 0 | 2 | 2 | 5 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 1.2879142 | 0.8577383 | 168 | F | 0 | 2,2 |
| 2018-03-29 | 1 | Chris Sale | 5 | 6 | 0 | S | 30 | X | 0 | 0 | 0 | 2 | 2 | 6 | 0 | 1 | 1 | L | R | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 1.4438245 | 0.8577383 | 168 | S | 0 | 2,2 |
| 2018-03-29 | 1 | Chris Sale | 6 | 7 | 0 | B | 34 | X | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | L | L | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | -1.2586219 | 0.8577383 | 168 | B | 0 | 0,0 |
| 2018-03-29 | 1 | Chris Sale | 7 | 8 | 0 | S | 35 | X | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 2 | 1 | L | L | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | -1.2586219 | -1.4682133 | 168 | C | 0 | 1,0 |
| 2018-03-29 | 1 | Chris Sale | 8 | 9 | 0 | S | 36 | X | 0 | 0 | 0 | 1 | 1 | 3 | 0 | 2 | 1 | L | L | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | -0.4790700 | -1.4682133 | 168 | F | 0 | 1,1 |
| 2018-03-29 | 1 | Chris Sale | 9 | 10 | 0 | S | 37 | X | 0 | 0 | 0 | 1 | 2 | 4 | 0 | 2 | 1 | L | L | 0 | 0 | 0 | 31042 | 16 | Tropicana Field | 0 | 0 | 72 | dome | bos | AL | tba | 0 | 0 | 1.5477648 | -1.4682133 | 168 | S | 0 | 1,2 |
In order to predict the pitch count at which a manager will relieve a pitcher, we need to create an Accelerated Failure Time Model. This models the tenure, or the survival time, of a pitcher. AFT models do not handle time-dependent variables very well, so let’s roll-up each starting pitcher’s game data so all variables are time-independent.
# Previous fastball speed for season
fastball <- c("FC", "FF","FT")
offspeed <- c("CH", "CU", "EP", "FS", "KC", "KN", "SC", "SI", "SL")
aftdf <- df %>%
mutate(fast_speed = ifelse(df$pitch_type %in% fastball, end_speed, NA)) %>%
group_by(year(date), p_name) %>%
arrange(date, event_num) %>%
mutate(avg_fast_speed = cumMean(fast_speed),
sd_fast_speed = sqrt(cumsum(ifelse(is.na((fast_speed-cumMean(fast_speed))^2),
0, (fast_speed-cumMean(fast_speed))^2))/(seq_along(fast_speed)-1)),
sdz_fast_speed = (fast_speed-avg_fast_speed)/sd_fast_speed,
sdz_fast_speed = ifelse(is.na(sdz_fast_speed), 0, sdz_fast_speed)) %>%
ungroup() %>%
arrange(p_name, date, event_num)
aftdf <- df %>%
group_by(year(date), p_name) %>%
arrange(date, event_num) %>%
mutate(avg_fast_speed = cumMean(fast_speed),
sd_fast_speed = sqrt(cumsum(ifelse(is.na((fast_speed-cumMean(fast_speed))^2),
0, (fast_speed-cumMean(fast_speed))^2))/(seq_along(fast_speed)-1)),
sdz_fast_speed = (fast_speed-avg_fast_speed)/sd_fast_speed,
sdz_fast_speed = ifelse(is.na(sdz_fast_speed), 0, sdz_fast_speed)) %>%
ungroup() %>%
arrange(p_name, date, event_num)
# Collapse levels with separation concerns
# event: Triple Play (added to Double Play), Strikeout - DP (added to Strikeout), Sac Fly DP (added to Pop Out),
# Fielders Choice Out (added to Groundout), Fielders Choice (added to Groundout),
# Catcher Interference (added to X), Bunt Lineout (added to Bunt Pop Out), Batter Interference (added to X)
aftdf$event <- ifelse(aftdf$event %in% c("Fielders Choice Out", "Fielders Choice"), "Groundout",
ifelse(aftdf$event== "Triple Play", "Double Play",
ifelse(aftdf$event=="Strikeout - DP", "Strikeout",
ifelse(aftdf$event=="Sac Fly DP", "Pop Out",
ifelse(aftdf$event %in%
c("Catcher Interference", "Batter Interference"), "X",
ifelse(aftdf$event=="Bunt Lineout",
"Bunt Pop Out", aftdf$event))))))
aftdf$event<- ifelse(aftdf$event %in% c("Strike", "Ball"), "X", aftdf$event)
# Form together levels that are similar in meaning: Bunt Groundout and Bunt Pop Out -> Bunt Out.
# Forceout -> Groundout. Grounded Into DP -> Double Play.
aftdf$event <- ifelse(aftdf$event %in% c("Bunt Groundout", "Bunt Pop Out"), "Bunt Out",
ifelse(aftdf$event=="Forceout", "Groundout",
ifelse(aftdf$event=="Grounded Into DP", "Double Play", aftdf$event)))
# Proportion of fastballs that were below avg speed
# last b_score
# Proportion of pitches that were balls (type)
# number of at bats (ab_id)
# proportion of pitches with runners in scoring position
# Count of each event type: (Strikeout, Groundout, Single, Double Play, Walk, Pop Out, Home Run, Lineout, Double,
# Flyout, Hit By Pitch, Intent Walk, Field Error, Triple, Runner Out, Bunt Out, Sac Fly, Sac Bunt)
# proportion of batters faced that are lefty (stand)
# p_throws for pitcher
# attendance
# start_hour
# number of unique pitches thrown in a game (pitch_type)
# venue_name
# team
# opp_team
# league
# delay
# home
# reached (last of cumreached)
# days_since_last_start
# game_number
# windspeed
# temp_f
# sky_weather
aftdf_sum <- aftdf %>% select(p_name, date,sdz_fast_speed, fast_speed, b_score, type, ab_id, pitch_num,
event, stand, p_throws, attendance, start_hour, pitch_type, venue_name, team,
opp_team, league, delay, home, cumreached, sky_weather,
temp_f, windspeed, game_number, days_since_last_start,on_2b, on_3b,
subbed, end, event_num, inning) %>%
arrange(date, p_name, event_num) %>%
group_by(date, p_name) %>%
arrange(date, event_num) %>%
summarize(
fastball_count = sum(!is.na(fast_speed)),
prop_fastball_slow = ifelse(sum(!is.na(fast_speed))==0, 0.40, sum(sdz_fast_speed<0)/sum(!is.na(fast_speed))),
runs_allowed = last(b_score),
prop_balls = mean(type=="B"),
batters_faced = length(unique(ab_id)),
prop_pitches_scoring_position = mean(on_2b==1 | on_3b == 1),
strikeouts = sum(event%in% c("Strikeout", "Strikeout - DP")),
groundouts = sum(event=="Groundout"),
singles = sum(event=="Single"),
double_plays = sum(event%in%c("Double Play", "Grounded Into DP")),
fielders_choice = sum(event%in% c("Fielders Choice", "Fielders Choice Out")),
walks = sum(event=="Walk"),
pop_outs = sum(event=="Pop Out"),
home_runs = sum(event=="Home Run"),
lineouts = sum(event=="Lineout"),
doubles = sum(event=="Double"),
flyouts = sum(event=="Flyout"),
hit_by_pitch = sum(event == "Hit By Pitch"),
intentional_walk = sum(event=="Intent Walk"),
triples = sum(event=="Triple"),
runner_out = sum(event == "Runner Out"),
bunt_out = sum(event%in% c("Bunt Groundout", "Sacrifice Bunt DP", "Bunt Lineout", "Bunt Pop Out")),
sac_fly = sum(event%in% c("Sac Fly", "Sac Fly DP")),
sac_bunt = sum(event=="Sac Bunt"),
prop_batters_lefty = mean(stand=="L"),
p_throws = last(p_throws),
attendance = last(attendance),
start_hour = last(start_hour),
num_p_types = length(unique(pitch_type)),
venue_name = last(venue_name),
team_year = paste(last(team), last(year(date))),
opp_team_year = paste(last(opp_team), last(year(date))),
league = last(league),
delay = last(delay),
home = last(home),
reached = last(cumreached),
game_number = last(game_number),
windspeed = last(windspeed),
temp_f = last(temp_f),
sky_weather = last(sky_weather),
inning = max(inning),
pitch_count = max(end),
subbed = max(subbed)
) %>% ungroup() %>%
mutate(
# Collapse levels for sky_weather
# snow, rain, drizzle into overcast
# dome into roof closed
sky_weather = ifelse(sky_weather %in% c("snow", "rain", "drizzle"), "overcast",
ifelse(sky_weather %in% c("dome"), "roof closed", sky_weather))
)
# Split into training and validation (use 2018 Boston Red Sox as Validation)
valid_aft <- aftdf_sum %>% filter(team_year == 'bos 2018')
train_aft <- aftdf_sum %>% filter(team_year != 'bos 2018', year(as.Date(date)) == 2018)
One of the main assumptions of an AFT model is the failure (relief) time follows a specific distribution. We can test a few distributions to see which one best fits the errors of the model:
With each distribution test, we will fit the full model. Once we have selected the best distribution for the relief time, we will select the most informative variables for the final AFT model.
#AFT modeling
# Create an AFT model with different distributions: we must use the full model/
aft.w <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "weibull")
aft.e <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "expo")
aft.ll <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "llogis")
aft.ln <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "lnorm")
aft.g <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "gamma")
aft.f <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+delay+home+reached+game_number+windspeed+sky_weather,
data = train_aft, dist = "genf")
We can assess the graphical fit of certain distributions by plotting the theoretical and actual cumulative hazard functions.
Although none of the distributions are perfect, the Weibull, Gamma, and F distributions seem to fit the data best. We can use the Likelihood Ratio Test (LRT) to compare these nested distributions statistically.
The Weibull distribution is nested within the Gamma distribution, which is nested within the F distribution. Therefore, we will test the Weibull with the Gamma distribution first.
# Test Weibull vs. Gamma
pval.w.g <- 1 - pchisq((-2*(aft.w$loglik-aft.g$loglik)), 1)
The p-value is extremely high, ~1, meaning we fail to reject the null hypothesis that the Gamma distribution provides significantly more information than the Weibull. Therefore, the Weibull is statistically a better fit than the Gamma.
Now let’s test the Weibull with the F distribution.
# Test F vs. Weibulll
pval.w.f <- 1 - pchisq((-2*(aft.w$loglik-aft.f$loglik)), 2)
The resulting p-value of the LRT, ~0, is extremely significant. This indicates that the F distribution provides significantly more information than the Weibull distribution. Because the survreg() function in R does not handle an F distribution, however, we will use the Weibull distribution for the rest of the model process.
The next phase of the AFT model process is selecting the most informative subset of explanatory variables. We will do this by minimizing the AIC of the model with the stepAIC() function.
aft_final <- stepAIC(survreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
data = train_aft, dist = "weibull"))
Let’s take a look at the model performance on the validation data set: the 2018 Boston Red Sox. We can look at, on average, the difference in the actual number of pitches a starter threw and the model’s predicted number of pitches.
valid_aft$pred_pitch_count <- round(predict(aft_final, newdata=valid_aft, type="response"), 0)
mean(abs(valid_aft$pred_pitch_count - valid_aft$pitch_count))
## [1] 8.660494
The model’s predicted number of pitches a starter will throw before being relived is, on average, 8.66 pitches off from the actual pitch count for the 2018 Boston Red Sox. We can also look at the survival curve as predicted by our final AFT model versus the actual survival curve to evaluate the model.
Overall, I would say the model did all right in its predictions. The margin of error of 9 pitches is usually less than an inning of baseball, which is a pretty good margin.
Let’s take a look at the final list of variables that minimize the AIC as well as their respective percentage effect on survival time and p-values. The variables that are marked out are not statistically significant at a threshold of 0.03.
Note the interpretation of the percentage effect on survival time: the percentage effect on the expected pitch count for a starting pitcher at relief for every unit increase in the variable.
| Variable | Percentage Effect on Survival Time | P-Value |
|---|---|---|
| batters_faced | 2.98 % | 0.0000 |
| intentional_walk | 1.06 % | 0.0022 |
| double_plays | 0.76 % | 0.0000 |
| num_p_types | 0.71 % | 0.0000 |
| walks | 0.61 % | 0.0000 |
| home | 0.42 % | 0.0101 |
| prop_balls | 0.34 % | 0.0000 |
| runs_allowed | 0.26 % | 0.0004 |
| game_number | 0.01 % | 0.2198 |
| temp_f | -0.04 % | 0.0000 |
| singles | -0.24 % | 0.0519 |
| sky_weathercloudy | -0.5 % | 0.0715 |
| sky_weatherroof closed | -0.52 % | 0.0705 |
| doubles | -0.61 % | 0.0000 |
| hit_by_pitch | -0.75 % | 0.0003 |
| triples | -0.81 % | 0.0024 |
| pop_outs | -0.82 % | 0.0000 |
| sky_weatherpartly cloudy | -0.88 % | 0.0002 |
| prop_fastball_slow | -0.91 % | 0.0211 |
| fielders_choice | -0.97 % | 0.0073 |
| flyouts | -1.09 % | 0.0000 |
| lineouts | -1.19 % | 0.0000 |
| groundouts | -1.2 % | 0.0000 |
| leagueNL | -1.21 % | 0.0000 |
| sky_weatherovercast | -1.38 % | 0.0001 |
| home_runs | -1.53 % | 0.0000 |
| sky_weathersunny | -1.59 % | 0.0000 |
| runner_out | -1.76 % | 0.0000 |
| reached | -1.86 % | 0.0000 |
| prop_pitches_scoring_position | -2.14 % | 0.0246 |
| bunt_out | -2.55 % | 0.0000 |
There are five interesting takeaways I want to highlight from this table.
This model can be useful in understanding when MLB managers are likely to pull a starting pitcher. By understanding the cumulative events that have occurred up until the current pitch, we can model what the expected pitch count of the pitcher will be at the time of relief. This can give a manager a sense of time before the pitcher is likely ready to be pulled so the team can prepare for relief and get the bullpen warmed up.
However, there are a few downfalls of the analysis. To start, the expected number of pitches thrown does not tell the whole story of the time of relief. Pitchers may be pulled at lower levels of pitch count for two reasons:
These two scenarios tell two completely different stories at the same number of pitches thrown. The first has thrown a great game and was relieved in the 8th inning while the second has thrown a terrible game and was relieved in the 3rd inning. In further analysis, it would be useful to incorporate the reasons of relief (i.e., fatigue or poor performance) to understand these two stories.
Another extension of this analysis would be to model pitcher fatigue. This current model imitates the thinking of the MLB managers: when do they usually pull the starter? However, managers make mistakes. There are many cases when the starter is pulled too late and the damage they have done is irreversible. On the other hand, there are times when the manager pulls the starter prematurely and the relief pitcher does much worse than the start would have performed. It may be more useful to model pitcher fatigue at each pitch number to understand when should the starter be pulled? This could be a helpful tool for managers to utilize in optimizing the time at which a starter should be taken out of the game.