Starting Pitcher Survival

Motivation

As I sat in quarantine, sulking about the delay of the 2020 MLB season, I realized there was no better way to numb the pain of missing baseball than to crunch some numbers behind some pitcher data. An important aspect of baseball, especially during the postseason, is knowing how long a starting pitcher will last in a game. This can be helpful for the MLB managers in understanding when they need to start warming up the bullpen and how many relief pitchers will be required to finish a game.

This notebook looks into the factors that affect the pitch count at which a manager pulls the starting pitcher from the game. In other words, I use survival analysis to attempt to model the behavior of an MLB manager: when do managers decide to relieve a starting pitcher? This assumes that the MLB managers make the correct choice of pulling starting pitchers, which is not always a true assumption. Nevertheless, here are a few takeaways from the analysis:

If the pitcher is starting a home game, there is a 0.42% increase in the expected pitches he will throw than if he were starting an away game.
“Clear” skies is the best weather to increase the expected pitch count.
For every degree Fahrenheit increase in the temperature at the game, the starting pitcher is expected to throw 0.04% fewer pitches before relief.
Surprisingly, the number of ground balls allowed decreases the expected pitch count before relief. Ground balls usually result in outs or other beneficial events for the pitcher’s team. This could be a result of pitchers working through at-bats in fewer pitches, going further into the game with respect to innings but not pitch count.
NL starting pitchers throw 1.21% fewer pitches before relief than do AL starting pitchers.

The notebook dives into the data preparation, engineering, and modeling that support these takeaways. Take a look!

Required Packages

These are the following packages that are used throughout the notebook.

library(dplyr)
library(stringr)
library(zoo)
library(lubridate)
library(survminer)
library(flexsurv)
library(knitr)
library(twextras)
library(kableExtra)
library(scales)

Load and Merge the Data

This notebook uses four data sets from Kaggle:

Atbats: Game information that cannot change over the course of an at-bat.
Pitches: Pitch-level data that match up with the Atbats data set.
Player_names: Names of the players that match up with player IDs in the Atbats data set.
Games: Game-level data.

These data sets all provide MLB information from seasons 2015-2018, however this notebook will only focus on the most recent year: 2018.

# Read in the necessary data
atbats <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/atbats.csv",
                   stringsAsFactors = FALSE)
pitches <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/pitches.csv",
                   stringsAsFactors = FALSE)
games <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/games.csv",
                   stringsAsFactors = FALSE)
player_names <- read.csv("C:/Users/Jackson Cabell/Documents/Pitcher_Survival/Data/player_names.csv",
                   stringsAsFactors = FALSE)

# Add atbat info
pitch <- left_join(pitches, atbats, by = 'ab_id')

# Add game info
pitch <- left_join(pitch, games, by = 'g_id')

# Add pitcher names
player_names$p_name <- paste(player_names$first_name, player_names$last_name)
pitch <- left_join(pitch, dplyr::select(player_names,c(id, p_name)), by = c('pitcher_id' = 'id'))

# Add batter names
player_names$b_name <- paste(player_names$first_name, player_names$last_name)
pitch <- left_join(pitch, dplyr::select(player_names,c(id, b_name)), by = c('batter_id' = 'id'))

# Take out unnecessary columns
pitch <- pitch %>%
  select(-c(ax, ay, az, sz_bot, sz_top, vx0, vy0, vz0, x, x0, y, y0, z0, pfx_x, pfx_z))

Determine the Starting Pitchers

Now that we have the data merged to include game-level and pitch-level data for each at-bat, let’s determine which pitches were thrown by the starting pitchers. We will only focus on these data points for the remainder of the analysis.

# Create a dataset of game_id and starting pitchers
starting_pitchers_away <- pitch %>%
  filter(inning == 1,
         o %in% c(0, 1),
         top == "False") %>%
  arrange(g_id, event_num) %>%
  group_by(g_id) %>%
  slice(1) %>%
  select(g_id, p_name, pitcher_id) %>%
  rename(starter_away_name = p_name,
         starter_away_id = pitcher_id) %>%
  mutate(home = 0)

starting_pitchers_home <- pitch %>%
  filter(inning == 1,
         o %in% c(0, 1),
         top == "True") %>%
  arrange(g_id, event_num) %>%
  group_by(g_id) %>%
  slice(1) %>%
  select(g_id, p_name, pitcher_id) %>%
  rename(starter_home_name = p_name,
         starter_home_id = pitcher_id) %>%
  mutate(home = 1)

# Get pitch data for starting away pitchers
pitch_away <- inner_join(pitch, starting_pitchers_away, by = c('g_id' = 'g_id',
                                                               'pitcher_id' = 'starter_away_id')) %>%
  select(-c(starter_away_name))

# Get pitch data for starting home pitchers
pitch_home <- inner_join(pitch, starting_pitchers_home, by = c('g_id' = 'g_id',
                                                               'pitcher_id' = 'starter_home_id')) %>%
  select(-c(starter_home_name))

# Combine both datasets to have pitch data only for starting pitchers
starters <- rbind(pitch_home, pitch_away)

#Create unique identifier for pitcher and game
starters$gp_id <- as.numeric(paste(starters$g_id, starters$pitcher_id, sep=''))
starters <- starters %>%
  arrange(gp_id,event_num)

# Fill in team for pitcher and the opp team
starters[which(starters$home==0),"team"] <- starters[which(starters$home==0), "away_team"]
starters[which(starters$home==0),"opp_team"] <- starters[which(starters$home==0), "home_team"]
starters[which(starters$home==1),"team"] <- starters[which(starters$home==1), "home_team"]
starters[which(starters$home==1),"opp_team"] <- starters[which(starters$home==1), "away_team"]

Feature Engineering

There are several variables we can create with the information provided by these data sets. Most importantly, we need to calculate the two target variables that are necessary to execute survival analysis:

What is the event? The pitcher is relieved by the manager: subbed
When does the event occur? The pitch count of the pitcher when he is relieved: pitch_count.

# Cumulative pitch-count per pitcher in a game
starters <- starters %>%
  mutate(flag = 1) %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(p_count = cumsum(flag)) %>%
  ungroup() %>%
  select(-c(flag)) %>%
  arrange(gp_id, event_num) %>%
  mutate(subbed = 0,# Add target variable for subbed
         rownum = 1:n())

# Determine last pitch thrown (assume the pitchers are only subbed in-between batters)
last <- starters %>%
  group_by(gp_id) %>%
  filter(p_count == max(p_count)) %>%
  mutate(subbed = 1)

starters[which(starters$rownum %in% c(last$rownum)),"subbed"] <- 1

Next, we will create several variables, both time-dependent and time-independent, that we think may be useful for predicting or explaining when a pitcher will be relieved.

The engineered time-dependent variables are the following:

The number of strikeouts the pitcher has thrown in the game at the time of the pitch.
The number of walks the pitcher has allowed in the game.
The speed of the previous fastball thrown: standardized to be positive if above the in-game average fastball speed, or negative if below the average.
The “nasty” score of the previous off-speed thrown: standardized to be positive if above the in-game average nasty score, or negative if below the average. Note: nasty was a variable already in the data, I did not create this measure.
Number of batters that reached a base.
Run differential of the game: pitcher’s team score - batter’s team score.

The engineered time-independent variables are the following:

Number of days since the pitcher last started in the season.
Number of the game started for the pitcher in the season.
Wind speed at the stadium on the day of the game.
Temperature (Fahrenheit) at the stadium on the day of the game.
Weather at the stadium on the day of the game.
League of the team the pitcher is on: American or National.

# Add indicator if batter gets on or gets out
outs <- starters %>%
  arrange(gp_id, event_num) %>%
  group_by(ab_id) %>%
  filter(pitch_num == max(pitch_num),
         grepl("out", str_replace_all((str_to_lower(event)), " ", ""))) %>%
  mutate(out = 1)

starters$out <- ifelse(starters$rownum %in% outs$rownum, 1, 0)

# Add last pitch indicator
last_pitch_df <- starters %>%
  arrange(gp_id, event_num) %>%
  group_by(ab_id) %>%
  filter(pitch_num == max(pitch_num)) %>%
  mutate(last_pitch = 1)
starters$last_pitch <- ifelse(starters$rownum %in% last_pitch_df$rownum, 1, 0)

# Add strikeout indicator
strikeouts <- starters %>%
  arrange(gp_id, event_num) %>%
  mutate(flag = 1) %>%
  group_by(gp_id) %>%
  filter(last_pitch==1,
         grepl("strikeout", str_replace_all((str_to_lower(event)), " ", ""))) %>%
  mutate(strikeouts = 1) %>%
  ungroup()

starters <- left_join(starters, select(strikeouts, c(rownum, strikeouts)), by = 'rownum')

# Add walk indicator
walks <- starters %>%
  arrange(gp_id, event_num) %>%
  mutate(flag = 1) %>%
  group_by(gp_id) %>%
  filter(last_pitch==1,
         grepl("walk", str_replace_all((str_to_lower(event)), " ", ""))) %>%
  mutate(walks = 1) %>%
  ungroup()

starters <- left_join(starters, select(walks, c(rownum, walks)), by = 'rownum')

# Cumulative strikeouts and walks
starters <- starters %>%
  mutate(walks = ifelse(is.na(walks), 0, walks),
         strikeouts = ifelse(is.na(strikeouts), 0, strikeouts)) %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(cumstrikeouts = cumsum(strikeouts),
         cumwalks = cumsum(walks)) %>%
  ungroup() %>%
  arrange(gp_id)

# Previous fastball speed
fastball <- c("FC", "FF","FT")
offspeed <- c("CH", "CU", "EP", "FS", "KC", "KN", "SC", "SI", "SL")
starters <- starters %>%
  mutate(fast_speed = ifelse(starters$pitch_type %in% fastball, end_speed, NA)) %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(avg_fast_speed = mean(fast_speed, na.rm=TRUE),
         sd_fast_speed = sd(fast_speed, na.rm=TRUE),
         sdz_fast_speed = na.locf((fast_speed-avg_fast_speed)/sd_fast_speed, na.rm = FALSE)) %>%
  ungroup() %>%
  arrange(gp_id)

# Replace NAs w/ avg (0)
starters$sdz_fast_speed <- ifelse(is.na(starters$sdz_fast_speed), 0, starters$sdz_fast_speed)

# Previous nasty score for offspeed
starters <- starters %>%
  mutate(nasty_offspeed = ifelse(starters$pitch_type %in% offspeed, nasty, NA)) %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(avg_nasty_offspeed = mean(nasty_offspeed, na.rm=TRUE),
         sd_nasty_offspeed = sd(nasty_offspeed, na.rm=TRUE),
         sdz_nasty_offspeed = na.locf((nasty_offspeed-avg_nasty_offspeed)/sd_nasty_offspeed, na.rm = FALSE)) %>%
  ungroup() %>%
  arrange(gp_id)

# Replace NAs w/ avg (0)
starters$sdz_nasty_offspeed <- ifelse(is.na(starters$sdz_nasty_offspeed), 0, starters$sdz_nasty_offspeed)


# Change event type to ball or strike if not the last pitch of the at-bat
starters$event <- ifelse(starters$last_pitch==0,
                         ifelse(starters$type=="S", "Strike", ifelse(starters$type=="B", "Ball", starters$type)),
                         starters$event)

# Calculate cumulative batters reached base in the game
reached <- starters %>%
  mutate(reached = ifelse(last_pitch==1 & out==0, 1, 0)) %>%
  filter(reached == 1) %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(cumreached = cumsum(reached)) %>%
  ungroup() %>%
  arrange(gp_id, event_num) %>%
  select(cumreached, rownum)

starters <- left_join(starters, reached,'rownum')

starters <- starters %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(cumreached = na.locf(cumreached, na.rm = FALSE)) %>%
  ungroup() %>%
  arrange(gp_id, event_num)

#Replace NAs with 0s
starters$cumreached <- ifelse(is.na(starters$cumreached), 0, starters$cumreached)

# Calculate days since last start for a pitcher within a season
starters$date <- as.Date(starters$date)
starters$start_hour <- hour(as.POSIXct(starters$start_time,format="%H:%M"))+12
starters$start_hour <- ifelse(grepl("AM", starters$start_time), starters$start_hour-12, starters$start_hour)
starters$start_hour <- ifelse(starters$start_hour==24, 12, starters$start_hour)
starters <- starters %>%
  mutate(start_year = year(date)) %>%
  group_by(start_year, p_name) %>%
  arrange(date, event_num) %>%
  mutate(days_since_last_start = c(0,as.numeric(diff(date)))) %>%
  mutate(days_since_last_start = ifelse(days_since_last_start==0, NA, days_since_last_start)) %>%
  mutate(days_since_last_start = na.locf(days_since_last_start, na.rm = FALSE)) %>%
  mutate(days_since_last_start = ifelse(is.na(days_since_last_start), 168, days_since_last_start)) %>%
  ungroup() 


# Calculate what start number of the year the game is (1st start, 2nd start, etc.)
starters <- starters %>%
  group_by(start_year, p_name) %>%
  arrange(date, event_num) %>%
  mutate(game_number = cumsum(!duplicated(date))) %>%
  ungroup() 

# Split up weather into windspeed, temperature (F), and weather (fair, partly cloudy, etc.)
numextract <- function(string){ 
  str_extract(string, "\\-*\\d+\\.*\\d*")
} 
starters <- starters %>%
  mutate(windspeed = as.numeric(numextract(wind)),
         temp_f = as.numeric(numextract(weather)),
         sky_weather = str_extract(starters$weather, '\\b[^,]+$'))

# Calculate run differential
starters$run_diff <- starters$p_score - starters$b_score

# Make sure starting pitchers who throw complete games are accounted for
starters$subbed <- ifelse(starters$inning==9 & starters$o==3 & starters$last_pitch==1, 0, starters$subbed)

# Add indicator for NL
nl_teams <- c("atl", "mia", "nyn", "phi","was", "chn","cin", "mil", "pit", "sln", "ari",
              "col", "lan", "sdn", "sfn")
starters$league <- ifelse(starters$team %in% nl_teams, "NL", "AL")

# Collect final list of variables for analysis
df <- starters %>%
  select(end_speed, code, b_score, ab_id, b_count, s_count, outs, pitch_num,
         on_1b, on_2b, on_3b, event, g_id, inning, o, p_score, p_throws,
         stand, top, attendance, date, start_time, venue_name, delay, p_name, b_name, home, gp_id, team,league, 
         opp_team, p_count, event_num, subbed, out, last_pitch, cumstrikeouts, cumwalks, sdz_fast_speed,
         sdz_nasty_offspeed,run_diff, days_since_last_start, game_number, windspeed, temp_f,
         sky_weather, start_hour, type, code, cumreached, pitch_type)
df <- df %>%
  group_by(gp_id) %>%
  arrange(event_num) %>%
  mutate(start = p_count-1,
         end = p_count) %>%
  ungroup()

Example Pitcher

Here is Chris Sale with the 2018 Boston Red Sox: an example of what the data set looks like so far with the created target variables (pitch_count is split-up into “start” and “end” to include time-dependent variables):

# View Chris Sale
preview <- df %>% select(date, game_number, p_name, start, end, subbed, type, everything()) %>%
       filter(p_name == "Chris Sale", year(as.Date(date))==2018) %>%
       arrange(date, end) %>% head(10)

kable(preview, align = 'c', caption = "Chris Sale: 2018 Boston Red Sox") %>%
  kable_styling(font_size = 12, position = 'center', full_width = FALSE, bootstrap_options = "hover")  %>%
  scroll_box(width = "100%", height = "350px")

Chris Sale: 2018 Boston Red Sox
date	game_number	p_name	start	end	type	event_num	event	b_count	s_count	pitch_num	o	inning	p_throws	stand	attendance	start_hour	venue_name	temp_f	sky_weather	team	league	opp_team	sdz_fast_speed	sdz_nasty_offspeed	days_since_last_start	code	bs_count
2018-03-29	1	Chris Sale	0	1	S	25	X	0	0	1	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	0.0926013	0.0000000	168	C	0,0
2018-03-29	1	Chris Sale	1	2	S	26	X	0	1	2	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	0.0926013	0.1341089	168	C	0,1
2018-03-29	1	Chris Sale	2	3	B	27	X	0	2	3	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	0.1965416	0.1341089	168	B	0,2
2018-03-29	1	Chris Sale	3	4	B	28	X	1	2	4	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	0.1965416	0.8577383	168	B	1,2
2018-03-29	1	Chris Sale	4	5	S	29	X	2	2	5	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	1.2879142	0.8577383	168	F	2,2
2018-03-29	1	Chris Sale	5	6	S	30	X	2	2	6	1	1	L	R	31042	16	Tropicana Field	72	dome	bos	AL	tba	1.4438245	0.8577383	168	S	2,2
2018-03-29	1	Chris Sale	6	7	B	34	X	0	0	1	2	1	L	L	31042	16	Tropicana Field	72	dome	bos	AL	tba	-1.2586219	0.8577383	168	B	0,0
2018-03-29	1	Chris Sale	7	8	S	35	X	1	0	2	2	1	L	L	31042	16	Tropicana Field	72	dome	bos	AL	tba	-1.2586219	-1.4682133	168	C	1,0
2018-03-29	1	Chris Sale	8	9	S	36	X	1	1	3	2	1	L	L	31042	16	Tropicana Field	72	dome	bos	AL	tba	-0.4790700	-1.4682133	168	F	1,1
2018-03-29	1	Chris Sale	9	10	S	37	X	1	2	4	2	1	L	L	31042	16	Tropicana Field	72	dome	bos	AL	tba	1.5477648	-1.4682133	168	S	1,2

Accelerated Failure Time Modeling

In order to predict the pitch count at which a manager will relieve a pitcher, we need to create an Accelerated Failure Time Model. This models the tenure, or the survival time, of a pitcher. AFT models do not handle time-dependent variables very well, so let’s roll-up each starting pitcher’s game data so all variables are time-independent.

# Previous fastball speed for season
fastball <- c("FC", "FF","FT")
offspeed <- c("CH", "CU", "EP", "FS", "KC", "KN", "SC", "SI", "SL")
aftdf <- df %>%
  mutate(fast_speed = ifelse(df$pitch_type %in% fastball, end_speed, NA)) %>%
  group_by(year(date), p_name) %>%
  arrange(date, event_num) %>%
  mutate(avg_fast_speed = cumMean(fast_speed),
         sd_fast_speed = sqrt(cumsum(ifelse(is.na((fast_speed-cumMean(fast_speed))^2),
                                            0, (fast_speed-cumMean(fast_speed))^2))/(seq_along(fast_speed)-1)),
         sdz_fast_speed = (fast_speed-avg_fast_speed)/sd_fast_speed,
         sdz_fast_speed = ifelse(is.na(sdz_fast_speed), 0, sdz_fast_speed)) %>%
  ungroup() %>%
  arrange(p_name, date, event_num)

aftdf <- df %>%
  group_by(year(date), p_name) %>%
  arrange(date, event_num) %>%
  mutate(avg_fast_speed = cumMean(fast_speed),
         sd_fast_speed = sqrt(cumsum(ifelse(is.na((fast_speed-cumMean(fast_speed))^2),
                                            0, (fast_speed-cumMean(fast_speed))^2))/(seq_along(fast_speed)-1)),
         sdz_fast_speed = (fast_speed-avg_fast_speed)/sd_fast_speed,
         sdz_fast_speed = ifelse(is.na(sdz_fast_speed), 0, sdz_fast_speed)) %>%
  ungroup() %>%
  arrange(p_name, date, event_num)

# Collapse levels with separation concerns
# event: Triple Play (added to Double Play), Strikeout - DP (added to Strikeout), Sac Fly DP (added to Pop Out),
# Fielders Choice Out (added to Groundout), Fielders Choice (added to Groundout),
# Catcher Interference (added to X), Bunt Lineout (added to Bunt Pop Out), Batter Interference (added to X)
aftdf$event <- ifelse(aftdf$event %in% c("Fielders Choice Out", "Fielders Choice"), "Groundout",
                       ifelse(aftdf$event== "Triple Play", "Double Play",
                              ifelse(aftdf$event=="Strikeout - DP", "Strikeout",
                                     ifelse(aftdf$event=="Sac Fly DP", "Pop Out",
                                            ifelse(aftdf$event %in%
                                                     c("Catcher Interference", "Batter Interference"), "X",
                                                   ifelse(aftdf$event=="Bunt Lineout",
                                                          "Bunt Pop Out", aftdf$event))))))

aftdf$event<- ifelse(aftdf$event %in% c("Strike", "Ball"), "X", aftdf$event)

# Form together levels that are similar in meaning: Bunt Groundout and Bunt Pop Out -> Bunt Out. 
# Forceout -> Groundout. Grounded Into DP -> Double Play. 
aftdf$event <- ifelse(aftdf$event %in% c("Bunt Groundout", "Bunt Pop Out"), "Bunt Out",
                       ifelse(aftdf$event=="Forceout", "Groundout",
                              ifelse(aftdf$event=="Grounded Into DP", "Double Play", aftdf$event)))


# Proportion of fastballs that were below avg speed
# last b_score
# Proportion of pitches that were balls (type)
# number of at bats (ab_id)
# proportion of pitches with runners in scoring position
# Count of each event type: (Strikeout, Groundout, Single, Double Play, Walk, Pop Out, Home Run, Lineout, Double,
# Flyout, Hit By Pitch, Intent Walk, Field Error, Triple, Runner Out, Bunt Out, Sac Fly, Sac Bunt)
# proportion of batters faced that are lefty (stand)
# p_throws for pitcher
# attendance
# start_hour
# number of unique pitches thrown in a game (pitch_type)
# venue_name
# team
# opp_team
# league 
# delay
# home
# reached (last of cumreached)
# days_since_last_start
# game_number
# windspeed
# temp_f
# sky_weather

aftdf_sum <- aftdf %>% select(p_name, date,sdz_fast_speed, fast_speed, b_score, type, ab_id, pitch_num,
                              event, stand, p_throws, attendance, start_hour, pitch_type, venue_name, team,
                              opp_team, league, delay, home, cumreached, sky_weather,
                              temp_f, windspeed, game_number, days_since_last_start,on_2b, on_3b,
                              subbed, end, event_num, inning) %>%
  arrange(date, p_name, event_num) %>%
  group_by(date, p_name) %>%
  arrange(date, event_num) %>%
  summarize(
    fastball_count = sum(!is.na(fast_speed)),
    prop_fastball_slow = ifelse(sum(!is.na(fast_speed))==0, 0.40, sum(sdz_fast_speed<0)/sum(!is.na(fast_speed))),
    runs_allowed = last(b_score),
    prop_balls = mean(type=="B"),
    batters_faced = length(unique(ab_id)),
    prop_pitches_scoring_position = mean(on_2b==1 | on_3b == 1),
    strikeouts = sum(event%in% c("Strikeout", "Strikeout - DP")),
    groundouts = sum(event=="Groundout"),
    singles = sum(event=="Single"),
    double_plays = sum(event%in%c("Double Play", "Grounded Into DP")),
    fielders_choice = sum(event%in% c("Fielders Choice", "Fielders Choice Out")),
    walks = sum(event=="Walk"),
    pop_outs = sum(event=="Pop Out"),
    home_runs = sum(event=="Home Run"),
    lineouts = sum(event=="Lineout"),
    doubles = sum(event=="Double"),
    flyouts = sum(event=="Flyout"),
    hit_by_pitch = sum(event == "Hit By Pitch"),
    intentional_walk = sum(event=="Intent Walk"),
    triples = sum(event=="Triple"),
    runner_out = sum(event == "Runner Out"),
    bunt_out = sum(event%in% c("Bunt Groundout", "Sacrifice Bunt DP", "Bunt Lineout", "Bunt Pop Out")),
    sac_fly = sum(event%in% c("Sac Fly", "Sac Fly DP")),
    sac_bunt = sum(event=="Sac Bunt"),
    prop_batters_lefty = mean(stand=="L"),
    p_throws = last(p_throws),
    attendance = last(attendance),
    start_hour = last(start_hour),
    num_p_types = length(unique(pitch_type)),
    venue_name = last(venue_name),
    team_year = paste(last(team), last(year(date))),
    opp_team_year = paste(last(opp_team), last(year(date))),
    league = last(league),
    delay = last(delay),
    home = last(home),
    reached = last(cumreached),
    game_number = last(game_number),
    windspeed = last(windspeed),
    temp_f = last(temp_f),
    sky_weather = last(sky_weather),
    inning = max(inning),
    pitch_count = max(end),
    subbed = max(subbed)
  ) %>% ungroup() %>%
  mutate(
    # Collapse levels for sky_weather
    # snow, rain, drizzle into overcast
    # dome into roof closed
        
    sky_weather = ifelse(sky_weather %in% c("snow", "rain", "drizzle"), "overcast",
                         ifelse(sky_weather %in% c("dome"), "roof closed", sky_weather))
    
  )


# Split into training and validation (use 2018 Boston Red Sox as Validation)
valid_aft <- aftdf_sum %>% filter(team_year == 'bos 2018')
train_aft <- aftdf_sum %>% filter(team_year != 'bos 2018', year(as.Date(date)) == 2018)

One of the main assumptions of an AFT model is the failure (relief) time follows a specific distribution. We can test a few distributions to see which one best fits the errors of the model:

Weibull
Exponential
Log-Logistic
Log-Normal
Gamma
F

With each distribution test, we will fit the full model. Once we have selected the best distribution for the relief time, we will select the most informative variables for the final AFT model.

#AFT modeling
# Create an AFT model with different distributions: we must use the full model/
aft.w <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "weibull")


aft.e <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "expo") 


aft.ll <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "llogis")


aft.ln <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "lnorm")


aft.g <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+delay+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "gamma")


aft.f <- flexsurvreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+delay+home+reached+game_number+windspeed+sky_weather,
                      data = train_aft, dist = "genf")

We can assess the graphical fit of certain distributions by plotting the theoretical and actual cumulative hazard functions.

Although none of the distributions are perfect, the Weibull, Gamma, and F distributions seem to fit the data best. We can use the Likelihood Ratio Test (LRT) to compare these nested distributions statistically.

The Weibull distribution is nested within the Gamma distribution, which is nested within the F distribution. Therefore, we will test the Weibull with the Gamma distribution first.

# Test Weibull vs. Gamma
pval.w.g <- 1 - pchisq((-2*(aft.w$loglik-aft.g$loglik)), 1)

The p-value is extremely high, ~1, meaning we fail to reject the null hypothesis that the Gamma distribution provides significantly more information than the Weibull. Therefore, the Weibull is statistically a better fit than the Gamma.

Now let’s test the Weibull with the F distribution.

# Test F vs. Weibulll
pval.w.f <- 1 - pchisq((-2*(aft.w$loglik-aft.f$loglik)), 2)

The resulting p-value of the LRT, ~0, is extremely significant. This indicates that the F distribution provides significantly more information than the Weibull distribution. Because the survreg() function in R does not handle an F distribution, however, we will use the Weibull distribution for the rest of the model process.

The next phase of the AFT model process is selecting the most informative subset of explanatory variables. We will do this by minimizing the AIC of the model with the stepAIC() function.

aft_final <- stepAIC(survreg(Surv(pitch_count, subbed == 1) ~ prop_fastball_slow+runs_allowed+prop_balls+
                       batters_faced+prop_pitches_scoring_position+strikeouts+groundouts+singles+
                       double_plays+fielders_choice+walks+pop_outs+home_runs+lineouts+doubles+
                       flyouts+hit_by_pitch+intentional_walk+triples+runner_out+bunt_out+
                       sac_fly+sac_bunt+prop_batters_lefty+p_throws+start_hour+
                       num_p_types+league+home+reached+game_number+windspeed+temp_f+sky_weather,
                      data = train_aft, dist = "weibull"))

AFT Results and Takeaways

Let’s take a look at the model performance on the validation data set: the 2018 Boston Red Sox. We can look at, on average, the difference in the actual number of pitches a starter threw and the model’s predicted number of pitches.

valid_aft$pred_pitch_count <- round(predict(aft_final, newdata=valid_aft, type="response"), 0)
mean(abs(valid_aft$pred_pitch_count - valid_aft$pitch_count))

## [1] 8.660494

The model’s predicted number of pitches a starter will throw before being relived is, on average, 8.66 pitches off from the actual pitch count for the 2018 Boston Red Sox. We can also look at the survival curve as predicted by our final AFT model versus the actual survival curve to evaluate the model.

Overall, I would say the model did all right in its predictions. The margin of error of 9 pitches is usually less than an inning of baseball, which is a pretty good margin.

Let’s take a look at the final list of variables that minimize the AIC as well as their respective percentage effect on survival time and p-values. The variables that are marked out are not statistically significant at a threshold of 0.03.

Note the interpretation of the percentage effect on survival time: the percentage effect on the expected pitch count for a starting pitcher at relief for every unit increase in the variable.

Final List of Variables for Weibull AFT Model
Variable	Percentage Effect on Survival Time	P-Value
batters_faced	2.98 %	0.0000
intentional_walk	1.06 %	0.0022
double_plays	0.76 %	0.0000
num_p_types	0.71 %	0.0000
walks	0.61 %	0.0000
home	0.42 %	0.0101
prop_balls	0.34 %	0.0000
runs_allowed	0.26 %	0.0004
game_number	0.01 %	0.2198
temp_f	-0.04 %	0.0000
singles	-0.24 %	0.0519
sky_weathercloudy	-0.5 %	0.0715
sky_weatherroof closed	-0.52 %	0.0705
doubles	-0.61 %	0.0000
hit_by_pitch	-0.75 %	0.0003
triples	-0.81 %	0.0024
pop_outs	-0.82 %	0.0000
sky_weatherpartly cloudy	-0.88 %	0.0002
prop_fastball_slow	-0.91 %	0.0211
fielders_choice	-0.97 %	0.0073
flyouts	-1.09 %	0.0000
lineouts	-1.19 %	0.0000
groundouts	-1.2 %	0.0000
leagueNL	-1.21 %	0.0000
sky_weatherovercast	-1.38 %	0.0001
home_runs	-1.53 %	0.0000
sky_weathersunny	-1.59 %	0.0000
runner_out	-1.76 %	0.0000
reached	-1.86 %	0.0000
prop_pitches_scoring_position	-2.14 %	0.0246
bunt_out	-2.55 %	0.0000

There are five interesting takeaways I want to highlight from this table.

If the pitcher is starting a home game, there is a 0.42% increase in the expected pitches he will throw than if he were starting an away game.
“Clear” skies is the best weather to increase the expected pitch count (this is the reference level for the categorical variable “sky_weather”).
The proportion of pitches thrown that were a ball and the number of walks in a game surprisingly have a positive effect on the number of pitches a starter will throw before relief. This could be a result of at-bats requiring more pitches when there are more balls thrown than strikes. If a pitcher throws more strikes, there is a chance for the hitter to hit the ball in play and complete the at-bat in fewer pitches.
For every degree Fahrenheit increase in the temperature at the game, the starting pitcher is expected to throw 0.04% fewer pitches before relief.
A surprising result is the negative effect of ground balls on the expected pitch count. Ground balls usually result in outs or other beneficial events for the pitcher’s team. This could be a result of pitchers working through at-bats in fewer pitches, going further into the game with respect to innings but not pitch count.

This model can be useful in understanding when MLB managers are likely to pull a starting pitcher. By understanding the cumulative events that have occurred up until the current pitch, we can model what the expected pitch count of the pitcher will be at the time of relief. This can give a manager a sense of time before the pitcher is likely ready to be pulled so the team can prepare for relief and get the bullpen warmed up.

However, there are a few downfalls of the analysis. To start, the expected number of pitches thrown does not tell the whole story of the time of relief. Pitchers may be pulled at lower levels of pitch count for two reasons:

The pitcher has worked efficiently in each at-bat, retiring most of the batters and keeping pitch count low even deep into the game.
The pitcher has allowed a lot of batters on base early in the game, causing the manager to take him out at a low pitch count.

These two scenarios tell two completely different stories at the same number of pitches thrown. The first has thrown a great game and was relieved in the 8th inning while the second has thrown a terrible game and was relieved in the 3rd inning. In further analysis, it would be useful to incorporate the reasons of relief (i.e., fatigue or poor performance) to understand these two stories.

Another extension of this analysis would be to model pitcher fatigue. This current model imitates the thinking of the MLB managers: when do they usually pull the starter? However, managers make mistakes. There are many cases when the starter is pulled too late and the damage they have done is irreversible. On the other hand, there are times when the manager pulls the starter prematurely and the relief pitcher does much worse than the start would have performed. It may be more useful to model pitcher fatigue at each pitch number to understand when should the starter be pulled? This could be a helpful tool for managers to utilize in optimizing the time at which a starter should be taken out of the game.