Chapter 5, Analyzing Baseball Data with R, Value of Plays Using Run Expectancy

5.2 Runs Score in the Remainder of the Inning

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(broom)

Now that the necessary libraries are loaded, let’s begin

setwd("C:\\Users\\james\\R_Working_Directory\\Analyzing_Baseball_Data_With_R\\baseball_R\\data")
fields <- read_csv("fields.csv")

## Rows: 97 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Description, Header
## dbl (1): Field number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data2016 <- read_csv("all2016.csv",
                     col_names = pull(fields, Header),
                     na=character())

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 190715 Columns: 97
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): GAME_ID, AWAY_TEAM_ID, PITCH_SEQ_TX, BAT_ID, BAT_HAND_CD, RESP_BAT...
## dbl (35): INN_CT, BAT_HOME_ID, OUTS_CT, BALLS_CT, STRIKES_CT, AWAY_SCORE_CT,...
## lgl (27): LEADOFF_FL, PH_FL, BAT_EVENT_FL, AB_FL, SH_FL, SF_FL, DP_FL, TP_FL...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We need to create several new variables to help us with our runs scored calculations

data2016 <- data2016 %>%
  mutate(RUNS = AWAY_SCORE_CT + HOME_SCORE_CT,
         HALF.INNING = paste(GAME_ID, INN_CT, BAT_HOME_ID),
         RUNS.SCORED = 
           (BAT_DEST_ID > 3) + (RUN1_DEST_ID > 3) +
           (RUN2_DEST_ID > 3) + (RUN3_DEST_ID > 3))

RUNS is the sum of both away and home teams run scored, HALF.INNING is a unique identifier for each half inning of every game, and RUNS.SCORED gives the number of runs scored on each play.

Next, we are going to compute the maximum total score for each half-inning

half_innings <- data2016 %>%
  group_by(HALF.INNING) %>%
  summarise(Outs.Inning = sum(EVENT_OUTS_CT),
            Runs.Inning = sum(RUNS.SCORED),
            Runs.Start = first(RUNS),
            MAX.RUNS = Runs.Inning + Runs.Start)

### Next we'll inner join this  with the data2016 data frame
#ROI stands for Remainder of Inning
data2016 <- data2016 %>%
  inner_join(half_innings, by = "HALF.INNING") %>%
  mutate(RUNS.ROI = MAX.RUNS - RUNS)

5.3 Creating the Matrix Now that we have the Runs Scored in the Remainder of the Inning Variable has been computed, we will need to create the run expectancy matrix

There are currently three variables containing the player codes of baserunners on first, second, and third base, we need to create a new three-digit variable BASES where each digit is a 1 or 0 corresponding to if the base is occupied or empty

the STATE variable will add the number of outs to the BASES variable.

An example, “011 2” would indicate there are currently runners on second and third base with two outs.

data2016 <- data2016 %>%
  mutate(BASES = 
  paste(ifelse(BASE1_RUN_ID > "", 1, 0),
        ifelse(BASE2_RUN_ID > "", 1, 0),
        ifelse(BASE3_RUN_ID > "", 1, 0), sep = ""),
  STATE = paste(BASES, OUTS_CT))

Next, we need to account for what happens after a play. We need to consider plays in our data frame only if there is a change in the runners, number of outs, or runs scored.

data2016 <- data2016 %>%
  mutate(NRUNNER1 = 
           as.numeric(RUN1_DEST_ID == 1 | BAT_DEST_ID == 1),
         NRUNNER2 = 
           as.numeric(RUN1_DEST_ID == 2 | RUN2_DEST_ID == 2 | BAT_DEST_ID == 2),
         NRUNNER3 = 
           as.numeric(RUN1_DEST_ID == 3 | RUN2_DEST_ID == 3 | RUN3_DEST_ID == 3 | BAT_DEST_ID == 3),
         NOUTS = OUTS_CT + EVENT_OUTS_CT,
         NEW.BASES = paste(NRUNNER1, NRUNNER2, NRUNNER3, sep = ""),
         NEW.STATE = paste(NEW.BASES, NOUTS)
  )

Now we need to shift our attention to playes where there is a change between STATE and NEW.STATE

data2016 <- data2016 %>%
  filter((STATE != NEW.STATE) | (RUNS.SCORED > 0))

Before the run expectancy can be computed we need to eliminate scenerios where there are incomplete half innings (which can occur at the end of the game when the winning run is scored before there are 3 outs). We want to work with only complete half innings where 3 outs are recorded.

We’ll create a new data frame with a C at the end to mark that it only includes complete half innings

data2016C <- data2016 %>%
  filter((Outs.Inning == 3))

Now it’s time to compute the expected number of runs scored in the remainder of the inning (the run expectency) for each of the 24 bases/out situations.

RUNS <- data2016C %>%
  group_by(STATE) %>%
  summarise(Mean = mean(RUNS.ROI)) %>%
  mutate(Outs = substr(STATE,5,5)) %>%
  arrange(Outs)

Next, we’ll use the matrix function to convert the RUNS data frame into a matrix called RUNS_out

RUNS_out <- matrix(round(RUNS$Mean, 2), 8, 3)
dimnames(RUNS_out)[[2]] <- c("0 outs", "1 out", "2 outs")
dimnames(RUNS_out)[[1]] <- c("000", "001", "010", "011", "100", "101", "110", "111")

Let’s see how this compares to 2002’s data, these numbers will be manually input, and were retrieved from the book.

RUNS.2002 <- matrix(c(.51, 1.40, 1.14, 1.96, .90, 1.84, 1.51, 2.33, .27, .94, .68, 1.36, .54, 1.18, .94, 1.51, .10, .36, .32, .63, .23, .52, .45, .78), 8, 3)
dimnames(RUNS.2002) <- dimnames(RUNS_out)
cbind(RUNS_out, RUNS.2002)

##     0 outs 1 out 2 outs 0 outs 1 out 2 outs
## 000   0.50  0.27   0.11   0.51  0.27   0.10
## 001   1.35  0.94   0.37   1.40  0.94   0.36
## 010   1.13  0.67   0.31   1.14  0.68   0.32
## 011   1.93  1.36   0.55   1.96  1.36   0.63
## 100   0.86  0.51   0.22   0.90  0.54   0.23
## 101   1.72  1.20   0.48   1.84  1.18   0.52
## 110   1.44  0.92   0.41   1.51  0.94   0.45
## 111   2.11  1.54   0.70   2.33  1.51   0.78

The fact that run expectancy hasn’t changed that much between 2002 and 2016 indicates few changes to average run scoring tendencies in recent history, which gives us confidence in being able to draw conclusions about run expectancy over time periods longer than a single season.

Before we continue on with the book, let’s visualize the data a bit

library(pheatmap)

## Warning: package 'pheatmap' was built under R version 4.3.2

pheatmap(RUNS_out)

#Don't really like the output, let's try using ggplot instead

#To use GGPlot we will have to turn our matrix into a data frame using the function melt
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.3.2

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

# Manually create a melted data frame from the matrix
RUNS_out_melted <- expand.grid(Var1 = dimnames(RUNS_out)[[1]], Var2 = dimnames(RUNS_out)[[2]])
RUNS_out_melted$value <- as.vector((RUNS_out))

# Plotting the heatmap
ggplot(RUNS_out_melted, aes(Var2, Var1)) +
  geom_tile(aes(fill = value)) + # Fill tiles based on 'value'
  geom_text(aes(label = sprintf("%.2f", value)), color = "white", size = 3) + # Add text labels
  scale_fill_gradient(low = "blue", high = "red") +
  xlab("Outs") +
  ylab("State") +
  theme_minimal() +
  scale_x_discrete(position = "top") +
  scale_y_discrete(labels = function(x) x)

Next, let’s look at an ANOVA and Tukey Honest Significant different test and visualize the results

library(multcompView)

## Warning: package 'multcompView' was built under R version 4.3.2

library(stats)

#Performing the ANOVA
RUNS_lm <- lm(RUNS.SCORED ~ BASES, data = data2016)
anova_result <- aov(RUNS_lm)
summary(anova_result)

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## BASES            7   4086   583.7    4154 <2e-16 ***
## Residuals   190548  26775     0.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Performing the Tukey HSD test
tukey_result <- TukeyHSD(anova_result)

#Print Tukey HSD results
print(tukey_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = RUNS_lm)
## 
## $BASES
##                diff         lwr         upr    p adj
## 001-000  0.31795713  0.30183359  0.33408067 0.00e+00
## 010-000  0.14491450  0.13522445  0.15460455 0.00e+00
## 011-000  0.45954598  0.44159742  0.47749454 0.00e+00
## 100-000  0.04512895  0.03828658  0.05197132 0.00e+00
## 101-000  0.42100151  0.40577668  0.43622635 0.00e+00
## 110-000  0.21421540  0.20359284  0.22483797 0.00e+00
## 111-000  0.66477541  0.64722863  0.68232220 0.00e+00
## 010-001 -0.17304263 -0.19118779 -0.15489746 0.00e+00
## 011-001  0.14158886  0.11797740  0.16520031 0.00e+00
## 100-001 -0.27282818 -0.28962603 -0.25603032 0.00e+00
## 101-001  0.10304439  0.08143083  0.12465794 0.00e+00
## 110-001 -0.10374172 -0.12240154 -0.08508191 0.00e+00
## 111-001  0.34681828  0.32351078  0.37012579 0.00e+00
## 011-010  0.31463148  0.29484691  0.33441605 0.00e+00
## 100-010 -0.09978555 -0.11056036 -0.08901074 0.00e+00
## 101-010  0.27608701  0.25873552  0.29343850 0.00e+00
## 110-010  0.06930090  0.05580582  0.08279599 0.00e+00
## 111-010  0.51986091  0.50044009  0.53928173 0.00e+00
## 100-011 -0.41441703 -0.43297371 -0.39586036 0.00e+00
## 101-011 -0.03854447 -0.06155160 -0.01553734 1.05e-05
## 110-011 -0.24533058 -0.26558820 -0.22507296 0.00e+00
## 111-011  0.20522943  0.18062411  0.22983475 0.00e+00
## 101-100  0.37587256  0.35993534  0.39180979 0.00e+00
## 110-100  0.16908645  0.15746585  0.18070705 0.00e+00
## 111-100  0.61964646  0.60147811  0.63781482 0.00e+00
## 110-101 -0.20678611 -0.22467510 -0.18889712 0.00e+00
## 111-101  0.24377390  0.22107882  0.26646898 0.00e+00
## 111-110  0.45056001  0.43065750  0.47046252 0.00e+00

plot(tukey_result)

library(agricolae)

## Warning: package 'agricolae' was built under R version 4.3.2

tukey.test2 <- HSD.test(anova_result, trt = 'BASES')
tukey.test2

## $statistics
##     MSerror     Df      Mean       CV
##   0.1405147 190548 0.1141082 328.5066
## 
## $parameters
##    test name.t ntr StudentizedRange alpha
##   Tukey  BASES   8          4.28631  0.05
## 
## $means
##     RUNS.SCORED       std      r          se Min Max Q25 Q50 Q75
## 000  0.03181263 0.1755018 104864 0.001157571   0   1   0   0   0
## 001  0.34976976 0.5200582   5212 0.005192287   0   2   0   0   1
## 010  0.17672713 0.4440839  15821 0.002980190   0   2   0   0   0
## 011  0.49135862 0.7832565   4166 0.005807661   0   3   0   0   1
## 100  0.07694159 0.3544243  37405 0.001938189   0   2   0   0   0
## 101  0.45281415 0.6790294   5881 0.004888046   0   3   0   0   1
## 110  0.24602804 0.6355390  12840 0.003308098   0   3   0   0   0
## 111  0.69658805 0.9408745   4367 0.005672432   0   4   0   0   1
## 
## $comparison
## NULL
## 
## $groups
##     RUNS.SCORED groups
## 111  0.69658805      a
## 011  0.49135862      b
## 101  0.45281415      c
## 001  0.34976976      d
## 110  0.24602804      e
## 010  0.17672713      f
## 100  0.07694159      g
## 000  0.03181263      h
## 
## attr(,"class")
## [1] "group"

5.4 Measuring the Success of a Batting Play

We estimate the value of the plate appearance, called the run value, by computing the difference in run expectencies in the old and new states plus the number of runs scored on the particular play…. RUN VALUE = Runs(New State) - Runs (Old State) + Runs(Scored on Play)

data2016 <- data2016 %>%
  left_join(select(RUNS, -Outs), by = "STATE") %>%
  rename(Runs.State = Mean) %>%
  left_join(select(RUNS, -Outs),
            by = c("NEW.STATE" = "STATE")) %>%
  rename(Runs.New.State = Mean) %>%
  replace_na(list(Runs.New.State = 0)) %>%
  mutate(run_value = Runs.New.State - Runs.State +
           RUNS.SCORED)

5.5 Jose Altuve and Andrew McCutchen

Master <- Lahman::People
altuve.id <- Master %>%
  filter(nameFirst == "Jose", nameLast == "Altuve") %>%
  pull(retroID)

mccutchen.id <- Master %>%
  filter(nameFirst == "Andrew", nameLast == "McCutchen") %>%
  pull(retroID)

altuve <- data2016 %>%
  filter(BAT_ID == altuve.id,
         BAT_EVENT_FL == TRUE)

mccutchen <- data2016 %>%
  filter(BAT_ID == mccutchen.id,
         BAT_EVENT_FL == TRUE)

altuve %>% 
  select(STATE, NEW.STATE, run_value) %>%
  slice(1:3)

mccutchen %>%
  select(STATE, NEW.STATE, run_value) %>%
  slice(1:3)

Let’s summarize

altuve_summary<- altuve %>%
  group_by(BASES) %>%
  summarize(N = n())

mccutchen_summary <- mccutchen %>%
  group_by(BASES) %>%
  summarize(N = n())

Let’s visualize how they actually did in each of these scenarios using the column “run value” and visualizing it using geom_jitter

Let’s start with Altuve

ggplot(altuve, aes(BASES, run_value)) +
  geom_jitter(width = 0.25, alpha = 0.5) +
  geom_hline(yintercept = 0, color = "blue") +
  xlab("RUNNERS") + 
  ggtitle("Altuve Run Value by Runner Position")

Next, let’s look at McCutchen

ggplot(mccutchen, aes(BASES, run_value)) +
  geom_jitter(width = 0.25, alpha = 0.5) +
  geom_hline(yintercept = 0, color = "blue") +
  xlab("RUNNERS") + 
  ggtitle("McCutchen Run Value by Runner Position")

To understand Altuve and McCutchen’s total run production in the 2016 Season, we will use the summarize function together with sum() and n() to get a fuller picture of how each batter did by summing the run values and including the number of opportunities in each situation.

The book didn’t include Runs per Plate Appearance, but we will include here to get a better picture.

Runs_McCutchen <-  mccutchen %>%
  group_by(BASES) %>%
  summarize(RUNS = sum(run_value), 
            PA = n()) %>%
  mutate(Runs_per_PA = RUNS/PA)

Runs_Altuve <-  altuve %>%
  group_by(BASES) %>%
  summarize(RUNS = sum(run_value), 
            PA = n()) %>%
  mutate(Runs_per_PA = RUNS/PA)

Now let’s take a look at McCutchen

Runs_McCutchen

In total, we see McCutchen produced the most runs when the bases were empty (4.49); however, looking at the runs per plate appearance data, we see McCutchen accumulated run production the fasted, with men on first and second at a rate of 0.09 runs per plate appearance of men on first and second. Additionally, wwe see with a man on first, McCutchen produced a negative run value of 3.34. Per plate appearance, we see McCutchen was worst with a runner on third, producing -0.09 runs per plate appearance with a runner on third base.

Let’s take a look at McCutcen’s total Run Expectancy over all scenarios. The measure of batting performance is known as run expectancy over the 24 base/out states. This is McCutchen’s total runs contribution over the 2016 season.

Runs_McCutchen %>%
  summarise(Run_Expectancy_24 = sum(RUNS))

In total, we see a run expectancy of about 7 runs by McCutchen in the 2016 season, over 24 base scenarios.

Now let’s look at Altuve

Runs_Altuve

The only situation where Altuve produced negative run value was when the bases were loaded. Per plate appearance, Altuve was most value in the 2016 season when runners were on second and third, with a run value of 0.191 per plate appearance. In total, with a run value of 10.194, Altuve produced the most runs with a runner on first.

Let’s look at Altuve’s total run expectancy of 24 base/out states for the 2016 season:

Runs_Altuve %>%
  summarise(Run_Expectancy_24 = sum(RUNS))

With a total run expectancy of 34.724, we see Altuve’s value almost 6 times of McCutchens.