Task 1: By hand calculation of 1 player over 3 stolen base attempts

Take 1 Goals: - display understanding of states - display understanding of expected value - applied expected value table to baseball event data

By hand calculation of these three Stolen Base attempts by JT Realmuto:


Task 2: Use R to automate the same 1 player, 3 stolen base attempt calculation

Repeat the same calculations as in Task 1, but this time using R. For one player over three stolen base attempts, determine the change in expected value between state1 and state2. Over a variety of SBA base states, not just (1,0,0) to (0,1,0)

Task 2a: read Expect Value table csv file

Read in the csv file that contains the expected value table for each base state - out state. Use the read_csv function to read the file and assign it to the dataframe expect_value.

expect_value <- read_csv('data/EV_base_out_states.csv')

Take a look at the dataframe, expect_value that was just created using the head function. In RStudio, you can view a dataframe by clicking on the table data icon on the right, in the environment pane.

head(expect_value)
## # A tibble: 6 × 6
##   base_state_name base_1 base_2 base_3 OUT_STATE EV_state
##   <chr>            <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
## 1 empty                0      0      0         0    0.481
## 2 on 1st               1      0      0         0    0.859
## 3 on 2nd               0      1      0         0    1.1  
## 4 1st & 2nd            1      1      0         0    1.44 
## 5 on 3rd               0      0      1         0    1.35 
## 6 corners              1      0      1         0    1.78

The head functions shows the column names and the top 6 rows, the default number of rows. There are more rows of data, but only the 1st 6 are shown by default. You can provide the number of rows to the head function; here we provide 9 rows:

head(expect_value, 9)
## # A tibble: 9 × 6
##   base_state_name base_1 base_2 base_3 OUT_STATE EV_state
##   <chr>            <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
## 1 empty                0      0      0         0    0.481
## 2 on 1st               1      0      0         0    0.859
## 3 on 2nd               0      1      0         0    1.1  
## 4 1st & 2nd            1      1      0         0    1.44 
## 5 on 3rd               0      0      1         0    1.35 
## 6 corners              1      0      1         0    1.78 
## 7 2nd & 3rd            0      1      1         0    1.96 
## 8 loaded               1      1      1         0    2.29 
## 9 empty                0      0      0         1    0.254

Let’s take time and understand what is in this important table. The variable base_state_name is a name label that is provided to help interpret the next three base_x variables. The variable base_1 shows if there is runner on 1st base; (1) indicates a running, or no runner (0). Likewise base_2 and base_3 are hot-coded variables for second and third base. OUT_STATE is the number of outs {0,1,2,3}. Finally, EV_state is the expect value, or expected runs, for each base-out state.

Add a new column to the expected value table; baserunners_tot, the total number of base runners on all three bases for each base state. Use the ‘mutate’ function to create the new column based on the addition of all three bases.

expect_value_A <- expect_value %>%
  mutate(baserunners_tot = (base_1+ base_2+ base_3))

Let’s take a quick look at the expect value. Plot the expect values of different out states.

ev_color_palettes <- c('#660510', '#991020', '#dd0000', '#ddaaaa', 
                       '#aaaadd', '#0000dd', '#201099','#100566')

expect_value_A$base_state_name <- factor(expect_value_A$base_state_name, 
                                      levels=c('loaded', '2nd & 3rd',
                                      'corners', 'on 3rd', '1st & 2nd', 
                                      'on 2nd', 'on 1st', 'empty'))

expect_value_B <- expect_value_A %>%
  filter(OUT_STATE<3)

ggplot(expect_value_B) +
  scale_color_manual(values=ev_color_palettes, name='Base State') +
  geom_point(aes(x=OUT_STATE, y=EV_state, color=base_state_name), size=4) +
  scale_x_continuous(name='Out State', limits= c(-0.5, 2.5)) +
  scale_y_continuous(name='Expected value (runs)', expand=c(0,0), limits=c(0, 2.5)) +
  theme_bw() +
  geom_hline(yintercept = 0)

Does that make sense? The expected value is highest with zero outs and decreases with increasing outs. The expected value with 3 outs is zero; the inning is over. Within each out state, the more runners on base, the higher the expected value.

TASK2b: read event csv file & select one runner: JT Realmuto

Use the read_csv() function to read in the event file. The event file contains stolen base attempts (SBA) and the base-out states of each SBA. This 1st event file will be an abbreviated set of SBA’s, hence the ‘short’ in the file name.

event_SBA_short <- read_csv('data/event_SBA_short.csv')
## Rows: 12 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): runner_name, runner_ID, SBA_outcome
## dbl (4): OUT_STATE, base_1, base_2, base_3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can see the column names of the df above. There are 6 columns, the details of variables in this event files are described in the data dictionary.

Let’s take a looks at the table, use the head() function again to show the event file:

head(event_SBA_short)
## # A tibble: 6 × 7
##   runner_name     runner_ID OUT_STATE base_1 base_2 base_3 SBA_outcome
##   <chr>           <chr>         <dbl>  <dbl>  <dbl>  <dbl> <chr>      
## 1 Bryce Harper    harpb003          0      1      0      0 success    
## 2 Bryce Harper    harpb003          2      0      1      0 success    
## 3 Bryce Harper    harpb003          2      1      0      0 success    
## 4 Cesar Hernandez hernc005          1      0      1      0 fail       
## 5 Cesar Hernandez hernc005          1      1      0      0 fail       
## 6 Cesar Hernandez hernc005          2      1      0      0 success

Again, the head() function is showing the first six rows. There are more base runners in the dataframe beyond those shown in these first six rows. Use the table() function to see how many rows of SBA are provided for each runner in this short event file.

table(event_SBA_short$runner_name)
## 
##    Bryce Harper Cesar Hernandez   J.T. Realmuto     Jean Segura 
##               3               3               3               3

Here the table() function is counting all the rows where each unique runner_name appears. This short event files contains, four different runners, and each runner has 3 observations (i.e. three SBA).

In Task 1, you calculate the EV change for J.T. Realmuto’s three sample SBA’s. Let’s filter the dataframe to JT, so that we can now use R to conduct the same calculation.
Use the filter() function on the event dataframe to select only one player: JT Realmuto, and assign that to the new dataframe, event_SBA_short_Realmuto:

event_SBA_short_Realmuto <- event_SBA_short %>%
  filter(runner_ID=='realj001')

Take a look at the new dataframe that was just created with JT Realmuto’s three sample SBA:

head(event_SBA_short_Realmuto)
## # A tibble: 3 × 7
##   runner_name   runner_ID OUT_STATE base_1 base_2 base_3 SBA_outcome
##   <chr>         <chr>         <dbl>  <dbl>  <dbl>  <dbl> <chr>      
## 1 J.T. Realmuto realj001          0      1      0      0 success    
## 2 J.T. Realmuto realj001          0      1      0      0 fail       
## 3 J.T. Realmuto realj001          2      1      0      0 success

Use the table() function again to check that the dataframe only contains JT’s SBA’s:

table(event_SBA_short_Realmuto$runner_name)
## 
## J.T. Realmuto 
##             3

Task 2c: Apply the initial state EV to the three the SBA events of JT Realmuto

Use the merge() function to add the expected value for each of the 24 base-out states to the JT Realmuto event dataframe, and assign this to a new dataframe event_SBA_short_Realmuto_ev.

Also, let’s rename the column EV_state to EV_state_0 to designate that this is the initial EV of the state before the SBA, at time=0.

event_SBA_short_Realmuto_ev <- merge(event_SBA_short_Realmuto, expect_value, 
                                   by=c('OUT_STATE', 'base_1', 'base_2', 'base_3'))  %>% 
  rename(EV_state_0= EV_state)

Take a look at this new dataframe, note how the expected value column, EV_state, was added for each initial base-out state. Two EV_state value are same; the initial state is the same for these two SBA’s.

head(event_SBA_short_Realmuto_ev)
##   OUT_STATE base_1 base_2 base_3   runner_name runner_ID SBA_outcome
## 1         0      1      0      0 J.T. Realmuto  realj001     success
## 2         0      1      0      0 J.T. Realmuto  realj001        fail
## 3         2      1      0      0 J.T. Realmuto  realj001     success
##   base_state_name EV_state_0
## 1          on 1st      0.859
## 2          on 1st      0.859
## 3          on 1st      0.224

Now we are ready to consider the state after the outcome of the SBA. We referred to the initial state as State0, we will refer to the state after the SBA as State2.

Task 2d: Determine the change in state after the SBA, State1.

Determining State2, the base state and out state after a SBA.

State 1 out state: Calculate the change in out state based on outcome of the stolen base attempt. Success, a successful SBA does not change the out state, or fail, a SBA that leads to an out increases the out state buy 1 out.

event_SBA_short_Realmuto_ev_B <- event_SBA_short_Realmuto_ev %>%
  mutate(S2_OUT_STATE= ifelse(SBA_outcome=='success', OUT_STATE, (1 + OUT_STATE)))

View the modified dataframe, note how the State2 out-state column has been created.

head(event_SBA_short_Realmuto_ev_B)
##   OUT_STATE base_1 base_2 base_3   runner_name runner_ID SBA_outcome
## 1         0      1      0      0 J.T. Realmuto  realj001     success
## 2         0      1      0      0 J.T. Realmuto  realj001        fail
## 3         2      1      0      0 J.T. Realmuto  realj001     success
##   base_state_name EV_state_0 S2_OUT_STATE
## 1          on 1st      0.859            0
## 2          on 1st      0.859            1
## 3          on 1st      0.224            2

Read in table with the changes in base state with a SBA.

SBA_state_change <- read_csv('data/SBA_state_change.csv')
## Rows: 8 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SBA_outcome
## dbl (6): base_1, base_2, base_3, S2_base_1, S2_base_2, S2_base_3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Merge the change in base state at State 2 with the df:

event_SBA_short_Realmuto_ev_C <- event_SBA_short_Realmuto_ev_B %>%
  merge(SBA_state_change, by=c('base_1', 'base_2', 'base_3', 'SBA_outcome'))

Take a look at the new columns added to the df:

head(event_SBA_short_Realmuto_ev_C)
##   base_1 base_2 base_3 SBA_outcome OUT_STATE   runner_name runner_ID
## 1      1      0      0        fail         0 J.T. Realmuto  realj001
## 2      1      0      0     success         0 J.T. Realmuto  realj001
## 3      1      0      0     success         2 J.T. Realmuto  realj001
##   base_state_name EV_state_0 S2_OUT_STATE S2_base_1 S2_base_2 S2_base_3
## 1          on 1st      0.859            1         0         0         0
## 2          on 1st      0.859            0         0         1         0
## 3          on 1st      0.224            2         0         1         0

add the expected value for State 2 in the same way as before.
Merge the expected value table with the df, but this time merger using the State 2 base and out states:

event_SBA_short_Realmuto_ev_D <- merge(event_SBA_short_Realmuto_ev_C, expect_value, 
        by.x=c('S2_OUT_STATE', 'S2_base_1', 'S2_base_2', 'S2_base_3'),
        by.y=c('OUT_STATE', 'base_1', 'base_2', 'base_3'))

take a look at the new columns added to the df:

head(event_SBA_short_Realmuto_ev_D)
##   S2_OUT_STATE S2_base_1 S2_base_2 S2_base_3 base_1 base_2 base_3 SBA_outcome
## 1            0         0         1         0      1      0      0     success
## 2            1         0         0         0      1      0      0        fail
## 3            2         0         1         0      1      0      0     success
##   OUT_STATE   runner_name runner_ID base_state_name.x EV_state_0
## 1         0 J.T. Realmuto  realj001            on 1st      0.859
## 2         0 J.T. Realmuto  realj001            on 1st      0.859
## 3         2 J.T. Realmuto  realj001            on 1st      0.224
##   base_state_name.y EV_state
## 1            on 2nd    1.100
## 2             empty    0.254
## 3            on 2nd    0.319

that is a getting a little complicated. Let’s use a table:

event_SBA_short_Realmuto_ev_D
S2_OUT_STATE S2_base_1 S2_base_2 S2_base_3 base_1 base_2 base_3 SBA_outcome OUT_STATE runner_name runner_ID base_state_name.x EV_state_0 base_state_name.y EV_state
0 0 1 0 1 0 0 success 0 J.T. Realmuto realj001 on 1st 0.859 on 2nd 1.100
1 0 0 0 1 0 0 fail 0 J.T. Realmuto realj001 on 1st 0.859 empty 0.254
2 0 1 0 1 0 0 success 2 J.T. Realmuto realj001 on 1st 0.224 on 2nd 0.319

see how there are now two ‘EV_state’ columns; EV_state.x & EV_state.y These are the expected values for State 1 and State 2. Let’s use the ‘rename’ function to rename them to somethings that makes more sense and is easier to read and remember. Also, now you can calculate the change in value for each of the three SBA in the data frame. Use the mutate function to create a new column, and name it ‘change_EV’.
This is the simple subtraction between the expected value of State 1 and State 2.

event_SBA_short_Realmuto_ev_E <- event_SBA_short_Realmuto_ev_D %>%
  rename(
         EV_state_1 = EV_state) %>%
  mutate(change_EV= EV_state_1- EV_state_0 )

was value created over these three SBA’s? Let’s find out.
Sum the change in value over these three SBA’s.
Remember, these three are non-representative, so this summation does not represent the players’ actual value creation or destruction.

sum(event_SBA_short_Realmuto_ev_E$change_EV)
## [1] -0.269

A positive value means the runner created value with these three SBA. A negative value means the runner destroyed value with these thee SBA.
Here we see a negative number; so in these three SBA’s JT Realmuto destroyed value. Specially, {r} sum(event_SBA_short_Realmuto_ev_E$change_EV) runs of value.

Remember, these three SBA’s are non-representative, so this summation does not represent the player’s typical, or normal, value creation or destruction. In JT Realmuto’s defense, these three example SBA were not selected randomly; a failed SBA was selected for variety. JT Realmuto is a excellent baserunner; as we can explore in the next section.


Task 3: One player over all SBA in a season

One player over all stolen base attempts, determine the change in expected value between state1 and state2. Over all SBA base states, and sum the net season change in value. Did a player X’s stolen base activity create run value or not?

Task 3a: read in the full season event file, and then select one player

As in Task 2a, use the read_csv() function to read in the SBA event file.

event_SBA <- read_csv('data/event_SBA_full.csv')
## Rows: 2532 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): runner_name, runner_ID, team_runner_cd, SBA_outcome, BAT_HAND_CD, C...
## dbl (5): OUT_STATE, base_1, base_2, base_3, BAT_LINEUP_ID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we are working with the full season event file; not the simple 12 rows file that we just explored. Let’s talk a look at his larger dataframe:

head(event_SBA)
## # A tibble: 6 × 13
##   runner_…¹ runne…² team_…³ OUT_S…⁴ base_1 base_2 base_3 SBA_o…⁵ BAT_L…⁶ BAT_H…⁷
##   <chr>     <chr>   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <chr>     <dbl> <chr>  
## 1 Jose Abr… abrej0… CHA           2      1      0      0 success       4 L      
## 2 Ronald A… acunr0… ATL           1      1      0      0 success       2 R      
## 3 Ronald A… acunr0… ATL           0      1      0      0 success       2 R      
## 4 Ronald A… acunr0… ATL           0      1      0      1 fail          2 L      
## 5 Ronald A… acunr0… ATL           2      1      0      1 success       2 L      
## 6 Ronald A… acunr0… ATL           2      1      0      1 success       2 L      
## # … with 3 more variables: CatcherID <chr>, PIT_ID <chr>, PIT_HAND_CD <chr>,
## #   and abbreviated variable names ¹​runner_name, ²​runner_ID, ³​team_runner_cd,
## #   ⁴​OUT_STATE, ⁵​SBA_outcome, ⁶​BAT_LINEUP_ID, ⁷​BAT_HAND_CD

Most things look similar; there are also a few new columns. There is a team_runner_cd variable; this provides the team of the runner. We will use this later in Task 4.

Let’s find out how many events (stolen base attempts) are in this file? There are a few ways to determine this. One way is to use the length() function, this shows how many rows are in the column runner_ID in the dataframe `event_SBA.

length(event_SBA$runner_ID)
## [1] 2532

We can also ask a simple question of this dataframe: Over all players what was the number success and failed SBA? The sum of these two potential outcomes should be the same as the above length() result.

table(event_SBA$SBA_outcome)
## 
##    fail success 
##     579    1953

Now we will repeat the same actions as in Task 1 and 2, but with the full season record for one player. Since you have already read in the expected value table into the global environment, that task doesn’t need to be repeated.

Use the filter() function on the full season event file to select one player. This time let’s select Bryce Harper:

event_SBA_full_harper <- event_SBA %>%
  filter(runner_ID=='harpb003')

Take a look on the new Harper full season SBA dataframe:

head(event_SBA_full_harper)
## # A tibble: 6 × 13
##   runner_…¹ runne…² team_…³ OUT_S…⁴ base_1 base_2 base_3 SBA_o…⁵ BAT_L…⁶ BAT_H…⁷
##   <chr>     <chr>   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <chr>     <dbl> <chr>  
## 1 Bryce Ha… harpb0… PHI           0      1      0      0 success       4 R      
## 2 Bryce Ha… harpb0… PHI           2      1      0      0 success       4 R      
## 3 Bryce Ha… harpb0… PHI           2      1      0      1 success       3 R      
## 4 Bryce Ha… harpb0… PHI           2      1      0      0 success       4 R      
## 5 Bryce Ha… harpb0… PHI           1      1      0      1 success       3 L      
## 6 Bryce Ha… harpb0… PHI           2      1      0      0 success       4 R      
## # … with 3 more variables: CatcherID <chr>, PIT_ID <chr>, PIT_HAND_CD <chr>,
## #   and abbreviated variable names ¹​runner_name, ²​runner_ID, ³​team_runner_cd,
## #   ⁴​OUT_STATE, ⁵​SBA_outcome, ⁶​BAT_LINEUP_ID, ⁷​BAT_HAND_CD

Use the table() function to calculate how many SBA’s Harper had over the full season.

table(event_SBA_full_harper$runner_name)
## 
## Bryce Harper 
##           16

The above shows how many SBA attempts this single player had over the entire season, but what were the outcomes? Use the table() function again, but this time over the SBA_outcome column.

table(event_SBA_full_harper$SBA_outcome) 
## 
##    fail success 
##       2      14

Task 3b: apply EV to SBA events

As you did earlier, use the merge() function to add the expected value for each of the 24 base-out states.

event_SBA_full_harper_ev <- merge(event_SBA_full_harper, expect_value, 
                                   by=c('OUT_STATE', 'base_1', 'base_2', 'base_3'))
Take a look at this dataframe of a single player. Note how the expected value column was added for each base-out state
the event_SBA_full_harper_ev dataframe as a table
OUT_STATE base_1 base_2 base_3 runner_name runner_ID team_runner_cd SBA_outcome BAT_LINEUP_ID BAT_HAND_CD CatcherID PIT_ID PIT_HAND_CD base_state_name EV_state
0 1 0 0 Bryce Harper harpb003 PHI success 4 R stalj001 ramiy001 R on 1st 0.859
0 1 1 0 Bryce Harper harpb003 PHI success 4 R mccab002 tehej001 R 1st & 2nd 1.437
1 1 0 0 Bryce Harper harpb003 PHI success 3 R carav001 darvy001 R on 1st 0.509

Task 3c: Calculate the change in state for SBA over the full season set

Determine State 2, base on the outcome of the SBA and determine the outcome base-out state. Calculate the change in out state based on outcome of the stolen base attempt.
Success, a successful SBA does not change the out state, or fail, a SBA that leads to an out increases the out state buy 1 out.

event_SBA_full_harper_ev_B <- event_SBA_full_harper_ev %>%
  mutate(S2_OUT_STATE= ifelse(SBA_outcome=='success', OUT_STATE, 1+OUT_STATE))
View the modified dataframe:
dataframe event_SBA_full_harper_ev_B, as a table
OUT_STATE base_1 base_2 base_3 runner_name runner_ID team_runner_cd SBA_outcome BAT_LINEUP_ID BAT_HAND_CD CatcherID PIT_ID PIT_HAND_CD base_state_name EV_state S2_OUT_STATE
0 1 0 0 Bryce Harper harpb003 PHI success 4 R stalj001 ramiy001 R on 1st 0.859 0
0 1 1 0 Bryce Harper harpb003 PHI success 4 R mccab002 tehej001 R 1st & 2nd 1.437 0
1 1 0 0 Bryce Harper harpb003 PHI success 3 R carav001 darvy001 R on 1st 0.509 1
1 1 0 0 Bryce Harper harpb003 PHI success 4 R grany001 housa001 R on 1st 0.509 1
1 1 0 0 Bryce Harper harpb003 PHI success 5 R barnt001 bowmm001 R on 1st 0.509 1
1 1 0 1 Bryce Harper harpb003 PHI success 3 L holab001 alcas001 R corners 1.130 1
1 1 0 1 Bryce Harper harpb003 PHI fail 4 R alfaj002 richt001 R corners 1.130 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R mccab002 sorom001 R on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R mccab002 soboc001 R on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R ramow001 matzs001 L on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R castj006 berrj001 R on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R flowt001 grees005 R on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI success 4 R casac001 grays001 R on 1st 0.224 2
2 1 0 0 Bryce Harper harpb003 PHI fail 3 R kellc002 mcfat001 L on 1st 0.224 3
2 1 0 0 Bryce Harper harpb003 PHI success 4 R alfaj002 quijj001 L on 1st 0.224 2

Previously we created the dataframe SBA_state_change by reading in the ‘SBA_state_change.csv’ file. Merge the change in base state at State 1, post SBA, with the df:

event_SBA_full_harper_ev_C <- event_SBA_full_harper_ev_B %>%
  merge(SBA_state_change, by=c('base_1', 'base_2', 'base_3', 'SBA_outcome'))
Take a look at the new columns added to the df:
dataframe event_SBA_full_harper_ev_C, as a table
base_1 base_2 base_3 SBA_outcome OUT_STATE runner_name runner_ID team_runner_cd BAT_LINEUP_ID BAT_HAND_CD CatcherID PIT_ID PIT_HAND_CD base_state_name EV_state S2_OUT_STATE S2_base_1 S2_base_2 S2_base_3
1 0 0 fail 2 Bryce Harper harpb003 PHI 3 R kellc002 mcfat001 L on 1st 0.224 3 0 0 0
1 0 0 success 0 Bryce Harper harpb003 PHI 4 R stalj001 ramiy001 R on 1st 0.859 0 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R ramow001 matzs001 L on 1st 0.224 2 0 1 0
1 0 0 success 1 Bryce Harper harpb003 PHI 3 R carav001 darvy001 R on 1st 0.509 1 0 1 0
1 0 0 success 1 Bryce Harper harpb003 PHI 4 R grany001 housa001 R on 1st 0.509 1 0 1 0
1 0 0 success 1 Bryce Harper harpb003 PHI 5 R barnt001 bowmm001 R on 1st 0.509 1 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R castj006 berrj001 R on 1st 0.224 2 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R mccab002 sorom001 R on 1st 0.224 2 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R mccab002 soboc001 R on 1st 0.224 2 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R alfaj002 quijj001 L on 1st 0.224 2 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R flowt001 grees005 R on 1st 0.224 2 0 1 0
1 0 0 success 2 Bryce Harper harpb003 PHI 4 R casac001 grays001 R on 1st 0.224 2 0 1 0
1 0 1 fail 1 Bryce Harper harpb003 PHI 4 R alfaj002 richt001 R corners 1.130 2 0 0 1
1 0 1 success 1 Bryce Harper harpb003 PHI 3 L holab001 alcas001 R corners 1.130 1 0 1 1
1 0 1 success 2 Bryce Harper harpb003 PHI 3 R casac001 grays001 R corners 0.478 2 0 1 1

Add the expected value for State 1 in the same way as before.
Merge the expected value table with the df, but this time merger using the State 2 base and out states:

event_SBA_full_harper_ev_D <- merge(event_SBA_full_harper_ev_C, expect_value, 
        by.x=c('S2_OUT_STATE', 'S2_base_1', 'S2_base_2', 'S2_base_3'),
        by.y=c('OUT_STATE', 'base_1', 'base_2', 'base_3'))
take a look at the new columns added to the df:
event_SBA_full_harper_ev_D
S2_OUT_STATE S2_base_1 S2_base_2 S2_base_3 base_1 base_2 base_3 SBA_outcome OUT_STATE runner_name runner_ID team_runner_cd BAT_LINEUP_ID BAT_HAND_CD CatcherID PIT_ID PIT_HAND_CD base_state_name.x EV_state.x base_state_name.y EV_state.y
0 0 1 0 1 0 0 success 0 Bryce Harper harpb003 PHI 4 R stalj001 ramiy001 R on 1st 0.859 on 2nd 1.100
0 1 0 1 1 1 0 success 0 Bryce Harper harpb003 PHI 4 R mccab002 tehej001 R 1st & 2nd 1.437 corners 1.784
1 0 1 0 1 0 0 success 1 Bryce Harper harpb003 PHI 4 R grany001 housa001 R on 1st 0.509 on 2nd 0.664
1 0 1 0 1 0 0 success 1 Bryce Harper harpb003 PHI 5 R barnt001 bowmm001 R on 1st 0.509 on 2nd 0.664
1 0 1 0 1 0 0 success 1 Bryce Harper harpb003 PHI 3 R carav001 darvy001 R on 1st 0.509 on 2nd 0.664
1 0 1 1 1 0 1 success 1 Bryce Harper harpb003 PHI 3 L holab001 alcas001 R corners 1.130 2nd & 3rd 1.376
2 0 0 1 1 0 1 fail 1 Bryce Harper harpb003 PHI 4 R alfaj002 richt001 R corners 1.130 on 3rd 0.353
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R ramow001 matzs001 L on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R mccab002 sorom001 R on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R mccab002 soboc001 R on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R alfaj002 quijj001 L on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R castj006 berrj001 R on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R casac001 grays001 R on 1st 0.224 on 2nd 0.319
2 0 1 0 1 0 0 success 2 Bryce Harper harpb003 PHI 4 R flowt001 grees005 R on 1st 0.224 on 2nd 0.319
2 0 1 1 1 0 1 success 2 Bryce Harper harpb003 PHI 3 R casac001 grays001 R corners 0.478 2nd & 3rd 0.580

See how there are now two ‘EV_state’ columns; EV_state.x & EV_state.y These are the expected values for State 1 and State 2. Let’s use the rename() function to rename the column to somethings that makes more sense and is easier to read and remember. Also, now you can calculate the change in value for each of the three SBA in the data frame. Use the mutate function to create a new column, and name it ‘change_EV’.
This is the simple subtraction between the expected value of State 1 and State 2.

event_SBA_full_harper_ev_E <- event_SBA_full_harper_ev_D %>%
  rename(EV_state_1 = EV_state.x, 
         EV_state_2 = EV_state.y) %>%
  mutate(change_EV= EV_state_2- EV_state_1 )

was value created over these three SBA’s? Let’s find out.
Sum the change in value over the full season of SBA’s.

sum(event_SBA_full_harper_ev_E$change_EV)
## [1] 1.065

A positive value means the runner created value with their SBA’s over the full season.


Task 4: Team level SBA expected value

Now we are ready to explore a larger question: Did the team create value (or not) by SBA? Over an entire team, over all stolen base attempts, determine the change in expected value between state0 and state1. Over all SBA base states.

The event file that we need is the same as the previous task; the full season SBA events. Determine the total number of SBA per team over the full season. Use the table() function over the team_runner_cd column.

table(event_SBA$team_runner_cd)
## 
## ANA ARI ATL BAL BOS CHA CHN CIN CLE COL DET HOU KCA LAN MIA MIL MIN NYA NYN OAK 
##  75  95 101  88  83  72  54  95 115  82  73  78 101  60  69  96  43  61  74  56 
## PHI PIT SDN SEA SFN SLN TBA TEX TOR WAS 
##  79  79  87 136  55 114 106 132  53 120

What team attempted the most SBA? Is there much variability across teams?

Similar to the previous tasks, next apply the expected value table to the event file containing all SBA for the season. Also at the Out state change in State 1.

event_SBA_team <- event_SBA %>% 
  merge(expect_value, by=c('OUT_STATE', 'base_1', 'base_2', 'base_3')) %>%
  mutate(S2_OUT_STATE= ifelse(SBA_outcome=='success', OUT_STATE, 1+OUT_STATE))

And complete the State 1 by adding the change in base states:

event_SBA_team_A <- event_SBA_team %>%
  merge(SBA_state_change, by=c('base_1', 'base_2', 'base_3', 'SBA_outcome'))

Finally, reapply the expected value to State 1:

event_SBA_team_B <- event_SBA_team_A %>%
  merge(expect_value, 
        by.x=c('S2_OUT_STATE', 'S2_base_1', 'S2_base_2', 'S2_base_3'),
        by.y=c('OUT_STATE', 'base_1', 'base_2', 'base_3')) %>%
  rename(EV_state_1 = EV_state.x, 
         EV_state_2 = EV_state.y,
         base_state_name_1 = base_state_name.x,
         base_state_name_2 = base_state_name.y) %>%
  mutate(delta_EV = EV_state_2-EV_state_1)

Now we have a dataframe the calculates the change in expected value for every stolen base attempt in one season. Let’s see if over the entire league, over the full season, did SBA’s create or destroy value on average. To do this we will sum the change in expected value over all SBA’s and then divide by the number of SBA’s.

sum(event_SBA_team_B$delta_EV)/length(event_SBA_team_B$delta_EV)
## [1] 0.02210269

A positive results indicates that over all stolen base attempts from all players stolen bases attempts create value.

But, our task is to explore the team level change in expected value. As a reminder, the dataframe contains a column, team_runner_cd, that identifies the team of the runner for each SBA event.
First we need to change the variable type of team_runner_cd.

event_SBA_team_B$team_runner_cd <- as.factor(event_SBA_team_B$team_runner_cd)

Next calculate the net expect value change by team from all players.

event_SBA_team_C <- event_SBA_team_B %>%
  group_by(team_runner_cd) %>% 
  summarise(tot_delta_EV = sum(delta_EV))

Visualize the net expected value change per team from SBA

ggplot(event_SBA_team_C) +
  geom_col(aes(y=tot_delta_EV, x=reorder(team_runner_cd, desc(tot_delta_EV)))) +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90,hjust=1)) +
  xlab(NULL) +
  ylab('cumulative EV change') +
  ggtitle('Team wide change in EV from SBA') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_hline(yintercept = 0, color='red')


extra stuff - all individual players over the full season

event_SBA_team_indiv <- event_SBA_team_B %>% 
  group_by(runner_ID) %>% 
  mutate(counter=1) %>%
  summarise(delta_EV_tot= sum(delta_EV),
            SBA_tot= sum(counter))
ggplot(event_SBA_team_indiv) +
  geom_hline(yintercept = 0, color='red') +
  geom_point(aes(x=SBA_tot, y=delta_EV_tot), alpha=0.5) +
  theme_bw() + xlab('SBA per player') +
  scale_y_continuous(breaks=seq(-2.5, 5, by=0.5), 
                     name='net change in expected value from SBA')