Question:

Identify players who are the most aggressive and successful baserunners out of the batters box.

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks nlme::collapse()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## 
## Attaching package: 'forecast'
## 
## The following object is masked from 'package:nlme':
## 
##     getResponse

load("G:/My Drive/Baseball/R Projects/Data/Statcast2023.RData")

# Create data of balls in play
hip <- c('single', 'double', 'triple')

hip_data <- Statcast2023 %>%
  filter(events %in% hip) %>%
  mutate(base = if_else(events == 'single', 1,
                           if_else(events == 'double', 2,
                                   if_else(events == 'triple', 3,
                                           NA)))) %>%
  select(player_name, events, bb_type, hc_x, hc_y, launch_angle, launch_speed, 
           base)

# Partition into training and validation data
RNGkind(sample.kind = 'Rounding')
set.seed(42)
train_rows <- sample(1:dim(hip_data)[1], dim(hip_data)[1]*0.8)
train_data <- hip_data[train_rows,]  
valid_data <- hip_data[-train_rows,]

Create general additive model

GAM models non-linear relationships between related variables. This model creates a “predicted bases” metric based on the chosen variables. It is most similar to an xSLG calculation, but it only considers hits in play. In this way, it focuses on baserunning.

launch_speed: exit velocity hc_x & hc_y: hit coordinates

model <- gam(base ~ s(launch_speed, launch_angle, hc_x, hc_y),
             data = train_data)
summary(model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## base ~ s(launch_speed, launch_angle, hc_x, hc_y)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.274892   0.001782   715.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                          edf Ref.df     F p-value    
## s(launch_speed,launch_angle,hc_x,hc_y) 111.4  113.9 415.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.625   Deviance explained = 62.7%
## GCV = 0.090442  Scale est. = 0.090083  n = 28353

# Predict on validation data
predictions <- as.vector(predict(model, newdata = valid_data, na.action = na.pass))

pred_dat <- cbind.data.frame(valid_data$player_name, predictions, valid_data$base) %>% na.omit

accuracy <- accuracy(predictions, valid_data$base)
accuracy

##                   ME      RMSE       MAE       MPE     MAPE
## Test set 0.006925169 0.2969843 0.1673352 -3.175594 12.00658

# Add predictions on all 2023 data
season_preds <- as.vector(predict(model, newdata = hip_data))

full_data <- cbind.data.frame(hip_data$player_name, season_preds,
                              hip_data$base) %>% na.omit %>%
  rename(name = `hip_data$player_name`,
         predicted = season_preds,
         actual = `hip_data$base`) %>%
  mutate(diff = actual - predicted)

head(full_data)

##                     name predicted actual        diff
## 1          Vosler, Jason 1.9397503      3  1.06024970
## 2           McNeil, Jeff 1.0605826      1 -0.06058258
## 3           Fraley, Jake 1.3296299      1 -0.32962986
## 4      Blackmon, Charlie 0.9635115      1  0.03648851
## 5 Guerrero Jr., Vladimir 1.0514479      1 -0.05144792
## 6         Martinez, J.D. 0.9660111      1  0.03398893

# Group by player and calculate sum and mean
extra_bases <- full_data %>%
  group_by(name) %>%
  summarize(total = sum(diff),
            mean = mean(diff)) %>%
  arrange(by = desc(total))

head(extra_bases, 15) %>% arrange(by = desc(total))

## # A tibble: 15 × 3
##    name              total   mean
##    <chr>             <dbl>  <dbl>
##  1 Duran, Jarren     15.4  0.172 
##  2 Friedl, TJ        15.1  0.128 
##  3 Carroll, Corbin   14.6  0.0963
##  4 Witt Jr., Bobby   14.4  0.0986
##  5 Hayes, Ke'Bryan   13.7  0.115 
##  6 Kwan, Steven      13.1  0.0792
##  7 Henderson, Gunnar 11.3  0.0945
##  8 De La Cruz, Elly  11.1  0.142 
##  9 Kiermaier, Kevin  11.1  0.122 
## 10 Ohtani, Shohei     9.77 0.0913
## 11 Rosario, Amed      9.69 0.0757
## 12 Suzuki, Seiya      9.53 0.0756
## 13 Perdomo, Geraldo   9.20 0.0876
## 14 Marte, Ketel       9.00 0.0584
## 15 Drury, Brandon     8.98 0.0890

tail(extra_bases, 15) %>% arrange(by = total)

## # A tibble: 15 × 3
##    name               total    mean
##    <chr>              <dbl>   <dbl>
##  1 Perez, Salvador   -13.6  -0.120 
##  2 Arraez, Luis      -12.4  -0.0639
##  3 Maldonado, Martín -11.0  -0.190 
##  4 Garver, Mitch      -9.67 -0.138 
##  5 Suárez, Eugenio    -9.44 -0.0807
##  6 Ozuna, Marcell     -9.16 -0.0856
##  7 Conforto, Michael  -9.07 -0.111 
##  8 Alvarez, Yordan    -8.10 -0.0786
##  9 Díaz, Elias        -7.62 -0.0662
## 10 Rizzo, Anthony     -7.05 -0.0892
## 11 Vogelbach, Daniel  -7.01 -0.137 
## 12 Schwarber, Kyle    -7.00 -0.0933
## 13 Ruiz, Keibert      -6.93 -0.0588
## 14 Kirk, Alejandro    -6.71 -0.0771
## 15 Olson, Matt        -6.65 -0.0545

Conclusion

The tables above list the players who took the most and fewest “extra bases” out of the batters box in 2023. The leaders in this category can be understood as those who are the most aggressive out of the box, taking hustle doubles and pushing for triples when most do not. This is also a very reliable form of power production because it is more reflective of play style than present success. As the saying goes, “Speed never slumps.” Likewise, good baserunning never slumps.

Most Extra Bases

Jarren Duran (BOS)
TJ Friedl (CIN)
Corbin Carroll (ARI)
Bobby Witt Jr. (KC)
Ke’Bryan Hayes (PIT)
Steven Kwan (CLE)
Gunnar Henderson (BAL)
Elly De La Cruz (CIN)
Kevin Kiermaier (TOR)
Shohei Ohtani (LAA)

Fewest Extra Bases

Salvador Perez (KC)
Luis Arraez (MIA)
Martín Maldonado (HOU)
Mitch Garver (TEX)
Eugenio Suárez (SEA)

Extra Bases