library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks nlme::collapse()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
##
## The following object is masked from 'package:nlme':
##
## getResponse
load("G:/My Drive/Baseball/R Projects/Data/Statcast2023.RData")
# Create data of balls in play
hip <- c('single', 'double', 'triple')
hip_data <- Statcast2023 %>%
filter(events %in% hip) %>%
mutate(base = if_else(events == 'single', 1,
if_else(events == 'double', 2,
if_else(events == 'triple', 3,
NA)))) %>%
select(player_name, events, bb_type, hc_x, hc_y, launch_angle, launch_speed,
base)
# Partition into training and validation data
RNGkind(sample.kind = 'Rounding')
set.seed(42)
train_rows <- sample(1:dim(hip_data)[1], dim(hip_data)[1]*0.8)
train_data <- hip_data[train_rows,]
valid_data <- hip_data[-train_rows,]
GAM models non-linear relationships between related variables. This model creates a “predicted bases” metric based on the chosen variables. It is most similar to an xSLG calculation, but it only considers hits in play. In this way, it focuses on baserunning.
launch_speed: exit velocity hc_x & hc_y: hit coordinates
model <- gam(base ~ s(launch_speed, launch_angle, hc_x, hc_y),
data = train_data)
summary(model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## base ~ s(launch_speed, launch_angle, hc_x, hc_y)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.274892 0.001782 715.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(launch_speed,launch_angle,hc_x,hc_y) 111.4 113.9 415.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.625 Deviance explained = 62.7%
## GCV = 0.090442 Scale est. = 0.090083 n = 28353
# Predict on validation data
predictions <- as.vector(predict(model, newdata = valid_data, na.action = na.pass))
pred_dat <- cbind.data.frame(valid_data$player_name, predictions, valid_data$base) %>% na.omit
accuracy <- accuracy(predictions, valid_data$base)
accuracy
## ME RMSE MAE MPE MAPE
## Test set 0.006925169 0.2969843 0.1673352 -3.175594 12.00658
# Add predictions on all 2023 data
season_preds <- as.vector(predict(model, newdata = hip_data))
full_data <- cbind.data.frame(hip_data$player_name, season_preds,
hip_data$base) %>% na.omit %>%
rename(name = `hip_data$player_name`,
predicted = season_preds,
actual = `hip_data$base`) %>%
mutate(diff = actual - predicted)
head(full_data)
## name predicted actual diff
## 1 Vosler, Jason 1.9397503 3 1.06024970
## 2 McNeil, Jeff 1.0605826 1 -0.06058258
## 3 Fraley, Jake 1.3296299 1 -0.32962986
## 4 Blackmon, Charlie 0.9635115 1 0.03648851
## 5 Guerrero Jr., Vladimir 1.0514479 1 -0.05144792
## 6 Martinez, J.D. 0.9660111 1 0.03398893
# Group by player and calculate sum and mean
extra_bases <- full_data %>%
group_by(name) %>%
summarize(total = sum(diff),
mean = mean(diff)) %>%
arrange(by = desc(total))
head(extra_bases, 15) %>% arrange(by = desc(total))
## # A tibble: 15 × 3
## name total mean
## <chr> <dbl> <dbl>
## 1 Duran, Jarren 15.4 0.172
## 2 Friedl, TJ 15.1 0.128
## 3 Carroll, Corbin 14.6 0.0963
## 4 Witt Jr., Bobby 14.4 0.0986
## 5 Hayes, Ke'Bryan 13.7 0.115
## 6 Kwan, Steven 13.1 0.0792
## 7 Henderson, Gunnar 11.3 0.0945
## 8 De La Cruz, Elly 11.1 0.142
## 9 Kiermaier, Kevin 11.1 0.122
## 10 Ohtani, Shohei 9.77 0.0913
## 11 Rosario, Amed 9.69 0.0757
## 12 Suzuki, Seiya 9.53 0.0756
## 13 Perdomo, Geraldo 9.20 0.0876
## 14 Marte, Ketel 9.00 0.0584
## 15 Drury, Brandon 8.98 0.0890
tail(extra_bases, 15) %>% arrange(by = total)
## # A tibble: 15 × 3
## name total mean
## <chr> <dbl> <dbl>
## 1 Perez, Salvador -13.6 -0.120
## 2 Arraez, Luis -12.4 -0.0639
## 3 Maldonado, Martín -11.0 -0.190
## 4 Garver, Mitch -9.67 -0.138
## 5 Suárez, Eugenio -9.44 -0.0807
## 6 Ozuna, Marcell -9.16 -0.0856
## 7 Conforto, Michael -9.07 -0.111
## 8 Alvarez, Yordan -8.10 -0.0786
## 9 Díaz, Elias -7.62 -0.0662
## 10 Rizzo, Anthony -7.05 -0.0892
## 11 Vogelbach, Daniel -7.01 -0.137
## 12 Schwarber, Kyle -7.00 -0.0933
## 13 Ruiz, Keibert -6.93 -0.0588
## 14 Kirk, Alejandro -6.71 -0.0771
## 15 Olson, Matt -6.65 -0.0545
The tables above list the players who took the most and fewest “extra bases” out of the batters box in 2023. The leaders in this category can be understood as those who are the most aggressive out of the box, taking hustle doubles and pushing for triples when most do not. This is also a very reliable form of power production because it is more reflective of play style than present success. As the saying goes, “Speed never slumps.” Likewise, good baserunning never slumps.