In the past couple weeks, I have been researching spin axis of pitches and if the difference between expected spin axis and actual spin axis affects certain outcomes. Here, I want to present a mini study, where I estimate expected spin axis of a sinker given the x and z coordinates of the pitcher’s release point, and if the difference between expected and real spin axis affect distance of batted balls. For this study, I use data from the 2022 season, which is a large enough to get a good fit for the expected spin axis model without having to worry about controlling for different ball types between seasons.

This first code chunk loads in the data and filters down to just sinkers. Then, I use linear regression to fit a model of spin axis where the dependent variables are release point in the x,z coordinate plane.

library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v dplyr   1.0.8
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Import Savant Data for 2022 (scraped using pybaseball)

statcast22 = read_csv("C:/Users/david/PycharmProjects/pythonProject/statcast22.csv")
## New names:
## * `` -> ...1
## Rows: 427974 Columns: 93
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (16): pitch_type, player_name, events, description, des, game_type, sta...
## dbl  (68): ...1, release_speed, release_pos_x, release_pos_z, batter, pitche...
## lgl   (8): spin_dir, spin_rate_deprecated, break_angle_deprecated, break_len...
## date  (1): game_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
si = statcast22 %>%
  filter(pitch_type == "SI") 

# Estimate Expected Spin Axis
m0 = lm(spin_axis ~ release_pos_x + release_pos_z, data = si)
summary(m0)
## 
## Call:
## lm(formula = spin_axis ~ release_pos_x + release_pos_z, data = si)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -268.056  -10.467    0.277   10.030  260.083 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   209.84334    0.59888  350.39   <2e-16 ***
## release_pos_x -20.31765    0.03271 -621.12   <2e-16 ***
## release_pos_z  -4.91457    0.10443  -47.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.72 on 66017 degrees of freedom
##   (167 observations deleted due to missingness)
## Multiple R-squared:  0.858,  Adjusted R-squared:  0.858 
## F-statistic: 1.994e+05 on 2 and 66017 DF,  p-value: < 2.2e-16

This simple model of release point from the x and z coordinate models the data well. A basic calibration check for this model is to see that a pitch thrown at 11:00 and 1:00 have the roughly same spin axis, just mirrored, which this model passes.

Moving to the next section, I will be using a mixed effects model to estimated batted ball distance. This is a fairly simple model as well, with the only random effect being ballpark, and the fixed effects being location (x,z coordinate of pitch), velocity, spin rate and movement (x,z) of the pitch, difference in predicted spin axis versus actual, launch angle and exit velocity, and if the pitcher and batter are the same handedness. The one added complexity is that the launch angle term will be smooth. The mgcv package is good for this type of problem, as it allows for both smooth and random effects.

si_new = si %>%
  mutate(p_axis = predict(m0, si),
         axis_diff = spin_axis - p_axis,
         adj_spin = release_spin_rate / 1000,
         same_hand = if_else(p_throws == stand, 1, 0)) %>%
  filter(is.na(hit_distance_sc) == FALSE)

library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
si_new$home_team <- as.factor(si_new$home_team) 
m0 = gam(hit_distance_sc ~ same_hand + axis_diff + plate_x + plate_z + pfx_x + pfx_z + release_speed + adj_spin + launch_speed + s(launch_angle) + s(home_team, bs = "re"), data = si_new)
summary(m0)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hit_distance_sc ~ same_hand + axis_diff + plate_x + plate_z + 
##     pfx_x + pfx_z + release_speed + adj_spin + launch_speed + 
##     s(launch_angle) + s(home_team, bs = "re")
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.129598   8.372474  -0.852  0.39447    
## same_hand     -3.564780   0.506435  -7.039 1.99e-12 ***
## axis_diff      0.008977   0.014615   0.614  0.53908    
## plate_x       -1.862673   0.466132  -3.996 6.46e-05 ***
## plate_z        3.484742   0.477325   7.301 2.96e-13 ***
## pfx_x         -0.266442   0.251029  -1.061  0.28852    
## pfx_z          1.890116   0.630766   2.997  0.00273 ** 
## release_speed -0.599287   0.090415  -6.628 3.47e-11 ***
## adj_spin       0.580981   1.432695   0.406  0.68510    
## launch_speed   2.296123   0.017075 134.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df         F p-value    
## s(launch_angle)  8.988      9 22801.836  <2e-16 ***
## s(home_team)    19.563     29     2.163  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.914   Deviance explained = 91.5%
## GCV = 1259.8  Scale est. = 1257.8    n = 23662

As you can see, differences in spin rate axis from expected is not significant in this model. There is some hope though. Here, I am saying that being -50 degrees relative to expectation is the different than +50 degrees. Let’s see what happens if they are the same.

m1 = gam(hit_distance_sc ~ same_hand + abs(axis_diff) + plate_x + plate_z + pfx_x + pfx_z + release_speed + adj_spin + launch_speed + s(launch_angle) + s(home_team, bs = "re"), data = si_new)
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hit_distance_sc ~ same_hand + abs(axis_diff) + plate_x + plate_z + 
##     pfx_x + pfx_z + release_speed + adj_spin + launch_speed + 
##     s(launch_angle) + s(home_team, bs = "re")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7.073432   8.382390  -0.844  0.39876    
## same_hand      -3.566445   0.506482  -7.042 1.95e-12 ***
## abs(axis_diff) -0.008397   0.021860  -0.384  0.70088    
## plate_x        -1.835402   0.463460  -3.960 7.51e-05 ***
## plate_z         3.482837   0.477325   7.297 3.04e-13 ***
## pfx_x          -0.293065   0.245607  -1.193  0.23279    
## pfx_z           1.818720   0.642016   2.833  0.00462 ** 
## release_speed  -0.595261   0.090247  -6.596 4.32e-11 ***
## adj_spin        0.450811   1.428320   0.316  0.75229    
## launch_speed    2.296114   0.017075 134.470  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df         F p-value    
## s(launch_angle)  8.988      9 22800.081  <2e-16 ***
## s(home_team)    19.562     29     2.147  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.914   Deviance explained = 91.5%
## GCV = 1259.8  Scale est. = 1257.8    n = 23662

Spin axis difference is still insignificant. This was surprising to me, since being an outlier tends to have some sort of large effect. Perhaps with a more sophisticated model, we could find something here, but it’s good practice to mine data in hopes of finding p values less than 5%, so I will leave this experiment here.