The MLB has advanced pitch tracking equipment inside all stadiums.
By using a pybaseball, a python based webscraper, I was able to get Pitch-by-Pitch data from baseballsavant.com
Will be using data from a single pitcher to start, Tarik Skubal.
Isolated numeric variables (kinematics, spin, movement) for modeling.
Training Set: 70%
Testing Set: 20%
Validation Set: 10%
The model achieves 99% balanced accuracy on Skubal’s pitches.
How does it perform on other pitchers?
Ryan Weathers: LHP. Similar arsenal, but throws a sweeper instead of a traditional curveball.
Patrick Sandoval: LHP. Throws the exact same 5 pitches as Skubal, but with a less elite movement profile.
Jake Arrieta: RHP. Retired, but threw mostly the same pitch mix.
Does a good job with Sliders, Changeups, and Sinkers.
Misclassified some fastballs as sinkers.
Weathers throws a Sweeper instead of a Curveball.
A sweeper is somewhere between a slider and curveball
Classifications mixed between slider and curveball
Overall the model generalizes very well to Sandoval
97%+ accuracy for slider, curveball, fastball, changeup.
Sinker did alright, misclassified often.
Sweeper again classified as curveball or slider.
Generalizing the model to a RHP didn’t do very well
Makes logical sense, RHP and LHP throw from opposite sides
So when a RHP throws a slider it breaks left, while for a LHP it breaks right.
This would mean for a variable like api_break_x_arm we would get a positive value for a LHP, and a negative value for a RHP.
This will clearly cause classification issue when trying to split on x-axis based variables.
Can a more general model be built off a random sample of ALL pitchers?
At what n do we see a diminishing return for our model?
A sample size of n will be drawn from a data frame with 1.5 million pitches from 2023-2024.
Right and Left Handed Models built seperately for each n.
Unbalanced Data; we will look at balanced accuracy.