Pitch Prediction Analysis

Data Setup

The MLB has advanced pitch tracking equipment inside all stadiums.
By using a pybaseball, a python based webscraper, I was able to get Pitch-by-Pitch data from baseballsavant.com
Will be using data from a single pitcher to start, Tarik Skubal.
Isolated numeric variables (kinematics, spin, movement) for modeling.

PCA Analysis

PCA - Biplot

Model Splitting and Training

Training Set: 70%
Testing Set: 20%
Validation Set: 10%

CART

Random Forest

XGBoost

Model Generalization

The model achieves 99% balanced accuracy on Skubal’s pitches.

How does it perform on other pitchers?

Ryan Weathers: LHP. Similar arsenal, but throws a sweeper instead of a traditional curveball.
Patrick Sandoval: LHP. Throws the exact same 5 pitches as Skubal, but with a less elite movement profile.
Jake Arrieta: RHP. Retired, but threw mostly the same pitch mix.

Generalization - Ryan Weathers

Does a good job with Sliders, Changeups, and Sinkers.
Misclassified some fastballs as sinkers.
Weathers throws a Sweeper instead of a Curveball.
- A sweeper is somewhere between a slider and curveball
- Classifications mixed between slider and curveball

Generalization - Patrick Sandoval

Overall the model generalizes very well to Sandoval
- 97%+ accuracy for slider, curveball, fastball, changeup.
- Sinker did alright, misclassified often.
- Sweeper again classified as curveball or slider.

Generalization - Jake Arrieta

Generalizing the model to a RHP didn’t do very well
Makes logical sense, RHP and LHP throw from opposite sides
- So when a RHP throws a slider it breaks left, while for a LHP it breaks right.
- This would mean for a variable like api_break_x_arm we would get a positive value for a LHP, and a negative value for a RHP.
- This will clearly cause classification issue when trying to split on x-axis based variables.

Scaling Up - Random Sample

Can a more general model be built off a random sample of ALL pitchers?
At what n do we see a diminishing return for our model?
A sample size of n will be drawn from a data frame with 1.5 million pitches from 2023-2024.
Right and Left Handed Models built seperately for each n.
Unbalanced Data; we will look at balanced accuracy.

RHP Model - For grid of n

LHP Model - For grid of n