Pitch Prediction Analysis

Data Setup

  • The MLB has advanced pitch tracking equipment inside all stadiums.

  • By using a pybaseball, a python based webscraper, I was able to get Pitch-by-Pitch data from baseballsavant.com

  • Will be using data from a single pitcher to start, Tarik Skubal.

  • Isolated numeric variables (kinematics, spin, movement) for modeling.

PCA Analysis

PCA - Biplot

Model Splitting and Training

  • Training Set: 70%

  • Testing Set: 20%

  • Validation Set: 10%

CART

Random Forest

XGBoost

Model Generalization

The model achieves 99% balanced accuracy on Skubal’s pitches.

How does it perform on other pitchers?

  • Ryan Weathers: LHP. Similar arsenal, but throws a sweeper instead of a traditional curveball.

  • Patrick Sandoval: LHP. Throws the exact same 5 pitches as Skubal, but with a less elite movement profile.

  • Jake Arrieta: RHP. Retired, but threw mostly the same pitch mix.

Generalization - Ryan Weathers

  • Does a good job with Sliders, Changeups, and Sinkers.

  • Misclassified some fastballs as sinkers.

  • Weathers throws a Sweeper instead of a Curveball.

    • A sweeper is somewhere between a slider and curveball

    • Classifications mixed between slider and curveball

Generalization - Patrick Sandoval

  • Overall the model generalizes very well to Sandoval

    • 97%+ accuracy for slider, curveball, fastball, changeup.

    • Sinker did alright, misclassified often.

    • Sweeper again classified as curveball or slider.

Generalization - Jake Arrieta

  • Generalizing the model to a RHP didn’t do very well

  • Makes logical sense, RHP and LHP throw from opposite sides

    • So when a RHP throws a slider it breaks left, while for a LHP it breaks right.

    • This would mean for a variable like api_break_x_arm we would get a positive value for a LHP, and a negative value for a RHP.

    • This will clearly cause classification issue when trying to split on x-axis based variables.

Scaling Up - Random Sample

  • Can a more general model be built off a random sample of ALL pitchers?

  • At what n do we see a diminishing return for our model?

  • A sample size of n will be drawn from a data frame with 1.5 million pitches from 2023-2024.

  • Right and Left Handed Models built seperately for each n.

  • Unbalanced Data; we will look at balanced accuracy.

RHP Model - For grid of n

LHP Model - For grid of n