MLB Pitch Prediction Analysis

Garett Prestholdt

Project Motivation

Every time a pitch is thrown in an MLB stadium, the scoreboard flashes a classification (Sinker, Slider, 4-Seam) within seconds. As a fan, I wanted to look under the hood of that process.

  • My Goal: Reverse-engineer the “black box” of Statcast to see if I can get close to the MLB model.
    • Understand which variables of ball flight have the most impact in prediction.

Data & Tools

  • Data Source: baseballsavant.mlb.com

  • Collection Tool: pybaseball (Open-source Python package by James LeDoux).

  • Pre-classified Targets: The dataset includes MLB’s official pitch classifications to evaluate model accuracy.

  • Filtered the Raw Dataset: Down to 23 numeric predictor variables

pitch_type release_speed release_pos_x az
FF 93.2 -3.35 -11.643
SL 87.3 -3.33 -24.934
CU 80.9 -3.07 -40.127
  • pitch_type: Classification made by proprietary MLB algorithm

    • Don’t know exactly how they get this

Research Questions

  • Core Research Questions:

    1. How accurately can pitch types be predicted using purely kinematic data?
    2. Which specific variables carry the most weight in predicting a pitch?
    3. Can this modeling framework be easily scaled to evaluate other pitchers?
  • Techniques: Principal Component Analysis (PCA) and XGBoost Classification.

Pitch Tracking Coordinate System

  • x -> the horizontal distance, from the center of homeplate

  • y -> the baseball’s distance from home-plate

  • z-> vertical distance, from middle of strikezone

    • Ex: (x, z) = (0,0) would be the exact middle of the front of the strikezone

Predictor Variables (1)

  • ax/y/z -> acceleration of the pitch in that respective direction

  • release_pos_x/z -> x, z coordinates of were the ball leaves the pitchers hand

  • release_pos_y -> how far away from the plate the pitcher relases the ball

Predictor Variables (2)

  • plate_x,z -> where the pitch crosses home-plate in the x, z coordinates

  • api_break_x/z -> Total Horizontal/Vertical Movement

    • how far the pitch moves from x, z release point to x, z plate coords, after accounting for gravity and spin

Predictor Variables (3)

  • pfx_x/z -> Induced Horizontal/Vertical Break

    • The horizontal/vertical break between release point and home plate, compared to a pitch thrown at the same speed, just with no spin.

    • Shows how the spin is manipulating the shape of the pitch.

  • release_extension -> Distance from rubber to where the pitcher releases the ball.

  • Spin Axis vs Spin Rate (rpm)

  • Release Speed -> Velocity the moment it leaves pitchers hand.

Brief Overview of Common Pitches

  • Changeup: Slower, tails /down

  • Fastball: Very Straight

  • Curveball: Slow, lots of vertical movement

  • Slider: Moderate Speed, more horizontal movement

What is PCA?

  • 23D to 2D: Allows us to visualize 23 variables in 2D
  • Natural Clustering: Reveals how pitches naturally group based on their physical traits.
  • “Principal Components”: 23 total components combine to explain 100% of the data’s variance.
    • We only plot PC1 and PC2 because they capture the vast majority of that variance.

PCA - Biplot

  • Until otherwise noted, will be using data from Tarik Skubal, ~13k pitches, LHP
  • Seems to be some clustering of points

PCA - Biplot

  • Adding color, we see the each pitch seems to be clustering by type

PCA - Biplot

  • Direction: If an arrow points toward a cluster, those pitches have higher-than-average values for that metric (e.g., higher Velocity or Spin).

Model Building

  • XGBoost: A collection of decision trees, similar to that of a CART or Random Forest

  • Will be using 70/20/10 Train, Test, Validate sets.

CART - Decision Tree

XGBoost Decision Tree - Example

  • XGBoost is a collection of “smart” decision trees

  • “Smart” in the way that each tree tries to “learn” from the mistakes of the tree before it.

XGBoost- Model Performance

  • Model Accuracy = 99.4%

    • Did an excellent job predicting all pitches.
    • Very close to what the MLB algorithm would predict

XGBoost - Variable Importance

  • Variables with high importance have a high impact on seperating pitches from one another.

Model Generalization

Can I apply Skubal’s Model to predict what other pitchers are throwing?

  • Patrick Sandoval: LHP. Throws the same 5 pitches Skubal does, and a Sweeper
  • Jake Arrieta: RHP. Retired, but threw mostly the same pitch mix.

Generalization - Patrick Sandoval

  • Model Accuracy = 94.0%

  • Sinker(SI) did alright, misclassified often as a fastball (FF) or changeup (CH).

  • Sweeper (ST) not in original model

    • Misclassified as Slider (SL) or Curveball (CU)

Generalization - Jake Arrieta

  • Model Accuracy: 19.4%, very poor

  • RHP and LHP throw from opposite sides

    • So when a RHP throws a slider it breaks left, while for a LHP it breaks right.

    • This will cause classification issues when trying to split on x-axis based variables.

Scaling Up - Random Sample

  • Can a more general model be built off a random sample of ALL pitchers?

  • Will draw a sample from a data frame with 1.5 million pitches from 2023-2024.

  • Right and Left Handed Models built separately for each.

  • At what sample size do we see a diminishing return.

How Large of a Sample of Pitches?

RHP using Sample of 25,000

  • Model Accuracy: 86.9%

LHP using Sample of 25,000

  • Model Accuracy: 90.6%

RHP vs LHP Variable Importance

  • Top 5 most important variables are in different order but same in both RHP and LHP

Answering Research Questions

  1. How accurately can pitch types be predicted using purely kinematic data?
    • We got very close to the MLB’s model when building individual models (99.4% Accuracy)
  2. Which specific variables carry the most weight in predicting a pitch?
    • Variables to do with Spin and Direction
    • Acceleration-X, Spin Rate/Axis, Vertical/Horizontal Movement
  3. Can this modeling framework be easily scaled to evaluate other pitchers?
    • Yes, with more error than individual
    • 87% for RHP model
    • 90% for LHP model

Future Work

  • I would shift to more pre-pitch prediction:

    • Predict which type of pitch will be thrown based on count and batter statistics.
  • Would be more useful to teams:

    • Could help prepare hitter for starting pitchers

Useful Courses

  • DSCI 326 - Data Management with large data, Python-heavy, some machine learning.

  • DSCI 310 - Making more effective data visualizations

  • STAT 450 - Gave me a lot of familiarity with R and ggplot

Thank You!

Questions?

Sources

https://i.redd.it/68jn94t68zgb1.png

https://rocklandpeakperformance.com/wp-content/uploads/2019/10/Pitch-Shapes.png