Garett Prestholdt
Every time a pitch is thrown in an MLB stadium, the scoreboard flashes a classification (Sinker, Slider, 4-Seam) within seconds. As a fan, I wanted to look under the hood of that process.
Data Source: baseballsavant.mlb.com
Collection Tool: pybaseball (Open-source Python package by James LeDoux).
Pre-classified Targets: The dataset includes MLB’s official pitch classifications to evaluate model accuracy.
Filtered the Raw Dataset: Down to 23 numeric predictor variables
| pitch_type | release_speed | release_pos_x | … | az |
| FF | 93.2 | -3.35 | … | -11.643 |
| SL | 87.3 | -3.33 | … | -24.934 |
| CU | 80.9 | -3.07 | … | -40.127 |
pitch_type: Classification made by proprietary MLB algorithm
Core Research Questions:
Techniques: Principal Component Analysis (PCA) and XGBoost Classification.
x -> the horizontal distance, from the center of homeplate
y -> the baseball’s distance from home-plate
z-> vertical distance, from middle of strikezone
ax/y/z -> acceleration of the pitch in that respective direction
release_pos_x/z -> x, z coordinates of were the ball leaves the pitchers hand
release_pos_y -> how far away from the plate the pitcher relases the ball
plate_x,z -> where the pitch crosses home-plate in the x, z coordinates
api_break_x/z -> Total Horizontal/Vertical Movement
pfx_x/z -> Induced Horizontal/Vertical Break
The horizontal/vertical break between release point and home plate, compared to a pitch thrown at the same speed, just with no spin.
Shows how the spin is manipulating the shape of the pitch.
release_extension -> Distance from rubber to where the pitcher releases the ball.
Spin Axis vs Spin Rate (rpm)
Release Speed -> Velocity the moment it leaves pitchers hand.
XGBoost: A collection of decision trees, similar to that of a CART or Random Forest
Will be using 70/20/10 Train, Test, Validate sets.
XGBoost is a collection of “smart” decision trees
“Smart” in the way that each tree tries to “learn” from the mistakes of the tree before it.
Model Accuracy = 99.4%
Can I apply Skubal’s Model to predict what other pitchers are throwing?
Model Accuracy = 94.0%
Sinker(SI) did alright, misclassified often as a fastball (FF) or changeup (CH).
Sweeper (ST) not in original model
Model Accuracy: 19.4%, very poor
RHP and LHP throw from opposite sides
So when a RHP throws a slider it breaks left, while for a LHP it breaks right.
This will cause classification issues when trying to split on x-axis based variables.
Can a more general model be built off a random sample of ALL pitchers?
Will draw a sample from a data frame with 1.5 million pitches from 2023-2024.
Right and Left Handed Models built separately for each.
At what sample size do we see a diminishing return.
I would shift to more pre-pitch prediction:
Would be more useful to teams:
DSCI 326 - Data Management with large data, Python-heavy, some machine learning.
DSCI 310 - Making more effective data visualizations
STAT 450 - Gave me a lot of familiarity with R and ggplot
Questions?
https://i.redd.it/68jn94t68zgb1.png
https://rocklandpeakperformance.com/wp-content/uploads/2019/10/Pitch-Shapes.png