Background

This exploratory data analysis report is based around data taken from the 2023 Major League Baseball (MLB) playoffs. It consists of athlete data from the games in each of the brackets; wild card, divisional series, league championship series and the world series. The structure of the data set follows each hit-by-hit play with different variables that gives information for each different hit.

Dataset Description

There were 11829 observations for each different hit, with 148 batters and 136 pitchers. There were 12 teams in total competing for the trophy

I created a pitch type table to use as a reference, especially for the visualisations and later modelling. The filtering, group-by and summarise functions were also used for the purpose of creating datasets to include only the metrics i wanted to compare and visualise.

Deletion of entire N/A columns was performed as there was no use for this. By using the subset function we can select the columns we have to remove from the original dataset.

Variables

Pitch Type Refers to the different technique in pitch. This can be a difference of velocity, angle and spin

pitch_type	pitch_name
Pitch Type Reference Table
FC	Cutter
CU	Curveball
FS	Splitter
FF	Four-Seam Fastball
SL	Slider
KC	Knuckleball
CH	Change-up
SI	Sinker
ST	Sweeper
PO	Pitchout
SV	Slurve

Pitch types leading to swing strikes

This bar chart allows for a quick understanding of the different pitch types and how they can lead to differing results. There is a clear dominance in the amount of FF (Four seam fast ball) leading to swinging strikes. It is important to note however this does not mean it has the best success rate, it just shows that this was the most common method for causing a swing strike

Swing strike hit zone

Here we have a simple hex diagram showing the areas of where the strikes were in the hit zone as an example of displaying the most effective spots to lead a batter into a swing strikes.

Outcome Description

This is showing the different outcomes of each pitch throw

Outcome	Description
Pitch Variable Reference Table
called_strike	Pitch taken for a strike (umpire calls it)
ball	Pitch outside the strike zone, not swung at
foul	Ball hit foul (with less than two strikes)
swinging_strike_blocked	Swing and miss on a pitch blocked by the catcher
swinging_strike	Batter swings and misses
hit_into_play	Ball is put into play (fair territory)
blocked_ball	Pitch blocked by the catcher before it reaches the batter
foul_tip	Foul ball tipped directly into the catcher’s glove
foul_bunt	Foul ball resulting from a bunt attempt
hit_by_pitch	Batter hit by the pitch
pitchout	Intentional pitch thrown wide to prevent a stolen base

Other important variables

Variable	Description
Pitch Variable Reference Table
plate_x	Horizontal position of the ball when it crosses home plate from the catcher's perspective.
plate_z	Vertical position of the ball when it crosses home plate from the catcher's perspective.
release_pos_x	Horizontal release position of the ball measured in feet from the catcher's perspective.
release_pos_z	Vertical release position of the ball measured in feet from the catcher's perspective.
zone	Zone location of the ball when it crosses the plate from the catcher's perspective.
ax	Acceleration of the pitch in the x-dimension (ft/s²), determined at y=50 feet.
ay	Acceleration of the pitch in the y-dimension (ft/s²), determined at y=50 feet.
az	Acceleration of the pitch in the z-dimension (ft/s²), determined at y=50 feet.
inning	Pre-pitch inning number.
outs_when_up	Pre-pitch number of outs.

DATA USE CASES

Use case 1: Clutch players for Four-seam fast ball

Analysing the pitching players from the world series teams Arizona and Texas and comparing how they performed in the world series games compared to the others finals games. The idea is to see which players were more ‘clutch’ than others based on their pitching speeds for the FF fast ball

Expected outcome of these results would be showcasing a clear difference between some of the players that improved more drastically on their pitching speeds.

*Removal of players that didnt play both games was done in order to have this comparison be made

Player	World Series (MPH)	Finals (MPH)
Average Four-Seam Fastball Hit Speed
Comparison Between World Series and Finals
Bradford, Cody	90.03	88.50
Chapman, Aroldis	98.33	95.31
Eovaldi, Nathan	95.30	89.77
Frías, Luis	96.20	90.73
Gallen, Zac	94.17	90.08
Ginkel, Kevin	96.15	94.09
Heaney, Andrew	92.42	88.59
Kelly, Merrill	92.96	90.85
Leclerc, José	95.95	92.64
Montgomery, Jordan	90.70	89.73
Nelson, Kyle	91.37	87.42
Nelson, Ryne	93.25	93.90
Pfaadt, Brandon	93.28	90.66
Sborz, Josh	95.50	90.47
Scherzer, Max	93.50	89.57
Sewald, Paul	91.83	90.18
Smith, Will	92.50	84.80
Stratton, Chris	92.10	89.44

Limitations would involve not displaying the amount of throws as the percentages might be skewed based on this. Other limitations involve the effects of these numbers such as which batter they mostly based and their skill level, injury, sickness, weather conditions etc. There are many conditions to account before saying that this is categorically an accurate representation on the most ‘clutch players’

Use case 2: Analysing variable and model importance

The idea of this use case is analyse the variables in the data set use them for two possible models to make a prediction on making a strike. Creating two different models to compare them will show which is more accurate in making this prediction, along with which variables are the most important.

After tuning both XGB and RF models and selecting the best model to use here are the results for the accuracy of the models, along with each variables importance.

Model Comparison -Accuracy: % of correct predictions (XGB outperforms Random Forest) -ROC-AUC: measures ability to distinguish between classes. (Random Forest is slightly better) -Brier: Measures the accuracy of probabilistic predictions. Lower is better. (Random Forrest is the best)

Variable importance refers to which metrics show the influence over the prediction the most. Zone shown to be the most impactful in predicting the outcome

.metric	.estimate	model
Mode Comparison
accuracy	0.6440000	rf
roc_auc	0.7092392	rf
brier_class	0.2144250	rf
accuracy	0.6580000	xgb
roc_auc	0.7090629	xgb
brier_class	0.2319823	xgb

Here we are using the Generalise Linear Model function to look at each variables odds ratio on how they will predict the likelihood of ending up as a strike. For example, pitch type KC is 4 times more likely to end up as a strike if chosen as the method of pitch type with statistical significance.

R²: 0.068 which indicates the model has modest explanatory power (~6.8% of the variance in strike calls explained).

	Is Strike
Predictors	Odds Ratios	CI	p
(Intercept)	2.68	0.54 – 13.35	0.229
plate x	1.05	0.90 – 1.23	0.523
pitch type [CU]	1.02	0.56 – 1.87	0.939
pitch type [FC]	0.76	0.43 – 1.34	0.343
pitch type [FF]	1.01	0.59 – 1.74	0.975
pitch type [FS]	2.32	1.15 – 4.76	0.019
pitch type [KC]	0.80	0.35 – 1.83	0.605
pitch type [SI]	1.04	0.63 – 1.72	0.885
pitch type [SL]	1.45	0.90 – 2.34	0.127
pitch type [ST]	1.89	1.02 – 3.52	0.043
pitch type [SV]	2.49	0.23 – 55.52	0.466
zone	0.89	0.86 – 0.92	<0.001
strikes	0.61	0.54 – 0.70	<0.001
inning	1.02	0.98 – 1.06	0.367
ax	0.99	0.98 – 1.01	0.308
ay	1.00	0.95 – 1.04	0.885
az	0.98	0.96 – 1.01	0.232
Observations	1500
R² Tjur	0.099

Limitations related to this is not including more models to compare. More analyse on the variable in particular as well to find which has the highest correlations is something also to be done in the future.

MLB Exploratory Data Analysis

Nic Krotiris

2025-04-09