Background

This exploratory data analysis report is based around data taken from the 2023 Major League Baseball (MLB) playoffs. It consists of athlete data from the games in each of the brackets; wild card, divisional series, league championship series and the world series. The structure of the data set follows each hit-by-hit play with different variables that gives information for each different hit.

Dataset Description

There were 11829 observations for each different hit, with 148 batters and 136 pitchers. There were 12 teams in total competing for the trophy

I created a pitch type table to use as a reference, especially for the visualisations and later modelling. The filtering, group-by and summarise functions were also used for the purpose of creating datasets to include only the metrics i wanted to compare and visualise.

Deletion of entire N/A columns was performed as there was no use for this. By using the subset function we can select the columns we have to remove from the original dataset.

Variables

Pitch Type Refers to the different technique in pitch. This can be a difference of velocity, angle and spin

Pitch Type Reference Table
pitch_type pitch_name
FC Cutter
CU Curveball
FS Splitter
FF Four-Seam Fastball
SL Slider
KC Knuckleball
CH Change-up
SI Sinker
ST Sweeper
PO Pitchout
SV Slurve

Pitch types leading to swing strikes

This bar chart allows for a quick understanding of the different pitch types and how they can lead to differing results. There is a clear dominance in the amount of FF (Four seam fast ball) leading to swinging strikes. It is important to note however this does not mean it has the best success rate, it just shows that this was the most common method for causing a swing strike

Swing strike hit zone

Here we have a simple hex diagram showing the areas of where the strikes were in the hit zone as an example of displaying the most effective spots to lead a batter into a swing strikes.

Outcome Description

This is showing the different outcomes of each pitch throw

Pitch Variable Reference Table
Outcome Description
called_strike Pitch taken for a strike (umpire calls it)
ball Pitch outside the strike zone, not swung at
foul Ball hit foul (with less than two strikes)
swinging_strike_blocked Swing and miss on a pitch blocked by the catcher
swinging_strike Batter swings and misses
hit_into_play Ball is put into play (fair territory)
blocked_ball Pitch blocked by the catcher before it reaches the batter
foul_tip Foul ball tipped directly into the catcher’s glove
foul_bunt Foul ball resulting from a bunt attempt
hit_by_pitch Batter hit by the pitch
pitchout Intentional pitch thrown wide to prevent a stolen base

Other important variables

Pitch Variable Reference Table
Variable Description
plate_x Horizontal position of the ball when it crosses home plate from the catcher's perspective.
plate_z Vertical position of the ball when it crosses home plate from the catcher's perspective.
release_pos_x Horizontal release position of the ball measured in feet from the catcher's perspective.
release_pos_z Vertical release position of the ball measured in feet from the catcher's perspective.
zone Zone location of the ball when it crosses the plate from the catcher's perspective.
ax Acceleration of the pitch in the x-dimension (ft/s²), determined at y=50 feet.
ay Acceleration of the pitch in the y-dimension (ft/s²), determined at y=50 feet.
az Acceleration of the pitch in the z-dimension (ft/s²), determined at y=50 feet.
inning Pre-pitch inning number.
outs_when_up Pre-pitch number of outs.

DATA USE CASES

Use case 1: Clutch players for Four-seam fast ball

Analysing the pitching players from the world series teams Arizona and Texas and comparing how they performed in the world series games compared to the others finals games. The idea is to see which players were more ‘clutch’ than others based on their pitching speeds for the FF fast ball

Expected outcome of these results would be showcasing a clear difference between some of the players that improved more drastically on their pitching speeds.

*Removal of players that didnt play both games was done in order to have this comparison be made

Average Four-Seam Fastball Hit Speed
Comparison Between World Series and Finals
Player World Series (MPH) Finals (MPH)
Bradford, Cody 90.03 88.50
Chapman, Aroldis 98.33 95.31
Eovaldi, Nathan 95.30 89.77
Frías, Luis 96.20 90.73
Gallen, Zac 94.17 90.08
Ginkel, Kevin 96.15 94.09
Heaney, Andrew 92.42 88.59
Kelly, Merrill 92.96 90.85
Leclerc, José 95.95 92.64
Montgomery, Jordan 90.70 89.73
Nelson, Kyle 91.37 87.42
Nelson, Ryne 93.25 93.90
Pfaadt, Brandon 93.28 90.66
Sborz, Josh 95.50 90.47
Scherzer, Max 93.50 89.57
Sewald, Paul 91.83 90.18
Smith, Will 92.50 84.80
Stratton, Chris 92.10 89.44

Limitations would involve not displaying the amount of throws as the percentages might be skewed based on this. Other limitations involve the effects of these numbers such as which batter they mostly based and their skill level, injury, sickness, weather conditions etc. There are many conditions to account before saying that this is categorically an accurate representation on the most ‘clutch players’

Use case 2: Analysing variable and model importance

The idea of this use case is analyse the variables in the data set use them for two possible models to make a prediction on making a strike. Creating two different models to compare them will show which is more accurate in making this prediction, along with which variables are the most important.

After tuning both XGB and RF models and selecting the best model to use here are the results for the accuracy of the models, along with each variables importance.

Model Comparison -Accuracy: % of correct predictions (XGB outperforms Random Forest) -ROC-AUC: measures ability to distinguish between classes. (Random Forest is slightly better) -Brier: Measures the accuracy of probabilistic predictions. Lower is better. (Random Forrest is the best)

Variable importance refers to which metrics show the influence over the prediction the most. Zone shown to be the most impactful in predicting the outcome

Mode Comparison
.metric .estimate model
accuracy 0.6440000 rf
roc_auc 0.7092392 rf
brier_class 0.2144250 rf
accuracy 0.6580000 xgb
roc_auc 0.7090629 xgb
brier_class 0.2319823 xgb

Here we are using the Generalise Linear Model function to look at each variables odds ratio on how they will predict the likelihood of ending up as a strike. For example, pitch type KC is 4 times more likely to end up as a strike if chosen as the method of pitch type with statistical significance.

R²: 0.068 which indicates the model has modest explanatory power (~6.8% of the variance in strike calls explained).

  Is Strike
Predictors Odds Ratios CI p
(Intercept) 2.68 0.54 – 13.35 0.229
plate x 1.05 0.90 – 1.23 0.523
pitch type [CU] 1.02 0.56 – 1.87 0.939
pitch type [FC] 0.76 0.43 – 1.34 0.343
pitch type [FF] 1.01 0.59 – 1.74 0.975
pitch type [FS] 2.32 1.15 – 4.76 0.019
pitch type [KC] 0.80 0.35 – 1.83 0.605
pitch type [SI] 1.04 0.63 – 1.72 0.885
pitch type [SL] 1.45 0.90 – 2.34 0.127
pitch type [ST] 1.89 1.02 – 3.52 0.043
pitch type [SV] 2.49 0.23 – 55.52 0.466
zone 0.89 0.86 – 0.92 <0.001
strikes 0.61 0.54 – 0.70 <0.001
inning 1.02 0.98 – 1.06 0.367
ax 0.99 0.98 – 1.01 0.308
ay 1.00 0.95 – 1.04 0.885
az 0.98 0.96 – 1.01 0.232
Observations 1500
R2 Tjur 0.099

Limitations related to this is not including more models to compare. More analyse on the variable in particular as well to find which has the highest correlations is something also to be done in the future.