This exploratory data analysis report is based around data taken from the 2023 Major League Baseball (MLB) playoffs. It consists of athlete data from the games in each of the brackets; wild card, divisional series, league championship series and the world series. The structure of the data set follows each hit-by-hit play with different variables that gives information for each different hit.
There were 11829 observations for each different hit, with 148 batters and 136 pitchers. There were 12 teams in total competing for the trophy
I created a pitch type table to use as a reference, especially for the visualisations and later modelling. The filtering, group-by and summarise functions were also used for the purpose of creating datasets to include only the metrics i wanted to compare and visualise.
Deletion of entire N/A columns was performed as there was no use for this. By using the subset function we can select the columns we have to remove from the original dataset.
Pitch Type Refers to the different technique in pitch. This can be a difference of velocity, angle and spin
| Pitch Type Reference Table | |
| pitch_type | pitch_name |
|---|---|
| FC | Cutter |
| CU | Curveball |
| FS | Splitter |
| FF | Four-Seam Fastball |
| SL | Slider |
| KC | Knuckleball |
| CH | Change-up |
| SI | Sinker |
| ST | Sweeper |
| PO | Pitchout |
| SV | Slurve |
This bar chart allows for a quick understanding of the different pitch types and how they can lead to differing results. There is a clear dominance in the amount of FF (Four seam fast ball) leading to swinging strikes. It is important to note however this does not mean it has the best success rate, it just shows that this was the most common method for causing a swing strike
Swing strike hit zone
Here we have a simple hex diagram showing the areas of where the strikes were in the hit zone as an example of displaying the most effective spots to lead a batter into a swing strikes.
This is showing the different outcomes of each pitch throw
| Pitch Variable Reference Table | |
| Outcome | Description |
|---|---|
| called_strike | Pitch taken for a strike (umpire calls it) |
| ball | Pitch outside the strike zone, not swung at |
| foul | Ball hit foul (with less than two strikes) |
| swinging_strike_blocked | Swing and miss on a pitch blocked by the catcher |
| swinging_strike | Batter swings and misses |
| hit_into_play | Ball is put into play (fair territory) |
| blocked_ball | Pitch blocked by the catcher before it reaches the batter |
| foul_tip | Foul ball tipped directly into the catcher’s glove |
| foul_bunt | Foul ball resulting from a bunt attempt |
| hit_by_pitch | Batter hit by the pitch |
| pitchout | Intentional pitch thrown wide to prevent a stolen base |
| Pitch Variable Reference Table | |
| Variable | Description |
|---|---|
| plate_x | Horizontal position of the ball when it crosses home plate from the catcher's perspective. |
| plate_z | Vertical position of the ball when it crosses home plate from the catcher's perspective. |
| release_pos_x | Horizontal release position of the ball measured in feet from the catcher's perspective. |
| release_pos_z | Vertical release position of the ball measured in feet from the catcher's perspective. |
| zone | Zone location of the ball when it crosses the plate from the catcher's perspective. |
| ax | Acceleration of the pitch in the x-dimension (ft/s²), determined at y=50 feet. |
| ay | Acceleration of the pitch in the y-dimension (ft/s²), determined at y=50 feet. |
| az | Acceleration of the pitch in the z-dimension (ft/s²), determined at y=50 feet. |
| inning | Pre-pitch inning number. |
| outs_when_up | Pre-pitch number of outs. |
Analysing the pitching players from the world series teams Arizona and Texas and comparing how they performed in the world series games compared to the others finals games. The idea is to see which players were more ‘clutch’ than others based on their pitching speeds for the FF fast ball
Expected outcome of these results would be showcasing a clear difference between some of the players that improved more drastically on their pitching speeds.
*Removal of players that didnt play both games was done in order to have this comparison be made
| Average Four-Seam Fastball Hit Speed | ||
| Comparison Between World Series and Finals | ||
| Player | World Series (MPH) | Finals (MPH) |
|---|---|---|
| Bradford, Cody | 90.03 | 88.50 |
| Chapman, Aroldis | 98.33 | 95.31 |
| Eovaldi, Nathan | 95.30 | 89.77 |
| Frías, Luis | 96.20 | 90.73 |
| Gallen, Zac | 94.17 | 90.08 |
| Ginkel, Kevin | 96.15 | 94.09 |
| Heaney, Andrew | 92.42 | 88.59 |
| Kelly, Merrill | 92.96 | 90.85 |
| Leclerc, José | 95.95 | 92.64 |
| Montgomery, Jordan | 90.70 | 89.73 |
| Nelson, Kyle | 91.37 | 87.42 |
| Nelson, Ryne | 93.25 | 93.90 |
| Pfaadt, Brandon | 93.28 | 90.66 |
| Sborz, Josh | 95.50 | 90.47 |
| Scherzer, Max | 93.50 | 89.57 |
| Sewald, Paul | 91.83 | 90.18 |
| Smith, Will | 92.50 | 84.80 |
| Stratton, Chris | 92.10 | 89.44 |
Limitations would involve not displaying the amount of throws as the percentages might be skewed based on this. Other limitations involve the effects of these numbers such as which batter they mostly based and their skill level, injury, sickness, weather conditions etc. There are many conditions to account before saying that this is categorically an accurate representation on the most ‘clutch players’
The idea of this use case is analyse the variables in the data set use them for two possible models to make a prediction on making a strike. Creating two different models to compare them will show which is more accurate in making this prediction, along with which variables are the most important.
After tuning both XGB and RF models and selecting the best model to use here are the results for the accuracy of the models, along with each variables importance.
Model Comparison -Accuracy: % of correct predictions (XGB outperforms Random Forest) -ROC-AUC: measures ability to distinguish between classes. (Random Forest is slightly better) -Brier: Measures the accuracy of probabilistic predictions. Lower is better. (Random Forrest is the best)
Variable importance refers to which metrics show the influence over the prediction the most. Zone shown to be the most impactful in predicting the outcome
| Mode Comparison | ||
| .metric | .estimate | model |
|---|---|---|
| accuracy | 0.6440000 | rf |
| roc_auc | 0.7092392 | rf |
| brier_class | 0.2144250 | rf |
| accuracy | 0.6580000 | xgb |
| roc_auc | 0.7090629 | xgb |
| brier_class | 0.2319823 | xgb |
Here we are using the Generalise Linear Model function to look at each variables odds ratio on how they will predict the likelihood of ending up as a strike. For example, pitch type KC is 4 times more likely to end up as a strike if chosen as the method of pitch type with statistical significance.
R²: 0.068 which indicates the model has modest explanatory power (~6.8% of the variance in strike calls explained).
| Is Strike | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 2.68 | 0.54 – 13.35 | 0.229 |
| plate x | 1.05 | 0.90 – 1.23 | 0.523 |
| pitch type [CU] | 1.02 | 0.56 – 1.87 | 0.939 |
| pitch type [FC] | 0.76 | 0.43 – 1.34 | 0.343 |
| pitch type [FF] | 1.01 | 0.59 – 1.74 | 0.975 |
| pitch type [FS] | 2.32 | 1.15 – 4.76 | 0.019 |
| pitch type [KC] | 0.80 | 0.35 – 1.83 | 0.605 |
| pitch type [SI] | 1.04 | 0.63 – 1.72 | 0.885 |
| pitch type [SL] | 1.45 | 0.90 – 2.34 | 0.127 |
| pitch type [ST] | 1.89 | 1.02 – 3.52 | 0.043 |
| pitch type [SV] | 2.49 | 0.23 – 55.52 | 0.466 |
| zone | 0.89 | 0.86 – 0.92 | <0.001 |
| strikes | 0.61 | 0.54 – 0.70 | <0.001 |
| inning | 1.02 | 0.98 – 1.06 | 0.367 |
| ax | 0.99 | 0.98 – 1.01 | 0.308 |
| ay | 1.00 | 0.95 – 1.04 | 0.885 |
| az | 0.98 | 0.96 – 1.01 | 0.232 |
| Observations | 1500 | ||
| R2 Tjur | 0.099 | ||
Limitations related to this is not including more models to compare. More analyse on the variable in particular as well to find which has the highest correlations is something also to be done in the future.