The dataset comes from Statcast Search CSV data, provided by MLB’s Savant system. It includes detailed pitch-by-pitch information from the 2023 MLB postseason, capturing variables such as pitch types, speeds, movements, and batted ball outcomes. Statcast tracks data for both pitchers and batters, focusing on high-stakes postseason games where performance is critical. This dataset offers insights into player and team strategies, making it a valuable resource for analyzing baseball performance at the highest level.
Database Loading and Libraries
#Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(rlang)
##
## Attaching package: 'rlang'
##
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
library(tidyr)
library(lubridate)
library(stringr)
library(purrr)
#Database
df <- read.csv("pitch_data_2023_mlb_post.csv")
head(df)
## pitch_type game_date release_speed release_pos_x release_pos_z
## 1 FC 2023-11-01 91.5 -2.60 5.44
## 2 CU 2023-11-01 76.2 -2.27 5.73
## 3 FS 2023-11-01 87.3 -2.61 5.49
## 4 FS 2023-11-01 88.6 -2.39 5.61
## 5 FS 2023-11-01 85.6 -2.38 5.64
## 6 FS 2023-11-01 88.5 -2.51 5.51
## player_name batter pitcher events description spin_dir
## 1 Eovaldi, Nathan 672695 543135 strikeout called_strike NA
## 2 Eovaldi, Nathan 672695 543135 called_strike NA
## 3 Eovaldi, Nathan 672695 543135 ball NA
## 4 Eovaldi, Nathan 672695 543135 foul NA
## 5 Eovaldi, Nathan 446334 543135 strikeout swinging_strike_blocked NA
## 6 Eovaldi, Nathan 446334 543135 swinging_strike NA
## spin_rate_deprecated break_angle_deprecated break_length_deprecated zone
## 1 NA NA NA 7
## 2 NA NA NA 8
## 3 NA NA NA 13
## 4 NA NA NA 14
## 5 NA NA NA 14
## 6 NA NA NA 13
## des game_type stand p_throws home_team
## 1 Geraldo Perdomo called out on strikes. W L R AZ
## 2 Geraldo Perdomo called out on strikes. W L R AZ
## 3 Geraldo Perdomo called out on strikes. W L R AZ
## 4 Geraldo Perdomo called out on strikes. W L R AZ
## 5 Evan Longoria strikes out swinging. W R R AZ
## 6 Evan Longoria strikes out swinging. W R R AZ
## away_team type hit_location bb_type balls strikes game_year pfx_x pfx_z
## 1 TEX S 2 1 2 2023 0.25 0.74
## 2 TEX S NA 1 1 2023 0.99 -0.37
## 3 TEX B NA 0 1 2023 -0.86 0.11
## 4 TEX S NA 0 0 2023 -0.92 0.24
## 5 TEX S 2 0 2 2023 -1.00 0.17
## 6 TEX S NA 0 1 2023 -1.08 0.13
## plate_x plate_z on_3b on_2b on_1b outs_when_up inning inning_topbot hc_x hc_y
## 1 -0.47 1.94 NA NA NA 2 6 Bot NA NA
## 2 -0.10 1.96 NA NA NA 2 6 Bot NA NA
## 3 -0.35 1.02 NA NA NA 2 6 Bot NA NA
## 4 0.18 1.30 NA NA NA 2 6 Bot NA NA
## 5 0.21 0.62 NA NA NA 1 6 Bot NA NA
## 6 -0.13 0.93 NA NA NA 1 6 Bot NA NA
## tfs_deprecated tfs_zulu_deprecated fielder_2 umpire sv_id vx0 vy0
## 1 NA NA 641680 NA NA 4.866354 -133.2660
## 2 NA NA 641680 NA NA 2.772168 -110.9465
## 3 NA NA 641680 NA NA 7.274464 -126.8861
## 4 NA NA 641680 NA NA 8.263622 -128.7932
## 5 NA NA 641680 NA NA 8.191721 -124.3931
## 6 NA NA 641680 NA NA 8.101944 -128.6535
## vz0 ax ay az sz_top sz_bot hit_distance_sc
## 1 -4.6442287 2.079220 26.02426 -22.51388 3.40 1.57 NA
## 2 -0.2128294 7.719352 19.68724 -35.40345 3.30 1.53 NA
## 3 -4.8384666 -10.799094 25.31579 -30.23172 3.43 1.53 NA
## 4 -4.9791683 -11.987677 26.02350 -28.68755 3.40 1.57 3
## 5 -5.9595113 -11.965423 23.96371 -29.38752 3.67 1.73 NA
## 6 -5.4465204 -13.627585 24.91191 -29.78662 3.67 1.73 NA
## launch_speed launch_angle effective_speed release_spin_rate release_extension
## 1 NA NA 92.2 2281 6.5
## 2 NA NA 76.6 1928 6.5
## 3 NA NA 88.0 1498 6.7
## 4 90.7 -29 89.2 1638 6.7
## 5 NA NA 86.3 1574 6.7
## 6 NA NA 89.3 1634 6.7
## game_pk pitcher_1 fielder_2_1 fielder_3 fielder_4 fielder_5 fielder_6
## 1 748534 543135 641680 663993 543760 673962 608369
## 2 748534 543135 641680 663993 543760 673962 608369
## 3 748534 543135 641680 663993 543760 673962 608369
## 4 748534 543135 641680 663993 543760 673962 608369
## 5 748534 543135 641680 663993 543760 673962 608369
## 6 748534 543135 641680 663993 543760 673962 608369
## fielder_7 fielder_8 fielder_9 release_pos_y estimated_ba_using_speedangle
## 1 694497 665750 608671 54.03 NA
## 2 694497 665750 608671 53.97 NA
## 3 694497 665750 608671 53.79 NA
## 4 694497 665750 608671 53.82 NA
## 5 694497 665750 608671 53.79 NA
## 6 694497 665750 608671 53.78 NA
## estimated_woba_using_speedangle woba_value woba_denom babip_value iso_value
## 1 NA 0 1 0 0
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA 0 1 0 0
## 6 NA NA NA NA NA
## launch_speed_angle at_bat_number pitch_number pitch_name home_score
## 1 NA 46 4 Cutter 0
## 2 NA 46 3 Curveball 0
## 3 NA 46 2 Split-Finger 0
## 4 NA 46 1 Split-Finger 0
## 5 NA 45 3 Split-Finger 0
## 6 NA 45 2 Split-Finger 0
## away_score bat_score fld_score post_away_score post_home_score post_bat_score
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## post_fld_score if_fielding_alignment of_fielding_alignment spin_axis
## 1 0 Strategic Strategic 200
## 2 0 Infield shade Strategic 44
## 3 0 Infield shade Strategic 240
## 4 0 Infield shade Strategic 237
## 5 0 Infield shade Standard 241
## 6 0 Standard Standard 236
## delta_home_win_exp delta_run_exp
## 1 -0.017 -0.074
## 2 0.000 -0.027
## 3 0.000 0.012
## 4 0.000 -0.017
## 5 -0.024 -0.107
## 6 0.000 -0.036
Basic Overview
cat("Dataset Structure and Summary \n")
## Dataset Structure and Summary
str(df)
## 'data.frame': 11829 obs. of 92 variables:
## $ pitch_type : chr "FC" "CU" "FS" "FS" ...
## $ game_date : chr "2023-11-01" "2023-11-01" "2023-11-01" "2023-11-01" ...
## $ release_speed : num 91.5 76.2 87.3 88.6 85.6 88.5 91.4 96.1 88 76.2 ...
## $ release_pos_x : num -2.6 -2.27 -2.61 -2.39 -2.38 -2.51 -2.59 -2.53 -2.52 -2.16 ...
## $ release_pos_z : num 5.44 5.73 5.49 5.61 5.64 5.51 5.43 5.38 5.55 5.76 ...
## $ player_name : chr "Eovaldi, Nathan" "Eovaldi, Nathan" "Eovaldi, Nathan" "Eovaldi, Nathan" ...
## $ batter : int 672695 672695 672695 672695 446334 446334 446334 677950 677950 677950 ...
## $ pitcher : int 543135 543135 543135 543135 543135 543135 543135 543135 543135 543135 ...
## $ events : chr "strikeout" "" "" "" ...
## $ description : chr "called_strike" "called_strike" "ball" "foul" ...
## $ spin_dir : logi NA NA NA NA NA NA ...
## $ spin_rate_deprecated : logi NA NA NA NA NA NA ...
## $ break_angle_deprecated : logi NA NA NA NA NA NA ...
## $ break_length_deprecated : logi NA NA NA NA NA NA ...
## $ zone : int 7 8 13 14 14 13 5 2 8 14 ...
## $ des : chr "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." ...
## $ game_type : chr "W" "W" "W" "W" ...
## $ stand : chr "L" "L" "L" "L" ...
## $ p_throws : chr "R" "R" "R" "R" ...
## $ home_team : chr "AZ" "AZ" "AZ" "AZ" ...
## $ away_team : chr "TEX" "TEX" "TEX" "TEX" ...
## $ type : chr "S" "S" "B" "S" ...
## $ hit_location : int 2 NA NA NA 2 NA NA 1 NA NA ...
## $ bb_type : chr "" "" "" "" ...
## $ balls : int 1 1 0 0 0 0 0 0 0 0 ...
## $ strikes : int 2 1 1 0 2 1 0 2 2 2 ...
## $ game_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ pfx_x : num 0.25 0.99 -0.86 -0.92 -1 -1.08 0.39 -0.97 -1.13 0.72 ...
## $ pfx_z : num 0.74 -0.37 0.11 0.24 0.17 0.13 0.94 1.08 0.15 -0.2 ...
## $ plate_x : num -0.47 -0.1 -0.35 0.18 0.21 -0.13 0.24 -0.09 -0.16 0.96 ...
## $ plate_z : num 1.94 1.96 1.02 1.3 0.62 0.93 2.69 3.3 1.36 1.82 ...
## $ on_3b : int NA NA NA NA NA NA NA NA NA NA ...
## $ on_2b : int NA NA NA NA NA NA NA NA NA NA ...
## $ on_1b : int NA NA NA NA NA NA NA NA NA NA ...
## $ outs_when_up : int 2 2 2 2 1 1 1 0 0 0 ...
## $ inning : int 6 6 6 6 6 6 6 6 6 6 ...
## $ inning_topbot : chr "Bot" "Bot" "Bot" "Bot" ...
## $ hc_x : num NA NA NA NA NA ...
## $ hc_y : num NA NA NA NA NA ...
## $ tfs_deprecated : logi NA NA NA NA NA NA ...
## $ tfs_zulu_deprecated : logi NA NA NA NA NA NA ...
## $ fielder_2 : int 641680 641680 641680 641680 641680 641680 641680 641680 641680 641680 ...
## $ umpire : logi NA NA NA NA NA NA ...
## $ sv_id : logi NA NA NA NA NA NA ...
## $ vx0 : num 4.87 2.77 7.27 8.26 8.19 ...
## $ vy0 : num -133 -111 -127 -129 -124 ...
## $ vz0 : num -4.644 -0.213 -4.838 -4.979 -5.96 ...
## $ ax : num 2.08 7.72 -10.8 -11.99 -11.97 ...
## $ ay : num 26 19.7 25.3 26 24 ...
## $ az : num -22.5 -35.4 -30.2 -28.7 -29.4 ...
## $ sz_top : num 3.4 3.3 3.43 3.4 3.67 3.67 3.67 3.24 3.24 3.24 ...
## $ sz_bot : num 1.57 1.53 1.53 1.57 1.73 1.73 1.73 1.44 1.44 1.44 ...
## $ hit_distance_sc : int NA NA NA 3 NA NA 230 1 2 44 ...
## $ launch_speed : num NA NA NA 90.7 NA NA 72.5 73.3 71.9 91.9 ...
## $ launch_angle : int NA NA NA -29 NA NA 34 -68 -40 0 ...
## $ effective_speed : num 92.2 76.6 88 89.2 86.3 89.3 92.3 96 88.6 76.6 ...
## $ release_spin_rate : int 2281 1928 1498 1638 1574 1634 2271 2313 1492 1702 ...
## $ release_extension : num 6.5 6.5 6.7 6.7 6.7 6.7 6.6 6.6 6.5 6.5 ...
## $ game_pk : int 748534 748534 748534 748534 748534 748534 748534 748534 748534 748534 ...
## $ pitcher_1 : int 543135 543135 543135 543135 543135 543135 543135 543135 543135 543135 ...
## $ fielder_2_1 : int 641680 641680 641680 641680 641680 641680 641680 641680 641680 641680 ...
## $ fielder_3 : int 663993 663993 663993 663993 663993 663993 663993 663993 663993 663993 ...
## $ fielder_4 : int 543760 543760 543760 543760 543760 543760 543760 543760 543760 543760 ...
## $ fielder_5 : int 673962 673962 673962 673962 673962 673962 673962 673962 673962 673962 ...
## $ fielder_6 : int 608369 608369 608369 608369 608369 608369 608369 608369 608369 608369 ...
## $ fielder_7 : int 694497 694497 694497 694497 694497 694497 694497 694497 694497 694497 ...
## $ fielder_8 : int 665750 665750 665750 665750 665750 665750 665750 665750 665750 665750 ...
## $ fielder_9 : int 608671 608671 608671 608671 608671 608671 608671 608671 608671 608671 ...
## $ release_pos_y : num 54 54 53.8 53.8 53.8 ...
## $ estimated_ba_using_speedangle : num NA NA NA NA NA NA NA 0.274 NA NA ...
## $ estimated_woba_using_speedangle: num NA NA NA NA NA NA NA 0.25 NA NA ...
## $ woba_value : num 0 NA NA NA 0 NA NA 0 NA NA ...
## $ woba_denom : int 1 NA NA NA 1 NA NA 1 NA NA ...
## $ babip_value : int 0 NA NA NA 0 NA NA 0 NA NA ...
## $ iso_value : int 0 NA NA NA 0 NA NA 0 NA NA ...
## $ launch_speed_angle : int NA NA NA NA NA NA NA 2 NA NA ...
## $ at_bat_number : int 46 46 46 46 45 45 45 44 44 44 ...
## $ pitch_number : int 4 3 2 1 3 2 1 5 4 3 ...
## $ pitch_name : chr "Cutter" "Curveball" "Split-Finger" "Split-Finger" ...
## $ home_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ away_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bat_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fld_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ post_away_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ post_home_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ post_bat_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ post_fld_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ if_fielding_alignment : chr "Strategic" "Infield shade" "Infield shade" "Infield shade" ...
## $ of_fielding_alignment : chr "Strategic" "Strategic" "Strategic" "Strategic" ...
## $ spin_axis : int 200 44 240 237 241 236 199 229 242 37 ...
## $ delta_home_win_exp : num -0.017 0 0 0 -0.024 0 0 -0.033 0 0 ...
## $ delta_run_exp : num -0.074 -0.027 0.012 -0.017 -0.107 -0.036 -0.026 -0.152 0 0 ...
cat("\nDimensions of df:", dim(df), "\n")
##
## Dimensions of df: 11829 92
summary(df)
## pitch_type game_date release_speed release_pos_x
## Length:11829 Length:11829 Min. : 71.40 Min. :-3.910
## Class :character Class :character 1st Qu.: 85.50 1st Qu.:-2.190
## Mode :character Mode :character Median : 91.40 Median :-1.620
## Mean : 90.05 Mean :-1.051
## 3rd Qu.: 94.70 3rd Qu.:-0.410
## Max. :103.70 Max. : 4.500
##
## release_pos_z player_name batter pitcher
## Min. :3.190 Length:11829 Min. :444482 Min. :434378
## 1st Qu.:5.450 Class :character 1st Qu.:592663 1st Qu.:571760
## Median :5.770 Mode :character Median :656775 Median :624133
## Mean :5.765 Mean :622921 Mean :614818
## 3rd Qu.:6.100 3rd Qu.:669221 3rd Qu.:664353
## Max. :7.270 Max. :694497 Max. :700363
##
## events description spin_dir spin_rate_deprecated
## Length:11829 Length:11829 Mode:logical Mode:logical
## Class :character Class :character NA's:11829 NA's:11829
## Mode :character Mode :character
##
##
##
##
## break_angle_deprecated break_length_deprecated zone
## Mode:logical Mode:logical Min. : 1.000
## NA's:11829 NA's:11829 1st Qu.: 6.000
## Median :11.000
## Mean : 9.274
## 3rd Qu.:13.000
## Max. :14.000
##
## des game_type stand p_throws
## Length:11829 Length:11829 Length:11829 Length:11829
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## home_team away_team type hit_location
## Length:11829 Length:11829 Length:11829 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Mode :character Median :5.000
## Mean :5.004
## 3rd Qu.:7.000
## Max. :9.000
## NA's :9157
## bb_type balls strikes game_year
## Length:11829 Min. :0.000 Min. :0.000 Min. :2023
## Class :character 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:2023
## Mode :character Median :1.000 Median :1.000 Median :2023
## Mean :0.863 Mean :0.897 Mean :2023
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:2023
## Max. :3.000 Max. :2.000 Max. :2023
##
## pfx_x pfx_z plate_x plate_z
## Min. :-1.98000 Min. :-1.6500 Min. :-5.07000 Min. :-2.300
## 1st Qu.:-0.84000 1st Qu.: 0.1000 1st Qu.:-0.49000 1st Qu.: 1.560
## Median :-0.23000 Median : 0.7100 Median : 0.07000 Median : 2.220
## Mean :-0.09191 Mean : 0.6041 Mean : 0.06783 Mean : 2.225
## 3rd Qu.: 0.63000 3rd Qu.: 1.2900 3rd Qu.: 0.62000 3rd Qu.: 2.890
## Max. : 2.19000 Max. : 1.9500 Max. : 4.11000 Max. : 6.250
##
## on_3b on_2b on_1b outs_when_up
## Min. :446334 Min. :444482 Min. :446334 Min. :0.0000
## 1st Qu.:606115 1st Qu.:592885 1st Qu.:592663 1st Qu.:0.0000
## Median :641680 Median :656941 Median :645277 Median :1.0000
## Mean :623920 Mean :623594 Mean :622009 Mean :0.9716
## 3rd Qu.:672515 3rd Qu.:670242 3rd Qu.:669016 3rd Qu.:2.0000
## Max. :694497 Max. :694497 Max. :694497 Max. :2.0000
## NA's :10719 NA's :9734 NA's :8139
## inning inning_topbot hc_x hc_y
## Min. : 1.000 Length:11829 Min. : 6.51 Min. : 24.94
## 1st Qu.: 3.000 Class :character 1st Qu.:101.19 1st Qu.: 89.94
## Median : 5.000 Mode :character Median :123.11 Median :124.88
## Mean : 4.937 Mean :125.48 Mean :122.47
## 3rd Qu.: 7.000 3rd Qu.:153.15 3rd Qu.:155.82
## Max. :11.000 Max. :238.77 Max. :225.80
## NA's :9787 NA's :9787
## tfs_deprecated tfs_zulu_deprecated fielder_2 umpire
## Mode:logical Mode:logical Min. :455117 Mode:logical
## NA's:11829 NA's:11829 1st Qu.:592663 NA's:11829
## Median :641680
## Mean :620693
## 3rd Qu.:672515
## Max. :680777
##
## sv_id vx0 vy0 vz0
## Mode:logical Min. :-16.3130 Min. :-150.0 Min. :-15.431
## NA's:11829 1st Qu.: -0.1746 1st Qu.:-137.6 1st Qu.: -6.144
## Median : 4.6091 Median :-132.8 Median : -4.125
## Mean : 3.0365 Mean :-130.9 Mean : -4.093
## 3rd Qu.: 6.9592 3rd Qu.:-124.4 3rd Qu.: -2.074
## Max. : 16.4404 Max. :-104.0 Max. : 7.695
##
## ax ay az sz_top
## Min. :-25.899 Min. :17.17 Min. :-47.645 Min. :2.700
## 1st Qu.:-11.646 1st Qu.:25.35 1st Qu.:-30.604 1st Qu.:3.300
## Median : -3.157 Median :28.51 Median :-23.135 Median :3.430
## Mean : -2.141 Mean :28.48 Mean :-23.577 Mean :3.426
## 3rd Qu.: 6.631 3rd Qu.:31.45 3rd Qu.:-14.898 3rd Qu.:3.570
## Max. : 26.118 Max. :41.47 Max. : -3.574 Max. :4.120
##
## sz_bot hit_distance_sc launch_speed launch_angle
## Min. :1.110 Min. : 0.0 Min. : 3.50 Min. :-85.00
## 1st Qu.:1.540 1st Qu.: 17.0 1st Qu.: 73.10 1st Qu.: -6.00
## Median :1.620 Median :161.0 Median : 82.30 Median : 20.00
## Mean :1.618 Mean :153.3 Mean : 82.95 Mean : 17.04
## 3rd Qu.:1.700 3rd Qu.:239.0 3rd Qu.: 95.20 3rd Qu.: 42.00
## Max. :1.990 Max. :461.0 Max. :117.10 Max. : 88.00
## NA's :7958 NA's :7978 NA's :7974
## effective_speed release_spin_rate release_extension game_pk
## Min. : 0.00 Min. : 658 Min. :4.900 Min. :748534
## 1st Qu.: 85.60 1st Qu.:2186 1st Qu.:6.200 1st Qu.:748545
## Median : 91.50 Median :2358 Median :6.500 Median :748556
## Mean : 90.24 Mean :2328 Mean :6.505 Mean :748558
## 3rd Qu.: 95.00 3rd Qu.:2517 3rd Qu.:6.800 3rd Qu.:748569
## Max. :104.50 Max. :3504 Max. :8.300 Max. :748585
## NA's :1 NA's :1
## pitcher_1 fielder_2_1 fielder_3 fielder_4
## Min. :434378 Min. :455117 Min. :456781 Min. :514888
## 1st Qu.:571760 1st Qu.:592663 1st Qu.:547180 1st Qu.:543760
## Median :624133 Median :641680 Median :572233 Median :606466
## Mean :614818 Mean :620693 Mean :589998 Mean :602297
## 3rd Qu.:664353 3rd Qu.:672515 3rd Qu.:663993 3rd Qu.:666397
## Max. :700363 Max. :680777 Max. :666135 Max. :681082
##
## fielder_5 fielder_6 fielder_7 fielder_8
## Min. :446334 Min. :500743 Min. :444482 Min. :518792
## 1st Qu.:602104 1st Qu.:607208 1st Qu.:650559 1st Qu.:665506
## Median :663586 Median :621043 Median :666971 Median :671739
## Mean :617049 Mean :630840 Mean :644639 Mean :662034
## 3rd Qu.:670623 3rd Qu.:670764 3rd Qu.:670541 3rd Qu.:677950
## Max. :683002 Max. :691783 Max. :694497 Max. :686217
##
## fielder_9 release_pos_y estimated_ba_using_speedangle
## Min. :502054 Min. :52.22 Min. :0.001
## 1st Qu.:592206 1st Qu.:53.68 1st Qu.:0.077
## Median :663656 Median :53.98 Median :0.254
## Mean :633090 Mean :54.00 Mean :0.340
## 3rd Qu.:666969 3rd Qu.:54.30 3rd Qu.:0.571
## Max. :682998 Max. :55.55 Max. :0.997
## NA's :9789
## estimated_woba_using_speedangle woba_value woba_denom babip_value
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:0.082 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.00
## Median :0.252 Median :0.000 Median :1.000 Median :0.00
## Mean :0.397 Mean :0.323 Mean :0.993 Mean :0.18
## 3rd Qu.:0.592 3rd Qu.:0.700 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.997 Max. :2.000 Max. :1.000 Max. :1.00
## NA's :9789 NA's :8751 NA's :8754 NA's :8751
## iso_value launch_speed_angle at_bat_number pitch_number
## Min. :0.00 Min. :1.000 Min. : 1.00 Min. : 1.000
## 1st Qu.:0.00 1st Qu.:2.000 1st Qu.:19.00 1st Qu.: 1.000
## Median :0.00 Median :3.000 Median :38.00 Median : 3.000
## Mean :0.15 Mean :3.286 Mean :38.45 Mean : 2.887
## 3rd Qu.:0.00 3rd Qu.:4.000 3rd Qu.:57.00 3rd Qu.: 4.000
## Max. :3.00 Max. :6.000 Max. :92.00 Max. :15.000
## NA's :8751 NA's :9789
## pitch_name home_score away_score bat_score
## Length:11829 Min. : 0.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median : 1.000 Median : 2.000 Median : 1.000
## Mean : 1.645 Mean : 2.763 Mean : 2.191
## 3rd Qu.: 3.000 3rd Qu.: 4.000 3rd Qu.: 3.000
## Max. :10.000 Max. :11.000 Max. :11.000
##
## fld_score post_away_score post_home_score post_bat_score
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 2.000 Median : 1.000 Median : 2.000
## Mean : 2.217 Mean : 2.779 Mean : 1.656 Mean : 2.219
## 3rd Qu.: 3.000 3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.: 3.000
## Max. :11.000 Max. :11.000 Max. :10.000 Max. :11.000
##
## post_fld_score if_fielding_alignment of_fielding_alignment spin_axis
## Min. : 0.000 Length:11829 Length:11829 Min. : 4.0
## 1st Qu.: 0.000 Class :character Class :character 1st Qu.:130.0
## Median : 1.000 Mode :character Mode :character Median :204.0
## Mean : 2.217 Mean :175.9
## 3rd Qu.: 3.000 3rd Qu.:223.0
## Max. :11.000 Max. :359.0
##
## delta_home_win_exp delta_run_exp
## Min. :-0.5810000 Min. :-1.130000
## 1st Qu.: 0.0000000 1st Qu.:-0.065000
## Median : 0.0000000 Median :-0.017000
## Mean :-0.0004862 Mean :-0.001553
## 3rd Qu.: 0.0000000 3rd Qu.: 0.036000
## Max. : 0.5790000 Max. : 2.729000
## NA's :1
Number of observations
num_obs <- nrow(df)
cat("\nThe number of observations is:", num_obs, "\n")
##
## The number of observations is: 11829
Unique Athletes / Participants
unique_pitchers <- unique(df$pitcher)
unique_batters <- unique(df$batter)
cat("\nNumber of unique pitchers:", length(unique_pitchers), "\n")
##
## Number of unique pitchers: 136
cat("Number of unique batters:", length(unique_batters), "\n")
## Number of unique batters: 148
#Total Unique Athletes from all relevant columns
athlete_cols <- c("pitcher", "batter", "on_1b", "on_2b", "on_3b",
"fielder_2", "fielder_3", "fielder_4", "fielder_5",
"fielder_6", "fielder_7", "fielder_8", "fielder_9")
all_athletes <- unlist(df[, athlete_cols])
all_athletes <- unique(all_athletes[!is.na(all_athletes)])
cat("Total number of unique athletes across all relevant columns:", length(all_athletes), "\n")
## Total number of unique athletes across all relevant columns: 292
Unique Teams
teams <- unique(c(df$home_team, df$away_team))
cat("\nUnique teams in the dataset:\n")
##
## Unique teams in the dataset:
print(teams)
## [1] "AZ" "TEX" "PHI" "HOU" "MIN" "ATL" "LAD" "BAL" "MIL" "TB" "MIA" "TOR"
cat("Total number of unique teams:", length(teams), "\n")
## Total number of unique teams: 12
Removing N/A
cols_to_remove <- c("spin_dir", "spin_rate_deprecated", "break_angle_deprecated",
"break_length_deprecated", "tfs_deprecated", "tfs_zulu_deprecated",
"umpire", "sv_id")
df <- df %>% select(-any_of(cols_to_remove))
cat("Columns after removal:\n")
## Columns after removal:
print(names(df))
## [1] "pitch_type" "game_date"
## [3] "release_speed" "release_pos_x"
## [5] "release_pos_z" "player_name"
## [7] "batter" "pitcher"
## [9] "events" "description"
## [11] "zone" "des"
## [13] "game_type" "stand"
## [15] "p_throws" "home_team"
## [17] "away_team" "type"
## [19] "hit_location" "bb_type"
## [21] "balls" "strikes"
## [23] "game_year" "pfx_x"
## [25] "pfx_z" "plate_x"
## [27] "plate_z" "on_3b"
## [29] "on_2b" "on_1b"
## [31] "outs_when_up" "inning"
## [33] "inning_topbot" "hc_x"
## [35] "hc_y" "fielder_2"
## [37] "vx0" "vy0"
## [39] "vz0" "ax"
## [41] "ay" "az"
## [43] "sz_top" "sz_bot"
## [45] "hit_distance_sc" "launch_speed"
## [47] "launch_angle" "effective_speed"
## [49] "release_spin_rate" "release_extension"
## [51] "game_pk" "pitcher_1"
## [53] "fielder_2_1" "fielder_3"
## [55] "fielder_4" "fielder_5"
## [57] "fielder_6" "fielder_7"
## [59] "fielder_8" "fielder_9"
## [61] "release_pos_y" "estimated_ba_using_speedangle"
## [63] "estimated_woba_using_speedangle" "woba_value"
## [65] "woba_denom" "babip_value"
## [67] "iso_value" "launch_speed_angle"
## [69] "at_bat_number" "pitch_number"
## [71] "pitch_name" "home_score"
## [73] "away_score" "bat_score"
## [75] "fld_score" "post_away_score"
## [77] "post_home_score" "post_bat_score"
## [79] "post_fld_score" "if_fielding_alignment"
## [81] "of_fielding_alignment" "spin_axis"
## [83] "delta_home_win_exp" "delta_run_exp"
These collumns were removed due to not having any values associated with them
Converting game_date to a Date class
df <- df %>%
mutate(game_date = as.Date(game_date, format = "%Y-%m-%d"))
This was done to ensure that game_date was a date class
Converting numeric columns that are being read as characters or factors
num_cols <- c("release_speed", "release_pos_x", "release_pos_z", "pfx_x", "pfx_z",
"plate_x", "plate_z", "vx0", "vy0", "ax", "ay", "az", "sz_top", "sz_bot",
"launch_speed", "launch_angle", "release_extension")
df <- df %>%
mutate(across(all_of(num_cols), as.numeric))
Removing any duplicate rows
num_duplicates <- sum(duplicated(df))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0
if(num_duplicates > 0) {
df <- df %>% distinct()
cat("Duplicate rows have been removed. New number of observations:", nrow(df), "\n")
}
Creating New Variables and Reformatting the Data
df <- df %>%
mutate(game_type = recode(game_type,
"E" = "Exhibition",
"S" = "Spring Training",
"R" = "Regular Season",
"F" = "Wild Card",
"D" = "Divisional Series",
"L" = "League Championship Series",
"W" = "World Series"))
Changed the names of the game_types so that I could be familiar with what they meant
df <- df %>%
mutate(launch_speed_angle = case_when(
launch_speed_angle == 1 ~ "Weak",
launch_speed_angle == 2 ~ "Topped",
launch_speed_angle == 3 ~ "Under",
launch_speed_angle == 4 ~ "Flare/Burner",
launch_speed_angle == 5 ~ "Solid Contact",
launch_speed_angle == 6 ~ "Barrel",
TRUE ~ as.character(launch_speed_angle)
))
Changed “launch_speed_angle” so that I could find out the actual name of the launch angle that was being counted
df <- df %>%
# Calculate pre-pitch total score and post-pitch total score
mutate(
total_score_pre = home_score + away_score,
total_score_post = post_home_score + post_away_score,
score_change = total_score_post - total_score_pre
)
Added 3 new collumns:
total_score_pre:
Adding the home_score and away_score columns. These represent the scores for both teams before the pitch event.
total_score_post:
Added the post_home_score and post_away_score columns. These values reflect the new scores immediately after the pitch.
score_change:
This is calculated as the difference between total_score_post and total_score_pre. For pitches that don’t result in any runs, the change will be 0; if runs are scored, this column will indicate how many runs were added as a result of the play.
pitch_type
Definition: The type of pitch (e.g., fastball, curveball, slider) as derived from Statcast.
CH (Changeup) CU (Curveball) FC (Cutter) FF (Four-Seam Fastball) FS (Split-Finger Fastball) KC (Knuckle Curve) PO (Pitchout) SI (Sinker) SL (Slider) ST (Splitter) SV (Slurve)
#Pitch Type
# Combine
pitch_type_summary <- df %>%
count(pitch_type) %>%
mutate(percentage = n / sum(n) * 100)
print(pitch_type_summary)
## pitch_type n percentage
## 1 CH 1114 9.4175332
## 2 CU 1088 9.1977344
## 3 FC 754 6.3741652
## 4 FF 4343 36.7148533
## 5 FS 373 3.1532674
## 6 KC 337 2.8489306
## 7 PO 1 0.0084538
## 8 SI 1645 13.9065010
## 9 SL 1553 13.1287514
## 10 ST 600 5.0722800
## 11 SV 21 0.1775298
unique(pitch_type_summary$pitch_type)
## [1] "CH" "CU" "FC" "FF" "FS" "KC" "PO" "SI" "SL" "ST" "SV"
# Bar Chart
ggplot(pitch_type_summary, aes(x = pitch_type, y = n, fill = pitch_type)) +
geom_bar(stat = "identity") +
labs(title = "Frequency of Pitch Types", x = "Pitch Type", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
At first glance it appears that there is a trend in the dataset, with Four-Seam Fastball throws being the most used.
release_speed
Definition: The velocity of the pitch (in mph).
# Summary
summary(df$release_speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.40 85.50 91.40 90.05 94.70 103.70
# Histogram
ggplot(df, aes(x = release_speed)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Release Speeds", x = "Release Speed (mph)", y = "Frequency") +
theme_minimal()
# Boxplot to inspect outliers
ggplot(df, aes(x = "", y = release_speed)) +
geom_boxplot(fill = "lightgreen", color = "darkgreen") +
labs(title = "Boxplot of Release Speed", y = "Release Speed (mph)") +
theme_minimal()
At first glance it appears that the pitches range from 70 mph to 100 mph, with the majority being closer to 100 mph
release_pos_x, release_pos_z, release_pos_y
Definition:
release_pos_x: Horizontal release position (feet from catcher’s perspective).
release_pos_z: Vertical release position (feet from catcher’s perspective).
release_pos_y: Release position of pitch measured in feet from the catcher’s perspective.
# Summaries for positions
summary(df$release_pos_x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.910 -2.190 -1.620 -1.051 -0.410 4.500
summary(df$release_pos_z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.190 5.450 5.770 5.765 6.100 7.270
summary(df$release_pos_y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.22 53.68 53.98 54.00 54.30 55.55
# Scatterplot: Horizontal vs. Vertical release positions
ggplot(df, aes(x = release_pos_x, y = release_pos_z)) +
geom_point(alpha = 0.3, color = "purple") +
labs(title = "Release Position: Horizontal vs Vertical", x = "Horizontal Position (ft)", y = "Vertical Position (ft)") +
theme_minimal()
Since this is from the catcher’s perspective, at first glance it appears that there are more right handed throwers in this database than left-handed throwers
vx0, vy0, vz0, ax, ay, az
Definition:
vx0, vy0, vz0: Components of the pitch velocity in feet per second in the x, y, z dimensions at a fixed point (y = 50 feet).
ax, ay, az: The corresponding acceleration components in feet per second squared.
# Scatter plot: vx0 vs. vy0
ggplot(df, aes(x = vx0, y = vy0)) +
geom_point(alpha = 0.3, color = "orange") +
labs(title = "Scatter Plot: vx0 vs. vy0", x = "vx0 (fps)", y = "vy0 (fps)") +
theme_minimal()
# Histograms for ax, ay, and az
p1 <- ggplot(df, aes(x = ax)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "Distribution of ax", x = "ax (fps²)", y = "Frequency") +
theme_minimal()
p2 <- ggplot(df, aes(x = ay)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
labs(title = "Distribution of ay", x = "ay (fps²)", y = "Frequency") +
theme_minimal()
p3 <- ggplot(df, aes(x = az)) +
geom_histogram(bins = 30, fill = "salmon", color = "black") +
labs(title = "Distribution of az", x = "az (fps²)", y = "Frequency") +
theme_minimal()
p1
p2
p3
# ax vs. ay
scatter_ax_ay <- ggplot(df, aes(x = ax, y = ay)) +
geom_point(alpha = 0.3, color = "darkblue") +
labs(title = "Scatter Plot: ax vs. ay",
x = "ax (fps²)", y = "ay (fps²)") +
theme_minimal()
# ax vs. az
scatter_ax_az <- ggplot(df, aes(x = ax, y = az)) +
geom_point(alpha = 0.3, color = "darkred") +
labs(title = "Scatter Plot: ax vs. az",
x = "ax (fps²)", y = "az (fps²)") +
theme_minimal()
# ay vs. az
scatter_ay_az <- ggplot(df, aes(x = ay, y = az)) +
geom_point(alpha = 0.3, color = "darkgreen") +
labs(title = "Scatter Plot: ay vs. az",
x = "ay (fps²)", y = "az (fps²)") +
theme_minimal()
# Print scatter plots
scatter_ax_ay
scatter_ax_az
scatter_ay_az
pfx_x, pfx_z
Definition: Horizontal and vertical movement (in feet) after the release, as the ball moves toward the plate (from the catcher’s perspective).
# Summary statistics
summary(df$pfx_x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.98000 -0.84000 -0.23000 -0.09191 0.63000 2.19000
summary(df$pfx_z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6500 0.1000 0.7100 0.6041 1.2900 1.9500
# Histograms
ggplot(df, aes(x = pfx_x)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "Distribution of pfx_x", x = "pfx_x (ft)", y = "Frequency") +
theme_minimal()
ggplot(df, aes(x = pfx_z)) +
geom_histogram(bins = 30, fill = "salmon", color = "black") +
labs(title = "Distribution of pfx_z", x = "pfx_z (ft)", y = "Frequency") +
theme_minimal()
Same observation as mentioned above, there is a right handed dominance
plate_x, plate_z
Definition: The position of the ball as it crosses home plate from the catcher’s perspective (horizontal and vertical positions).
plate_x: This variable represents the horizontal position of the baseball as it crosses home plate. It is the left-right location relative to the center of the plate, measured in feet from the catcher’s perspective. In this case:
A value of 0 represents the center of the plate.
Negative values typically indicate that the ball is to one side (for example, from the catcher’s point of view, it might be towards the pitcher’s left), whereas positive values indicate the opposite side.
plate_z: This variable represents the vertical position of the baseball as it crosses home plate. It is measured in feet from a defined reference point (often the ground or the bottom of the strike zone). Which means:
Lower values indicate a pitch that is low.
Higher values indicate a pitch that is higher, closer to the top of the strike zone.
# Scatter plot of plate locations
ggplot(df, aes(x = plate_x, y = plate_z)) +
geom_point(alpha = 0.3, color = "blue") +
labs(title = "Pitch Location at Plate", x = "Plate X (ft)", y = "Plate Z (ft)") +
theme_minimal()
# Adding density contours
ggplot(df, aes(x = plate_x, y = plate_z)) +
geom_point(alpha = 0.3, color = "blue") +
geom_density_2d(color = "red") +
labs(title = "Pitch Location with Density Contours", x = "Plate X (ft)", y = "Plate Z (ft)") +
theme_minimal()
hit_distance, launch_speed, launch_angle, effective_speed
Definition:
hit_distance: Projected distance the ball travels after contact.
launch_speed: Exit velocity of the batted ball (as tracked by Statcast or estimates).
launch_angle: The angle at which the ball leaves the bat.
effective_speed: Derived speed that accounts for the pitcher’s release extension.
# Scatter plot: Launch Speed vs. Launch Angle
ggplot(df, aes(x = launch_speed, y = launch_angle)) +
geom_point(alpha = 0.3, color = "brown") +
labs(title = "Launch Speed vs. Launch Angle", x = "Launch Speed (mph)", y = "Launch Angle (°)") +
theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).
At first glance, it appears that batting closer to 0 degrees produces more speed.
release_spin, release_extension
Definition:
release_spin: Spin rate of the pitch (in RPM) as tracked by Statcast.
release_extension: The pitcher’s release extension (in feet), which can affect perceived velocity.
# Boxplot by pitch type for release spin
ggplot(df, aes(x = pitch_type, y = df$release_spin)) +
geom_boxplot(fill = "plum", color = "darkmagenta") +
labs(title = "Release Spin by Pitch Type", x = "Pitch Type", y = "Release Spin (RPM)") +
theme_minimal()
## Warning: Use of `df$release_spin` is discouraged.
## ℹ Use `release_spin` instead.
At a first glance, it appears Curveballs have higher RPM than the rest of the pitches, however it also has the most variability.
launch_speed_angle
Definition: A categorical metric based on launch speed and launch angle classified into:
1: Weak
2: Topped
3: Under
4: Flare/Burner
5: Solid Contact
6: Barrel
# Remove rows with NA in launch_speed_angle, then count and calculate percentages
launch_speed_angle_summary <- df %>%
filter(!is.na(launch_speed_angle)) %>%
count(launch_speed_angle) %>%
mutate(percentage = n / sum(n) * 100)
# Print the summary table
print(launch_speed_angle_summary)
## launch_speed_angle n percentage
## 1 Barrel 220 10.784314
## 2 Flare/Burner 486 23.823529
## 3 Solid Contact 126 6.176471
## 4 Topped 614 30.098039
## 5 Under 494 24.215686
## 6 Weak 100 4.901961
# Visualization: Bar Chart
ggplot(launch_speed_angle_summary, aes(x = launch_speed_angle, y = n, fill = launch_speed_angle)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Launch Speed/Angle Zones",
x = "Zone", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
At first glance it appears that Topped, Flare and Under pitches seem to be the norm in the dataset.
game_date, game_year, game_type, game_pk
Definition:
game_date: The date on which the game was played.
game_year: The year the game took place.
game_type: Game classification (e.g., Regular Season, Spring Training, Wild Card, Divisional Series, etc.).
game_pk: Unique game identifier.
# Descriptive statistics over time
game_summary <- df %>%
count(game_year, game_type)
print(game_summary)
## game_year game_type n
## 1 2023 Divisional Series 4116
## 2 2023 League Championship Series 3942
## 3 2023 Wild Card 2281
## 4 2023 World Series 1490
# Visualization: Count of Games by Game Type
ggplot(df, aes(x = game_type)) +
geom_bar(fill = "coral", color = "black") +
labs(title = "Count of Games by Type", x = "Game Type", y = "Number of Games") +
theme_minimal()
player_name, batter, pitcher, pitch_name
Definition:
player_name: The name of the player associated with the play event (may be used for context).
batter: The MLB Player Id for the batter.
pitcher: The MLB Player Id for the pitcher.
pitch_name: The pitch’s name as derived from the tracking data.
# Unique pitchers and batters
num_unique_pitchers <- df %>% distinct(pitcher) %>% nrow()
num_unique_batters <- df %>% distinct(batter) %>% nrow()
cat("Unique Pitchers:", num_unique_pitchers, "\nUnique Batters:", num_unique_batters, "\n")
## Unique Pitchers: 136
## Unique Batters: 148
# Top 10 most frequently occurring pitches
top_pitches <- df %>%
count(pitch_name, sort = TRUE) %>%
head(10)
print(top_pitches)
## pitch_name n
## 1 4-Seam Fastball 4343
## 2 Sinker 1645
## 3 Slider 1553
## 4 Changeup 1114
## 5 Curveball 1088
## 6 Cutter 754
## 7 Sweeper 600
## 8 Split-Finger 373
## 9 Knuckle Curve 337
## 10 Slurve 21
# Bar plot for top pitch names
ggplot(top_pitches, aes(x = reorder(pitch_name, n), y = n)) +
geom_bar(stat = "identity", fill = "lightgreen", color = "darkgreen") +
coord_flip() +
labs(title = "Top 10 Most Frequent Pitch Names", x = "Pitch Name", y = "Frequency") +
theme_minimal()
events, description, des
Definition:
events: The play event outcome (e.g., single, strikeout, home run).
description: A detailed description of the result of the pitch.
des: The plate appearance description from gameday.
# Remove blank
event_summary <- df %>%
filter(!is.na(events), events != "") %>%
count(events, sort = TRUE)
# View
print(event_summary)
## events n
## 1 field_out 1212
## 2 strikeout 732
## 3 single 429
## 4 walk 266
## 5 double 115
## 6 home_run 110
## 7 grounded_into_double_play 56
## 8 force_out 45
## 9 hit_by_pitch 28
## 10 field_error 15
## 11 sac_bunt 15
## 12 sac_fly 15
## 13 triple 9
## 14 double_play 8
## 15 fielders_choice_out 8
## 16 fielders_choice 6
## 17 caught_stealing_2b 3
## 18 catcher_interf 2
## 19 other_out 2
## 20 strikeout_double_play 2
# Plot
ggplot(event_summary, aes(x = reorder(events, n), y = n)) +
geom_bar(stat = "identity", fill = "#4E79A7") +
coord_flip() +
labs(title = "Frequency of Events (Excluding NAs and Blank Entries)",
x = "Event",
y = "Count") +
theme_minimal()
At first glance, it appears field_outs are the most common event in the database (with strikeouts in second)
stand, p_throws
Definition:
stand: Indicates the batter’s stance (left/right).
p_throws: Indicates the pitcher’s throwing hand (left/right).
What to Explore:
Count frequencies
Compare performance metrics across handedness
# Frequency counts
batter_stand <- df %>% count(stand)
pitcher_hand <- df %>% count(p_throws)
print(batter_stand)
## stand n
## 1 L 5095
## 2 R 6734
print(pitcher_hand)
## p_throws n
## 1 L 2882
## 2 R 8947
# Visualization: Side-by-side bar plots
ggplot(df, aes(x = stand, fill = p_throws)) +
geom_bar(position = "dodge") +
labs(title = "Batter Stance by Pitcher Handedness", x = "Batter Stance", y = "Count") +
theme_minimal()
As mentioned above, right hand dominance
home_team, away_team, score variables
Definition:
home_team & away_team: Abbreviations representing the teams playing.
home_score & away_score: Pre-pitch scores for the home and away teams, respectively.
bat_score & fld_score: The score of the batting and fielding teams before the pitch (useful for identifying which team is at bat).
post_home_score, post_away_score, post_bat_score: The scores immediately after the pitch event.
# Visualization: Histogram of score change
ggplot(df, aes(x = score_change)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
labs(title = "Histogram of Score Change per Pitch", x = "Score Change", y = "Frequency") +
theme_minimal()
At first glance, it appears that most individual pitches have 0 impact on the score.
inning, inning_topbot, outs_when_up, at_bat_number, pitch_number
Definition:
inning: The inning number for the pitch event.
inning_topbot: Indicates whether it is the top or bottom of the inning.
outs_when_up: The number of outs in the inning prior to the pitch.
at_bat_number: The batting order count (i.e., which plate appearance in the game for that team).
pitch_number: The pitch count within the plate appearance.
# Visualizing inning distribution
ggplot(df, aes(x = factor(inning))) +
geom_bar(fill = "orchid", color = "black") +
labs(title = "Distribution of Innings", x = "Inning", y = "Count") +
theme_minimal()
At first glance, it appears there aren’t many 10 or 11 innings.
HC Coordinates (hc_x, hc_y)
Definition: The coordinates of the ball when hit, representing the location where the batted ball landed.
# Scatter plot of hit coordinates
ggplot(df, aes(x = hc_x, y = hc_y)) +
geom_point(alpha = 0.3, color = "darkred") +
labs(title = "Hit Coordinates (hc_x vs. hc_y)", x = "hc_x", y = "hc_y") +
theme_minimal()
## Warning: Removed 9787 rows containing missing values or values outside the scale range
## (`geom_point()`).
At first glance, it appears that most balls land close to the first ring
Advanced Performance Metrics
estimated_ba_using_speedangle: Estimated batting average based on launch speed and angle.
estimated_woba_using_speedangle: Estimated wOBA (weighted on-base average) using launch speed and angle.
woba_value, woba_denom: The wOBA value and its corresponding denominator.
babip_value: Batting Average on Balls In Play.
iso_value: Isolated Power – extra-base power indicator.
delta_home_win_exp, delta_run_exp: The change in win expectancy and run expectancy before and after the pitch, respectively.
# Boxplot: wOBA value by pitch type
ggplot(df, aes(x = pitch_type, y = woba_value)) +
geom_boxplot(fill = "lightcyan", color = "blue") +
labs(title = "wOBA by Pitch Type", x = "Pitch Type", y = "wOBA Value") +
theme_minimal()
## Warning: Removed 8751 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
At first glance, the wOBA appears to be similar for most pitches.
fielder_2 to fielder_9
Definition: These fields store the MLB Player Ids for the players involved in fielding:
fielder_2: Typically the catcher
fielder_3: First base
fielder_4: Second base
fielder_5: Third base
fielder_6: Short stop
fielder_7: Left field
fielder_8: Center field
fielder_9: Right field
# Frequency for fielder 2 (Catcher)
fielder2_summary <- df %>%
count(fielder_2, sort = TRUE)
print(fielder2_summary)
## fielder_2 n
## 1 641680 2390
## 2 672515 2237
## 3 592663 1758
## 4 455117 1452
## 5 680777 839
## 6 668939 498
## 7 669257 475
## 8 518595 345
## 9 661388 332
## 10 650907 283
## 11 669221 264
## 12 672386 261
## 13 673237 164
## 14 645444 160
## 15 663743 142
## 16 607732 140
## 17 595978 36
## 18 542194 32
## 19 596117 21
if_fielding_alignment, of_fielding_alignment
Definition: These variables indicate the infield and outfield defensive alignments at the time of the pitch.
# Frequency of different alignments
table(df$if_fielding_alignment)
##
## Infield shade Standard Strategic
## 19 3186 7952 672
table(df$of_fielding_alignment)
##
## Standard Strategic
## 19 11525 285
spin_axis
Definition: The spin axis in degrees (0 to 360) measured in the 2D X-Z plane. For example, 180° represents a pure backspin fastball, and 0° a pure topspin curveball.
# Summary and histogram for spin_axis
summary(df$spin_axis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 130.0 204.0 175.9 223.0 359.0
ggplot(df, aes(x = spin_axis)) +
geom_histogram(bins = 30, fill = "thistle", color = "black") +
labs(title = "Distribution of Spin Axis", x = "Spin Axis (degrees)", y = "Frequency") +
theme_minimal()
Analysis of Pitches on Game Outcomes
Idea Description
The idea here is to analyze how individual pitch events affect game outcome variables such as run expectancy and win expectancy. By analysing variables like delta_home_win_exp and delta_run_exp, you can assess which pitches or events lead to significant shifts in the game state. This analysis could be valuable for both coaching staff and team management by providing insight into which game situations (or pitch types) are most impactful.
Preliminary Data Visualisations and Statistics
Win Expectancy Change Analysis: Firstly, I plotted the distribution of win expectancy changes to understand how often different pitch outcomes lead to large swings in game situations
ggplot(df, aes(x = delta_home_win_exp)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Home Win Expectancy Change", x = "Delta Home Win Exp", y = "Frequency") +
theme_minimal()
At first glance, it can be seen that most pitches actually do not have that big of an impact, with most having 0 impact. In this scenario, a positive change means the play increased the likelihood of scoring. A negative change indicates the play reduced that likelihood. Lastly, a value close to zero means the pitch had little impact on scoring expectations.
Run Expectancy Change by Pitch Type: Secondly, I wanted to visualize which pitch types were associated with larger changes in run expectancy.
run_exp_change <- df %>%
filter(!is.na(pitch_type)) %>%
group_by(pitch_type) %>%
summarise(avg_run_exp_change = mean(delta_run_exp, na.rm = TRUE),
count = n())
ggplot(run_exp_change, aes(x = reorder(pitch_type, avg_run_exp_change), y = avg_run_exp_change, fill = pitch_type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Average Run Expectancy Change by Pitch Type", x = "Pitch Type", y = "Avg Run Exp Change") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).
As seen from the previous visualization, pitches do not have that big of an impact on scoring expectancy changes, with that said however, it appears that Changeup pitches seem to have the highest Avg Run Expected change out of all the other pitches (0.022) and with the Slurve pitch having a negative effect (-0.035).
The value 0.02 means that, on average, when a changeup is thrown, it increases the expected number of runs by 0.02 runs. To put it in a real life perspective, if a team was expected to score 0.50 runs on average, it appears that after a changeup is thrown, this increases slightly to 0.52 runs on average because of that pitch.
This is crucial for coaches and teams, to be able to understand which types of pitches are more likely to improve (or worsen) the game situation helps in making better informed decisions. For analysts, It provides a way to quantify the effectiveness of each pitch type beyond just traditional measures like strikeouts or batting average.
Outcome Analysis by Game Situation: Lastly, I wanted to examine how the score changes (or win expectancy changes) vary by different game contexts (in this case innings).
# Boxplot of score changes by inning
ggplot(df, aes(x = factor(inning), y = score_change)) +
geom_boxplot(fill = "lightcoral", color = "black") +
labs(title = "Score Change by Inning", x = "Inning", y = "Score Change") +
theme_minimal()
Expected Outcomes / Research Hypotheses
Hypothesis 1: Pitch events with larger run expectancy changes correspond to pitches in specific zones (such as those just outside the striking zone of the batter that result in weak contact).
Hypothesis 2: The impact of individual pitch outcomes on win expectancy is more noticeable in high-leverage situations (late innings or during close games).
Expected Outcomes: The identification of pitch characteristics or situations that can be used for coaches to better their strategy. These insights could help in evaluating pitcher performance in critical moments and inform in-game decisions.
Limitations or Issues in the Data
Contextual Details: While the dataset includes some context (score, inning, outs), deeper factors such as batter tendencies or defensive positioning was not captured.
Data Representativeness: Postseason data typically involves high-caliber teams and pitchers. While interesting, conclusions drawn might not generalize to regular-season scenarios.
Predicting Batted Ball Outcomes Based on Pitch Characteristics
Idea Description
The idea here is to investigate how various pitch attributes (such as pitch type, release speed, pitch location, and movement) influence the quality of contact and the resulting batted ball outcome (using metrics like exit velocity, launch angle, and advanced statistics such as estimated BABIP, wOBA, or ISO). Essentially, the goal is to develop a predictive framework that links the details of a pitch to how well (or poorly) the ball is hit.
Key Aspects of the Study:
Inputs:
Pitch characteristics including:
Pitch Type: (CH, CU, FF, etc.)
Release Speed: Velocity at which the pitch is thrown.
Pitch Location: Represented by plate_x and plate_z.
Movement Variables: Such as pfx_x and pfx_z or the acceleration measures.
Outcomes:
Batted ball outcomes measured with variables such as:
Launch Speed: How fast the ball leaves the bat.
Launch Angle: The vertical angle at which the ball is hit.
Hit Distance: The projected distance the ball travels.
Advanced Metrics: Such as estimated BABIP, ISO, and wOBA values (which relate to the quality of the contact).
Preliminary Data Visualizations and Statistics
Before building the predictive model, I explored the relationship between pitch characteristics and batted ball outcomes.
Correlation Analysis
Scatter plot to see how pitch speed (release_speed) relates to exit velocity (launch_speed):
ggplot(df, aes(x = release_speed, y = launch_speed)) +
geom_point(alpha = 0.3, color = "darkblue") +
labs(title = "Release Speed vs. Launch Speed",
x = "Release Speed (mph)", y = "Launch Speed (mph)") +
theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).
At a first glance it doesn’t appear to be a correlation. The points all vary, but there seems to be an aggregation in the 95 mph release speed to 80 mph launch speed zone.
Distribution by Pitch Type
Visualization of the distribution of launch speeds across different pitch types:
# Filter NA
df_filtered <- df %>%
filter(!is.na(pitch_type), !is.na(launch_speed))
ggplot(df_filtered, aes(x = pitch_type, y = launch_speed, fill = pitch_type)) +
geom_boxplot() +
labs(title = "Launch Speed by Pitch Type",
x = "Pitch Type", y = "Launch Speed (mph)") +
theme_minimal() +
theme(legend.position = "none")
Trend at first glance appears to be at 85 mph launch speed for all pitches besides SV (much slower).
Pitch Location and Batted Ball Quality
Examining how the pitch location (plate_x and plate_z) relates to the outcome of the ball in terms of exit velocity:
ggplot(df, aes(x = plate_x, y = launch_speed)) +
geom_point(alpha = 0.3, color = "darkgreen") +
labs(title = "Pitch Location vs. Launch Speed",
x = "Plate X (ft)", y = "Launch Speed (mph)") +
theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).
Expected Outcomes / Research Hypotheses
Hypothesis 1: Certain pitch types (for example, a four-seam fastball) are more likely to lead to higher exit velocities because of the higher release speeds and the more predictable location in the strike zone.
Hypothesis 2: The location of the pitch is key and that pitches thrown in the heart of the strike zone (moderate values of plate_x and plate_z) are expected to produce more solid contact compared to pitches on the edges, leading to higher launch speeds and more optimal launch angles.
Expected Outcome: By analyzing these relationships, one might be able to identify specific pitch characteristics that consistently yield better contact with the ball . The study could result in valuable insights such as recommended pitch selection under certain circumstances or situational changes that maximize the likelihood of a weakly hit ball, ultimately guiding both pitching strategies and hitter approaches.
Discussion of Limitations or Issues in the Data
Contextual Details: While the dataset includes some context (score, inning, outs), deeper factors such as batter tendencies or defensive positioning was not captured.
Data Representativeness: Postseason data typically involves high-caliber teams and pitchers. While interesting, conclusions drawn might not generalize to regular-season scenarios.
Measurement Error and Estimation: Variables like estimated BABIP or wOBA based on speed and angle might have a lot of noise or estimation error, affecting the precision of these variables.
Causation vs. Correlation: It can be challenging to determine causality in observational data. While you may find significant correlations between pitch characteristics and batted ball outcomes, establishing causation may require more controlled experiments or adjustments for confounding variables.