Background

The dataset comes from Statcast Search CSV data, provided by MLB’s Savant system. It includes detailed pitch-by-pitch information from the 2023 MLB postseason, capturing variables such as pitch types, speeds, movements, and batted ball outcomes. Statcast tracks data for both pitchers and batters, focusing on high-stakes postseason games where performance is critical. This dataset offers insights into player and team strategies, making it a valuable resource for analyzing baseball performance at the highest level.

Dataset Description

Database Loading and Libraries

#Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(rlang)   
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice
library(tidyr)
library(lubridate)
library(stringr)
library(purrr)


#Database
df <- read.csv("pitch_data_2023_mlb_post.csv")

head(df)
##   pitch_type  game_date release_speed release_pos_x release_pos_z
## 1         FC 2023-11-01          91.5         -2.60          5.44
## 2         CU 2023-11-01          76.2         -2.27          5.73
## 3         FS 2023-11-01          87.3         -2.61          5.49
## 4         FS 2023-11-01          88.6         -2.39          5.61
## 5         FS 2023-11-01          85.6         -2.38          5.64
## 6         FS 2023-11-01          88.5         -2.51          5.51
##       player_name batter pitcher    events             description spin_dir
## 1 Eovaldi, Nathan 672695  543135 strikeout           called_strike       NA
## 2 Eovaldi, Nathan 672695  543135                     called_strike       NA
## 3 Eovaldi, Nathan 672695  543135                              ball       NA
## 4 Eovaldi, Nathan 672695  543135                              foul       NA
## 5 Eovaldi, Nathan 446334  543135 strikeout swinging_strike_blocked       NA
## 6 Eovaldi, Nathan 446334  543135                   swinging_strike       NA
##   spin_rate_deprecated break_angle_deprecated break_length_deprecated zone
## 1                   NA                     NA                      NA    7
## 2                   NA                     NA                      NA    8
## 3                   NA                     NA                      NA   13
## 4                   NA                     NA                      NA   14
## 5                   NA                     NA                      NA   14
## 6                   NA                     NA                      NA   13
##                                      des game_type stand p_throws home_team
## 1 Geraldo Perdomo called out on strikes.         W     L        R        AZ
## 2 Geraldo Perdomo called out on strikes.         W     L        R        AZ
## 3 Geraldo Perdomo called out on strikes.         W     L        R        AZ
## 4 Geraldo Perdomo called out on strikes.         W     L        R        AZ
## 5    Evan Longoria strikes out swinging.         W     R        R        AZ
## 6    Evan Longoria strikes out swinging.         W     R        R        AZ
##   away_team type hit_location bb_type balls strikes game_year pfx_x pfx_z
## 1       TEX    S            2             1       2      2023  0.25  0.74
## 2       TEX    S           NA             1       1      2023  0.99 -0.37
## 3       TEX    B           NA             0       1      2023 -0.86  0.11
## 4       TEX    S           NA             0       0      2023 -0.92  0.24
## 5       TEX    S            2             0       2      2023 -1.00  0.17
## 6       TEX    S           NA             0       1      2023 -1.08  0.13
##   plate_x plate_z on_3b on_2b on_1b outs_when_up inning inning_topbot hc_x hc_y
## 1   -0.47    1.94    NA    NA    NA            2      6           Bot   NA   NA
## 2   -0.10    1.96    NA    NA    NA            2      6           Bot   NA   NA
## 3   -0.35    1.02    NA    NA    NA            2      6           Bot   NA   NA
## 4    0.18    1.30    NA    NA    NA            2      6           Bot   NA   NA
## 5    0.21    0.62    NA    NA    NA            1      6           Bot   NA   NA
## 6   -0.13    0.93    NA    NA    NA            1      6           Bot   NA   NA
##   tfs_deprecated tfs_zulu_deprecated fielder_2 umpire sv_id      vx0       vy0
## 1             NA                  NA    641680     NA    NA 4.866354 -133.2660
## 2             NA                  NA    641680     NA    NA 2.772168 -110.9465
## 3             NA                  NA    641680     NA    NA 7.274464 -126.8861
## 4             NA                  NA    641680     NA    NA 8.263622 -128.7932
## 5             NA                  NA    641680     NA    NA 8.191721 -124.3931
## 6             NA                  NA    641680     NA    NA 8.101944 -128.6535
##          vz0         ax       ay        az sz_top sz_bot hit_distance_sc
## 1 -4.6442287   2.079220 26.02426 -22.51388   3.40   1.57              NA
## 2 -0.2128294   7.719352 19.68724 -35.40345   3.30   1.53              NA
## 3 -4.8384666 -10.799094 25.31579 -30.23172   3.43   1.53              NA
## 4 -4.9791683 -11.987677 26.02350 -28.68755   3.40   1.57               3
## 5 -5.9595113 -11.965423 23.96371 -29.38752   3.67   1.73              NA
## 6 -5.4465204 -13.627585 24.91191 -29.78662   3.67   1.73              NA
##   launch_speed launch_angle effective_speed release_spin_rate release_extension
## 1           NA           NA            92.2              2281               6.5
## 2           NA           NA            76.6              1928               6.5
## 3           NA           NA            88.0              1498               6.7
## 4         90.7          -29            89.2              1638               6.7
## 5           NA           NA            86.3              1574               6.7
## 6           NA           NA            89.3              1634               6.7
##   game_pk pitcher_1 fielder_2_1 fielder_3 fielder_4 fielder_5 fielder_6
## 1  748534    543135      641680    663993    543760    673962    608369
## 2  748534    543135      641680    663993    543760    673962    608369
## 3  748534    543135      641680    663993    543760    673962    608369
## 4  748534    543135      641680    663993    543760    673962    608369
## 5  748534    543135      641680    663993    543760    673962    608369
## 6  748534    543135      641680    663993    543760    673962    608369
##   fielder_7 fielder_8 fielder_9 release_pos_y estimated_ba_using_speedangle
## 1    694497    665750    608671         54.03                            NA
## 2    694497    665750    608671         53.97                            NA
## 3    694497    665750    608671         53.79                            NA
## 4    694497    665750    608671         53.82                            NA
## 5    694497    665750    608671         53.79                            NA
## 6    694497    665750    608671         53.78                            NA
##   estimated_woba_using_speedangle woba_value woba_denom babip_value iso_value
## 1                              NA          0          1           0         0
## 2                              NA         NA         NA          NA        NA
## 3                              NA         NA         NA          NA        NA
## 4                              NA         NA         NA          NA        NA
## 5                              NA          0          1           0         0
## 6                              NA         NA         NA          NA        NA
##   launch_speed_angle at_bat_number pitch_number   pitch_name home_score
## 1                 NA            46            4       Cutter          0
## 2                 NA            46            3    Curveball          0
## 3                 NA            46            2 Split-Finger          0
## 4                 NA            46            1 Split-Finger          0
## 5                 NA            45            3 Split-Finger          0
## 6                 NA            45            2 Split-Finger          0
##   away_score bat_score fld_score post_away_score post_home_score post_bat_score
## 1          0         0         0               0               0              0
## 2          0         0         0               0               0              0
## 3          0         0         0               0               0              0
## 4          0         0         0               0               0              0
## 5          0         0         0               0               0              0
## 6          0         0         0               0               0              0
##   post_fld_score if_fielding_alignment of_fielding_alignment spin_axis
## 1              0             Strategic             Strategic       200
## 2              0         Infield shade             Strategic        44
## 3              0         Infield shade             Strategic       240
## 4              0         Infield shade             Strategic       237
## 5              0         Infield shade              Standard       241
## 6              0              Standard              Standard       236
##   delta_home_win_exp delta_run_exp
## 1             -0.017        -0.074
## 2              0.000        -0.027
## 3              0.000         0.012
## 4              0.000        -0.017
## 5             -0.024        -0.107
## 6              0.000        -0.036

Basic Overview

cat("Dataset Structure and Summary \n")
## Dataset Structure and Summary
str(df)
## 'data.frame':    11829 obs. of  92 variables:
##  $ pitch_type                     : chr  "FC" "CU" "FS" "FS" ...
##  $ game_date                      : chr  "2023-11-01" "2023-11-01" "2023-11-01" "2023-11-01" ...
##  $ release_speed                  : num  91.5 76.2 87.3 88.6 85.6 88.5 91.4 96.1 88 76.2 ...
##  $ release_pos_x                  : num  -2.6 -2.27 -2.61 -2.39 -2.38 -2.51 -2.59 -2.53 -2.52 -2.16 ...
##  $ release_pos_z                  : num  5.44 5.73 5.49 5.61 5.64 5.51 5.43 5.38 5.55 5.76 ...
##  $ player_name                    : chr  "Eovaldi, Nathan" "Eovaldi, Nathan" "Eovaldi, Nathan" "Eovaldi, Nathan" ...
##  $ batter                         : int  672695 672695 672695 672695 446334 446334 446334 677950 677950 677950 ...
##  $ pitcher                        : int  543135 543135 543135 543135 543135 543135 543135 543135 543135 543135 ...
##  $ events                         : chr  "strikeout" "" "" "" ...
##  $ description                    : chr  "called_strike" "called_strike" "ball" "foul" ...
##  $ spin_dir                       : logi  NA NA NA NA NA NA ...
##  $ spin_rate_deprecated           : logi  NA NA NA NA NA NA ...
##  $ break_angle_deprecated         : logi  NA NA NA NA NA NA ...
##  $ break_length_deprecated        : logi  NA NA NA NA NA NA ...
##  $ zone                           : int  7 8 13 14 14 13 5 2 8 14 ...
##  $ des                            : chr  "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." "Geraldo Perdomo called out on strikes." ...
##  $ game_type                      : chr  "W" "W" "W" "W" ...
##  $ stand                          : chr  "L" "L" "L" "L" ...
##  $ p_throws                       : chr  "R" "R" "R" "R" ...
##  $ home_team                      : chr  "AZ" "AZ" "AZ" "AZ" ...
##  $ away_team                      : chr  "TEX" "TEX" "TEX" "TEX" ...
##  $ type                           : chr  "S" "S" "B" "S" ...
##  $ hit_location                   : int  2 NA NA NA 2 NA NA 1 NA NA ...
##  $ bb_type                        : chr  "" "" "" "" ...
##  $ balls                          : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ strikes                        : int  2 1 1 0 2 1 0 2 2 2 ...
##  $ game_year                      : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ pfx_x                          : num  0.25 0.99 -0.86 -0.92 -1 -1.08 0.39 -0.97 -1.13 0.72 ...
##  $ pfx_z                          : num  0.74 -0.37 0.11 0.24 0.17 0.13 0.94 1.08 0.15 -0.2 ...
##  $ plate_x                        : num  -0.47 -0.1 -0.35 0.18 0.21 -0.13 0.24 -0.09 -0.16 0.96 ...
##  $ plate_z                        : num  1.94 1.96 1.02 1.3 0.62 0.93 2.69 3.3 1.36 1.82 ...
##  $ on_3b                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ on_2b                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ on_1b                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ outs_when_up                   : int  2 2 2 2 1 1 1 0 0 0 ...
##  $ inning                         : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ inning_topbot                  : chr  "Bot" "Bot" "Bot" "Bot" ...
##  $ hc_x                           : num  NA NA NA NA NA ...
##  $ hc_y                           : num  NA NA NA NA NA ...
##  $ tfs_deprecated                 : logi  NA NA NA NA NA NA ...
##  $ tfs_zulu_deprecated            : logi  NA NA NA NA NA NA ...
##  $ fielder_2                      : int  641680 641680 641680 641680 641680 641680 641680 641680 641680 641680 ...
##  $ umpire                         : logi  NA NA NA NA NA NA ...
##  $ sv_id                          : logi  NA NA NA NA NA NA ...
##  $ vx0                            : num  4.87 2.77 7.27 8.26 8.19 ...
##  $ vy0                            : num  -133 -111 -127 -129 -124 ...
##  $ vz0                            : num  -4.644 -0.213 -4.838 -4.979 -5.96 ...
##  $ ax                             : num  2.08 7.72 -10.8 -11.99 -11.97 ...
##  $ ay                             : num  26 19.7 25.3 26 24 ...
##  $ az                             : num  -22.5 -35.4 -30.2 -28.7 -29.4 ...
##  $ sz_top                         : num  3.4 3.3 3.43 3.4 3.67 3.67 3.67 3.24 3.24 3.24 ...
##  $ sz_bot                         : num  1.57 1.53 1.53 1.57 1.73 1.73 1.73 1.44 1.44 1.44 ...
##  $ hit_distance_sc                : int  NA NA NA 3 NA NA 230 1 2 44 ...
##  $ launch_speed                   : num  NA NA NA 90.7 NA NA 72.5 73.3 71.9 91.9 ...
##  $ launch_angle                   : int  NA NA NA -29 NA NA 34 -68 -40 0 ...
##  $ effective_speed                : num  92.2 76.6 88 89.2 86.3 89.3 92.3 96 88.6 76.6 ...
##  $ release_spin_rate              : int  2281 1928 1498 1638 1574 1634 2271 2313 1492 1702 ...
##  $ release_extension              : num  6.5 6.5 6.7 6.7 6.7 6.7 6.6 6.6 6.5 6.5 ...
##  $ game_pk                        : int  748534 748534 748534 748534 748534 748534 748534 748534 748534 748534 ...
##  $ pitcher_1                      : int  543135 543135 543135 543135 543135 543135 543135 543135 543135 543135 ...
##  $ fielder_2_1                    : int  641680 641680 641680 641680 641680 641680 641680 641680 641680 641680 ...
##  $ fielder_3                      : int  663993 663993 663993 663993 663993 663993 663993 663993 663993 663993 ...
##  $ fielder_4                      : int  543760 543760 543760 543760 543760 543760 543760 543760 543760 543760 ...
##  $ fielder_5                      : int  673962 673962 673962 673962 673962 673962 673962 673962 673962 673962 ...
##  $ fielder_6                      : int  608369 608369 608369 608369 608369 608369 608369 608369 608369 608369 ...
##  $ fielder_7                      : int  694497 694497 694497 694497 694497 694497 694497 694497 694497 694497 ...
##  $ fielder_8                      : int  665750 665750 665750 665750 665750 665750 665750 665750 665750 665750 ...
##  $ fielder_9                      : int  608671 608671 608671 608671 608671 608671 608671 608671 608671 608671 ...
##  $ release_pos_y                  : num  54 54 53.8 53.8 53.8 ...
##  $ estimated_ba_using_speedangle  : num  NA NA NA NA NA NA NA 0.274 NA NA ...
##  $ estimated_woba_using_speedangle: num  NA NA NA NA NA NA NA 0.25 NA NA ...
##  $ woba_value                     : num  0 NA NA NA 0 NA NA 0 NA NA ...
##  $ woba_denom                     : int  1 NA NA NA 1 NA NA 1 NA NA ...
##  $ babip_value                    : int  0 NA NA NA 0 NA NA 0 NA NA ...
##  $ iso_value                      : int  0 NA NA NA 0 NA NA 0 NA NA ...
##  $ launch_speed_angle             : int  NA NA NA NA NA NA NA 2 NA NA ...
##  $ at_bat_number                  : int  46 46 46 46 45 45 45 44 44 44 ...
##  $ pitch_number                   : int  4 3 2 1 3 2 1 5 4 3 ...
##  $ pitch_name                     : chr  "Cutter" "Curveball" "Split-Finger" "Split-Finger" ...
##  $ home_score                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ away_score                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bat_score                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fld_score                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ post_away_score                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ post_home_score                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ post_bat_score                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ post_fld_score                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ if_fielding_alignment          : chr  "Strategic" "Infield shade" "Infield shade" "Infield shade" ...
##  $ of_fielding_alignment          : chr  "Strategic" "Strategic" "Strategic" "Strategic" ...
##  $ spin_axis                      : int  200 44 240 237 241 236 199 229 242 37 ...
##  $ delta_home_win_exp             : num  -0.017 0 0 0 -0.024 0 0 -0.033 0 0 ...
##  $ delta_run_exp                  : num  -0.074 -0.027 0.012 -0.017 -0.107 -0.036 -0.026 -0.152 0 0 ...
cat("\nDimensions of df:", dim(df), "\n")
## 
## Dimensions of df: 11829 92
summary(df)
##   pitch_type         game_date         release_speed    release_pos_x   
##  Length:11829       Length:11829       Min.   : 71.40   Min.   :-3.910  
##  Class :character   Class :character   1st Qu.: 85.50   1st Qu.:-2.190  
##  Mode  :character   Mode  :character   Median : 91.40   Median :-1.620  
##                                        Mean   : 90.05   Mean   :-1.051  
##                                        3rd Qu.: 94.70   3rd Qu.:-0.410  
##                                        Max.   :103.70   Max.   : 4.500  
##                                                                         
##  release_pos_z   player_name            batter          pitcher      
##  Min.   :3.190   Length:11829       Min.   :444482   Min.   :434378  
##  1st Qu.:5.450   Class :character   1st Qu.:592663   1st Qu.:571760  
##  Median :5.770   Mode  :character   Median :656775   Median :624133  
##  Mean   :5.765                      Mean   :622921   Mean   :614818  
##  3rd Qu.:6.100                      3rd Qu.:669221   3rd Qu.:664353  
##  Max.   :7.270                      Max.   :694497   Max.   :700363  
##                                                                      
##     events          description        spin_dir       spin_rate_deprecated
##  Length:11829       Length:11829       Mode:logical   Mode:logical        
##  Class :character   Class :character   NA's:11829     NA's:11829          
##  Mode  :character   Mode  :character                                      
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  break_angle_deprecated break_length_deprecated      zone       
##  Mode:logical           Mode:logical            Min.   : 1.000  
##  NA's:11829             NA's:11829              1st Qu.: 6.000  
##                                                 Median :11.000  
##                                                 Mean   : 9.274  
##                                                 3rd Qu.:13.000  
##                                                 Max.   :14.000  
##                                                                 
##      des             game_type            stand             p_throws        
##  Length:11829       Length:11829       Length:11829       Length:11829      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   home_team          away_team             type            hit_location  
##  Length:11829       Length:11829       Length:11829       Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :5.000  
##                                                           Mean   :5.004  
##                                                           3rd Qu.:7.000  
##                                                           Max.   :9.000  
##                                                           NA's   :9157   
##    bb_type              balls          strikes        game_year   
##  Length:11829       Min.   :0.000   Min.   :0.000   Min.   :2023  
##  Class :character   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:2023  
##  Mode  :character   Median :1.000   Median :1.000   Median :2023  
##                     Mean   :0.863   Mean   :0.897   Mean   :2023  
##                     3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2023  
##                     Max.   :3.000   Max.   :2.000   Max.   :2023  
##                                                                   
##      pfx_x              pfx_z            plate_x            plate_z      
##  Min.   :-1.98000   Min.   :-1.6500   Min.   :-5.07000   Min.   :-2.300  
##  1st Qu.:-0.84000   1st Qu.: 0.1000   1st Qu.:-0.49000   1st Qu.: 1.560  
##  Median :-0.23000   Median : 0.7100   Median : 0.07000   Median : 2.220  
##  Mean   :-0.09191   Mean   : 0.6041   Mean   : 0.06783   Mean   : 2.225  
##  3rd Qu.: 0.63000   3rd Qu.: 1.2900   3rd Qu.: 0.62000   3rd Qu.: 2.890  
##  Max.   : 2.19000   Max.   : 1.9500   Max.   : 4.11000   Max.   : 6.250  
##                                                                          
##      on_3b            on_2b            on_1b         outs_when_up   
##  Min.   :446334   Min.   :444482   Min.   :446334   Min.   :0.0000  
##  1st Qu.:606115   1st Qu.:592885   1st Qu.:592663   1st Qu.:0.0000  
##  Median :641680   Median :656941   Median :645277   Median :1.0000  
##  Mean   :623920   Mean   :623594   Mean   :622009   Mean   :0.9716  
##  3rd Qu.:672515   3rd Qu.:670242   3rd Qu.:669016   3rd Qu.:2.0000  
##  Max.   :694497   Max.   :694497   Max.   :694497   Max.   :2.0000  
##  NA's   :10719    NA's   :9734     NA's   :8139                     
##      inning       inning_topbot           hc_x             hc_y       
##  Min.   : 1.000   Length:11829       Min.   :  6.51   Min.   : 24.94  
##  1st Qu.: 3.000   Class :character   1st Qu.:101.19   1st Qu.: 89.94  
##  Median : 5.000   Mode  :character   Median :123.11   Median :124.88  
##  Mean   : 4.937                      Mean   :125.48   Mean   :122.47  
##  3rd Qu.: 7.000                      3rd Qu.:153.15   3rd Qu.:155.82  
##  Max.   :11.000                      Max.   :238.77   Max.   :225.80  
##                                      NA's   :9787     NA's   :9787    
##  tfs_deprecated tfs_zulu_deprecated   fielder_2       umpire       
##  Mode:logical   Mode:logical        Min.   :455117   Mode:logical  
##  NA's:11829     NA's:11829          1st Qu.:592663   NA's:11829    
##                                     Median :641680                 
##                                     Mean   :620693                 
##                                     3rd Qu.:672515                 
##                                     Max.   :680777                 
##                                                                    
##   sv_id              vx0                vy0              vz0         
##  Mode:logical   Min.   :-16.3130   Min.   :-150.0   Min.   :-15.431  
##  NA's:11829     1st Qu.: -0.1746   1st Qu.:-137.6   1st Qu.: -6.144  
##                 Median :  4.6091   Median :-132.8   Median : -4.125  
##                 Mean   :  3.0365   Mean   :-130.9   Mean   : -4.093  
##                 3rd Qu.:  6.9592   3rd Qu.:-124.4   3rd Qu.: -2.074  
##                 Max.   : 16.4404   Max.   :-104.0   Max.   :  7.695  
##                                                                      
##        ax                ay              az              sz_top     
##  Min.   :-25.899   Min.   :17.17   Min.   :-47.645   Min.   :2.700  
##  1st Qu.:-11.646   1st Qu.:25.35   1st Qu.:-30.604   1st Qu.:3.300  
##  Median : -3.157   Median :28.51   Median :-23.135   Median :3.430  
##  Mean   : -2.141   Mean   :28.48   Mean   :-23.577   Mean   :3.426  
##  3rd Qu.:  6.631   3rd Qu.:31.45   3rd Qu.:-14.898   3rd Qu.:3.570  
##  Max.   : 26.118   Max.   :41.47   Max.   : -3.574   Max.   :4.120  
##                                                                     
##      sz_bot      hit_distance_sc  launch_speed     launch_angle   
##  Min.   :1.110   Min.   :  0.0   Min.   :  3.50   Min.   :-85.00  
##  1st Qu.:1.540   1st Qu.: 17.0   1st Qu.: 73.10   1st Qu.: -6.00  
##  Median :1.620   Median :161.0   Median : 82.30   Median : 20.00  
##  Mean   :1.618   Mean   :153.3   Mean   : 82.95   Mean   : 17.04  
##  3rd Qu.:1.700   3rd Qu.:239.0   3rd Qu.: 95.20   3rd Qu.: 42.00  
##  Max.   :1.990   Max.   :461.0   Max.   :117.10   Max.   : 88.00  
##                  NA's   :7958    NA's   :7978     NA's   :7974    
##  effective_speed  release_spin_rate release_extension    game_pk      
##  Min.   :  0.00   Min.   : 658      Min.   :4.900     Min.   :748534  
##  1st Qu.: 85.60   1st Qu.:2186      1st Qu.:6.200     1st Qu.:748545  
##  Median : 91.50   Median :2358      Median :6.500     Median :748556  
##  Mean   : 90.24   Mean   :2328      Mean   :6.505     Mean   :748558  
##  3rd Qu.: 95.00   3rd Qu.:2517      3rd Qu.:6.800     3rd Qu.:748569  
##  Max.   :104.50   Max.   :3504      Max.   :8.300     Max.   :748585  
##  NA's   :1                          NA's   :1                         
##    pitcher_1       fielder_2_1       fielder_3        fielder_4     
##  Min.   :434378   Min.   :455117   Min.   :456781   Min.   :514888  
##  1st Qu.:571760   1st Qu.:592663   1st Qu.:547180   1st Qu.:543760  
##  Median :624133   Median :641680   Median :572233   Median :606466  
##  Mean   :614818   Mean   :620693   Mean   :589998   Mean   :602297  
##  3rd Qu.:664353   3rd Qu.:672515   3rd Qu.:663993   3rd Qu.:666397  
##  Max.   :700363   Max.   :680777   Max.   :666135   Max.   :681082  
##                                                                     
##    fielder_5        fielder_6        fielder_7        fielder_8     
##  Min.   :446334   Min.   :500743   Min.   :444482   Min.   :518792  
##  1st Qu.:602104   1st Qu.:607208   1st Qu.:650559   1st Qu.:665506  
##  Median :663586   Median :621043   Median :666971   Median :671739  
##  Mean   :617049   Mean   :630840   Mean   :644639   Mean   :662034  
##  3rd Qu.:670623   3rd Qu.:670764   3rd Qu.:670541   3rd Qu.:677950  
##  Max.   :683002   Max.   :691783   Max.   :694497   Max.   :686217  
##                                                                     
##    fielder_9      release_pos_y   estimated_ba_using_speedangle
##  Min.   :502054   Min.   :52.22   Min.   :0.001                
##  1st Qu.:592206   1st Qu.:53.68   1st Qu.:0.077                
##  Median :663656   Median :53.98   Median :0.254                
##  Mean   :633090   Mean   :54.00   Mean   :0.340                
##  3rd Qu.:666969   3rd Qu.:54.30   3rd Qu.:0.571                
##  Max.   :682998   Max.   :55.55   Max.   :0.997                
##                                   NA's   :9789                 
##  estimated_woba_using_speedangle   woba_value      woba_denom     babip_value  
##  Min.   :0.000                   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.082                   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.00  
##  Median :0.252                   Median :0.000   Median :1.000   Median :0.00  
##  Mean   :0.397                   Mean   :0.323   Mean   :0.993   Mean   :0.18  
##  3rd Qu.:0.592                   3rd Qu.:0.700   3rd Qu.:1.000   3rd Qu.:0.00  
##  Max.   :1.997                   Max.   :2.000   Max.   :1.000   Max.   :1.00  
##  NA's   :9789                    NA's   :8751    NA's   :8754    NA's   :8751  
##    iso_value    launch_speed_angle at_bat_number    pitch_number   
##  Min.   :0.00   Min.   :1.000      Min.   : 1.00   Min.   : 1.000  
##  1st Qu.:0.00   1st Qu.:2.000      1st Qu.:19.00   1st Qu.: 1.000  
##  Median :0.00   Median :3.000      Median :38.00   Median : 3.000  
##  Mean   :0.15   Mean   :3.286      Mean   :38.45   Mean   : 2.887  
##  3rd Qu.:0.00   3rd Qu.:4.000      3rd Qu.:57.00   3rd Qu.: 4.000  
##  Max.   :3.00   Max.   :6.000      Max.   :92.00   Max.   :15.000  
##  NA's   :8751   NA's   :9789                                       
##   pitch_name          home_score       away_score       bat_score     
##  Length:11829       Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Mode  :character   Median : 1.000   Median : 2.000   Median : 1.000  
##                     Mean   : 1.645   Mean   : 2.763   Mean   : 2.191  
##                     3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 3.000  
##                     Max.   :10.000   Max.   :11.000   Max.   :11.000  
##                                                                       
##    fld_score      post_away_score  post_home_score  post_bat_score  
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 1.000   Median : 2.000   Median : 1.000   Median : 2.000  
##  Mean   : 2.217   Mean   : 2.779   Mean   : 1.656   Mean   : 2.219  
##  3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   :11.000   Max.   :11.000   Max.   :10.000   Max.   :11.000  
##                                                                     
##  post_fld_score   if_fielding_alignment of_fielding_alignment   spin_axis    
##  Min.   : 0.000   Length:11829          Length:11829          Min.   :  4.0  
##  1st Qu.: 0.000   Class :character      Class :character      1st Qu.:130.0  
##  Median : 1.000   Mode  :character      Mode  :character      Median :204.0  
##  Mean   : 2.217                                               Mean   :175.9  
##  3rd Qu.: 3.000                                               3rd Qu.:223.0  
##  Max.   :11.000                                               Max.   :359.0  
##                                                                              
##  delta_home_win_exp   delta_run_exp      
##  Min.   :-0.5810000   Min.   :-1.130000  
##  1st Qu.: 0.0000000   1st Qu.:-0.065000  
##  Median : 0.0000000   Median :-0.017000  
##  Mean   :-0.0004862   Mean   :-0.001553  
##  3rd Qu.: 0.0000000   3rd Qu.: 0.036000  
##  Max.   : 0.5790000   Max.   : 2.729000  
##                       NA's   :1

Number of observations

num_obs <- nrow(df)
cat("\nThe number of observations is:", num_obs, "\n")
## 
## The number of observations is: 11829

Unique Athletes / Participants

unique_pitchers <- unique(df$pitcher)
unique_batters  <- unique(df$batter)
cat("\nNumber of unique pitchers:", length(unique_pitchers), "\n")
## 
## Number of unique pitchers: 136
cat("Number of unique batters:", length(unique_batters), "\n")
## Number of unique batters: 148
#Total Unique Athletes from all relevant columns
athlete_cols <- c("pitcher", "batter", "on_1b", "on_2b", "on_3b",
                  "fielder_2", "fielder_3", "fielder_4", "fielder_5", 
                  "fielder_6", "fielder_7", "fielder_8", "fielder_9")

all_athletes <- unlist(df[, athlete_cols])
all_athletes <- unique(all_athletes[!is.na(all_athletes)])
cat("Total number of unique athletes across all relevant columns:", length(all_athletes), "\n")
## Total number of unique athletes across all relevant columns: 292

Unique Teams

teams <- unique(c(df$home_team, df$away_team))
cat("\nUnique teams in the dataset:\n")
## 
## Unique teams in the dataset:
print(teams)
##  [1] "AZ"  "TEX" "PHI" "HOU" "MIN" "ATL" "LAD" "BAL" "MIL" "TB"  "MIA" "TOR"
cat("Total number of unique teams:", length(teams), "\n")
## Total number of unique teams: 12

Data Cleaning

Removing N/A

cols_to_remove <- c("spin_dir", "spin_rate_deprecated", "break_angle_deprecated", 
                    "break_length_deprecated", "tfs_deprecated", "tfs_zulu_deprecated", 
                    "umpire", "sv_id")

df <- df %>% select(-any_of(cols_to_remove))
cat("Columns after removal:\n")
## Columns after removal:
print(names(df))
##  [1] "pitch_type"                      "game_date"                      
##  [3] "release_speed"                   "release_pos_x"                  
##  [5] "release_pos_z"                   "player_name"                    
##  [7] "batter"                          "pitcher"                        
##  [9] "events"                          "description"                    
## [11] "zone"                            "des"                            
## [13] "game_type"                       "stand"                          
## [15] "p_throws"                        "home_team"                      
## [17] "away_team"                       "type"                           
## [19] "hit_location"                    "bb_type"                        
## [21] "balls"                           "strikes"                        
## [23] "game_year"                       "pfx_x"                          
## [25] "pfx_z"                           "plate_x"                        
## [27] "plate_z"                         "on_3b"                          
## [29] "on_2b"                           "on_1b"                          
## [31] "outs_when_up"                    "inning"                         
## [33] "inning_topbot"                   "hc_x"                           
## [35] "hc_y"                            "fielder_2"                      
## [37] "vx0"                             "vy0"                            
## [39] "vz0"                             "ax"                             
## [41] "ay"                              "az"                             
## [43] "sz_top"                          "sz_bot"                         
## [45] "hit_distance_sc"                 "launch_speed"                   
## [47] "launch_angle"                    "effective_speed"                
## [49] "release_spin_rate"               "release_extension"              
## [51] "game_pk"                         "pitcher_1"                      
## [53] "fielder_2_1"                     "fielder_3"                      
## [55] "fielder_4"                       "fielder_5"                      
## [57] "fielder_6"                       "fielder_7"                      
## [59] "fielder_8"                       "fielder_9"                      
## [61] "release_pos_y"                   "estimated_ba_using_speedangle"  
## [63] "estimated_woba_using_speedangle" "woba_value"                     
## [65] "woba_denom"                      "babip_value"                    
## [67] "iso_value"                       "launch_speed_angle"             
## [69] "at_bat_number"                   "pitch_number"                   
## [71] "pitch_name"                      "home_score"                     
## [73] "away_score"                      "bat_score"                      
## [75] "fld_score"                       "post_away_score"                
## [77] "post_home_score"                 "post_bat_score"                 
## [79] "post_fld_score"                  "if_fielding_alignment"          
## [81] "of_fielding_alignment"           "spin_axis"                      
## [83] "delta_home_win_exp"              "delta_run_exp"

These collumns were removed due to not having any values associated with them

Converting game_date to a Date class

df <- df %>% 
    mutate(game_date = as.Date(game_date, format = "%Y-%m-%d"))

This was done to ensure that game_date was a date class

Converting numeric columns that are being read as characters or factors

num_cols <- c("release_speed", "release_pos_x", "release_pos_z", "pfx_x", "pfx_z", 
              "plate_x", "plate_z", "vx0", "vy0", "ax", "ay", "az", "sz_top", "sz_bot",
               "launch_speed", "launch_angle", "release_extension")
df <- df %>%
  mutate(across(all_of(num_cols), as.numeric))

Removing any duplicate rows

num_duplicates <- sum(duplicated(df))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0
if(num_duplicates > 0) {
  df <- df %>% distinct()
  cat("Duplicate rows have been removed. New number of observations:", nrow(df), "\n")
}

Creating New Variables and Reformatting the Data

df <- df %>%
  mutate(game_type = recode(game_type,
                            "E" = "Exhibition",
                            "S" = "Spring Training",
                            "R" = "Regular Season",
                            "F" = "Wild Card",
                            "D" = "Divisional Series",
                            "L" = "League Championship Series",
                            "W" = "World Series"))

Changed the names of the game_types so that I could be familiar with what they meant

df <- df %>%
  mutate(launch_speed_angle = case_when(
    launch_speed_angle == 1 ~ "Weak",
    launch_speed_angle == 2 ~ "Topped",
    launch_speed_angle == 3 ~ "Under",
    launch_speed_angle == 4 ~ "Flare/Burner",
    launch_speed_angle == 5 ~ "Solid Contact",
    launch_speed_angle == 6 ~ "Barrel",
    TRUE ~ as.character(launch_speed_angle)
  )) 

Changed “launch_speed_angle” so that I could find out the actual name of the launch angle that was being counted

df <- df %>%
  # Calculate pre-pitch total score and post-pitch total score
  mutate(
    total_score_pre  = home_score + away_score,
    total_score_post = post_home_score + post_away_score,
    score_change     = total_score_post - total_score_pre
  )

Added 3 new collumns:

total_score_pre:

Adding the home_score and away_score columns. These represent the scores for both teams before the pitch event.

total_score_post:

Added the post_home_score and post_away_score columns. These values reflect the new scores immediately after the pitch.

score_change:

This is calculated as the difference between total_score_post and total_score_pre. For pitches that don’t result in any runs, the change will be 0; if runs are scored, this column will indicate how many runs were added as a result of the play.

Variable Explanation and visualizations

pitch_type

Definition: The type of pitch (e.g., fastball, curveball, slider) as derived from Statcast.

CH (Changeup) CU (Curveball) FC (Cutter) FF (Four-Seam Fastball) FS (Split-Finger Fastball) KC (Knuckle Curve) PO (Pitchout) SI (Sinker) SL (Slider) ST (Splitter) SV (Slurve)

#Pitch Type

# Combine
pitch_type_summary <- df %>% 
  count(pitch_type) %>% 
  mutate(percentage = n / sum(n) * 100)
print(pitch_type_summary)
##    pitch_type    n percentage
## 1          CH 1114  9.4175332
## 2          CU 1088  9.1977344
## 3          FC  754  6.3741652
## 4          FF 4343 36.7148533
## 5          FS  373  3.1532674
## 6          KC  337  2.8489306
## 7          PO    1  0.0084538
## 8          SI 1645 13.9065010
## 9          SL 1553 13.1287514
## 10         ST  600  5.0722800
## 11         SV   21  0.1775298
unique(pitch_type_summary$pitch_type)
##  [1] "CH" "CU" "FC" "FF" "FS" "KC" "PO" "SI" "SL" "ST" "SV"
# Bar Chart
ggplot(pitch_type_summary, aes(x = pitch_type, y = n, fill = pitch_type)) +
  geom_bar(stat = "identity") +
  labs(title = "Frequency of Pitch Types", x = "Pitch Type", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

At first glance it appears that there is a trend in the dataset, with Four-Seam Fastball throws being the most used.

release_speed

Definition: The velocity of the pitch (in mph).

# Summary
summary(df$release_speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   71.40   85.50   91.40   90.05   94.70  103.70
# Histogram
ggplot(df, aes(x = release_speed)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Release Speeds", x = "Release Speed (mph)", y = "Frequency") +
  theme_minimal()

# Boxplot to inspect outliers
ggplot(df, aes(x = "", y = release_speed)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  labs(title = "Boxplot of Release Speed", y = "Release Speed (mph)") +
  theme_minimal()

At first glance it appears that the pitches range from 70 mph to 100 mph, with the majority being closer to 100 mph

release_pos_x, release_pos_z, release_pos_y

Definition:

release_pos_x: Horizontal release position (feet from catcher’s perspective).

release_pos_z: Vertical release position (feet from catcher’s perspective).

release_pos_y: Release position of pitch measured in feet from the catcher’s perspective.

# Summaries for positions
summary(df$release_pos_x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.910  -2.190  -1.620  -1.051  -0.410   4.500
summary(df$release_pos_z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.190   5.450   5.770   5.765   6.100   7.270
summary(df$release_pos_y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52.22   53.68   53.98   54.00   54.30   55.55
# Scatterplot: Horizontal vs. Vertical release positions
ggplot(df, aes(x = release_pos_x, y = release_pos_z)) +
  geom_point(alpha = 0.3, color = "purple") +
  labs(title = "Release Position: Horizontal vs Vertical", x = "Horizontal Position (ft)", y = "Vertical Position (ft)") +
  theme_minimal()

Since this is from the catcher’s perspective, at first glance it appears that there are more right handed throwers in this database than left-handed throwers

vx0, vy0, vz0, ax, ay, az

Definition:

vx0, vy0, vz0: Components of the pitch velocity in feet per second in the x, y, z dimensions at a fixed point (y = 50 feet).

ax, ay, az: The corresponding acceleration components in feet per second squared.

# Scatter plot: vx0 vs. vy0
ggplot(df, aes(x = vx0, y = vy0)) +
  geom_point(alpha = 0.3, color = "orange") +
  labs(title = "Scatter Plot: vx0 vs. vy0", x = "vx0 (fps)", y = "vy0 (fps)") +
  theme_minimal()

# Histograms for ax, ay, and az
p1 <- ggplot(df, aes(x = ax)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Distribution of ax", x = "ax (fps²)", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(df, aes(x = ay)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of ay", x = "ay (fps²)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(df, aes(x = az)) +
  geom_histogram(bins = 30, fill = "salmon", color = "black") +
  labs(title = "Distribution of az", x = "az (fps²)", y = "Frequency") +
  theme_minimal()


p1

p2

p3

# ax vs. ay
scatter_ax_ay <- ggplot(df, aes(x = ax, y = ay)) +
  geom_point(alpha = 0.3, color = "darkblue") +
  labs(title = "Scatter Plot: ax vs. ay",
       x = "ax (fps²)", y = "ay (fps²)") +
  theme_minimal()

# ax vs. az
scatter_ax_az <- ggplot(df, aes(x = ax, y = az)) +
  geom_point(alpha = 0.3, color = "darkred") +
  labs(title = "Scatter Plot: ax vs. az",
       x = "ax (fps²)", y = "az (fps²)") +
  theme_minimal()

# ay vs. az
scatter_ay_az <- ggplot(df, aes(x = ay, y = az)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  labs(title = "Scatter Plot: ay vs. az",
       x = "ay (fps²)", y = "az (fps²)") +
  theme_minimal()

# Print scatter plots
scatter_ax_ay

scatter_ax_az

scatter_ay_az

pfx_x, pfx_z

Definition: Horizontal and vertical movement (in feet) after the release, as the ball moves toward the plate (from the catcher’s perspective).

# Summary statistics
summary(df$pfx_x)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.98000 -0.84000 -0.23000 -0.09191  0.63000  2.19000
summary(df$pfx_z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6500  0.1000  0.7100  0.6041  1.2900  1.9500
# Histograms
ggplot(df, aes(x = pfx_x)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Distribution of pfx_x", x = "pfx_x (ft)", y = "Frequency") +
  theme_minimal()

ggplot(df, aes(x = pfx_z)) +
  geom_histogram(bins = 30, fill = "salmon", color = "black") +
  labs(title = "Distribution of pfx_z", x = "pfx_z (ft)", y = "Frequency") +
  theme_minimal()

Same observation as mentioned above, there is a right handed dominance

plate_x, plate_z

Definition: The position of the ball as it crosses home plate from the catcher’s perspective (horizontal and vertical positions).

plate_x: This variable represents the horizontal position of the baseball as it crosses home plate. It is the left-right location relative to the center of the plate, measured in feet from the catcher’s perspective. In this case:

A value of 0 represents the center of the plate.

Negative values typically indicate that the ball is to one side (for example, from the catcher’s point of view, it might be towards the pitcher’s left), whereas positive values indicate the opposite side.

plate_z: This variable represents the vertical position of the baseball as it crosses home plate. It is measured in feet from a defined reference point (often the ground or the bottom of the strike zone). Which means:

Lower values indicate a pitch that is low.

Higher values indicate a pitch that is higher, closer to the top of the strike zone.

# Scatter plot of plate locations
ggplot(df, aes(x = plate_x, y = plate_z)) +
  geom_point(alpha = 0.3, color = "blue") +
  labs(title = "Pitch Location at Plate", x = "Plate X (ft)", y = "Plate Z (ft)") +
  theme_minimal()

# Adding density contours
ggplot(df, aes(x = plate_x, y = plate_z)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_density_2d(color = "red") +
  labs(title = "Pitch Location with Density Contours", x = "Plate X (ft)", y = "Plate Z (ft)") +
  theme_minimal()

hit_distance, launch_speed, launch_angle, effective_speed

Definition:

hit_distance: Projected distance the ball travels after contact.

launch_speed: Exit velocity of the batted ball (as tracked by Statcast or estimates).

launch_angle: The angle at which the ball leaves the bat.

effective_speed: Derived speed that accounts for the pitcher’s release extension.

# Scatter plot: Launch Speed vs. Launch Angle
ggplot(df, aes(x = launch_speed, y = launch_angle)) +
  geom_point(alpha = 0.3, color = "brown") +
  labs(title = "Launch Speed vs. Launch Angle", x = "Launch Speed (mph)", y = "Launch Angle (°)") +
  theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).

At first glance, it appears that batting closer to 0 degrees produces more speed.

release_spin, release_extension

Definition:

release_spin: Spin rate of the pitch (in RPM) as tracked by Statcast.

release_extension: The pitcher’s release extension (in feet), which can affect perceived velocity.

# Boxplot by pitch type for release spin
ggplot(df, aes(x = pitch_type, y = df$release_spin)) +
  geom_boxplot(fill = "plum", color = "darkmagenta") +
  labs(title = "Release Spin by Pitch Type", x = "Pitch Type", y = "Release Spin (RPM)") +
  theme_minimal()
## Warning: Use of `df$release_spin` is discouraged.
## ℹ Use `release_spin` instead.

At a first glance, it appears Curveballs have higher RPM than the rest of the pitches, however it also has the most variability.

launch_speed_angle

Definition: A categorical metric based on launch speed and launch angle classified into:

1: Weak

2: Topped

3: Under

4: Flare/Burner

5: Solid Contact

6: Barrel

# Remove rows with NA in launch_speed_angle, then count and calculate percentages
launch_speed_angle_summary <- df %>% 
  filter(!is.na(launch_speed_angle)) %>% 
  count(launch_speed_angle) %>% 
  mutate(percentage = n / sum(n) * 100)

# Print the summary table
print(launch_speed_angle_summary)
##   launch_speed_angle   n percentage
## 1             Barrel 220  10.784314
## 2       Flare/Burner 486  23.823529
## 3      Solid Contact 126   6.176471
## 4             Topped 614  30.098039
## 5              Under 494  24.215686
## 6               Weak 100   4.901961
# Visualization: Bar Chart
ggplot(launch_speed_angle_summary, aes(x = launch_speed_angle, y = n, fill = launch_speed_angle)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Launch Speed/Angle Zones", 
       x = "Zone", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

At first glance it appears that Topped, Flare and Under pitches seem to be the norm in the dataset.

game_date, game_year, game_type, game_pk

Definition:

game_date: The date on which the game was played.

game_year: The year the game took place.

game_type: Game classification (e.g., Regular Season, Spring Training, Wild Card, Divisional Series, etc.).

game_pk: Unique game identifier.

# Descriptive statistics over time
game_summary <- df %>% 
  count(game_year, game_type)
print(game_summary)
##   game_year                  game_type    n
## 1      2023          Divisional Series 4116
## 2      2023 League Championship Series 3942
## 3      2023                  Wild Card 2281
## 4      2023               World Series 1490
# Visualization: Count of Games by Game Type
ggplot(df, aes(x = game_type)) +
  geom_bar(fill = "coral", color = "black") +
  labs(title = "Count of Games by Type", x = "Game Type", y = "Number of Games") +
  theme_minimal()

player_name, batter, pitcher, pitch_name

Definition:

player_name: The name of the player associated with the play event (may be used for context).

batter: The MLB Player Id for the batter.

pitcher: The MLB Player Id for the pitcher.

pitch_name: The pitch’s name as derived from the tracking data.

# Unique pitchers and batters
num_unique_pitchers <- df %>% distinct(pitcher) %>% nrow()
num_unique_batters  <- df %>% distinct(batter) %>% nrow()
cat("Unique Pitchers:", num_unique_pitchers, "\nUnique Batters:", num_unique_batters, "\n")
## Unique Pitchers: 136 
## Unique Batters: 148
# Top 10 most frequently occurring pitches
top_pitches <- df %>% 
  count(pitch_name, sort = TRUE) %>% 
  head(10)
print(top_pitches)
##         pitch_name    n
## 1  4-Seam Fastball 4343
## 2           Sinker 1645
## 3           Slider 1553
## 4         Changeup 1114
## 5        Curveball 1088
## 6           Cutter  754
## 7          Sweeper  600
## 8     Split-Finger  373
## 9    Knuckle Curve  337
## 10          Slurve   21
# Bar plot for top pitch names
ggplot(top_pitches, aes(x = reorder(pitch_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "lightgreen", color = "darkgreen") +
  coord_flip() +
  labs(title = "Top 10 Most Frequent Pitch Names", x = "Pitch Name", y = "Frequency") +
  theme_minimal()

events, description, des

Definition:

events: The play event outcome (e.g., single, strikeout, home run).

description: A detailed description of the result of the pitch.

des: The plate appearance description from gameday.

# Remove blank
event_summary <- df %>%
  filter(!is.na(events), events != "") %>%
  count(events, sort = TRUE)

# View
print(event_summary)
##                       events    n
## 1                  field_out 1212
## 2                  strikeout  732
## 3                     single  429
## 4                       walk  266
## 5                     double  115
## 6                   home_run  110
## 7  grounded_into_double_play   56
## 8                  force_out   45
## 9               hit_by_pitch   28
## 10               field_error   15
## 11                  sac_bunt   15
## 12                   sac_fly   15
## 13                    triple    9
## 14               double_play    8
## 15       fielders_choice_out    8
## 16           fielders_choice    6
## 17        caught_stealing_2b    3
## 18            catcher_interf    2
## 19                 other_out    2
## 20     strikeout_double_play    2
# Plot 
ggplot(event_summary, aes(x = reorder(events, n), y = n)) +
  geom_bar(stat = "identity", fill = "#4E79A7") +
  coord_flip() +
  labs(title = "Frequency of Events (Excluding NAs and Blank Entries)",
       x = "Event",
       y = "Count") +
  theme_minimal()

At first glance, it appears field_outs are the most common event in the database (with strikeouts in second)

stand, p_throws

Definition:

stand: Indicates the batter’s stance (left/right).

p_throws: Indicates the pitcher’s throwing hand (left/right).

What to Explore:

Count frequencies

Compare performance metrics across handedness

# Frequency counts
batter_stand <- df %>% count(stand)
pitcher_hand <- df %>% count(p_throws)
print(batter_stand)
##   stand    n
## 1     L 5095
## 2     R 6734
print(pitcher_hand)
##   p_throws    n
## 1        L 2882
## 2        R 8947
# Visualization: Side-by-side bar plots
ggplot(df, aes(x = stand, fill = p_throws)) +
  geom_bar(position = "dodge") +
  labs(title = "Batter Stance by Pitcher Handedness", x = "Batter Stance", y = "Count") +
  theme_minimal()

As mentioned above, right hand dominance

home_team, away_team, score variables

Definition:

home_team & away_team: Abbreviations representing the teams playing.

home_score & away_score: Pre-pitch scores for the home and away teams, respectively.

bat_score & fld_score: The score of the batting and fielding teams before the pitch (useful for identifying which team is at bat).

post_home_score, post_away_score, post_bat_score: The scores immediately after the pitch event.

# Visualization: Histogram of score change
ggplot(df, aes(x = score_change)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  labs(title = "Histogram of Score Change per Pitch", x = "Score Change", y = "Frequency") +
  theme_minimal()

At first glance, it appears that most individual pitches have 0 impact on the score.

inning, inning_topbot, outs_when_up, at_bat_number, pitch_number

Definition:

inning: The inning number for the pitch event.

inning_topbot: Indicates whether it is the top or bottom of the inning.

outs_when_up: The number of outs in the inning prior to the pitch.

at_bat_number: The batting order count (i.e., which plate appearance in the game for that team).

pitch_number: The pitch count within the plate appearance.

# Visualizing inning distribution
ggplot(df, aes(x = factor(inning))) +
  geom_bar(fill = "orchid", color = "black") +
  labs(title = "Distribution of Innings", x = "Inning", y = "Count") +
  theme_minimal()

At first glance, it appears there aren’t many 10 or 11 innings.

HC Coordinates (hc_x, hc_y)

Definition: The coordinates of the ball when hit, representing the location where the batted ball landed.

# Scatter plot of hit coordinates
ggplot(df, aes(x = hc_x, y = hc_y)) +
  geom_point(alpha = 0.3, color = "darkred") +
  labs(title = "Hit Coordinates (hc_x vs. hc_y)", x = "hc_x", y = "hc_y") +
  theme_minimal()
## Warning: Removed 9787 rows containing missing values or values outside the scale range
## (`geom_point()`).

At first glance, it appears that most balls land close to the first ring

Advanced Performance Metrics

estimated_ba_using_speedangle: Estimated batting average based on launch speed and angle.

estimated_woba_using_speedangle: Estimated wOBA (weighted on-base average) using launch speed and angle.

woba_value, woba_denom: The wOBA value and its corresponding denominator.

babip_value: Batting Average on Balls In Play.

iso_value: Isolated Power – extra-base power indicator.

delta_home_win_exp, delta_run_exp: The change in win expectancy and run expectancy before and after the pitch, respectively.

# Boxplot: wOBA value by pitch type
ggplot(df, aes(x = pitch_type, y = woba_value)) +
  geom_boxplot(fill = "lightcyan", color = "blue") +
  labs(title = "wOBA by Pitch Type", x = "Pitch Type", y = "wOBA Value") +
  theme_minimal()
## Warning: Removed 8751 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

At first glance, the wOBA appears to be similar for most pitches.

fielder_2 to fielder_9

Definition: These fields store the MLB Player Ids for the players involved in fielding:

fielder_2: Typically the catcher

fielder_3: First base

fielder_4: Second base

fielder_5: Third base

fielder_6: Short stop

fielder_7: Left field

fielder_8: Center field

fielder_9: Right field

# Frequency for fielder 2 (Catcher)
fielder2_summary <- df %>% 
  count(fielder_2, sort = TRUE)
print(fielder2_summary)
##    fielder_2    n
## 1     641680 2390
## 2     672515 2237
## 3     592663 1758
## 4     455117 1452
## 5     680777  839
## 6     668939  498
## 7     669257  475
## 8     518595  345
## 9     661388  332
## 10    650907  283
## 11    669221  264
## 12    672386  261
## 13    673237  164
## 14    645444  160
## 15    663743  142
## 16    607732  140
## 17    595978   36
## 18    542194   32
## 19    596117   21

if_fielding_alignment, of_fielding_alignment

Definition: These variables indicate the infield and outfield defensive alignments at the time of the pitch.

# Frequency of different alignments
table(df$if_fielding_alignment)
## 
##               Infield shade      Standard     Strategic 
##            19          3186          7952           672
table(df$of_fielding_alignment)
## 
##            Standard Strategic 
##        19     11525       285

spin_axis

Definition: The spin axis in degrees (0 to 360) measured in the 2D X-Z plane. For example, 180° represents a pure backspin fastball, and 0° a pure topspin curveball.

# Summary and histogram for spin_axis
summary(df$spin_axis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0   130.0   204.0   175.9   223.0   359.0
ggplot(df, aes(x = spin_axis)) +
  geom_histogram(bins = 30, fill = "thistle", color = "black") +
  labs(title = "Distribution of Spin Axis", x = "Spin Axis (degrees)", y = "Frequency") +
  theme_minimal()

Data Use Cases (1)

Analysis of Pitches on Game Outcomes

Idea Description

The idea here is to analyze how individual pitch events affect game outcome variables such as run expectancy and win expectancy. By analysing variables like delta_home_win_exp and delta_run_exp, you can assess which pitches or events lead to significant shifts in the game state. This analysis could be valuable for both coaching staff and team management by providing insight into which game situations (or pitch types) are most impactful.

Preliminary Data Visualisations and Statistics

Win Expectancy Change Analysis: Firstly, I plotted the distribution of win expectancy changes to understand how often different pitch outcomes lead to large swings in game situations

ggplot(df, aes(x = delta_home_win_exp)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Home Win Expectancy Change", x = "Delta Home Win Exp", y = "Frequency") +
  theme_minimal()

At first glance, it can be seen that most pitches actually do not have that big of an impact, with most having 0 impact. In this scenario, a positive change means the play increased the likelihood of scoring. A negative change indicates the play reduced that likelihood. Lastly, a value close to zero means the pitch had little impact on scoring expectations.

Run Expectancy Change by Pitch Type: Secondly, I wanted to visualize which pitch types were associated with larger changes in run expectancy.

run_exp_change <- df %>% 
  filter(!is.na(pitch_type)) %>% 
  group_by(pitch_type) %>% 
  summarise(avg_run_exp_change = mean(delta_run_exp, na.rm = TRUE),
            count = n())

ggplot(run_exp_change, aes(x = reorder(pitch_type, avg_run_exp_change), y = avg_run_exp_change, fill = pitch_type)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Average Run Expectancy Change by Pitch Type", x = "Pitch Type", y = "Avg Run Exp Change") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

As seen from the previous visualization, pitches do not have that big of an impact on scoring expectancy changes, with that said however, it appears that Changeup pitches seem to have the highest Avg Run Expected change out of all the other pitches (0.022) and with the Slurve pitch having a negative effect (-0.035).

The value 0.02 means that, on average, when a changeup is thrown, it increases the expected number of runs by 0.02 runs. To put it in a real life perspective, if a team was expected to score 0.50 runs on average, it appears that after a changeup is thrown, this increases slightly to 0.52 runs on average because of that pitch.

This is crucial for coaches and teams, to be able to understand which types of pitches are more likely to improve (or worsen) the game situation helps in making better informed decisions. For analysts, It provides a way to quantify the effectiveness of each pitch type beyond just traditional measures like strikeouts or batting average.

Outcome Analysis by Game Situation: Lastly, I wanted to examine how the score changes (or win expectancy changes) vary by different game contexts (in this case innings).

# Boxplot of score changes by inning
ggplot(df, aes(x = factor(inning), y = score_change)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  labs(title = "Score Change by Inning", x = "Inning", y = "Score Change") +
  theme_minimal()

Expected Outcomes / Research Hypotheses

Hypothesis 1: Pitch events with larger run expectancy changes correspond to pitches in specific zones (such as those just outside the striking zone of the batter that result in weak contact).

Hypothesis 2: The impact of individual pitch outcomes on win expectancy is more noticeable in high-leverage situations (late innings or during close games).

Expected Outcomes: The identification of pitch characteristics or situations that can be used for coaches to better their strategy. These insights could help in evaluating pitcher performance in critical moments and inform in-game decisions.

Limitations or Issues in the Data

Contextual Details: While the dataset includes some context (score, inning, outs), deeper factors such as batter tendencies or defensive positioning was not captured.

Data Representativeness: Postseason data typically involves high-caliber teams and pitchers. While interesting, conclusions drawn might not generalize to regular-season scenarios.

Data Use Cases (2)

Predicting Batted Ball Outcomes Based on Pitch Characteristics

Idea Description

The idea here is to investigate how various pitch attributes (such as pitch type, release speed, pitch location, and movement) influence the quality of contact and the resulting batted ball outcome (using metrics like exit velocity, launch angle, and advanced statistics such as estimated BABIP, wOBA, or ISO). Essentially, the goal is to develop a predictive framework that links the details of a pitch to how well (or poorly) the ball is hit.

Key Aspects of the Study:

Inputs:

Pitch characteristics including:

Pitch Type: (CH, CU, FF, etc.)

Release Speed: Velocity at which the pitch is thrown.

Pitch Location: Represented by plate_x and plate_z.

Movement Variables: Such as pfx_x and pfx_z or the acceleration measures.

Outcomes:

Batted ball outcomes measured with variables such as:

Launch Speed: How fast the ball leaves the bat.

Launch Angle: The vertical angle at which the ball is hit.

Hit Distance: The projected distance the ball travels.

Advanced Metrics: Such as estimated BABIP, ISO, and wOBA values (which relate to the quality of the contact).

Preliminary Data Visualizations and Statistics

Before building the predictive model, I explored the relationship between pitch characteristics and batted ball outcomes.

Correlation Analysis

Scatter plot to see how pitch speed (release_speed) relates to exit velocity (launch_speed):

ggplot(df, aes(x = release_speed, y = launch_speed)) +
  geom_point(alpha = 0.3, color = "darkblue") +
  labs(title = "Release Speed vs. Launch Speed",
       x = "Release Speed (mph)", y = "Launch Speed (mph)") +
  theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).

At a first glance it doesn’t appear to be a correlation. The points all vary, but there seems to be an aggregation in the 95 mph release speed to 80 mph launch speed zone.

Distribution by Pitch Type

Visualization of the distribution of launch speeds across different pitch types:

# Filter NA
df_filtered <- df %>% 
  filter(!is.na(pitch_type), !is.na(launch_speed))

ggplot(df_filtered, aes(x = pitch_type, y = launch_speed, fill = pitch_type)) +
  geom_boxplot() +
  labs(title = "Launch Speed by Pitch Type",
       x = "Pitch Type", y = "Launch Speed (mph)") +
  theme_minimal() +
  theme(legend.position = "none")

Trend at first glance appears to be at 85 mph launch speed for all pitches besides SV (much slower).

Pitch Location and Batted Ball Quality

Examining how the pitch location (plate_x and plate_z) relates to the outcome of the ball in terms of exit velocity:

ggplot(df, aes(x = plate_x, y = launch_speed)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  labs(title = "Pitch Location vs. Launch Speed",
       x = "Plate X (ft)", y = "Launch Speed (mph)") +
  theme_minimal()
## Warning: Removed 7978 rows containing missing values or values outside the scale range
## (`geom_point()`).

Expected Outcomes / Research Hypotheses

Hypothesis 1: Certain pitch types (for example, a four-seam fastball) are more likely to lead to higher exit velocities because of the higher release speeds and the more predictable location in the strike zone.

Hypothesis 2: The location of the pitch is key and that pitches thrown in the heart of the strike zone (moderate values of plate_x and plate_z) are expected to produce more solid contact compared to pitches on the edges, leading to higher launch speeds and more optimal launch angles.

Expected Outcome: By analyzing these relationships, one might be able to identify specific pitch characteristics that consistently yield better contact with the ball . The study could result in valuable insights such as recommended pitch selection under certain circumstances or situational changes that maximize the likelihood of a weakly hit ball, ultimately guiding both pitching strategies and hitter approaches.

Discussion of Limitations or Issues in the Data

Contextual Details: While the dataset includes some context (score, inning, outs), deeper factors such as batter tendencies or defensive positioning was not captured.

Data Representativeness: Postseason data typically involves high-caliber teams and pitchers. While interesting, conclusions drawn might not generalize to regular-season scenarios.

Measurement Error and Estimation: Variables like estimated BABIP or wOBA based on speed and angle might have a lot of noise or estimation error, affecting the precision of these variables.

Causation vs. Correlation: It can be challenging to determine causality in observational data. While you may find significant correlations between pitch characteristics and batted ball outcomes, establishing causation may require more controlled experiments or adjustments for confounding variables.