/— title: “KAA and KAA+ by James Orange” output: html_notebook — On 9/16/2024, Tangotiger proposed the development of Strikeouts Above Average (KAA) and an Index (KAA+). This is my attempt to create this.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.2

Let’s take a look at the lahman data:

pitching <- Lahman::Pitching 

pitching <- pitching %>%
  mutate (K_p_BF = SO/BFP)

We see the pitching data came in, and we have K’s (SO) and Batters Faced (BFP). Instead of simply doing K’s above average, we’ll also look at K’s per Batter Faced above Average, in order to compare relief/starting/“opener” pitchers. Additionally, with the change in strategy/deployment of all these positions year over year, might better be compared by looking at a per batter faced statistic, instead of K’s in general.

Next, let’s build a data frame which takes the average by year:

# Create a new dataframe with the average K_p_BF by yearID
MeanKperBatterFacedbyYear <- pitching %>%
  group_by(yearID) %>%
  summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
            Mean_K = mean(SO, na.rm = TRUE) )  # Use na.rm = TRUE to remove NA values from the calculation

# View the new dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 152 × 3
##    yearID Mean_K_p_BF Mean_K
##     <int>       <dbl>  <dbl>
##  1   1871     0.00903   9.21
##  2   1872     0.0120   11.0 
##  3   1873     0.0125   11.1 
##  4   1874     0.0105   21   
##  5   1875     0.0175   14.4 
##  6   1876     0.0353   17.3 
##  7   1877     0.0447   38.2 
##  8   1878     0.0591   49.1 
##  9   1879     0.0831   70.9 
## 10   1880     0.0661   60.3 
## # ℹ 142 more rows

Looking at the data, we see the mean K’s for 2020 down, which makes sense, considering only 60 games per team were played. This gives more evidence that the per batter faced is the correct statistic to use when developing a KAA and KAA+ index.

Now let’s quickly visualize these statistics using a line graph to get an idea of what mean K’s and mean K ber Batter Faced looks like over time:

# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
  geom_line(color = "blue") +
  theme_minimal() +
  labs(title = "Mean K per Batter Faced by Year",
       x = "Year",
       y = "Mean K per Batter Faced") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
  geom_line(color = "red") +
  theme_minimal() +
  labs(title = "Mean Strikeouts by Year",
       x = "Year",
       y = "Mean Strikeouts") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange the plots side by side
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, ncol = 2)

It looks like everything is okay, although there seems to be a big spike around 1915 for Mean K per Batter Faced, whereas for Mean Strikeouts, we see the dip in 2020 with the shortened season. We can see the upward trend for Mean Strikeout per Batter Faced, whereas for Mean strikeouts, this trend doesn’t seem to be as apparent. We could run a time series regressing in the future for further analysis. However, the purpose here is not prediction of future values, rather the creation of a new metric which would allow for comparison over time/position.

Let’s dig into the spike in 1915 to figure out why this is occurring in the data, and then we can proceed with creation of the new metric.

# Filter the dataframe for the desired yearIDs
filtered_years <- MeanKperBatterFacedbyYear %>%
  filter(yearID %in% c(1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919))

# Print the filtered dataframe
print(filtered_years)
## # A tibble: 10 × 3
##    yearID Mean_K_p_BF Mean_K
##     <int>       <dbl>  <dbl>
##  1   1910      0.0976   46.0
##  2   1911      0.0944   43.5
##  3   1912      0.0888   38.8
##  4   1913      0.0920   39.6
##  5   1914    Inf        52.8
##  6   1915    Inf        45.7
##  7   1916      0.0960   48.1
##  8   1917      0.0881   48.3
##  9   1918      0.0730   27.7
## 10   1919      0.0736   30.5

We can see here, that Mean_K_p_BF is absent for 1914 and 1915. Let’s take a look and see why this might be.

# Filter the dataframe for the desired yearIDs
filtered_years_pitching <- pitching %>%
  filter(yearID %in% c(1914, 1915))

# Print the filtered dataframe
print(head(filtered_years_pitching))
##    playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H ER HR BB
## 1 adamsba01   1914     1    PIT   NL 13 16 40 35 19   3  1    849 253 79  5 39
## 2 adamsda01   1914     1    KCF   FL  4  9 36 14  6   0  3    408 141 53  3 52
## 3 adamska01   1914     1    CIN   NL  0  0  4  0  0   0  0     24  14  8  0  5
## 4 adamswi01   1914     1    PTF   FL  1  1 15  2  1   0  2    166  70 23  4 22
## 5 aitchra01   1914     1    BRO   NL 12  7 26 17  8   3  0    517 156 51  4 60
## 6 alexape01   1914     1    PHI   NL 27 15 46 39 32   6  1   1065 327 94  8 76
##    SO BAOpp  ERA IBB WP HBP BK  BFP GF   R SH SF GIDP     K_p_BF
## 1  91  0.24 2.51  NA  2   7  0 1116  4  97 NA NA   NA 0.08154122
## 2  38  9.99 3.51  NA  5   7  0  576 16  67 NA NA   NA 0.06597222
## 3   5  0.42 9.00  NA  0   0  0   41  4  10 NA NA   NA 0.12195122
## 4  14  9.99 3.74  NA  2   1  0  238 10  29 NA NA   NA 0.05882353
## 5  87  0.24 2.66  NA  9   3  0  720  6  71 NA NA   NA 0.12083333
## 6 214  0.24 2.38  NA  1  11  0 1459  7 133 NA NA   NA 0.14667581

It appears we have values for K_p_BF, let’s see if we can narrow this down a little more.

# Check for NA values in BFP for 1914 and 1915
na_entries <- pitching %>%
  filter(yearID %in% c(1914, 1915) & is.na(K_p_BF))

print(na_entries)
##     playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts  H ER HR BB SO
## 1  hugheve01   1914     1    BLF   FL 0 0 3  0  0   0  0     17  5  2  0  3  0
## 2  ostenfr01   1914     1    IND   FL 0 0 1  0  0   0  0      6  5  5  0  2  0
## 3  perrige01   1914     1    KCF   FL 0 0 1  0  0   0  0      2  2  1  0  1  0
## 4  porraed01   1914     1    BUF   FL 0 1 3  3  0   0  0     31 18  5  2  7  0
## 5  shermda01   1914     1    CHF   FL 0 1 1  1  0   0  0      1  0  0  0  2  0
## 6  vernojo01   1914     1    BRF   FL 0 0 1  1  0   0  0     10  4  4  0  5  0
## 7  harribe02   1915     1    KCF   FL 0 0 1  0  0   0  0      6  1  0  0  0  0
## 8  miljujo01   1915     1    PTF   FL 0 0 1  0  0   0  0      3  1  0  0  0  0
## 9  smithbo01   1915     1    BUF   FL 0 0 1  0  0   0  0      3  1  2  0  2  0
## 10 whitegi01   1915     1    NEW   FL 0 0 1  0  0   0  0      3  0  0  0  1  0
## 11 wilheka01   1915     1    BLF   FL 0 0 1  0  0   0  0      3  0  0  0  0  0
##    BAOpp   ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF
## 1   9.99  3.18  NA  2   0  0   0  2 4 NA NA   NA    NaN
## 2   9.99 22.50  NA  0   1  0   0  1 5 NA NA   NA    NaN
## 3   9.99 13.50  NA  0   0  0   0  1 1 NA NA   NA    NaN
## 4   9.99  4.35  NA  0   0  0   0  0 9 NA NA   NA    NaN
## 5   9.99  0.00  NA  0   0  0   0  0 2 NA NA   NA    NaN
## 6   9.99 10.80  NA  0   0  0   0  0 4 NA NA   NA    NaN
## 7   9.99  0.00  NA  0   0  0   0  1 0 NA NA   NA    NaN
## 8   9.99  0.00  NA  0   0  0   0  1 0 NA NA   NA    NaN
## 9   9.99 18.00  NA  0   1  0   0  1 2 NA NA   NA    NaN
## 10  9.99  0.00  NA  0   0  0   0  1 0 NA NA   NA    NaN
## 11  9.99  0.00  NA  0   0  0   0  0 0 NA NA   NA    NaN

Here we see quite a few NaN values, with a BFP and strikeouts of 0 (dividing by 0). To handle this, instead of trying to use a predicted value or waited average for these 11 total observations, we’ll excluded them from the data set, since their exclusion is unlikely to meaningfully change the performance data.

Next, let’s use the new mean statistics by year to develop KAA, KAA+, Strikeout per Batter Faced Above Average (KBFAA), and Index (KBFAA+)

# Remove rows where K_p_BF is NaN or NA
pitching_clean <- pitching %>%
  filter(!is.na(K_p_BF) & !is.nan(K_p_BF) & !is.infinite(K_p_BF))  # This will remove both NaN and NA

# Check for NA values in BFP for 1914 and 1915
na_entries <- pitching_clean %>%
  filter(yearID %in% c(1914, 1915) & is.na(K_p_BF))

print(na_entries)
##  [1] playerID yearID   stint    teamID   lgID     W        L        G       
##  [9] GS       CG       SHO      SV       IPouts   H        ER       HR      
## [17] BB       SO       BAOpp    ERA      IBB      WP       HBP      BK      
## [25] BFP      GF       R        SH       SF       GIDP     K_p_BF  
## <0 rows> (or 0-length row.names)

Okay, now let’s examine the data frame and visualize it again to see what it looks like, cleaned.

MeanKperBatterFacedbyYear <- pitching_clean %>%
  group_by(yearID) %>%
  summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
            Mean_K = mean(SO, na.rm = TRUE) )  # Use na.rm = TRUE to remove NA values from the calculation

# View the new dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 152 × 3
##    yearID Mean_K_p_BF Mean_K
##     <int>       <dbl>  <dbl>
##  1   1871     0.00903   9.21
##  2   1872     0.0120   11.0 
##  3   1873     0.0125   11.1 
##  4   1874     0.0105   21   
##  5   1875     0.0175   14.4 
##  6   1876     0.0353   17.3 
##  7   1877     0.0447   38.2 
##  8   1878     0.0591   49.1 
##  9   1879     0.0831   70.9 
## 10   1880     0.0661   60.3 
## # ℹ 142 more rows

Okay, evaluating 1914 and 1915 we can see Mean_K_p_BF seems to move in step with Mean K, which is what we would expect after removal. Let’s also visualize the cleaned data quickly to ensure we have removed the spike from K’s per Batter Faced:

# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
  geom_line(color = "blue") +
  theme_minimal() +
  labs(title = "Mean K per Batter Faced by Year",
       x = "Year",
       y = "Mean K per Batter Faced") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
  geom_line(color = "red") +
  theme_minimal() +
  labs(title = "Mean Strikeouts by Year",
       x = "Year",
       y = "Mean Strikeouts") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange the plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Since this is what we expected, and don’t have any further cause for concern, we can proceed with creation of the KAA, KAA+, KBFAA, and KBFAA+ statistics.I am going to elect to keep the dip in the mean K’s per year for 2020, rather than trying to normalize the entire data set for the year. We will see what the results look like.

The first step in creating these statistics, will be to add the average to the pitching_clean dataframe through a join.

pitching_clean <- pitching_clean %>%
  left_join(MeanKperBatterFacedbyYear, by = "yearID")

# View the updated pitching_clean data frame
head(pitching_clean)
##    playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H  ER HR BB
## 1 bechtge01   1871     1    PH1   NA  1  2  3  3  2   0  0     78  43  23  0 11
## 2 brainas01   1871     1    WS3   NA 12 15 30 30 30   0  0    792 361 132  4 37
## 3 fergubo01   1871     1    NY2   NA  0  0  1  0  0   0  0      3   8   3  0  0
## 4 fishech01   1871     1    RC1   NA  4 16 24 24 22   1  0    639 295 103  3 31
## 5 fleetfr01   1871     1    NY2   NA  0  1  1  1  1   0  0     27  20  10  0  3
## 6 flowedi01   1871     1    TRO   NA  0  0  1  0  0   0  0      3   1   0  0  0
##   SO BAOpp   ERA IBB WP HBP BK  BFP GF   R SH SF GIDP      K_p_BF Mean_K_p_BF
## 1  1    NA  7.96  NA  7  NA  0  146  0  42 NA NA   NA 0.006849315 0.009033541
## 2 13    NA  4.50  NA  7  NA  0 1291  0 292 NA NA   NA 0.010069713 0.009033541
## 3  0    NA 27.00  NA  2  NA  0   14  0   9 NA NA   NA 0.000000000 0.009033541
## 4 15    NA  4.35  NA 20  NA  0 1080  1 257 NA NA   NA 0.013888889 0.009033541
## 5  0    NA 10.00  NA  0  NA  0   57  0  21 NA NA   NA 0.000000000 0.009033541
## 6  0    NA  0.00  NA  0  NA  0    3  1   0 NA NA   NA 0.000000000 0.009033541
##     Mean_K
## 1 9.210526
## 2 9.210526
## 3 9.210526
## 4 9.210526
## 5 9.210526
## 6 9.210526

Now let’s create KAA, KAA+, KBFAA, KBFAA+

pitching_clean <- pitching_clean %>%
  mutate(
    KAA = SO - Mean_K,
    `KAA+` = (SO/Mean_K) * 100,
    KBFAA = K_p_BF - Mean_K_p_BF, 
    `KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
  )

head(pitching_clean)
##    playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H  ER HR BB
## 1 bechtge01   1871     1    PH1   NA  1  2  3  3  2   0  0     78  43  23  0 11
## 2 brainas01   1871     1    WS3   NA 12 15 30 30 30   0  0    792 361 132  4 37
## 3 fergubo01   1871     1    NY2   NA  0  0  1  0  0   0  0      3   8   3  0  0
## 4 fishech01   1871     1    RC1   NA  4 16 24 24 22   1  0    639 295 103  3 31
## 5 fleetfr01   1871     1    NY2   NA  0  1  1  1  1   0  0     27  20  10  0  3
## 6 flowedi01   1871     1    TRO   NA  0  0  1  0  0   0  0      3   1   0  0  0
##   SO BAOpp   ERA IBB WP HBP BK  BFP GF   R SH SF GIDP      K_p_BF Mean_K_p_BF
## 1  1    NA  7.96  NA  7  NA  0  146  0  42 NA NA   NA 0.006849315 0.009033541
## 2 13    NA  4.50  NA  7  NA  0 1291  0 292 NA NA   NA 0.010069713 0.009033541
## 3  0    NA 27.00  NA  2  NA  0   14  0   9 NA NA   NA 0.000000000 0.009033541
## 4 15    NA  4.35  NA 20  NA  0 1080  1 257 NA NA   NA 0.013888889 0.009033541
## 5  0    NA 10.00  NA  0  NA  0   57  0  21 NA NA   NA 0.000000000 0.009033541
## 6  0    NA  0.00  NA  0  NA  0    3  1   0 NA NA   NA 0.000000000 0.009033541
##     Mean_K       KAA      KAA+        KBFAA    KBFAA+
## 1 9.210526 -8.210526  10.85714 -0.002184226  75.82093
## 2 9.210526  3.789474 141.14286  0.001036172 111.47028
## 3 9.210526 -9.210526   0.00000 -0.009033541   0.00000
## 4 9.210526  5.789474 162.85714  0.004855348 153.74800
## 5 9.210526 -9.210526   0.00000 -0.009033541   0.00000
## 6 9.210526 -9.210526   0.00000 -0.009033541   0.00000

Okay, for clarity, let’s bring in the player names:

# Select relevant columns from the Master table (which contains player names)
player_info <- Lahman::People %>%
  select(playerID, nameFirst, nameLast) %>%
  mutate(nameFull = paste(nameLast, ", ", nameFirst, sep=""))

# Join the player names into the pitching_clean dataframe using playerID as the key
pitching_clean_with_names <- pitching_clean %>%
  left_join(player_info, by = "playerID")


# View the updated dataframe with player names
head(pitching_clean_with_names)
##    playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H  ER HR BB
## 1 bechtge01   1871     1    PH1   NA  1  2  3  3  2   0  0     78  43  23  0 11
## 2 brainas01   1871     1    WS3   NA 12 15 30 30 30   0  0    792 361 132  4 37
## 3 fergubo01   1871     1    NY2   NA  0  0  1  0  0   0  0      3   8   3  0  0
## 4 fishech01   1871     1    RC1   NA  4 16 24 24 22   1  0    639 295 103  3 31
## 5 fleetfr01   1871     1    NY2   NA  0  1  1  1  1   0  0     27  20  10  0  3
## 6 flowedi01   1871     1    TRO   NA  0  0  1  0  0   0  0      3   1   0  0  0
##   SO BAOpp   ERA IBB WP HBP BK  BFP GF   R SH SF GIDP      K_p_BF Mean_K_p_BF
## 1  1    NA  7.96  NA  7  NA  0  146  0  42 NA NA   NA 0.006849315 0.009033541
## 2 13    NA  4.50  NA  7  NA  0 1291  0 292 NA NA   NA 0.010069713 0.009033541
## 3  0    NA 27.00  NA  2  NA  0   14  0   9 NA NA   NA 0.000000000 0.009033541
## 4 15    NA  4.35  NA 20  NA  0 1080  1 257 NA NA   NA 0.013888889 0.009033541
## 5  0    NA 10.00  NA  0  NA  0   57  0  21 NA NA   NA 0.000000000 0.009033541
## 6  0    NA  0.00  NA  0  NA  0    3  1   0 NA NA   NA 0.000000000 0.009033541
##     Mean_K       KAA      KAA+        KBFAA    KBFAA+ nameFirst nameLast
## 1 9.210526 -8.210526  10.85714 -0.002184226  75.82093    George  Bechtel
## 2 9.210526  3.789474 141.14286  0.001036172 111.47028       Asa Brainard
## 3 9.210526 -9.210526   0.00000 -0.009033541   0.00000       Bob Ferguson
## 4 9.210526  5.789474 162.85714  0.004855348 153.74800  Cherokee   Fisher
## 5 9.210526 -9.210526   0.00000 -0.009033541   0.00000     Frank    Fleet
## 6 9.210526 -9.210526   0.00000 -0.009033541   0.00000    Dickie  Flowers
##           nameFull
## 1  Bechtel, George
## 2    Brainard, Asa
## 3    Ferguson, Bob
## 4 Fisher, Cherokee
## 5     Fleet, Frank
## 6  Flowers, Dickie

Okay, let’s create a data frame of just the player name, year, and the new statistics:

# Create a new dataframe with just player name, year, and the new statistics
player_kaa_stats <- pitching_clean_with_names %>%
  select( yearID, nameLast, nameFirst, nameFull,KAA, `KAA+`, KBFAA, `KBFAA+`)

# View the new dataframe
head(player_kaa_stats)
##   yearID nameLast nameFirst         nameFull       KAA      KAA+        KBFAA
## 1   1871  Bechtel    George  Bechtel, George -8.210526  10.85714 -0.002184226
## 2   1871 Brainard       Asa    Brainard, Asa  3.789474 141.14286  0.001036172
## 3   1871 Ferguson       Bob    Ferguson, Bob -9.210526   0.00000 -0.009033541
## 4   1871   Fisher  Cherokee Fisher, Cherokee  5.789474 162.85714  0.004855348
## 5   1871    Fleet     Frank     Fleet, Frank -9.210526   0.00000 -0.009033541
## 6   1871  Flowers    Dickie  Flowers, Dickie -9.210526   0.00000 -0.009033541
##      KBFAA+
## 1  75.82093
## 2 111.47028
## 3   0.00000
## 4 153.74800
## 5   0.00000
## 6   0.00000

Now let’s look at histograms of these statistics to get an idea of the distribution

# Histogram for KAA
p1 <- ggplot(player_kaa_stats, aes(x = KAA)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KAA", x = "KAA", y = "Frequency")

# Histogram for KAA+
p2 <- ggplot(player_kaa_stats, aes(x = `KAA+`)) +
  geom_histogram(binwidth = 5, fill = "red", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KAA+", x = "KAA+", y = "Frequency")

# Histogram for KBFAA
p3 <- ggplot(player_kaa_stats, aes(x = KBFAA)) +
  geom_histogram(binwidth = 0.01, fill = "green", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KBFAA", x = "KBFAA", y = "Frequency")

# Histogram for KBFAA+
p4 <- ggplot(player_kaa_stats, aes(x = `KBFAA+`)) +
  geom_histogram(binwidth = 5, fill = "purple", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KBFAA+", x = "KBFAA+", y = "Frequency")

# Arrange the histograms in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

okay, now let’s get the 2023 and 2024 data from statcast, data was downladed from: https://baseballsavant.mlb.com/leaderboard/custom?year=2024%2C2023&type=pitcher&filter=&min=10&selections=pa%2Cstrikeout&chart=false&x=pa&y=pa&r=no&chartType=beeswarm&sort=xwoba&sortDir=asc and downladed into a csv:

# Set the file path
file_path <- "C:/Users/james/Downloads/stats.csv"

# Load the data
statcast_pitching_data <- read.csv(file_path)

# Check the first few rows of the data
head(statcast_pitching_data)
##   last_name..first_name player_id year  pa strikeout
## 1      Wainwright, Adam    425794 2023 484        55
## 2         Greinke, Zack    425844 2023 593        97
## 3     Verlander, Justin    434378 2023 669       144
## 4        Jansen, Kenley    445276 2023 188        52
## 5         Chavez, Jesse    445926 2023 144        39
## 6         Kluber, Corey    446372 2023 257        42

Next, let’s re-create the statistics for 2023 and 2024:

statcast_pitching_data <- statcast_pitching_data %>%
  rename(nameFull = last_name..first_name) %>%
  mutate(
    K_p_BF = strikeout/pa,
    yearID = year,
    nameLast = str_extract(nameFull, "^[^,]+"),  # Extract everything before the comma for nameLast
    nameFirst = str_extract(nameFull, "(?<=,\\s).+")
  )

head(statcast_pitching_data)
##            nameFull player_id year  pa strikeout    K_p_BF yearID   nameLast
## 1  Wainwright, Adam    425794 2023 484        55 0.1136364   2023 Wainwright
## 2     Greinke, Zack    425844 2023 593        97 0.1635750   2023    Greinke
## 3 Verlander, Justin    434378 2023 669       144 0.2152466   2023  Verlander
## 4    Jansen, Kenley    445276 2023 188        52 0.2765957   2023     Jansen
## 5     Chavez, Jesse    445926 2023 144        39 0.2708333   2023     Chavez
## 6     Kluber, Corey    446372 2023 257        42 0.1634241   2023     Kluber
##   nameFirst
## 1      Adam
## 2      Zack
## 3    Justin
## 4    Kenley
## 5     Jesse
## 6     Corey

Okay, let’s summarize

# Create a new dataframe with the average K_p_BF by yearID
MeanKperBatterFacedbyYear_statcast <- statcast_pitching_data %>%
  group_by(year) %>%
  summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
            Mean_K = mean(strikeout, na.rm = TRUE) )  # Use na.rm = TRUE to remove NA values from the calculation

# View the new dataframe
print(MeanKperBatterFacedbyYear_statcast)
## # A tibble: 2 × 3
##    year Mean_K_p_BF Mean_K
##   <int>       <dbl>  <dbl>
## 1  2023       0.219   53.6
## 2  2024       0.216   49.8

great, now let’s append the data to the MeanKperBatterBacedbyYear data

# Rename the 'year' column to 'yearID' in the second dataframe
MeanKperBatterFacedbyYear_statcast <- MeanKperBatterFacedbyYear_statcast %>%
  rename(yearID = year)
# Append the dataframes
MeanKperBatterFacedbyYear <- bind_rows(MeanKperBatterFacedbyYear, MeanKperBatterFacedbyYear_statcast)

# Check the combined dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 154 × 3
##    yearID Mean_K_p_BF Mean_K
##     <int>       <dbl>  <dbl>
##  1   1871     0.00903   9.21
##  2   1872     0.0120   11.0 
##  3   1873     0.0125   11.1 
##  4   1874     0.0105   21   
##  5   1875     0.0175   14.4 
##  6   1876     0.0353   17.3 
##  7   1877     0.0447   38.2 
##  8   1878     0.0591   49.1 
##  9   1879     0.0831   70.9 
## 10   1880     0.0661   60.3 
## # ℹ 144 more rows

Okay, now let’s re-do the visuals:

# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
  geom_line(color = "blue") +
  theme_minimal() +
  labs(title = "Mean K per Batter Faced by Year",
       x = "Year",
       y = "Mean K per Batter Faced") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
  geom_line(color = "red") +
  theme_minimal() +
  labs(title = "Mean Strikeouts by Year",
       x = "Year",
       y = "Mean Strikeouts") +
  scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID), 
                                  max(MeanKperBatterFacedbyYear$yearID), 
                                  by = 15)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange the plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

We can see the two extra years in the data, which is good, and they seem to line up with the previous data, which gives us confidence in both data sets. Next, let’s proceed with adding the KAA, KAA+, KBFAA, KBFAA+ statistics for the 2023/2024 data and then appending them into the player_kaa_stats dataframe

statcast_pitching_data <- statcast_pitching_data %>%
  left_join(MeanKperBatterFacedbyYear, by = "yearID") %>%
  select(yearID, nameLast, nameFirst, nameFull, pa, strikeout, K_p_BF, Mean_K_p_BF, Mean_K) %>%
  mutate(
    KAA = strikeout - Mean_K,
    `KAA+` = (strikeout/Mean_K) * 100,
    KBFAA = K_p_BF - Mean_K_p_BF, 
    `KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
  )

# View the updated pitching_clean data frame
head(statcast_pitching_data)
##   yearID   nameLast nameFirst          nameFull  pa strikeout    K_p_BF
## 1   2023 Wainwright      Adam  Wainwright, Adam 484        55 0.1136364
## 2   2023    Greinke      Zack     Greinke, Zack 593        97 0.1635750
## 3   2023  Verlander    Justin Verlander, Justin 669       144 0.2152466
## 4   2023     Jansen    Kenley    Jansen, Kenley 188        52 0.2765957
## 5   2023     Chavez     Jesse     Chavez, Jesse 144        39 0.2708333
## 6   2023     Kluber     Corey     Kluber, Corey 257        42 0.1634241
##   Mean_K_p_BF   Mean_K        KAA      KAA+        KBFAA    KBFAA+
## 1   0.2194861 53.64698   1.353017 102.52207 -0.105849698  51.77384
## 2   0.2194861 53.64698  43.353017 180.81166 -0.055911019  74.52639
## 3   0.2194861 53.64698  90.353017 268.42143 -0.004239424  98.06848
## 4   0.2194861 53.64698  -1.646983  96.92996  0.057109684 126.01973
## 5   0.2194861 53.64698 -14.646983  72.69747  0.051347272 123.39432
## 6   0.2194861 53.64698 -11.646983  78.28958 -0.056061937  74.45763

okay, now let’s append this into the player_kaa_stats data, first let’s select only what will match the dataframe we are appending.

# Create a new dataframe with just player name, year, and the new statistics
player_kaa_stats_statcast <- statcast_pitching_data %>%
  select( yearID, nameLast, nameFirst, nameFull,KAA, `KAA+`, KBFAA, `KBFAA+`)

# View the new dataframe
head(player_kaa_stats_statcast)
##   yearID   nameLast nameFirst          nameFull        KAA      KAA+
## 1   2023 Wainwright      Adam  Wainwright, Adam   1.353017 102.52207
## 2   2023    Greinke      Zack     Greinke, Zack  43.353017 180.81166
## 3   2023  Verlander    Justin Verlander, Justin  90.353017 268.42143
## 4   2023     Jansen    Kenley    Jansen, Kenley  -1.646983  96.92996
## 5   2023     Chavez     Jesse     Chavez, Jesse -14.646983  72.69747
## 6   2023     Kluber     Corey     Kluber, Corey -11.646983  78.28958
##          KBFAA    KBFAA+
## 1 -0.105849698  51.77384
## 2 -0.055911019  74.52639
## 3 -0.004239424  98.06848
## 4  0.057109684 126.01973
## 5  0.051347272 123.39432
## 6 -0.056061937  74.45763

Next, let’s append the new player_kaa_stats_statcast with player_kaa_stats to get one dataframe that includes all the Lahman data up until 2022 and the statcast data from 2023 and 2024:

# Append the two dataframes
player_kaa_stats <- bind_rows(player_kaa_stats, player_kaa_stats_statcast)

# View the updated dataframe
head(player_kaa_stats)
##   yearID nameLast nameFirst         nameFull       KAA      KAA+        KBFAA
## 1   1871  Bechtel    George  Bechtel, George -8.210526  10.85714 -0.002184226
## 2   1871 Brainard       Asa    Brainard, Asa  3.789474 141.14286  0.001036172
## 3   1871 Ferguson       Bob    Ferguson, Bob -9.210526   0.00000 -0.009033541
## 4   1871   Fisher  Cherokee Fisher, Cherokee  5.789474 162.85714  0.004855348
## 5   1871    Fleet     Frank     Fleet, Frank -9.210526   0.00000 -0.009033541
## 6   1871  Flowers    Dickie  Flowers, Dickie -9.210526   0.00000 -0.009033541
##      KBFAA+
## 1  75.82093
## 2 111.47028
## 3   0.00000
## 4 153.74800
## 5   0.00000
## 6   0.00000

Let’s again look at the distribution of the new statistics:

# Histogram for KAA
p1 <- ggplot(player_kaa_stats, aes(x = KAA)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KAA", x = "KAA", y = "Frequency")

# Histogram for KAA+
p2 <- ggplot(player_kaa_stats, aes(x = `KAA+`)) +
  geom_histogram(binwidth = 5, fill = "red", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KAA+", x = "KAA+", y = "Frequency")

# Histogram for KBFAA
p3 <- ggplot(player_kaa_stats, aes(x = KBFAA)) +
  geom_histogram(binwidth = 0.01, fill = "green", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KBFAA", x = "KBFAA", y = "Frequency")

# Histogram for KBFAA+
p4 <- ggplot(player_kaa_stats, aes(x = `KBFAA+`)) +
  geom_histogram(binwidth = 5, fill = "purple", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Histogram of KBFAA+", x = "KBFAA+", y = "Frequency")

# Arrange the histograms in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

Now, finally, let’s look at Kerry Wood’s 1998 season data and Paul Skenes 2024 season data to compare:

skenes_statcast_data <- statcast_pitching_data %>%
  filter(nameLast == "Skenes") %>%
  mutate(K_per_Batter = strikeout/pa)

skenes_statcast_data
##   yearID nameLast nameFirst     nameFull  pa strikeout   K_p_BF Mean_K_p_BF
## 1   2024   Skenes      Paul Skenes, Paul 490       158 0.322449   0.2162191
##     Mean_K      KAA     KAA+     KBFAA   KBFAA+ K_per_Batter
## 1 49.84666 108.1533 316.9721 0.1062299 149.1307     0.322449
wood_lahman_data <- pitching_clean_with_names %>%
  filter(nameLast == "Wood" & yearID == 1998) %>%
  mutate(K_per_Batter = SO/BFP)

wood_lahman_data
##   playerID yearID stint teamID lgID  W L  G GS CG SHO SV IPouts   H ER HR BB
## 1 woodke02   1998     1    CHN   NL 13 6 26 26  1   1  0    500 117 63 14 85
##    SO BAOpp ERA IBB WP HBP BK BFP GF  R SH SF GIDP    K_p_BF Mean_K_p_BF
## 1 233 0.196 3.4   1  6  11  3 699  0 69  2  4    3 0.3333333   0.1625023
##     Mean_K      KAA     KAA+    KBFAA   KBFAA+ nameFirst nameLast    nameFull
## 1 51.11058 181.8894 455.8743 0.170831 205.1253     Kerry     Wood Wood, Kerry
##   K_per_Batter
## 1    0.3333333

Looking on the face, we see Wood’s 1998 season with a strikeout per batter faced of 33.3%, striking out 1/3 of all the batter’s he faced. Whereas Skenes is striking out 32.2% of batter’s faced. Pretty close in terms of just what percetenage of batter’s faced they are striking out, with Wood still having a slight edge.

Now let’s compare using KAA, KAA+, KBFAA, and KBFAA+

Kerry_Wood_Paul_Skenes_Comparison <- player_kaa_stats %>%
  filter((yearID == 1998 & nameLast == "Wood") | (yearID == 2024 & nameLast == "Skenes"))


Kerry_Wood_Paul_Skenes_Comparison
##   yearID nameLast nameFirst     nameFull      KAA     KAA+     KBFAA   KBFAA+
## 1   1998     Wood     Kerry  Wood, Kerry 181.8894 455.8743 0.1708310 205.1253
## 2   2024   Skenes      Paul Skenes, Paul 108.1533 316.9721 0.1062299 149.1307

Let’s visualize this a little nicer:

# Load necessary library
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(grid)
library(gridExtra)


# Create a simple HTML table using kable
Kerry_Wood_Paul_Skenes_Comparison %>%
  kable(format = "html", table.attr = "class='table table-bordered table-hover'", caption = "<b><center>Kerry Wood and Paul Skenes Comparison Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)</center></b>") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))
Kerry Wood and Paul Skenes Comparison Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)
yearID nameLast nameFirst nameFull KAA KAA+ KBFAA KBFAA+
1998 Wood Kerry Wood, Kerry 181.8894 455.8743 0.1708310 205.1253
2024 Skenes Paul Skenes, Paul 108.1533 316.9721 0.1062299 149.1307

It appears, based on this comparison, using KAA+, that Kerry Wood was about 356% better than league average in 1998, whereas Paul Skenes has 217% than league average in 2024. However, keep in mind, we see Skenes faced 490 batters to Wood’s 699. Which is why it’s important to look at Strikeouts per Batter Faced Abover Average, in 1998 Kerry Wood was about 105% better than league average, whereas Paul Skenes has been about 49% better than league average to this point.

This indicates, in terms of evaluating Kerry Wood’s 1998 season was more with Paul Skenes 2024 season using strikeouts above average or strikeouts above average per batter faced, Wood was striking out much more than the league was at the time than Paul Skenes is today. Looking at the line graphs above, we see that the trend in the league has been increase K’s per batter faced, over time. So even though Wood and Skenes are striking out about the same amount of batter’s faced, if we look at it compared to the league average at the time, Wood was doing it at a much higher rate than Skenes, compared to his peers.

If we create a new dataframe given the strikeout info provided by Tom Tango, to specifically look at the first 490 plate appearances of Wood, we would first create a data frame and join in the Mean K data:

kerry_wood_data <- data.frame(
  yearID = 1998,
  playerID = "woodke01",
  nameLast = "Wood",
  nameFirst = "Kerry", 
  nameFull = "Wood, Kerry", 
  BFA = 490,
  strikeout = 169, 
  K_p_BF = 169/490
)

# Merge Kerry Wood's data with the mean data
kerry_wood_stats <- kerry_wood_data %>%
  left_join(MeanKperBatterFacedbyYear, by = "yearID") %>%
  mutate(
    KAA = strikeout - Mean_K,
    `KAA+` = (strikeout/Mean_K) * 100,
    KBFAA = K_p_BF - Mean_K_p_BF, 
    `KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
  )

Okay, now let’s append Paul Skenes data back in for the first 490 PA comparison

kerry_wood_mod <- kerry_wood_stats %>%
  select(-playerID, -nameLast, -nameFirst) %>%  # Remove playerID column
  rename(pa = BFA)  # Rename BFA to pa

# Modify skenes_statcast_data
skenes_statcast_mod <- skenes_statcast_data %>%
  select(-K_per_Batter, -nameLast, -nameFirst)  # Remove K_per_Batter column

# Append dataframes together
skenes_wood_first_four_ninety_pa_comparison <- bind_rows(kerry_wood_mod, skenes_statcast_mod)

#rename some things
skenes_wood_first_four_ninety_pa_comparison <- skenes_wood_first_four_ninety_pa_comparison %>%
  rename(
    Year = yearID,
    Name = nameFull,
    Strikeouts = strikeout,
    `Stikeouts per Batter Faced` = K_p_BF,
    `Mean strikeouts per Batter Faced by Year` = Mean_K_p_BF
  )

print(skenes_wood_first_four_ninety_pa_comparison)
##   Year         Name  pa Strikeouts Stikeouts per Batter Faced
## 1 1998  Wood, Kerry 490        169                   0.344898
## 2 2024 Skenes, Paul 490        158                   0.322449
##   Mean strikeouts per Batter Faced by Year   Mean_K      KAA     KAA+     KBFAA
## 1                                0.1625023 51.11058 117.8894 330.6556 0.1823956
## 2                                0.2162191 49.84666 108.1533 316.9721 0.1062299
##     KBFAA+
## 1 212.2419
## 2 149.1307

Now let’s create a nicer looking table

# Create a simple HTML table using kable
skenes_wood_first_four_ninety_pa_comparison %>%
  kable(format = "html", table.attr = "class='table table-bordered table-hover'", caption = "<b><center>Kerry Wood and Paul Skenes Comparison of the first 490 Batters Faced Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)</center></b>") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))
Kerry Wood and Paul Skenes Comparison of the first 490 Batters Faced Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)
Year Name pa Strikeouts Stikeouts per Batter Faced Mean strikeouts per Batter Faced by Year Mean_K KAA KAA+ KBFAA KBFAA+
1998 Wood, Kerry 490 169 0.344898 0.1625023 51.11058 117.8894 330.6556 0.1823956 212.2419
2024 Skenes, Paul 490 158 0.322449 0.2162191 49.84666 108.1533 316.9721 0.1062299 149.1307

Summary and Analysis

Kerry Wood’s 1998 Season:

Wood's numbers without league comparison are 169 strikeouts, and striking out 34.5% batters faced, (greater than 1/3 of batter's faced)
KAA of 117.8894: This indicates he struck out about 118 more batters than league average. 
KAA+ of 330.6556: This indicates that Kerry Wood was approximately 231% better than the league average in terms of strikeouts above average (KAA) in 1998. Such a high KAA+ value signifies an exceptional ability to strike out batters compared to his peers during that season.
KBFAA of .1823956: This indicates he struck out the batter's he faced at a rate of about 18.2% more than league average.
KBFAA+ of 212.2419: Reflecting on strikeouts per batter faced above average, Kerry Wood performed about 112% better than league average. This highlights his efficiency in striking out batters at a rate significantly higher than what was typical in the league.

Paul Skenes’ 2024 Season:

Skenes' numbers without league comparison are 158 strikeouts, and striking out 32.2% batters faced. 
KAA of 108.1533: This indicates he struck out about 108 more batters than league average. 
KAA+ of 316.9721: Paul Skenes shows a KAA+ approximately 217% above the league average in 2024, marking him also as an outstanding pitcher in terms of strikeouts above average, though slightly less than Wood's exceptional performance.
 KBFAA of .1062299: This indicates he struck out the batter's he faced at a rate of about 10.6% more than league average.
KBFAA+ of 149.1307: Skenes' ability to strike out batters compared to how many he faced is about 49% better than the league average. While still impressive, it is notably lower than Wood's rate, suggesting that while Skenes is effective, he does not dominate to the same extent as Wood did.

Comparative Analysis:

Despite the similar number of plate appearances (PA) each pitcher faced in their respective seasons, Kerry Wood's striking ability relative to his peers was more pronounced. His KAA, KAA+, KBFAA, and KBFAA+ scores both surpass those of Paul Skenes, indicating a higher level of dominance during his era.
The data also suggests a trend of increasing strikeouts per batter faced over the years within the league. Given this context, Wood's performance becomes even more impressive, demonstrating his capability to significantly outperform a league striking out significantly less batters than the era Skenes is pitching in.

Conclusion:

Evaluative Context: When comparing performances relative to their contemporaries, Kerry Wood's 1998 season was more dominant compared to Paul Skenes' 2024 season, both in absolute terms and relative efficiency (strikeouts per batter faced).
League Trends: Considering the trend of increasing strikeouts over time, Wood’s ability to far surpass the league average of his day emphasizes his exceptional skill during an era of relatively fewer strikeouts. This comparative advantage highlights his standout season even more when juxtaposed with Skenes, who also excels, but in a time when striking out batters is more common.

This analysis encapsulates the comparative excellence of Kerry Wood and Paul Skenes by situating their achievements within the broader statistical trends of their respective eras. While looking at face value of the counting statistics provides a pretty comparable view of both pitchers, this analysis underscores the importance of contextualizing individual statistics within league-wide performance metrics to fully appreciate the historical impact of players’ seasons.