/— title: “KAA and KAA+ by James Orange” output: html_notebook — On 9/16/2024, Tangotiger proposed the development of Strikeouts Above Average (KAA) and an Index (KAA+). This is my attempt to create this.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.2
Let’s take a look at the lahman data:
pitching <- Lahman::Pitching
pitching <- pitching %>%
mutate (K_p_BF = SO/BFP)
We see the pitching data came in, and we have K’s (SO) and Batters Faced (BFP). Instead of simply doing K’s above average, we’ll also look at K’s per Batter Faced above Average, in order to compare relief/starting/“opener” pitchers. Additionally, with the change in strategy/deployment of all these positions year over year, might better be compared by looking at a per batter faced statistic, instead of K’s in general.
Next, let’s build a data frame which takes the average by year:
# Create a new dataframe with the average K_p_BF by yearID
MeanKperBatterFacedbyYear <- pitching %>%
group_by(yearID) %>%
summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
Mean_K = mean(SO, na.rm = TRUE) ) # Use na.rm = TRUE to remove NA values from the calculation
# View the new dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 152 × 3
## yearID Mean_K_p_BF Mean_K
## <int> <dbl> <dbl>
## 1 1871 0.00903 9.21
## 2 1872 0.0120 11.0
## 3 1873 0.0125 11.1
## 4 1874 0.0105 21
## 5 1875 0.0175 14.4
## 6 1876 0.0353 17.3
## 7 1877 0.0447 38.2
## 8 1878 0.0591 49.1
## 9 1879 0.0831 70.9
## 10 1880 0.0661 60.3
## # ℹ 142 more rows
Looking at the data, we see the mean K’s for 2020 down, which makes sense, considering only 60 games per team were played. This gives more evidence that the per batter faced is the correct statistic to use when developing a KAA and KAA+ index.
Now let’s quickly visualize these statistics using a line graph to get an idea of what mean K’s and mean K ber Batter Faced looks like over time:
# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
geom_line(color = "blue") +
theme_minimal() +
labs(title = "Mean K per Batter Faced by Year",
x = "Year",
y = "Mean K per Batter Faced") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
geom_line(color = "red") +
theme_minimal() +
labs(title = "Mean Strikeouts by Year",
x = "Year",
y = "Mean Strikeouts") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Arrange the plots side by side
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, ncol = 2)
It looks like everything is okay, although there seems to be a big spike
around 1915 for Mean K per Batter Faced, whereas for Mean Strikeouts, we
see the dip in 2020 with the shortened season. We can see the upward
trend for Mean Strikeout per Batter Faced, whereas for Mean strikeouts,
this trend doesn’t seem to be as apparent. We could run a time series
regressing in the future for further analysis. However, the purpose here
is not prediction of future values, rather the creation of a new metric
which would allow for comparison over time/position.
Let’s dig into the spike in 1915 to figure out why this is occurring in the data, and then we can proceed with creation of the new metric.
# Filter the dataframe for the desired yearIDs
filtered_years <- MeanKperBatterFacedbyYear %>%
filter(yearID %in% c(1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919))
# Print the filtered dataframe
print(filtered_years)
## # A tibble: 10 × 3
## yearID Mean_K_p_BF Mean_K
## <int> <dbl> <dbl>
## 1 1910 0.0976 46.0
## 2 1911 0.0944 43.5
## 3 1912 0.0888 38.8
## 4 1913 0.0920 39.6
## 5 1914 Inf 52.8
## 6 1915 Inf 45.7
## 7 1916 0.0960 48.1
## 8 1917 0.0881 48.3
## 9 1918 0.0730 27.7
## 10 1919 0.0736 30.5
We can see here, that Mean_K_p_BF is absent for 1914 and 1915. Let’s take a look and see why this might be.
# Filter the dataframe for the desired yearIDs
filtered_years_pitching <- pitching %>%
filter(yearID %in% c(1914, 1915))
# Print the filtered dataframe
print(head(filtered_years_pitching))
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 adamsba01 1914 1 PIT NL 13 16 40 35 19 3 1 849 253 79 5 39
## 2 adamsda01 1914 1 KCF FL 4 9 36 14 6 0 3 408 141 53 3 52
## 3 adamska01 1914 1 CIN NL 0 0 4 0 0 0 0 24 14 8 0 5
## 4 adamswi01 1914 1 PTF FL 1 1 15 2 1 0 2 166 70 23 4 22
## 5 aitchra01 1914 1 BRO NL 12 7 26 17 8 3 0 517 156 51 4 60
## 6 alexape01 1914 1 PHI NL 27 15 46 39 32 6 1 1065 327 94 8 76
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF
## 1 91 0.24 2.51 NA 2 7 0 1116 4 97 NA NA NA 0.08154122
## 2 38 9.99 3.51 NA 5 7 0 576 16 67 NA NA NA 0.06597222
## 3 5 0.42 9.00 NA 0 0 0 41 4 10 NA NA NA 0.12195122
## 4 14 9.99 3.74 NA 2 1 0 238 10 29 NA NA NA 0.05882353
## 5 87 0.24 2.66 NA 9 3 0 720 6 71 NA NA NA 0.12083333
## 6 214 0.24 2.38 NA 1 11 0 1459 7 133 NA NA NA 0.14667581
It appears we have values for K_p_BF, let’s see if we can narrow this down a little more.
# Check for NA values in BFP for 1914 and 1915
na_entries <- pitching %>%
filter(yearID %in% c(1914, 1915) & is.na(K_p_BF))
print(na_entries)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB SO
## 1 hugheve01 1914 1 BLF FL 0 0 3 0 0 0 0 17 5 2 0 3 0
## 2 ostenfr01 1914 1 IND FL 0 0 1 0 0 0 0 6 5 5 0 2 0
## 3 perrige01 1914 1 KCF FL 0 0 1 0 0 0 0 2 2 1 0 1 0
## 4 porraed01 1914 1 BUF FL 0 1 3 3 0 0 0 31 18 5 2 7 0
## 5 shermda01 1914 1 CHF FL 0 1 1 1 0 0 0 1 0 0 0 2 0
## 6 vernojo01 1914 1 BRF FL 0 0 1 1 0 0 0 10 4 4 0 5 0
## 7 harribe02 1915 1 KCF FL 0 0 1 0 0 0 0 6 1 0 0 0 0
## 8 miljujo01 1915 1 PTF FL 0 0 1 0 0 0 0 3 1 0 0 0 0
## 9 smithbo01 1915 1 BUF FL 0 0 1 0 0 0 0 3 1 2 0 2 0
## 10 whitegi01 1915 1 NEW FL 0 0 1 0 0 0 0 3 0 0 0 1 0
## 11 wilheka01 1915 1 BLF FL 0 0 1 0 0 0 0 3 0 0 0 0 0
## BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF
## 1 9.99 3.18 NA 2 0 0 0 2 4 NA NA NA NaN
## 2 9.99 22.50 NA 0 1 0 0 1 5 NA NA NA NaN
## 3 9.99 13.50 NA 0 0 0 0 1 1 NA NA NA NaN
## 4 9.99 4.35 NA 0 0 0 0 0 9 NA NA NA NaN
## 5 9.99 0.00 NA 0 0 0 0 0 2 NA NA NA NaN
## 6 9.99 10.80 NA 0 0 0 0 0 4 NA NA NA NaN
## 7 9.99 0.00 NA 0 0 0 0 1 0 NA NA NA NaN
## 8 9.99 0.00 NA 0 0 0 0 1 0 NA NA NA NaN
## 9 9.99 18.00 NA 0 1 0 0 1 2 NA NA NA NaN
## 10 9.99 0.00 NA 0 0 0 0 1 0 NA NA NA NaN
## 11 9.99 0.00 NA 0 0 0 0 0 0 NA NA NA NaN
Here we see quite a few NaN values, with a BFP and strikeouts of 0 (dividing by 0). To handle this, instead of trying to use a predicted value or waited average for these 11 total observations, we’ll excluded them from the data set, since their exclusion is unlikely to meaningfully change the performance data.
Next, let’s use the new mean statistics by year to develop KAA, KAA+, Strikeout per Batter Faced Above Average (KBFAA), and Index (KBFAA+)
# Remove rows where K_p_BF is NaN or NA
pitching_clean <- pitching %>%
filter(!is.na(K_p_BF) & !is.nan(K_p_BF) & !is.infinite(K_p_BF)) # This will remove both NaN and NA
# Check for NA values in BFP for 1914 and 1915
na_entries <- pitching_clean %>%
filter(yearID %in% c(1914, 1915) & is.na(K_p_BF))
print(na_entries)
## [1] playerID yearID stint teamID lgID W L G
## [9] GS CG SHO SV IPouts H ER HR
## [17] BB SO BAOpp ERA IBB WP HBP BK
## [25] BFP GF R SH SF GIDP K_p_BF
## <0 rows> (or 0-length row.names)
Okay, now let’s examine the data frame and visualize it again to see what it looks like, cleaned.
MeanKperBatterFacedbyYear <- pitching_clean %>%
group_by(yearID) %>%
summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
Mean_K = mean(SO, na.rm = TRUE) ) # Use na.rm = TRUE to remove NA values from the calculation
# View the new dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 152 × 3
## yearID Mean_K_p_BF Mean_K
## <int> <dbl> <dbl>
## 1 1871 0.00903 9.21
## 2 1872 0.0120 11.0
## 3 1873 0.0125 11.1
## 4 1874 0.0105 21
## 5 1875 0.0175 14.4
## 6 1876 0.0353 17.3
## 7 1877 0.0447 38.2
## 8 1878 0.0591 49.1
## 9 1879 0.0831 70.9
## 10 1880 0.0661 60.3
## # ℹ 142 more rows
Okay, evaluating 1914 and 1915 we can see Mean_K_p_BF seems to move in step with Mean K, which is what we would expect after removal. Let’s also visualize the cleaned data quickly to ensure we have removed the spike from K’s per Batter Faced:
# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
geom_line(color = "blue") +
theme_minimal() +
labs(title = "Mean K per Batter Faced by Year",
x = "Year",
y = "Mean K per Batter Faced") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
geom_line(color = "red") +
theme_minimal() +
labs(title = "Mean Strikeouts by Year",
x = "Year",
y = "Mean Strikeouts") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Arrange the plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
Since this is what we expected, and don’t have any further cause for
concern, we can proceed with creation of the KAA, KAA+, KBFAA, and
KBFAA+ statistics.I am going to elect to keep the dip in the mean K’s
per year for 2020, rather than trying to normalize the entire data set
for the year. We will see what the results look like.
The first step in creating these statistics, will be to add the average to the pitching_clean dataframe through a join.
pitching_clean <- pitching_clean %>%
left_join(MeanKperBatterFacedbyYear, by = "yearID")
# View the updated pitching_clean data frame
head(pitching_clean)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23 0 11
## 2 brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132 4 37
## 3 fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3 0 0
## 4 fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103 3 31
## 5 fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10 0 3
## 6 flowedi01 1871 1 TRO NA 0 0 1 0 0 0 0 3 1 0 0 0
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF Mean_K_p_BF
## 1 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA 0.006849315 0.009033541
## 2 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA 0.010069713 0.009033541
## 3 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA 0.000000000 0.009033541
## 4 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA 0.013888889 0.009033541
## 5 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA 0.000000000 0.009033541
## 6 0 NA 0.00 NA 0 NA 0 3 1 0 NA NA NA 0.000000000 0.009033541
## Mean_K
## 1 9.210526
## 2 9.210526
## 3 9.210526
## 4 9.210526
## 5 9.210526
## 6 9.210526
Now let’s create KAA, KAA+, KBFAA, KBFAA+
pitching_clean <- pitching_clean %>%
mutate(
KAA = SO - Mean_K,
`KAA+` = (SO/Mean_K) * 100,
KBFAA = K_p_BF - Mean_K_p_BF,
`KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
)
head(pitching_clean)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23 0 11
## 2 brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132 4 37
## 3 fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3 0 0
## 4 fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103 3 31
## 5 fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10 0 3
## 6 flowedi01 1871 1 TRO NA 0 0 1 0 0 0 0 3 1 0 0 0
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF Mean_K_p_BF
## 1 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA 0.006849315 0.009033541
## 2 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA 0.010069713 0.009033541
## 3 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA 0.000000000 0.009033541
## 4 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA 0.013888889 0.009033541
## 5 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA 0.000000000 0.009033541
## 6 0 NA 0.00 NA 0 NA 0 3 1 0 NA NA NA 0.000000000 0.009033541
## Mean_K KAA KAA+ KBFAA KBFAA+
## 1 9.210526 -8.210526 10.85714 -0.002184226 75.82093
## 2 9.210526 3.789474 141.14286 0.001036172 111.47028
## 3 9.210526 -9.210526 0.00000 -0.009033541 0.00000
## 4 9.210526 5.789474 162.85714 0.004855348 153.74800
## 5 9.210526 -9.210526 0.00000 -0.009033541 0.00000
## 6 9.210526 -9.210526 0.00000 -0.009033541 0.00000
Okay, for clarity, let’s bring in the player names:
# Select relevant columns from the Master table (which contains player names)
player_info <- Lahman::People %>%
select(playerID, nameFirst, nameLast) %>%
mutate(nameFull = paste(nameLast, ", ", nameFirst, sep=""))
# Join the player names into the pitching_clean dataframe using playerID as the key
pitching_clean_with_names <- pitching_clean %>%
left_join(player_info, by = "playerID")
# View the updated dataframe with player names
head(pitching_clean_with_names)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23 0 11
## 2 brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132 4 37
## 3 fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3 0 0
## 4 fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103 3 31
## 5 fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10 0 3
## 6 flowedi01 1871 1 TRO NA 0 0 1 0 0 0 0 3 1 0 0 0
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF Mean_K_p_BF
## 1 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA 0.006849315 0.009033541
## 2 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA 0.010069713 0.009033541
## 3 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA 0.000000000 0.009033541
## 4 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA 0.013888889 0.009033541
## 5 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA 0.000000000 0.009033541
## 6 0 NA 0.00 NA 0 NA 0 3 1 0 NA NA NA 0.000000000 0.009033541
## Mean_K KAA KAA+ KBFAA KBFAA+ nameFirst nameLast
## 1 9.210526 -8.210526 10.85714 -0.002184226 75.82093 George Bechtel
## 2 9.210526 3.789474 141.14286 0.001036172 111.47028 Asa Brainard
## 3 9.210526 -9.210526 0.00000 -0.009033541 0.00000 Bob Ferguson
## 4 9.210526 5.789474 162.85714 0.004855348 153.74800 Cherokee Fisher
## 5 9.210526 -9.210526 0.00000 -0.009033541 0.00000 Frank Fleet
## 6 9.210526 -9.210526 0.00000 -0.009033541 0.00000 Dickie Flowers
## nameFull
## 1 Bechtel, George
## 2 Brainard, Asa
## 3 Ferguson, Bob
## 4 Fisher, Cherokee
## 5 Fleet, Frank
## 6 Flowers, Dickie
Okay, let’s create a data frame of just the player name, year, and the new statistics:
# Create a new dataframe with just player name, year, and the new statistics
player_kaa_stats <- pitching_clean_with_names %>%
select( yearID, nameLast, nameFirst, nameFull,KAA, `KAA+`, KBFAA, `KBFAA+`)
# View the new dataframe
head(player_kaa_stats)
## yearID nameLast nameFirst nameFull KAA KAA+ KBFAA
## 1 1871 Bechtel George Bechtel, George -8.210526 10.85714 -0.002184226
## 2 1871 Brainard Asa Brainard, Asa 3.789474 141.14286 0.001036172
## 3 1871 Ferguson Bob Ferguson, Bob -9.210526 0.00000 -0.009033541
## 4 1871 Fisher Cherokee Fisher, Cherokee 5.789474 162.85714 0.004855348
## 5 1871 Fleet Frank Fleet, Frank -9.210526 0.00000 -0.009033541
## 6 1871 Flowers Dickie Flowers, Dickie -9.210526 0.00000 -0.009033541
## KBFAA+
## 1 75.82093
## 2 111.47028
## 3 0.00000
## 4 153.74800
## 5 0.00000
## 6 0.00000
Now let’s look at histograms of these statistics to get an idea of the distribution
# Histogram for KAA
p1 <- ggplot(player_kaa_stats, aes(x = KAA)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KAA", x = "KAA", y = "Frequency")
# Histogram for KAA+
p2 <- ggplot(player_kaa_stats, aes(x = `KAA+`)) +
geom_histogram(binwidth = 5, fill = "red", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KAA+", x = "KAA+", y = "Frequency")
# Histogram for KBFAA
p3 <- ggplot(player_kaa_stats, aes(x = KBFAA)) +
geom_histogram(binwidth = 0.01, fill = "green", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KBFAA", x = "KBFAA", y = "Frequency")
# Histogram for KBFAA+
p4 <- ggplot(player_kaa_stats, aes(x = `KBFAA+`)) +
geom_histogram(binwidth = 5, fill = "purple", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KBFAA+", x = "KBFAA+", y = "Frequency")
# Arrange the histograms in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
okay, now let’s get the 2023 and 2024 data from statcast, data was
downladed from: https://baseballsavant.mlb.com/leaderboard/custom?year=2024%2C2023&type=pitcher&filter=&min=10&selections=pa%2Cstrikeout&chart=false&x=pa&y=pa&r=no&chartType=beeswarm&sort=xwoba&sortDir=asc
and downladed into a csv:
# Set the file path
file_path <- "C:/Users/james/Downloads/stats.csv"
# Load the data
statcast_pitching_data <- read.csv(file_path)
# Check the first few rows of the data
head(statcast_pitching_data)
## last_name..first_name player_id year pa strikeout
## 1 Wainwright, Adam 425794 2023 484 55
## 2 Greinke, Zack 425844 2023 593 97
## 3 Verlander, Justin 434378 2023 669 144
## 4 Jansen, Kenley 445276 2023 188 52
## 5 Chavez, Jesse 445926 2023 144 39
## 6 Kluber, Corey 446372 2023 257 42
Next, let’s re-create the statistics for 2023 and 2024:
statcast_pitching_data <- statcast_pitching_data %>%
rename(nameFull = last_name..first_name) %>%
mutate(
K_p_BF = strikeout/pa,
yearID = year,
nameLast = str_extract(nameFull, "^[^,]+"), # Extract everything before the comma for nameLast
nameFirst = str_extract(nameFull, "(?<=,\\s).+")
)
head(statcast_pitching_data)
## nameFull player_id year pa strikeout K_p_BF yearID nameLast
## 1 Wainwright, Adam 425794 2023 484 55 0.1136364 2023 Wainwright
## 2 Greinke, Zack 425844 2023 593 97 0.1635750 2023 Greinke
## 3 Verlander, Justin 434378 2023 669 144 0.2152466 2023 Verlander
## 4 Jansen, Kenley 445276 2023 188 52 0.2765957 2023 Jansen
## 5 Chavez, Jesse 445926 2023 144 39 0.2708333 2023 Chavez
## 6 Kluber, Corey 446372 2023 257 42 0.1634241 2023 Kluber
## nameFirst
## 1 Adam
## 2 Zack
## 3 Justin
## 4 Kenley
## 5 Jesse
## 6 Corey
Okay, let’s summarize
# Create a new dataframe with the average K_p_BF by yearID
MeanKperBatterFacedbyYear_statcast <- statcast_pitching_data %>%
group_by(year) %>%
summarise(Mean_K_p_BF = mean(K_p_BF, na.rm = TRUE),
Mean_K = mean(strikeout, na.rm = TRUE) ) # Use na.rm = TRUE to remove NA values from the calculation
# View the new dataframe
print(MeanKperBatterFacedbyYear_statcast)
## # A tibble: 2 × 3
## year Mean_K_p_BF Mean_K
## <int> <dbl> <dbl>
## 1 2023 0.219 53.6
## 2 2024 0.216 49.8
great, now let’s append the data to the MeanKperBatterBacedbyYear data
# Rename the 'year' column to 'yearID' in the second dataframe
MeanKperBatterFacedbyYear_statcast <- MeanKperBatterFacedbyYear_statcast %>%
rename(yearID = year)
# Append the dataframes
MeanKperBatterFacedbyYear <- bind_rows(MeanKperBatterFacedbyYear, MeanKperBatterFacedbyYear_statcast)
# Check the combined dataframe
print(MeanKperBatterFacedbyYear)
## # A tibble: 154 × 3
## yearID Mean_K_p_BF Mean_K
## <int> <dbl> <dbl>
## 1 1871 0.00903 9.21
## 2 1872 0.0120 11.0
## 3 1873 0.0125 11.1
## 4 1874 0.0105 21
## 5 1875 0.0175 14.4
## 6 1876 0.0353 17.3
## 7 1877 0.0447 38.2
## 8 1878 0.0591 49.1
## 9 1879 0.0831 70.9
## 10 1880 0.0661 60.3
## # ℹ 144 more rows
Okay, now let’s re-do the visuals:
# Using ggplot2 to create side by side line graphs
p1 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K_p_BF)) +
geom_line(color = "blue") +
theme_minimal() +
labs(title = "Mean K per Batter Faced by Year",
x = "Year",
y = "Mean K per Batter Faced") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(MeanKperBatterFacedbyYear, aes(x = yearID, y = Mean_K)) +
geom_line(color = "red") +
theme_minimal() +
labs(title = "Mean Strikeouts by Year",
x = "Year",
y = "Mean Strikeouts") +
scale_x_continuous(breaks = seq(min(MeanKperBatterFacedbyYear$yearID),
max(MeanKperBatterFacedbyYear$yearID),
by = 15)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Arrange the plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
We can see the two extra years in the data, which is good, and they seem
to line up with the previous data, which gives us confidence in both
data sets. Next, let’s proceed with adding the KAA, KAA+, KBFAA, KBFAA+
statistics for the 2023/2024 data and then appending them into the
player_kaa_stats dataframe
statcast_pitching_data <- statcast_pitching_data %>%
left_join(MeanKperBatterFacedbyYear, by = "yearID") %>%
select(yearID, nameLast, nameFirst, nameFull, pa, strikeout, K_p_BF, Mean_K_p_BF, Mean_K) %>%
mutate(
KAA = strikeout - Mean_K,
`KAA+` = (strikeout/Mean_K) * 100,
KBFAA = K_p_BF - Mean_K_p_BF,
`KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
)
# View the updated pitching_clean data frame
head(statcast_pitching_data)
## yearID nameLast nameFirst nameFull pa strikeout K_p_BF
## 1 2023 Wainwright Adam Wainwright, Adam 484 55 0.1136364
## 2 2023 Greinke Zack Greinke, Zack 593 97 0.1635750
## 3 2023 Verlander Justin Verlander, Justin 669 144 0.2152466
## 4 2023 Jansen Kenley Jansen, Kenley 188 52 0.2765957
## 5 2023 Chavez Jesse Chavez, Jesse 144 39 0.2708333
## 6 2023 Kluber Corey Kluber, Corey 257 42 0.1634241
## Mean_K_p_BF Mean_K KAA KAA+ KBFAA KBFAA+
## 1 0.2194861 53.64698 1.353017 102.52207 -0.105849698 51.77384
## 2 0.2194861 53.64698 43.353017 180.81166 -0.055911019 74.52639
## 3 0.2194861 53.64698 90.353017 268.42143 -0.004239424 98.06848
## 4 0.2194861 53.64698 -1.646983 96.92996 0.057109684 126.01973
## 5 0.2194861 53.64698 -14.646983 72.69747 0.051347272 123.39432
## 6 0.2194861 53.64698 -11.646983 78.28958 -0.056061937 74.45763
okay, now let’s append this into the player_kaa_stats data, first let’s select only what will match the dataframe we are appending.
# Create a new dataframe with just player name, year, and the new statistics
player_kaa_stats_statcast <- statcast_pitching_data %>%
select( yearID, nameLast, nameFirst, nameFull,KAA, `KAA+`, KBFAA, `KBFAA+`)
# View the new dataframe
head(player_kaa_stats_statcast)
## yearID nameLast nameFirst nameFull KAA KAA+
## 1 2023 Wainwright Adam Wainwright, Adam 1.353017 102.52207
## 2 2023 Greinke Zack Greinke, Zack 43.353017 180.81166
## 3 2023 Verlander Justin Verlander, Justin 90.353017 268.42143
## 4 2023 Jansen Kenley Jansen, Kenley -1.646983 96.92996
## 5 2023 Chavez Jesse Chavez, Jesse -14.646983 72.69747
## 6 2023 Kluber Corey Kluber, Corey -11.646983 78.28958
## KBFAA KBFAA+
## 1 -0.105849698 51.77384
## 2 -0.055911019 74.52639
## 3 -0.004239424 98.06848
## 4 0.057109684 126.01973
## 5 0.051347272 123.39432
## 6 -0.056061937 74.45763
Next, let’s append the new player_kaa_stats_statcast with player_kaa_stats to get one dataframe that includes all the Lahman data up until 2022 and the statcast data from 2023 and 2024:
# Append the two dataframes
player_kaa_stats <- bind_rows(player_kaa_stats, player_kaa_stats_statcast)
# View the updated dataframe
head(player_kaa_stats)
## yearID nameLast nameFirst nameFull KAA KAA+ KBFAA
## 1 1871 Bechtel George Bechtel, George -8.210526 10.85714 -0.002184226
## 2 1871 Brainard Asa Brainard, Asa 3.789474 141.14286 0.001036172
## 3 1871 Ferguson Bob Ferguson, Bob -9.210526 0.00000 -0.009033541
## 4 1871 Fisher Cherokee Fisher, Cherokee 5.789474 162.85714 0.004855348
## 5 1871 Fleet Frank Fleet, Frank -9.210526 0.00000 -0.009033541
## 6 1871 Flowers Dickie Flowers, Dickie -9.210526 0.00000 -0.009033541
## KBFAA+
## 1 75.82093
## 2 111.47028
## 3 0.00000
## 4 153.74800
## 5 0.00000
## 6 0.00000
Let’s again look at the distribution of the new statistics:
# Histogram for KAA
p1 <- ggplot(player_kaa_stats, aes(x = KAA)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KAA", x = "KAA", y = "Frequency")
# Histogram for KAA+
p2 <- ggplot(player_kaa_stats, aes(x = `KAA+`)) +
geom_histogram(binwidth = 5, fill = "red", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KAA+", x = "KAA+", y = "Frequency")
# Histogram for KBFAA
p3 <- ggplot(player_kaa_stats, aes(x = KBFAA)) +
geom_histogram(binwidth = 0.01, fill = "green", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KBFAA", x = "KBFAA", y = "Frequency")
# Histogram for KBFAA+
p4 <- ggplot(player_kaa_stats, aes(x = `KBFAA+`)) +
geom_histogram(binwidth = 5, fill = "purple", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Histogram of KBFAA+", x = "KBFAA+", y = "Frequency")
# Arrange the histograms in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
Now, finally, let’s look at Kerry Wood’s 1998 season data and Paul Skenes 2024 season data to compare:
skenes_statcast_data <- statcast_pitching_data %>%
filter(nameLast == "Skenes") %>%
mutate(K_per_Batter = strikeout/pa)
skenes_statcast_data
## yearID nameLast nameFirst nameFull pa strikeout K_p_BF Mean_K_p_BF
## 1 2024 Skenes Paul Skenes, Paul 490 158 0.322449 0.2162191
## Mean_K KAA KAA+ KBFAA KBFAA+ K_per_Batter
## 1 49.84666 108.1533 316.9721 0.1062299 149.1307 0.322449
wood_lahman_data <- pitching_clean_with_names %>%
filter(nameLast == "Wood" & yearID == 1998) %>%
mutate(K_per_Batter = SO/BFP)
wood_lahman_data
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 woodke02 1998 1 CHN NL 13 6 26 26 1 1 0 500 117 63 14 85
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP K_p_BF Mean_K_p_BF
## 1 233 0.196 3.4 1 6 11 3 699 0 69 2 4 3 0.3333333 0.1625023
## Mean_K KAA KAA+ KBFAA KBFAA+ nameFirst nameLast nameFull
## 1 51.11058 181.8894 455.8743 0.170831 205.1253 Kerry Wood Wood, Kerry
## K_per_Batter
## 1 0.3333333
Looking on the face, we see Wood’s 1998 season with a strikeout per batter faced of 33.3%, striking out 1/3 of all the batter’s he faced. Whereas Skenes is striking out 32.2% of batter’s faced. Pretty close in terms of just what percetenage of batter’s faced they are striking out, with Wood still having a slight edge.
Now let’s compare using KAA, KAA+, KBFAA, and KBFAA+
Kerry_Wood_Paul_Skenes_Comparison <- player_kaa_stats %>%
filter((yearID == 1998 & nameLast == "Wood") | (yearID == 2024 & nameLast == "Skenes"))
Kerry_Wood_Paul_Skenes_Comparison
## yearID nameLast nameFirst nameFull KAA KAA+ KBFAA KBFAA+
## 1 1998 Wood Kerry Wood, Kerry 181.8894 455.8743 0.1708310 205.1253
## 2 2024 Skenes Paul Skenes, Paul 108.1533 316.9721 0.1062299 149.1307
Let’s visualize this a little nicer:
# Load necessary library
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(grid)
library(gridExtra)
# Create a simple HTML table using kable
Kerry_Wood_Paul_Skenes_Comparison %>%
kable(format = "html", table.attr = "class='table table-bordered table-hover'", caption = "<b><center>Kerry Wood and Paul Skenes Comparison Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)</center></b>") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))
yearID | nameLast | nameFirst | nameFull | KAA | KAA+ | KBFAA | KBFAA+ |
---|---|---|---|---|---|---|---|
1998 | Wood | Kerry | Wood, Kerry | 181.8894 | 455.8743 | 0.1708310 | 205.1253 |
2024 | Skenes | Paul | Skenes, Paul | 108.1533 | 316.9721 | 0.1062299 | 149.1307 |
It appears, based on this comparison, using KAA+, that Kerry Wood was about 356% better than league average in 1998, whereas Paul Skenes has 217% than league average in 2024. However, keep in mind, we see Skenes faced 490 batters to Wood’s 699. Which is why it’s important to look at Strikeouts per Batter Faced Abover Average, in 1998 Kerry Wood was about 105% better than league average, whereas Paul Skenes has been about 49% better than league average to this point.
This indicates, in terms of evaluating Kerry Wood’s 1998 season was more with Paul Skenes 2024 season using strikeouts above average or strikeouts above average per batter faced, Wood was striking out much more than the league was at the time than Paul Skenes is today. Looking at the line graphs above, we see that the trend in the league has been increase K’s per batter faced, over time. So even though Wood and Skenes are striking out about the same amount of batter’s faced, if we look at it compared to the league average at the time, Wood was doing it at a much higher rate than Skenes, compared to his peers.
If we create a new dataframe given the strikeout info provided by Tom Tango, to specifically look at the first 490 plate appearances of Wood, we would first create a data frame and join in the Mean K data:
kerry_wood_data <- data.frame(
yearID = 1998,
playerID = "woodke01",
nameLast = "Wood",
nameFirst = "Kerry",
nameFull = "Wood, Kerry",
BFA = 490,
strikeout = 169,
K_p_BF = 169/490
)
# Merge Kerry Wood's data with the mean data
kerry_wood_stats <- kerry_wood_data %>%
left_join(MeanKperBatterFacedbyYear, by = "yearID") %>%
mutate(
KAA = strikeout - Mean_K,
`KAA+` = (strikeout/Mean_K) * 100,
KBFAA = K_p_BF - Mean_K_p_BF,
`KBFAA+` = (K_p_BF/Mean_K_p_BF) * 100
)
Okay, now let’s append Paul Skenes data back in for the first 490 PA comparison
kerry_wood_mod <- kerry_wood_stats %>%
select(-playerID, -nameLast, -nameFirst) %>% # Remove playerID column
rename(pa = BFA) # Rename BFA to pa
# Modify skenes_statcast_data
skenes_statcast_mod <- skenes_statcast_data %>%
select(-K_per_Batter, -nameLast, -nameFirst) # Remove K_per_Batter column
# Append dataframes together
skenes_wood_first_four_ninety_pa_comparison <- bind_rows(kerry_wood_mod, skenes_statcast_mod)
#rename some things
skenes_wood_first_four_ninety_pa_comparison <- skenes_wood_first_four_ninety_pa_comparison %>%
rename(
Year = yearID,
Name = nameFull,
Strikeouts = strikeout,
`Stikeouts per Batter Faced` = K_p_BF,
`Mean strikeouts per Batter Faced by Year` = Mean_K_p_BF
)
print(skenes_wood_first_four_ninety_pa_comparison)
## Year Name pa Strikeouts Stikeouts per Batter Faced
## 1 1998 Wood, Kerry 490 169 0.344898
## 2 2024 Skenes, Paul 490 158 0.322449
## Mean strikeouts per Batter Faced by Year Mean_K KAA KAA+ KBFAA
## 1 0.1625023 51.11058 117.8894 330.6556 0.1823956
## 2 0.2162191 49.84666 108.1533 316.9721 0.1062299
## KBFAA+
## 1 212.2419
## 2 149.1307
Now let’s create a nicer looking table
# Create a simple HTML table using kable
skenes_wood_first_four_ninety_pa_comparison %>%
kable(format = "html", table.attr = "class='table table-bordered table-hover'", caption = "<b><center>Kerry Wood and Paul Skenes Comparison of the first 490 Batters Faced Using Strikeouts Above Average(KAA) and Strikeouts per Batter Faced Above Average(KBFAA)</center></b>") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))
Year | Name | pa | Strikeouts | Stikeouts per Batter Faced | Mean strikeouts per Batter Faced by Year | Mean_K | KAA | KAA+ | KBFAA | KBFAA+ |
---|---|---|---|---|---|---|---|---|---|---|
1998 | Wood, Kerry | 490 | 169 | 0.344898 | 0.1625023 | 51.11058 | 117.8894 | 330.6556 | 0.1823956 | 212.2419 |
2024 | Skenes, Paul | 490 | 158 | 0.322449 | 0.2162191 | 49.84666 | 108.1533 | 316.9721 | 0.1062299 | 149.1307 |
Summary and Analysis
Kerry Wood’s 1998 Season:
Wood's numbers without league comparison are 169 strikeouts, and striking out 34.5% batters faced, (greater than 1/3 of batter's faced)
KAA of 117.8894: This indicates he struck out about 118 more batters than league average.
KAA+ of 330.6556: This indicates that Kerry Wood was approximately 231% better than the league average in terms of strikeouts above average (KAA) in 1998. Such a high KAA+ value signifies an exceptional ability to strike out batters compared to his peers during that season.
KBFAA of .1823956: This indicates he struck out the batter's he faced at a rate of about 18.2% more than league average.
KBFAA+ of 212.2419: Reflecting on strikeouts per batter faced above average, Kerry Wood performed about 112% better than league average. This highlights his efficiency in striking out batters at a rate significantly higher than what was typical in the league.
Paul Skenes’ 2024 Season:
Skenes' numbers without league comparison are 158 strikeouts, and striking out 32.2% batters faced.
KAA of 108.1533: This indicates he struck out about 108 more batters than league average.
KAA+ of 316.9721: Paul Skenes shows a KAA+ approximately 217% above the league average in 2024, marking him also as an outstanding pitcher in terms of strikeouts above average, though slightly less than Wood's exceptional performance.
KBFAA of .1062299: This indicates he struck out the batter's he faced at a rate of about 10.6% more than league average.
KBFAA+ of 149.1307: Skenes' ability to strike out batters compared to how many he faced is about 49% better than the league average. While still impressive, it is notably lower than Wood's rate, suggesting that while Skenes is effective, he does not dominate to the same extent as Wood did.
Comparative Analysis:
Despite the similar number of plate appearances (PA) each pitcher faced in their respective seasons, Kerry Wood's striking ability relative to his peers was more pronounced. His KAA, KAA+, KBFAA, and KBFAA+ scores both surpass those of Paul Skenes, indicating a higher level of dominance during his era.
The data also suggests a trend of increasing strikeouts per batter faced over the years within the league. Given this context, Wood's performance becomes even more impressive, demonstrating his capability to significantly outperform a league striking out significantly less batters than the era Skenes is pitching in.
Conclusion:
Evaluative Context: When comparing performances relative to their contemporaries, Kerry Wood's 1998 season was more dominant compared to Paul Skenes' 2024 season, both in absolute terms and relative efficiency (strikeouts per batter faced).
League Trends: Considering the trend of increasing strikeouts over time, Wood’s ability to far surpass the league average of his day emphasizes his exceptional skill during an era of relatively fewer strikeouts. This comparative advantage highlights his standout season even more when juxtaposed with Skenes, who also excels, but in a time when striking out batters is more common.
This analysis encapsulates the comparative excellence of Kerry Wood and Paul Skenes by situating their achievements within the broader statistical trends of their respective eras. While looking at face value of the counting statistics provides a pretty comparable view of both pitchers, this analysis underscores the importance of contextualizing individual statistics within league-wide performance metrics to fully appreciate the historical impact of players’ seasons.