library(dplyr)
library(tidyr)
library(ggplot2)
library(mclust)
library(jagsUI)
library(readr)
library(knitr)
library(tibble)
library(vegan)
library(ggrepel)
library(ggnewscale)
library(MASS)
Major League Baseball’s Statcast system assigns discrete pitch-type labels — including curveball (CU), slider (SL), and sweeper (ST) — to every pitch via a proprietary classifier that draws on movement, velocity, and spin. The sweeper label was introduced publicly around 2022 and has since grown rapidly in usage, raising a fundamental question: does this label identify a genuinely novel pitch type, or does it represent a refinement of an already continuous breaking-ball movement spectrum?
An important caveat governs all analyses here. Because Statcast’s classifier itself uses movement variables as inputs, any analysis that regresses movement features against Statcast labels is partly circular — the labels and the features are not independent. This analysis therefore characterises the geometry of the movement space underlying the taxonomy rather than independently validating the labels. The question is whether the movement data, taken on their own terms, support discrete categories or a continuum.
The emergence of the sweeper label may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The traditional distinction between curveballs, sliders, and sweepers maps loosely onto what pitchers and coaches have long called the “slurve” — a pitch that resists clean categorisation because it occupies intermediate space on the horizontal/vertical break continuum. The analyses below evaluate this possibility formally across three complementary stages.
breaking_raw <- read_csv("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/statcast_breaking.csv",
show_col_types = FALSE)
count_table <- breaking_raw |>
dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
dplyr::count(game_year, pitch_type) |>
tidyr::pivot_wider(names_from = pitch_type, values_from = n, values_fill = 0) |>
mutate(
Total = CU + SL + ST,
CU_pct = round(100 * CU / Total, 1),
SL_pct = round(100 * SL / Total, 1),
ST_pct = round(100 * ST / Total, 1)
) |>
dplyr::select(game_year, CU, CU_pct, SL, SL_pct, ST, ST_pct)
kable(count_table,
col.names = c("Season", "CU (n)", "CU (%)", "SL (n)", "SL (%)", "ST (n)", "ST (%)"),
caption = "Pitch counts and proportions by type and season.",
format.args = list(big.mark = ","))
| Season | CU (n) | CU (%) | SL (n) | SL (%) | ST (n) | ST (%) |
|---|---|---|---|---|---|---|
| 2,020 | 18,142 | 31.3 | 37,050 | 63.9 | 2,823 | 4.9 |
| 2,021 | 41,890 | 27.9 | 94,951 | 63.3 | 13,159 | 8.8 |
| 2,022 | 39,483 | 26.3 | 87,730 | 58.5 | 22,787 | 15.2 |
| 2,023 | 34,433 | 23.0 | 82,638 | 55.1 | 32,929 | 22.0 |
| 2,024 | 33,708 | 22.5 | 78,389 | 52.3 | 37,903 | 25.3 |
| 2,025 | 35,421 | 23.6 | 75,205 | 50.1 | 39,374 | 26.2 |
label_colours <- c(CU = "#0072B2", SL = "#D55E00", ST = "#009E73")
label_names <- c(CU = "Curveball", SL = "Slider", ST = "Sweeper")
pct_tbl <- breaking_raw |>
dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
dplyr::count(game_year, pitch_type) |>
group_by(game_year) |>
mutate(pct = 100 * n / sum(n)) |>
ungroup()
ggplot(pct_tbl, aes(game_year, pct, colour = pitch_type, group = pitch_type)) +
geom_line(linewidth = 0.9) +
geom_point(size = 2.5) +
scale_colour_manual(values = label_colours, labels = label_names) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
scale_x_continuous(breaks = min(pct_tbl$game_year):max(pct_tbl$game_year)) +
labs(x = "Season", y = "Share of breaking balls", colour = NULL,
title = "Breaking ball composition over time") +
theme_bw() +
theme(panel.grid.minor = element_blank(), legend.position = "bottom")
Share of breaking balls classified as curveball, slider, or sweeper by season. The sweeper share has grown consistently since 2020, largely at the expense of the slider — consistent with reclassification along a pre-existing movement continuum rather than the emergence of a genuinely new pitch type.
REQUIRED_COLS <- c("pitcher", "game_year", "pitch_type", "p_throws",
"pfx_x", "pfx_z", "release_speed", "release_spin_rate",
"spin_axis", "description", "zone")
breaking <- breaking_raw |>
dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
dplyr::select(any_of(REQUIRED_COLS)) |>
mutate(
pfx_x_adj = if_else(p_throws == "L", -pfx_x, pfx_x),
spin_axis_adj = if_else(p_throws == "L", 360 - spin_axis, spin_axis),
whiff = as.integer(description == "swinging_strike"),
chase = as.integer(
description %in% c("swinging_strike", "swinging_strike_blocked",
"foul", "foul_tip", "hit_into_play") &
zone %in% c(11, 12, 13, 14)
),
label_cu = as.integer(pitch_type == "CU"),
label_sl = as.integer(pitch_type == "SL"),
pitcher_id = as.integer(factor(pitcher)),
year_f = factor(game_year)
) |>
drop_na(pfx_x_adj, pfx_z, release_speed, release_spin_rate, spin_axis_adj) |>
mutate(
pfx_x_std = as.numeric(scale(pfx_x_adj)),
pfx_z_std = as.numeric(scale(pfx_z)),
velo_std = as.numeric(scale(release_speed)),
spin_std = as.numeric(scale(release_spin_rate)),
spin_axis_std = as.numeric(scale(spin_axis_adj))
) |>
mutate(
pfx_x_std = residuals(lm(pfx_x_std ~ year_f, data = pick(everything()))),
pfx_z_std = residuals(lm(pfx_z_std ~ year_f, data = pick(everything()))),
velo_std = residuals(lm(velo_std ~ year_f, data = pick(everything()))),
spin_std = residuals(lm(spin_std ~ year_f, data = pick(everything()))),
spin_axis_std = residuals(lm(spin_axis_std ~ year_f, data = pick(everything())))
)
cat(sprintf(
"%s pitches — CU: %s SL: %s ST: %s | %s unique pitchers\n",
format(nrow(breaking), big.mark = ","),
format(sum(breaking$pitch_type == "CU"), big.mark = ","),
format(sum(breaking$pitch_type == "SL"), big.mark = ","),
format(sum(breaking$pitch_type == "ST"), big.mark = ","),
format(n_distinct(breaking$pitcher), big.mark = ",")
))
## 804,590 pitches — CU: 202,261 SL: 454,138 ST: 148,191 | 1,607 unique pitchers
PCA identifies the dominant axes of covariation in pitch shape across
five standardised, year-residualised features: horizontal break
(pfx_x_adj, sign-flipped for LHP), vertical break
(pfx_z), velocity, spin rate, and spin axis
(spin_axis_adj, mirrored for LHP). Year effects are
residualised from all features prior to analysis so that the movement
geometry reflects within-year pitch physics rather than secular
classifier drift. PCA is used here because it preserves global geometry
— it provides an interpretable low-dimensional representation of the
full movement space without imposing cluster structure, making it
well-suited for evaluating whether the space is continuous or discretely
partitioned.
The convergent validity question is whether the three Statcast labels occupy clearly separable regions of this space, or whether they instead grade into one another along shared movement axes.
pca_obj <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_pca.rds")
rda_fit <- pca_obj$rda_fit
all_scores <- pca_obj$all_scores
loadings_raw <- pca_obj$loadings_raw
var_exp <- pca_obj$var_exp
cum_var <- pca_obj$cum_var
var_tbl <- data.frame(
PC = paste0("PC", seq_along(var_exp)),
Proportion = round(var_exp, 3),
Cumulative = round(cum_var, 3),
check.names = FALSE
)
kable(var_tbl, caption = "Variance explained by each principal component.")
| PC | Proportion | Cumulative | |
|---|---|---|---|
| PC1 | PC1 | 0.508 | 0.508 |
| PC2 | PC2 | 0.202 | 0.710 |
| PC3 | PC3 | 0.153 | 0.863 |
| PC4 | PC4 | 0.085 | 0.948 |
| PC5 | PC5 | 0.052 | 1.000 |
ldg <- loadings_raw |>
dplyr::select(Feature = feature, PC1, PC2) |>
mutate(
Feature = recode(Feature,
pfx_x_std = "Horiz. break (pfx_x, adj.)",
pfx_z_std = "Vert. break (pfx_z)",
velo_std = "Release speed",
spin_std = "Spin rate",
spin_axis_std = "Spin axis (adj.)"
),
across(where(is.numeric), ~ round(., 3))
)
kable(ldg, caption = "PCA loadings (vegan rda).")
| Feature | PC1 | PC2 | |
|---|---|---|---|
| pfx_x_std | Horiz. break (pfx_x, adj.) | 0.397 | 0.179 |
| pfx_z_std | Vert. break (pfx_z) | -0.426 | 0.152 |
| velo_std | Release speed | -0.417 | 0.250 |
| spin_std | Spin rate | 0.277 | 0.447 |
| spin_axis_std | Spin axis (adj.) | -0.466 | 0.056 |
PC1 primarily reflects a horizontal-versus-vertical movement gradient, with spin axis loading in the opposite direction to horizontal break — consistent with the expected movement characteristics of sweepers (high horizontal break, low spin axis after handedness adjustment) versus curveballs (low horizontal break, high spin axis). PC2 captures a secondary dimension reflecting velocity and spin rate, largely independent of movement direction.
set.seed(1)
scores_plot <- all_scores |> slice_sample(n = min(nrow(all_scores), 25000))
ggplot(scores_plot, aes(x = PC1, y = PC2, colour = pitch_type, fill = pitch_type)) +
geom_point(alpha = 0.04, size = 0.35) +
stat_density_2d(
geom = "polygon",
contour_var = "ndensity",
breaks = 0.5,
alpha = 0.4,
linewidth = 0
) +
geom_segment(
data = loadings_raw,
aes(x = 0, xend = PC1, y = 0, yend = PC2),
colour = "black", linewidth = 0.65, alpha = 0.55,
arrow = arrow(length = unit(0.01, "npc")),
inherit.aes = FALSE
) +
geom_label_repel(
data = loadings_raw,
aes(x = PC1, y = PC2, label = label),
size = 4, alpha = 0.85,
label.padding = unit(0.3, "lines"),
inherit.aes = FALSE
) +
scale_colour_manual(values = label_colours, labels = label_names) +
scale_fill_manual(values = label_colours, labels = label_names) +
labs(
x = sprintf("PC1 (%.1f%% variance)", var_exp[1] * 100),
y = sprintf("PC2 (%.1f%% variance)", var_exp[2] * 100),
colour = NULL, fill = NULL,
title = sprintf("Breaking-ball movement space (%d–2025)",
min(breaking$game_year))
) +
theme_minimal(base_size = 14) +
theme(legend.position = "bottom") +
guides(
colour = guide_legend(override.aes = list(alpha = 1, size = 2)),
fill = guide_legend(override.aes = list(alpha = 0.5))
)
PC1 vs PC2 coloured by Statcast pitch label. Shaded regions show the 50% kernel density contour for each label. The three labels grade into one another along PC1 rather than occupying well-separated regions, consistent with a continuous movement spectrum.
If the three labels occupy clearly discrete regions of movement space, a linear discriminant classifier trained on the five features should recover them with high accuracy. Poor classification accuracy — particularly systematic confusion between adjacent categories — is consistent with a movement continuum rather than discrete types. LDA is fit on a random subsample (n = 30,000) to make computation tractable; leave-one-out cross-validation is used to avoid overfitting.
set.seed(7)
lda_df <- breaking |>
group_by(pitch_type) |>
slice_sample(n = 10000) |>
ungroup()
lda_fit <- lda(pitch_type ~ pfx_x_std + pfx_z_std + velo_std +
spin_std + spin_axis_std,
data = lda_df, CV = TRUE)
lda_tab <- table(Predicted = lda_fit$class, Actual = lda_df$pitch_type)
lda_acc <- round(100 * sum(diag(lda_tab)) / sum(lda_tab), 1)
kable(lda_tab,
caption = sprintf(
"LDA leave-one-out confusion matrix (overall accuracy = %.1f%%). Rows are predicted labels; columns are Statcast labels. Systematic off-diagonal mass, especially between SL and ST, indicates the movement space does not cleanly support three discrete categories.", lda_acc))
| CU | SL | ST | |
|---|---|---|---|
| CU | 8370 | 576 | 435 |
| SL | 557 | 8014 | 736 |
| ST | 1073 | 1410 | 8829 |
An overall accuracy substantially below 100%, combined with systematic confusion between adjacent categories (particularly SL ↔︎ ST), supports the interpretation that Statcast labels impose discrete boundaries on a continuous movement distribution.
GMM is applied to the same five scaled features to ask whether an unsupervised partition of the movement space — with no knowledge of Statcast labels — recovers three components corresponding to CU, SL, and ST. BIC selects the number of components \(G\) and covariance parameterisation jointly. BIC search is conducted on a random subsample (n = 5,000) across G = 2–9 and four covariance structures (VVV, VVI, VEV, EEE); the full-data model is fit at G = 6, the elbow of the BIC curve beyond which marginal gains are negligible. Cluster–label alignment is quantified with the Adjusted Rand Index (ARI). This stage addresses discriminant validity: does a purely data-driven partition recover MLB’s operational taxonomy?
gmm_obj <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_gmm.rds")
fit_gmm <- gmm_obj$fit_gmm
ari <- gmm_obj$ari
bic_df <- gmm_obj$bic_df
bic_best <- gmm_obj$bic_best
cluster_df <- gmm_obj$cluster_df
cat(sprintf("GMM fit: G = %d, model = %s | ARI = %.4f\n",
fit_gmm$G, fit_gmm$modelName, ari))
## GMM fit: G = 6, model = VVV | ARI = 0.2709
ggplot(bic_df, aes(G, BIC, group = model, colour = model)) +
geom_line(alpha = 0.4, linewidth = 0.5) +
geom_point(data = bic_best, aes(G, BIC), colour = "black", size = 2.5) +
geom_vline(xintercept = 6, linetype = "dashed", colour = "red") +
scale_x_continuous(breaks = 2:9) +
labs(x = "Number of GMM components (G)", y = "BIC", colour = "Model",
title = "GMM BIC curve — reported G = 6 (VVV)") +
theme_bw() +
theme(panel.grid.minor = element_blank(), legend.position = "right",
legend.text = element_text(size = 8))
BIC values across candidate GMM components (subsample). The marked elbow after G=5 indicates diminishing returns from additional components. The data-preferred solution (G=6) exceeds the three-label taxonomy, suggesting richer latent structure than CU/SL/ST implies.
cluster_colours <- setNames(scales::hue_pal()(6), as.character(1:6))
ggplot(cluster_df, aes(PC1, PC2)) +
geom_point(aes(colour = pitch_type), alpha = 0.15, size = 0.4) +
scale_colour_manual(
name = "Statcast label",
values = label_colours,
labels = label_names,
guide = guide_legend(override.aes = list(alpha = 1, size = 2,
linetype = "blank", linewidth = 0))
) +
new_scale_colour() +
stat_ellipse(
aes(group = cluster, colour = cluster),
type = "norm", level = 0.95, linetype = "dashed", linewidth = 0.6
) +
scale_colour_manual(
name = "GMM cluster",
values = cluster_colours,
labels = paste("Cluster", 1:6),
guide = guide_legend(override.aes = list(alpha = 1, size = 0,
linetype = "dashed", linewidth = 0.6))
) +
labs(
x = sprintf("PC1 (%.1f%%)", var_exp[1] * 100),
y = sprintf("PC2 (%.1f%%)", var_exp[2] * 100),
title = "GMM clusters (G = 6) vs. Statcast labels"
) +
theme_bw() +
theme(panel.grid.minor = element_blank(), legend.position = "right")
GMM cluster assignments overlaid on PC space, coloured by Statcast label. Ellipses show 95% coverage regions. No cluster is dominated by a single pitch type, and adjacent clusters span label boundaries — consistent with a continuous movement spectrum rather than discrete categories.
uncertainty_df <- data.frame(
uncertainty = fit_gmm$uncertainty,
pitch_type = breaking$pitch_type
)
ggplot(uncertainty_df, aes(x = uncertainty, fill = pitch_type, colour = pitch_type)) +
geom_histogram(binwidth = 0.02, alpha = 0.6, position = "identity") +
facet_wrap(~ pitch_type, ncol = 1,
labeller = labeller(pitch_type = label_names)) +
scale_fill_manual(values = label_colours, guide = "none") +
scale_colour_manual(values = label_colours, guide = "none") +
labs(x = "Classification uncertainty (1 − max posterior probability)",
y = "Count",
title = "GMM posterior classification uncertainty by pitch type") +
theme_bw() +
theme(panel.grid.minor = element_blank(), strip.background = element_blank())
Posterior classification uncertainty (1 − max component probability) by pitch type. The slider distribution is the most diffuse, consistent with its intermediate position on the movement continuum. Curveball and sweeper show lower uncertainty, consistent with their positions at opposing ends of the spectrum.
tab <- table(Cluster = fit_gmm$classification, Label = breaking$pitch_type)
kable(tab, caption = "GMM cluster (G = 6) × Statcast label contingency table.")
| CU | SL | ST |
|---|---|---|
| 42835 | 42313 | 10147 |
| 21401 | 68802 | 130131 |
| 586 | 183813 | 1653 |
| 124760 | 2725 | 2213 |
| 11065 | 137134 | 3173 |
| 1614 | 19351 | 874 |
ARI interpretation: ARI = 1 indicates perfect cluster–label correspondence; ARI = 0 indicates correspondence no better than chance. Values below ~0.2 suggest the data-driven partition bears little relationship to the CU/SL/ST taxonomy.
Two Bayesian logistic models with pitcher-level random intercepts estimate whether the Statcast pitch-type label is associated with whiff or chase probability above and beyond the five physical features. These associations should be interpreted with caution as outcome differences across pitch types are likely confounded by pitcher archetypes (pitchers who throw sweepers may differ systematically from those who throw curveballs), platoon effects, count leverage, and usage context. The models do not establish causal effects of the label itself.
ST (sweeper) is the reference category. beta_cu and
beta_sl estimate the whiff (or chase) log-odds of curveball
and slider relative to sweeper after adjusting for horizontal break,
vertical break, velocity, spin rate, and spin axis. This stage addresses
predictive validity: do the labels carry incremental
associations with outcomes once physical features are adjusted for?
Pitcher random intercepts are centred at zero with precision
tau_alpha estimated from the data. Year effects are
residualised from all features upstream and are not included as a
separate model term. Models were fit on a stratified random sample of
approximately 50,000 pitches (~16,667 per pitch type) to ensure balanced
pitch-type representation. Bayesian inference was conducted using JAGS
via jagsUI (3 chains, 5,000 burn-in, 10,000 iterations,
thinning = 2).
fit_whiff <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_whiff.rds")
fit_chase <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_chase.rds")
rhat_w <- unlist(fit_whiff$Rhat); rhat_w <- rhat_w[!is.na(rhat_w)]
rhat_c <- unlist(fit_chase$Rhat); rhat_c <- rhat_c[!is.na(rhat_c)]
cat(sprintf("Whiff — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
max(rhat_w), sum(rhat_w > 1.1), length(rhat_w),
if (sum(rhat_w > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))
## Whiff — max R-hat: 1.003 | nodes > 1.1: 0/10 | CONVERGED
cat(sprintf("Chase — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
max(rhat_c), sum(rhat_c > 1.1), length(rhat_c),
if (sum(rhat_c > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))
## Chase — max R-hat: 1.001 | nodes > 1.1: 0/10 | CONVERGED
target_params <- c("beta_pfx_x", "beta_pfx_z", "beta_velo", "beta_spin",
"beta_spin_axis", "beta_cu", "beta_sl", "beta_cu_vs_sl",
"tau_alpha")
param_labels <- c(
beta_pfx_x = "Horiz. break (β_pfx_x)",
beta_pfx_z = "Vert. break (β_pfx_z)",
beta_velo = "Velocity (β_velo)",
beta_spin = "Spin rate (β_spin)",
beta_spin_axis = "Spin axis (β_spin_axis)",
beta_cu = "CU vs. ST (β_cu)",
beta_sl = "SL vs. ST (β_sl)",
beta_cu_vs_sl = "CU vs. SL (β_cu − β_sl, derived)",
tau_alpha = "Pitcher RE precision (τ_α)"
)
make_post_tbl <- function(fit) {
fit$summary[target_params, c("mean", "sd", "2.5%", "97.5%", "Rhat"),
drop = FALSE] |>
as.data.frame() |>
rownames_to_column("Parameter") |>
mutate(
Parameter = recode(Parameter, !!!param_labels),
across(where(is.numeric), ~ round(., 3))
)
}
kable(
make_post_tbl(fit_whiff),
col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
caption = paste(
"Whiff model posterior summaries. ST is the reference category.",
"β_cu and β_sl are the adjusted log-odds of whiff for curveball and slider",
"relative to sweeper. Credible intervals excluding zero indicate a label",
"association with whiff beyond the five movement features, though",
"unmeasured confounders may contribute to these associations."
)
)
| Parameter | Post. mean | Post. SD | 2.5% | 97.5% | R-hat |
|---|---|---|---|---|---|
| Horiz. break (β_pfx_x) | -0.331 | 0.024 | -0.379 | -0.284 | 1.001 |
| Vert. break (β_pfx_z) | -0.309 | 0.027 | -0.363 | -0.256 | 1.000 |
| Velocity (β_velo) | 0.334 | 0.031 | 0.273 | 0.394 | 1.001 |
| Spin rate (β_spin) | 0.124 | 0.024 | 0.077 | 0.172 | 1.000 |
| Spin axis (β_spin_axis) | 0.104 | 0.023 | 0.059 | 0.148 | 1.000 |
| CU vs. ST (β_cu) | -1.800 | 0.062 | -1.921 | -1.676 | 1.002 |
| SL vs. ST (β_sl) | -1.770 | 0.047 | -1.862 | -1.681 | 1.000 |
| CU vs. SL (β_cu − β_sl, derived) | -0.030 | 0.072 | -0.172 | 0.111 | 1.003 |
| Pitcher RE precision (τ_α) | 1.183 | 0.088 | 1.019 | 1.363 | 1.002 |
kable(
make_post_tbl(fit_chase),
col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
caption = paste(
"Chase model posterior summaries. ST is the reference category.",
"β_cu and β_sl are the adjusted log-odds of chasing an out-of-zone pitch",
"for curveball and slider relative to sweeper."
)
)
| Parameter | Post. mean | Post. SD | 2.5% | 97.5% | R-hat |
|---|---|---|---|---|---|
| Horiz. break (β_pfx_x) | -0.208 | 0.021 | -0.250 | -0.166 | 1.001 |
| Vert. break (β_pfx_z) | -0.456 | 0.024 | -0.503 | -0.408 | 1.000 |
| Velocity (β_velo) | 0.468 | 0.028 | 0.415 | 0.523 | 1.000 |
| Spin rate (β_spin) | 0.100 | 0.022 | 0.057 | 0.142 | 1.000 |
| Spin axis (β_spin_axis) | 0.074 | 0.021 | 0.031 | 0.117 | 1.000 |
| CU vs. ST (β_cu) | -1.492 | 0.053 | -1.597 | -1.389 | 1.001 |
| SL vs. ST (β_sl) | -1.521 | 0.040 | -1.601 | -1.441 | 1.000 |
| CU vs. SL (β_cu − β_sl, derived) | 0.029 | 0.063 | -0.095 | 0.153 | 1.000 |
| Pitcher RE precision (τ_α) | 1.565 | 0.114 | 1.350 | 1.799 | 1.001 |
compute_pred_probs <- function(fit) {
samps <- do.call(rbind, lapply(fit$samples, as.data.frame))
set.seed(42)
alpha_draw <- rnorm(nrow(samps), 0, 1 / sqrt(samps$tau_alpha))
data.frame(
ST = plogis(alpha_draw),
CU = plogis(alpha_draw + samps$beta_cu),
SL = plogis(alpha_draw + samps$beta_sl)
)
}
summarise_probs <- function(prob_df, outcome) {
prob_df |>
pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
group_by(Pitch) |>
summarise(
Mean = round(mean(p), 3),
Lower = round(quantile(p, 0.025), 3),
Upper = round(quantile(p, 0.975), 3),
.groups = "drop"
) |>
mutate(Outcome = outcome)
}
probs_whiff <- compute_pred_probs(fit_whiff)
probs_chase <- compute_pred_probs(fit_chase)
prob_tbl <- bind_rows(
summarise_probs(probs_whiff, "Whiff"),
summarise_probs(probs_chase, "Chase")
) |>
mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper")) |>
dplyr::select(Outcome, Pitch, Mean, Lower, Upper)
kable(
prob_tbl,
col.names = c("Outcome", "Pitch type", "Post. mean", "2.5%", "97.5%"),
caption = paste(
"Posterior predicted probabilities for each pitch type at mean covariate",
"values, marginalised over the pitcher random effect distribution.",
"Wide credible intervals for the sweeper reflect heterogeneity among",
"pitchers who throw the pitch."
)
)
| Outcome | Pitch type | Post. mean | 2.5% | 97.5% |
|---|---|---|---|---|
| Whiff | Curveball | 0.174 | 0.026 | 0.507 |
| Whiff | Slider | 0.178 | 0.026 | 0.513 |
| Whiff | Sweeper | 0.498 | 0.138 | 0.860 |
| Chase | Curveball | 0.209 | 0.044 | 0.524 |
| Chase | Slider | 0.205 | 0.043 | 0.512 |
| Chase | Sweeper | 0.498 | 0.170 | 0.828 |
prob_long <- bind_rows(
probs_whiff |>
pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
mutate(Outcome = "Whiff"),
probs_chase |>
pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
mutate(Outcome = "Chase")
) |>
mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper"))
pitch_colours <- c(Curveball = "#0072B2", Slider = "#D55E00", Sweeper = "#009E73")
ggplot(prob_long, aes(x = p, fill = Pitch, colour = Pitch)) +
geom_density(alpha = 0.35, linewidth = 0.8) +
facet_wrap(~ Outcome, scales = "free_y") +
scale_fill_manual(values = pitch_colours) +
scale_colour_manual(values = pitch_colours) +
labs(x = "Predicted probability", y = "Posterior density",
fill = NULL, colour = NULL,
title = "Adjusted predicted probabilities by pitch type") +
theme_bw() +
theme(panel.grid.minor = element_blank(), strip.background = element_blank(),
legend.position = "bottom")
Posterior predicted probability distributions for each pitch type at mean covariate values. Curveball and slider distributions are nearly indistinguishable on both outcomes; the sweeper is associated with substantially higher whiff and chase rates, though these associations likely reflect pitcher archetypes and movement extremity rather than a categorically distinct pitch type.
extract_samples <- function(fit, param) {
do.call(rbind, lapply(fit$samples, as.data.frame))[[param]]
}
make_contrast_df <- function(fit, outcome_label) {
bind_rows(
data.frame(value = extract_samples(fit, "beta_cu"),
contrast = "CU vs. ST", outcome = outcome_label),
data.frame(value = extract_samples(fit, "beta_sl"),
contrast = "SL vs. ST", outcome = outcome_label),
data.frame(value = extract_samples(fit, "beta_cu_vs_sl"),
contrast = "CU vs. SL (derived)", outcome = outcome_label)
)
}
contrast_df <- bind_rows(
make_contrast_df(fit_whiff, "Whiff"),
make_contrast_df(fit_chase, "Chase")
)
contrast_colours <- c("CU vs. ST" = "#0072B2",
"SL vs. ST" = "#D55E00",
"CU vs. SL (derived)" = "#CC79A7")
ci_df <- contrast_df |>
group_by(contrast, outcome) |>
summarise(lo = quantile(value, 0.025), hi = quantile(value, 0.975),
.groups = "drop")
ggplot(contrast_df, aes(value, fill = contrast, colour = contrast)) +
geom_density(alpha = 0.3, linewidth = 0.8) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
geom_segment(
data = ci_df,
aes(x = lo, xend = hi, y = -Inf, yend = -Inf, colour = contrast),
linewidth = 2, lineend = "round", inherit.aes = FALSE,
show.legend = FALSE
) +
facet_grid(outcome ~ contrast, scales = "free_y") +
scale_fill_manual(values = contrast_colours) +
scale_colour_manual(values = contrast_colours) +
labs(x = expression(beta ~ "(log-odds)"), y = "Posterior density",
title = "Posterior label associations with pitch outcomes",
fill = NULL, colour = NULL) +
theme_bw() +
theme(panel.grid.minor = element_blank(), legend.position = "none",
strip.background = element_blank())
Posterior distributions for the three pairwise label contrasts. ST is the reference category. The dashed line marks zero. Curveball and slider are indistinguishable from each other on both outcomes; both differ from the sweeper, though these associations are likely confounded by pitcher archetypes and movement extremity.
get_ci <- function(fit, param) round(fit$summary[param, c("2.5%", "97.5%")], 3)
ci_includes_zero <- function(ci) ci[1] < 0 & ci[2] > 0
w_cu <- get_ci(fit_whiff, "beta_cu")
w_sl <- get_ci(fit_whiff, "beta_sl")
w_cv <- get_ci(fit_whiff, "beta_cu_vs_sl")
c_cu <- get_ci(fit_chase, "beta_cu")
c_sl <- get_ci(fit_chase, "beta_sl")
c_cv <- get_ci(fit_chase, "beta_cu_vs_sl")
summary_df <- data.frame(
Stage = c("PCA", "LDA continuum test", "GMM",
"Whiff — CU vs. ST", "Whiff — SL vs. ST", "Whiff — CU vs. SL",
"Chase — CU vs. ST", "Chase — SL vs. ST", "Chase — CU vs. SL"),
Finding = c(
sprintf("PC1 = %.1f%%; PC1+PC2 = %.1f%%", var_exp[1]*100, cum_var[2]*100),
sprintf("Overall LDA accuracy = %.1f%%", lda_acc),
sprintf("G = 6 (VVV, elbow); ARI = %.3f", ari),
sprintf("95%% CI [%.3f, %.3f]%s", w_cu[1], w_cu[2],
if (ci_includes_zero(w_cu)) " — includes zero" else " — excludes zero"),
sprintf("95%% CI [%.3f, %.3f]%s", w_sl[1], w_sl[2],
if (ci_includes_zero(w_sl)) " — includes zero" else " — excludes zero"),
sprintf("95%% CI [%.3f, %.3f]%s", w_cv[1], w_cv[2],
if (ci_includes_zero(w_cv)) " — includes zero" else " — excludes zero"),
sprintf("95%% CI [%.3f, %.3f]%s", c_cu[1], c_cu[2],
if (ci_includes_zero(c_cu)) " — includes zero" else " — excludes zero"),
sprintf("95%% CI [%.3f, %.3f]%s", c_sl[1], c_sl[2],
if (ci_includes_zero(c_sl)) " — includes zero" else " — excludes zero"),
sprintf("95%% CI [%.3f, %.3f]%s", c_cv[1], c_cv[2],
if (ci_includes_zero(c_cv)) " — includes zero" else " — excludes zero")
),
Interpretation = c(
if (cum_var[2] > 0.85) "Most variance in two dimensions — low-dimensional continuum"
else "Variance spread across PCs — higher-dimensional movement space",
if (lda_acc < 70) "Poor recovery — movement space does not support three discrete categories"
else if (lda_acc < 85) "Moderate recovery — partial but imperfect label separability"
else "Strong recovery — labels are separable in movement space",
if (ari < 0.2) "Low ARI — unsupervised clusters do not recover CU/SL/ST"
else if (ari < 0.5) "Moderate ARI — partial label–cluster alignment"
else "High ARI — clusters align well with labels",
if (ci_includes_zero(w_cu)) "No association beyond movement" else "Association beyond movement",
if (ci_includes_zero(w_sl)) "No association beyond movement" else "Association beyond movement",
if (ci_includes_zero(w_cv)) "CU and SL indistinguishable on whiff" else "CU and SL distinguishable on whiff",
if (ci_includes_zero(c_cu)) "No association beyond movement" else "Association beyond movement",
if (ci_includes_zero(c_sl)) "No association beyond movement" else "Association beyond movement",
if (ci_includes_zero(c_cv)) "CU and SL indistinguishable on chase" else "CU and SL distinguishable on chase"
)
)
kable(summary_df, caption = "Summary of findings across all three analysis stages.")
| Stage | Finding | Interpretation |
|---|---|---|
| PCA | PC1 = 50.8%; PC1+PC2 = 71.0% | Variance spread across PCs — higher-dimensional movement space |
| LDA continuum test | Overall LDA accuracy = 84.0% | Moderate recovery — partial but imperfect label separability |
| GMM | G = 6 (VVV, elbow); ARI = 0.271 | Moderate ARI — partial label–cluster alignment |
| Whiff — CU vs. ST | 95% CI [-1.921, -1.676] — excludes zero | Association beyond movement |
| Whiff — SL vs. ST | 95% CI [-1.862, -1.681] — excludes zero | Association beyond movement |
| Whiff — CU vs. SL | 95% CI [-0.172, 0.111] — includes zero | CU and SL indistinguishable on whiff |
| Chase — CU vs. ST | 95% CI [-1.597, -1.389] — excludes zero | Association beyond movement |
| Chase — SL vs. ST | 95% CI [-1.601, -1.441] — excludes zero | Association beyond movement |
| Chase — CU vs. SL | 95% CI [-0.095, 0.153] — includes zero | CU and SL indistinguishable on chase |
Take-home: The evidence across all three stages is consistent with breaking-ball taxonomy existing on a continuous movement spectrum rather than comprising three discrete pitch types. PCA reveals a single dominant movement gradient along which curveball, slider, and sweeper grade into one another; LDA shows that the five movement features do not cleanly recover the three labels; and GMM selects six components rather than three, with only moderate alignment between data-driven clusters and Statcast labels. The slider occupies the intermediate space on this spectrum — what pitchers and coaches have historically called the slurve — and shows the highest classification uncertainty of the three types.
The Bayesian models show that curveball and slider are associated with lower whiff and chase rates than sweeper after adjusting for movement features, but the curveball/slider distinction itself carries no additional association with either outcome. These associations should not be interpreted causally: sweepers are thrown by a distinct population of pitchers in distinct contexts, and the label-outcome associations likely reflect those differences rather than an independent effect of the label. The wide credible intervals on sweeper predicted probabilities are consistent with substantial pitcher-level heterogeneity.
The emergence of the sweeper as a formal Statcast category may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The modern pitch taxonomy imposes linguistic categories — curveball, slider, sweeper — on aerodynamic variation that is better characterised as continuous, with the sweeper occupying one extreme of a horizontal break gradient that grades without a clear natural boundary into the slider, and the slider grading in turn into the curveball. Analytical frameworks that treat these labels as discrete inputs should account for the substantial movement-space overlap among adjacent categories.