library(dplyr)
library(tidyr)
library(ggplot2)
library(mclust)
library(jagsUI)
library(readr)
library(knitr)
library(tibble)
library(vegan)
library(ggrepel)
library(ggnewscale)
library(MASS)

Introduction

Major League Baseball’s Statcast system assigns discrete pitch-type labels — including curveball (CU), slider (SL), and sweeper (ST) — to every pitch via a proprietary classifier that draws on movement, velocity, and spin. The sweeper label was introduced publicly around 2022 and has since grown rapidly in usage, raising a fundamental question: does this label identify a genuinely novel pitch type, or does it represent a refinement of an already continuous breaking-ball movement spectrum?

An important caveat governs all analyses here. Because Statcast’s classifier itself uses movement variables as inputs, any analysis that regresses movement features against Statcast labels is partly circular — the labels and the features are not independent. This analysis therefore characterises the geometry of the movement space underlying the taxonomy rather than independently validating the labels. The question is whether the movement data, taken on their own terms, support discrete categories or a continuum.

The emergence of the sweeper label may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The traditional distinction between curveballs, sliders, and sweepers maps loosely onto what pitchers and coaches have long called the “slurve” — a pitch that resists clean categorisation because it occupies intermediate space on the horizontal/vertical break continuum. The analyses below evaluate this possibility formally across three complementary stages.

Data

breaking_raw <- read_csv("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/statcast_breaking.csv",
                         show_col_types = FALSE)
count_table <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::count(game_year, pitch_type) |>
  tidyr::pivot_wider(names_from = pitch_type, values_from = n, values_fill = 0) |>
  mutate(
    Total  = CU + SL + ST,
    CU_pct = round(100 * CU / Total, 1),
    SL_pct = round(100 * SL / Total, 1),
    ST_pct = round(100 * ST / Total, 1)
  ) |>
  dplyr::select(game_year, CU, CU_pct, SL, SL_pct, ST, ST_pct)

kable(count_table,
      col.names   = c("Season", "CU (n)", "CU (%)", "SL (n)", "SL (%)", "ST (n)", "ST (%)"),
      caption     = "Pitch counts and proportions by type and season.",
      format.args = list(big.mark = ","))
Pitch counts and proportions by type and season.
Season CU (n) CU (%) SL (n) SL (%) ST (n) ST (%)
2,020 18,142 31.3 37,050 63.9 2,823 4.9
2,021 41,890 27.9 94,951 63.3 13,159 8.8
2,022 39,483 26.3 87,730 58.5 22,787 15.2
2,023 34,433 23.0 82,638 55.1 32,929 22.0
2,024 33,708 22.5 78,389 52.3 37,903 25.3
2,025 35,421 23.6 75,205 50.1 39,374 26.2
label_colours <- c(CU = "#0072B2", SL = "#D55E00", ST = "#009E73")
label_names   <- c(CU = "Curveball", SL = "Slider",  ST = "Sweeper")

pct_tbl <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::count(game_year, pitch_type) |>
  group_by(game_year) |>
  mutate(pct = 100 * n / sum(n)) |>
  ungroup()

ggplot(pct_tbl, aes(game_year, pct, colour = pitch_type, group = pitch_type)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2.5) +
  scale_colour_manual(values = label_colours, labels = label_names) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  scale_x_continuous(breaks = min(pct_tbl$game_year):max(pct_tbl$game_year)) +
  labs(x = "Season", y = "Share of breaking balls", colour = NULL,
       title = "Breaking ball composition over time") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "bottom")
Share of breaking balls classified as curveball, slider, or sweeper by season. The sweeper share has grown consistently since 2020, largely at the expense of the slider — consistent with reclassification along a pre-existing movement continuum rather than the emergence of a genuinely new pitch type.

Share of breaking balls classified as curveball, slider, or sweeper by season. The sweeper share has grown consistently since 2020, largely at the expense of the slider — consistent with reclassification along a pre-existing movement continuum rather than the emergence of a genuinely new pitch type.

REQUIRED_COLS <- c("pitcher", "game_year", "pitch_type", "p_throws",
                   "pfx_x", "pfx_z", "release_speed", "release_spin_rate",
                   "spin_axis", "description", "zone")

breaking <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::select(any_of(REQUIRED_COLS)) |>
  mutate(
    pfx_x_adj     = if_else(p_throws == "L", -pfx_x, pfx_x),
    spin_axis_adj = if_else(p_throws == "L", 360 - spin_axis, spin_axis),
    whiff         = as.integer(description == "swinging_strike"),
    chase         = as.integer(
                      description %in% c("swinging_strike", "swinging_strike_blocked",
                                         "foul", "foul_tip", "hit_into_play") &
                      zone %in% c(11, 12, 13, 14)
                    ),
    label_cu      = as.integer(pitch_type == "CU"),
    label_sl      = as.integer(pitch_type == "SL"),
    pitcher_id    = as.integer(factor(pitcher)),
    year_f        = factor(game_year)
  ) |>
  drop_na(pfx_x_adj, pfx_z, release_speed, release_spin_rate, spin_axis_adj) |>
  mutate(
    pfx_x_std     = as.numeric(scale(pfx_x_adj)),
    pfx_z_std     = as.numeric(scale(pfx_z)),
    velo_std      = as.numeric(scale(release_speed)),
    spin_std      = as.numeric(scale(release_spin_rate)),
    spin_axis_std = as.numeric(scale(spin_axis_adj))
  ) |>
  mutate(
    pfx_x_std     = residuals(lm(pfx_x_std     ~ year_f, data = pick(everything()))),
    pfx_z_std     = residuals(lm(pfx_z_std     ~ year_f, data = pick(everything()))),
    velo_std      = residuals(lm(velo_std      ~ year_f, data = pick(everything()))),
    spin_std      = residuals(lm(spin_std      ~ year_f, data = pick(everything()))),
    spin_axis_std = residuals(lm(spin_axis_std ~ year_f, data = pick(everything())))
  )

cat(sprintf(
  "%s pitches — CU: %s  SL: %s  ST: %s  |  %s unique pitchers\n",
  format(nrow(breaking),                   big.mark = ","),
  format(sum(breaking$pitch_type == "CU"), big.mark = ","),
  format(sum(breaking$pitch_type == "SL"), big.mark = ","),
  format(sum(breaking$pitch_type == "ST"), big.mark = ","),
  format(n_distinct(breaking$pitcher),     big.mark = ",")
))
## 804,590 pitches — CU: 202,261  SL: 454,138  ST: 148,191  |  1,607 unique pitchers

Stage 1: PCA

PCA identifies the dominant axes of covariation in pitch shape across five standardised, year-residualised features: horizontal break (pfx_x_adj, sign-flipped for LHP), vertical break (pfx_z), velocity, spin rate, and spin axis (spin_axis_adj, mirrored for LHP). Year effects are residualised from all features prior to analysis so that the movement geometry reflects within-year pitch physics rather than secular classifier drift. PCA is used here because it preserves global geometry — it provides an interpretable low-dimensional representation of the full movement space without imposing cluster structure, making it well-suited for evaluating whether the space is continuous or discretely partitioned.

The convergent validity question is whether the three Statcast labels occupy clearly separable regions of this space, or whether they instead grade into one another along shared movement axes.

pca_obj      <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_pca.rds")
rda_fit      <- pca_obj$rda_fit
all_scores   <- pca_obj$all_scores
loadings_raw <- pca_obj$loadings_raw
var_exp      <- pca_obj$var_exp
cum_var      <- pca_obj$cum_var
var_tbl <- data.frame(
  PC         = paste0("PC", seq_along(var_exp)),
  Proportion = round(var_exp, 3),
  Cumulative = round(cum_var, 3),
  check.names = FALSE
)
kable(var_tbl, caption = "Variance explained by each principal component.")
Variance explained by each principal component.
PC Proportion Cumulative
PC1 PC1 0.508 0.508
PC2 PC2 0.202 0.710
PC3 PC3 0.153 0.863
PC4 PC4 0.085 0.948
PC5 PC5 0.052 1.000
ldg <- loadings_raw |>
  dplyr::select(Feature = feature, PC1, PC2) |>
  mutate(
    Feature = recode(Feature,
      pfx_x_std     = "Horiz. break (pfx_x, adj.)",
      pfx_z_std     = "Vert. break (pfx_z)",
      velo_std      = "Release speed",
      spin_std      = "Spin rate",
      spin_axis_std = "Spin axis (adj.)"
    ),
    across(where(is.numeric), ~ round(., 3))
  )
kable(ldg, caption = "PCA loadings (vegan rda).")
PCA loadings (vegan rda).
Feature PC1 PC2
pfx_x_std Horiz. break (pfx_x, adj.) 0.397 0.179
pfx_z_std Vert. break (pfx_z) -0.426 0.152
velo_std Release speed -0.417 0.250
spin_std Spin rate 0.277 0.447
spin_axis_std Spin axis (adj.) -0.466 0.056

PC1 primarily reflects a horizontal-versus-vertical movement gradient, with spin axis loading in the opposite direction to horizontal break — consistent with the expected movement characteristics of sweepers (high horizontal break, low spin axis after handedness adjustment) versus curveballs (low horizontal break, high spin axis). PC2 captures a secondary dimension reflecting velocity and spin rate, largely independent of movement direction.

set.seed(1)
scores_plot <- all_scores |> slice_sample(n = min(nrow(all_scores), 25000))

ggplot(scores_plot, aes(x = PC1, y = PC2, colour = pitch_type, fill = pitch_type)) +
  geom_point(alpha = 0.04, size = 0.35) +
  stat_density_2d(
    geom        = "polygon",
    contour_var = "ndensity",
    breaks      = 0.5,
    alpha       = 0.4,
    linewidth   = 0
  ) +
  geom_segment(
    data = loadings_raw,
    aes(x = 0, xend = PC1, y = 0, yend = PC2),
    colour = "black", linewidth = 0.65, alpha = 0.55,
    arrow = arrow(length = unit(0.01, "npc")),
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = loadings_raw,
    aes(x = PC1, y = PC2, label = label),
    size = 4, alpha = 0.85,
    label.padding = unit(0.3, "lines"),
    inherit.aes = FALSE
  ) +
  scale_colour_manual(values = label_colours, labels = label_names) +
  scale_fill_manual(values   = label_colours, labels = label_names) +
  labs(
    x      = sprintf("PC1 (%.1f%% variance)", var_exp[1] * 100),
    y      = sprintf("PC2 (%.1f%% variance)", var_exp[2] * 100),
    colour = NULL, fill = NULL,
    title  = sprintf("Breaking-ball movement space (%d–2025)",
                     min(breaking$game_year))
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom") +
  guides(
    colour = guide_legend(override.aes = list(alpha = 1, size = 2)),
    fill   = guide_legend(override.aes = list(alpha = 0.5))
  )
PC1 vs PC2 coloured by Statcast pitch label. Shaded regions show the 50% kernel density contour for each label. The three labels grade into one another along PC1 rather than occupying well-separated regions, consistent with a continuous movement spectrum.

PC1 vs PC2 coloured by Statcast pitch label. Shaded regions show the 50% kernel density contour for each label. The three labels grade into one another along PC1 rather than occupying well-separated regions, consistent with a continuous movement spectrum.

Formal continuum test: LDA misclassification

If the three labels occupy clearly discrete regions of movement space, a linear discriminant classifier trained on the five features should recover them with high accuracy. Poor classification accuracy — particularly systematic confusion between adjacent categories — is consistent with a movement continuum rather than discrete types. LDA is fit on a random subsample (n = 30,000) to make computation tractable; leave-one-out cross-validation is used to avoid overfitting.

set.seed(7)
lda_df <- breaking |>
  group_by(pitch_type) |>
  slice_sample(n = 10000) |>
  ungroup()

lda_fit  <- lda(pitch_type ~ pfx_x_std + pfx_z_std + velo_std +
                              spin_std + spin_axis_std,
                data = lda_df, CV = TRUE)

lda_tab  <- table(Predicted = lda_fit$class, Actual = lda_df$pitch_type)
lda_acc  <- round(100 * sum(diag(lda_tab)) / sum(lda_tab), 1)

kable(lda_tab,
      caption = sprintf(
        "LDA leave-one-out confusion matrix (overall accuracy = %.1f%%). Rows are predicted labels; columns are Statcast labels. Systematic off-diagonal mass, especially between SL and ST, indicates the movement space does not cleanly support three discrete categories.", lda_acc))
LDA leave-one-out confusion matrix (overall accuracy = 84.0%). Rows are predicted labels; columns are Statcast labels. Systematic off-diagonal mass, especially between SL and ST, indicates the movement space does not cleanly support three discrete categories.
CU SL ST
CU 8370 576 435
SL 557 8014 736
ST 1073 1410 8829

An overall accuracy substantially below 100%, combined with systematic confusion between adjacent categories (particularly SL ↔︎ ST), supports the interpretation that Statcast labels impose discrete boundaries on a continuous movement distribution.


Stage 2: Gaussian Mixture Modelling

GMM is applied to the same five scaled features to ask whether an unsupervised partition of the movement space — with no knowledge of Statcast labels — recovers three components corresponding to CU, SL, and ST. BIC selects the number of components \(G\) and covariance parameterisation jointly. BIC search is conducted on a random subsample (n = 5,000) across G = 2–9 and four covariance structures (VVV, VVI, VEV, EEE); the full-data model is fit at G = 6, the elbow of the BIC curve beyond which marginal gains are negligible. Cluster–label alignment is quantified with the Adjusted Rand Index (ARI). This stage addresses discriminant validity: does a purely data-driven partition recover MLB’s operational taxonomy?

gmm_obj    <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_gmm.rds")
fit_gmm    <- gmm_obj$fit_gmm
ari        <- gmm_obj$ari
bic_df     <- gmm_obj$bic_df
bic_best   <- gmm_obj$bic_best
cluster_df <- gmm_obj$cluster_df

cat(sprintf("GMM fit: G = %d, model = %s  |  ARI = %.4f\n",
            fit_gmm$G, fit_gmm$modelName, ari))
## GMM fit: G = 6, model = VVV  |  ARI = 0.2709
ggplot(bic_df, aes(G, BIC, group = model, colour = model)) +
  geom_line(alpha = 0.4, linewidth = 0.5) +
  geom_point(data = bic_best, aes(G, BIC), colour = "black", size = 2.5) +
  geom_vline(xintercept = 6, linetype = "dashed", colour = "red") +
  scale_x_continuous(breaks = 2:9) +
  labs(x = "Number of GMM components (G)", y = "BIC", colour = "Model",
       title = "GMM BIC curve — reported G = 6 (VVV)") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "right",
        legend.text = element_text(size = 8))
BIC values across candidate GMM components (subsample). The marked elbow after G=5 indicates diminishing returns from additional components. The data-preferred solution (G=6) exceeds the three-label taxonomy, suggesting richer latent structure than CU/SL/ST implies.

BIC values across candidate GMM components (subsample). The marked elbow after G=5 indicates diminishing returns from additional components. The data-preferred solution (G=6) exceeds the three-label taxonomy, suggesting richer latent structure than CU/SL/ST implies.

cluster_colours <- setNames(scales::hue_pal()(6), as.character(1:6))

ggplot(cluster_df, aes(PC1, PC2)) +
  geom_point(aes(colour = pitch_type), alpha = 0.15, size = 0.4) +
  scale_colour_manual(
    name   = "Statcast label",
    values = label_colours,
    labels = label_names,
    guide  = guide_legend(override.aes = list(alpha = 1, size = 2,
                                              linetype = "blank", linewidth = 0))
  ) +
  new_scale_colour() +
  stat_ellipse(
    aes(group = cluster, colour = cluster),
    type = "norm", level = 0.95, linetype = "dashed", linewidth = 0.6
  ) +
  scale_colour_manual(
    name   = "GMM cluster",
    values = cluster_colours,
    labels = paste("Cluster", 1:6),
    guide  = guide_legend(override.aes = list(alpha = 1, size = 0,
                                              linetype = "dashed", linewidth = 0.6))
  ) +
  labs(
    x     = sprintf("PC1 (%.1f%%)", var_exp[1] * 100),
    y     = sprintf("PC2 (%.1f%%)", var_exp[2] * 100),
    title = "GMM clusters (G = 6) vs. Statcast labels"
  ) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "right")
GMM cluster assignments overlaid on PC space, coloured by Statcast label. Ellipses show 95% coverage regions. No cluster is dominated by a single pitch type, and adjacent clusters span label boundaries — consistent with a continuous movement spectrum rather than discrete categories.

GMM cluster assignments overlaid on PC space, coloured by Statcast label. Ellipses show 95% coverage regions. No cluster is dominated by a single pitch type, and adjacent clusters span label boundaries — consistent with a continuous movement spectrum rather than discrete categories.

uncertainty_df <- data.frame(
  uncertainty = fit_gmm$uncertainty,
  pitch_type  = breaking$pitch_type
)

ggplot(uncertainty_df, aes(x = uncertainty, fill = pitch_type, colour = pitch_type)) +
  geom_histogram(binwidth = 0.02, alpha = 0.6, position = "identity") +
  facet_wrap(~ pitch_type, ncol = 1,
             labeller = labeller(pitch_type = label_names)) +
  scale_fill_manual(values   = label_colours, guide = "none") +
  scale_colour_manual(values = label_colours, guide = "none") +
  labs(x = "Classification uncertainty (1 − max posterior probability)",
       y = "Count",
       title = "GMM posterior classification uncertainty by pitch type") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), strip.background = element_blank())
Posterior classification uncertainty (1 − max component probability) by pitch type. The slider distribution is the most diffuse, consistent with its intermediate position on the movement continuum. Curveball and sweeper show lower uncertainty, consistent with their positions at opposing ends of the spectrum.

Posterior classification uncertainty (1 − max component probability) by pitch type. The slider distribution is the most diffuse, consistent with its intermediate position on the movement continuum. Curveball and sweeper show lower uncertainty, consistent with their positions at opposing ends of the spectrum.

tab <- table(Cluster = fit_gmm$classification, Label = breaking$pitch_type)
kable(tab, caption = "GMM cluster (G = 6) × Statcast label contingency table.")
GMM cluster (G = 6) × Statcast label contingency table.
CU SL ST
42835 42313 10147
21401 68802 130131
586 183813 1653
124760 2725 2213
11065 137134 3173
1614 19351 874

ARI interpretation: ARI = 1 indicates perfect cluster–label correspondence; ARI = 0 indicates correspondence no better than chance. Values below ~0.2 suggest the data-driven partition bears little relationship to the CU/SL/ST taxonomy.


Stage 3: Bayesian Hierarchical Models

Two Bayesian logistic models with pitcher-level random intercepts estimate whether the Statcast pitch-type label is associated with whiff or chase probability above and beyond the five physical features. These associations should be interpreted with caution as outcome differences across pitch types are likely confounded by pitcher archetypes (pitchers who throw sweepers may differ systematically from those who throw curveballs), platoon effects, count leverage, and usage context. The models do not establish causal effects of the label itself.

ST (sweeper) is the reference category. beta_cu and beta_sl estimate the whiff (or chase) log-odds of curveball and slider relative to sweeper after adjusting for horizontal break, vertical break, velocity, spin rate, and spin axis. This stage addresses predictive validity: do the labels carry incremental associations with outcomes once physical features are adjusted for?

Pitcher random intercepts are centred at zero with precision tau_alpha estimated from the data. Year effects are residualised from all features upstream and are not included as a separate model term. Models were fit on a stratified random sample of approximately 50,000 pitches (~16,667 per pitch type) to ensure balanced pitch-type representation. Bayesian inference was conducted using JAGS via jagsUI (3 chains, 5,000 burn-in, 10,000 iterations, thinning = 2).

fit_whiff <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_whiff.rds")
fit_chase <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_chase.rds")
rhat_w <- unlist(fit_whiff$Rhat); rhat_w <- rhat_w[!is.na(rhat_w)]
rhat_c <- unlist(fit_chase$Rhat); rhat_c <- rhat_c[!is.na(rhat_c)]

cat(sprintf("Whiff — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
            max(rhat_w), sum(rhat_w > 1.1), length(rhat_w),
            if (sum(rhat_w > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))
## Whiff — max R-hat: 1.003 | nodes > 1.1: 0/10 | CONVERGED
cat(sprintf("Chase — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
            max(rhat_c), sum(rhat_c > 1.1), length(rhat_c),
            if (sum(rhat_c > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))
## Chase — max R-hat: 1.001 | nodes > 1.1: 0/10 | CONVERGED
target_params <- c("beta_pfx_x", "beta_pfx_z", "beta_velo", "beta_spin",
                   "beta_spin_axis", "beta_cu", "beta_sl", "beta_cu_vs_sl",
                   "tau_alpha")

param_labels <- c(
  beta_pfx_x     = "Horiz. break (β_pfx_x)",
  beta_pfx_z     = "Vert. break (β_pfx_z)",
  beta_velo      = "Velocity (β_velo)",
  beta_spin      = "Spin rate (β_spin)",
  beta_spin_axis = "Spin axis (β_spin_axis)",
  beta_cu        = "CU vs. ST (β_cu)",
  beta_sl        = "SL vs. ST (β_sl)",
  beta_cu_vs_sl  = "CU vs. SL (β_cu − β_sl, derived)",
  tau_alpha      = "Pitcher RE precision (τ_α)"
)

make_post_tbl <- function(fit) {
  fit$summary[target_params, c("mean", "sd", "2.5%", "97.5%", "Rhat"),
              drop = FALSE] |>
    as.data.frame() |>
    rownames_to_column("Parameter") |>
    mutate(
      Parameter = recode(Parameter, !!!param_labels),
      across(where(is.numeric), ~ round(., 3))
    )
}

kable(
  make_post_tbl(fit_whiff),
  col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
  caption   = paste(
    "Whiff model posterior summaries. ST is the reference category.",
    "β_cu and β_sl are the adjusted log-odds of whiff for curveball and slider",
    "relative to sweeper. Credible intervals excluding zero indicate a label",
    "association with whiff beyond the five movement features, though",
    "unmeasured confounders may contribute to these associations."
  )
)
Whiff model posterior summaries. ST is the reference category. β_cu and β_sl are the adjusted log-odds of whiff for curveball and slider relative to sweeper. Credible intervals excluding zero indicate a label association with whiff beyond the five movement features, though unmeasured confounders may contribute to these associations.
Parameter Post. mean Post. SD 2.5% 97.5% R-hat
Horiz. break (β_pfx_x) -0.331 0.024 -0.379 -0.284 1.001
Vert. break (β_pfx_z) -0.309 0.027 -0.363 -0.256 1.000
Velocity (β_velo) 0.334 0.031 0.273 0.394 1.001
Spin rate (β_spin) 0.124 0.024 0.077 0.172 1.000
Spin axis (β_spin_axis) 0.104 0.023 0.059 0.148 1.000
CU vs. ST (β_cu) -1.800 0.062 -1.921 -1.676 1.002
SL vs. ST (β_sl) -1.770 0.047 -1.862 -1.681 1.000
CU vs. SL (β_cu − β_sl, derived) -0.030 0.072 -0.172 0.111 1.003
Pitcher RE precision (τ_α) 1.183 0.088 1.019 1.363 1.002
kable(
  make_post_tbl(fit_chase),
  col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
  caption   = paste(
    "Chase model posterior summaries. ST is the reference category.",
    "β_cu and β_sl are the adjusted log-odds of chasing an out-of-zone pitch",
    "for curveball and slider relative to sweeper."
  )
)
Chase model posterior summaries. ST is the reference category. β_cu and β_sl are the adjusted log-odds of chasing an out-of-zone pitch for curveball and slider relative to sweeper.
Parameter Post. mean Post. SD 2.5% 97.5% R-hat
Horiz. break (β_pfx_x) -0.208 0.021 -0.250 -0.166 1.001
Vert. break (β_pfx_z) -0.456 0.024 -0.503 -0.408 1.000
Velocity (β_velo) 0.468 0.028 0.415 0.523 1.000
Spin rate (β_spin) 0.100 0.022 0.057 0.142 1.000
Spin axis (β_spin_axis) 0.074 0.021 0.031 0.117 1.000
CU vs. ST (β_cu) -1.492 0.053 -1.597 -1.389 1.001
SL vs. ST (β_sl) -1.521 0.040 -1.601 -1.441 1.000
CU vs. SL (β_cu − β_sl, derived) 0.029 0.063 -0.095 0.153 1.000
Pitcher RE precision (τ_α) 1.565 0.114 1.350 1.799 1.001
compute_pred_probs <- function(fit) {
  samps <- do.call(rbind, lapply(fit$samples, as.data.frame))
  set.seed(42)
  alpha_draw <- rnorm(nrow(samps), 0, 1 / sqrt(samps$tau_alpha))
  data.frame(
    ST = plogis(alpha_draw),
    CU = plogis(alpha_draw + samps$beta_cu),
    SL = plogis(alpha_draw + samps$beta_sl)
  )
}

summarise_probs <- function(prob_df, outcome) {
  prob_df |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    group_by(Pitch) |>
    summarise(
      Mean  = round(mean(p), 3),
      Lower = round(quantile(p, 0.025), 3),
      Upper = round(quantile(p, 0.975), 3),
      .groups = "drop"
    ) |>
    mutate(Outcome = outcome)
}

probs_whiff <- compute_pred_probs(fit_whiff)
probs_chase <- compute_pred_probs(fit_chase)

prob_tbl <- bind_rows(
  summarise_probs(probs_whiff, "Whiff"),
  summarise_probs(probs_chase, "Chase")
) |>
  mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper")) |>
  dplyr::select(Outcome, Pitch, Mean, Lower, Upper)

kable(
  prob_tbl,
  col.names = c("Outcome", "Pitch type", "Post. mean", "2.5%", "97.5%"),
  caption   = paste(
    "Posterior predicted probabilities for each pitch type at mean covariate",
    "values, marginalised over the pitcher random effect distribution.",
    "Wide credible intervals for the sweeper reflect heterogeneity among",
    "pitchers who throw the pitch."
  )
)
Posterior predicted probabilities for each pitch type at mean covariate values, marginalised over the pitcher random effect distribution. Wide credible intervals for the sweeper reflect heterogeneity among pitchers who throw the pitch.
Outcome Pitch type Post. mean 2.5% 97.5%
Whiff Curveball 0.174 0.026 0.507
Whiff Slider 0.178 0.026 0.513
Whiff Sweeper 0.498 0.138 0.860
Chase Curveball 0.209 0.044 0.524
Chase Slider 0.205 0.043 0.512
Chase Sweeper 0.498 0.170 0.828
prob_long <- bind_rows(
  probs_whiff |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    mutate(Outcome = "Whiff"),
  probs_chase |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    mutate(Outcome = "Chase")
) |>
  mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper"))

pitch_colours <- c(Curveball = "#0072B2", Slider = "#D55E00", Sweeper = "#009E73")

ggplot(prob_long, aes(x = p, fill = Pitch, colour = Pitch)) +
  geom_density(alpha = 0.35, linewidth = 0.8) +
  facet_wrap(~ Outcome, scales = "free_y") +
  scale_fill_manual(values   = pitch_colours) +
  scale_colour_manual(values = pitch_colours) +
  labs(x = "Predicted probability", y = "Posterior density",
       fill = NULL, colour = NULL,
       title = "Adjusted predicted probabilities by pitch type") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), strip.background = element_blank(),
        legend.position = "bottom")
Posterior predicted probability distributions for each pitch type at mean covariate values. Curveball and slider distributions are nearly indistinguishable on both outcomes; the sweeper is associated with substantially higher whiff and chase rates, though these associations likely reflect pitcher archetypes and movement extremity rather than a categorically distinct pitch type.

Posterior predicted probability distributions for each pitch type at mean covariate values. Curveball and slider distributions are nearly indistinguishable on both outcomes; the sweeper is associated with substantially higher whiff and chase rates, though these associations likely reflect pitcher archetypes and movement extremity rather than a categorically distinct pitch type.

extract_samples <- function(fit, param) {
  do.call(rbind, lapply(fit$samples, as.data.frame))[[param]]
}

make_contrast_df <- function(fit, outcome_label) {
  bind_rows(
    data.frame(value = extract_samples(fit, "beta_cu"),
               contrast = "CU vs. ST", outcome = outcome_label),
    data.frame(value = extract_samples(fit, "beta_sl"),
               contrast = "SL vs. ST", outcome = outcome_label),
    data.frame(value = extract_samples(fit, "beta_cu_vs_sl"),
               contrast = "CU vs. SL (derived)", outcome = outcome_label)
  )
}

contrast_df <- bind_rows(
  make_contrast_df(fit_whiff, "Whiff"),
  make_contrast_df(fit_chase, "Chase")
)

contrast_colours <- c("CU vs. ST"           = "#0072B2",
                      "SL vs. ST"           = "#D55E00",
                      "CU vs. SL (derived)" = "#CC79A7")

ci_df <- contrast_df |>
  group_by(contrast, outcome) |>
  summarise(lo = quantile(value, 0.025), hi = quantile(value, 0.975),
            .groups = "drop")

ggplot(contrast_df, aes(value, fill = contrast, colour = contrast)) +
  geom_density(alpha = 0.3, linewidth = 0.8) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
  geom_segment(
    data = ci_df,
    aes(x = lo, xend = hi, y = -Inf, yend = -Inf, colour = contrast),
    linewidth = 2, lineend = "round", inherit.aes = FALSE,
    show.legend = FALSE
  ) +
  facet_grid(outcome ~ contrast, scales = "free_y") +
  scale_fill_manual(values   = contrast_colours) +
  scale_colour_manual(values = contrast_colours) +
  labs(x = expression(beta ~ "(log-odds)"), y = "Posterior density",
       title = "Posterior label associations with pitch outcomes",
       fill = NULL, colour = NULL) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "none",
        strip.background = element_blank())
Posterior distributions for the three pairwise label contrasts. ST is the reference category. The dashed line marks zero. Curveball and slider are indistinguishable from each other on both outcomes; both differ from the sweeper, though these associations are likely confounded by pitcher archetypes and movement extremity.

Posterior distributions for the three pairwise label contrasts. ST is the reference category. The dashed line marks zero. Curveball and slider are indistinguishable from each other on both outcomes; both differ from the sweeper, though these associations are likely confounded by pitcher archetypes and movement extremity.


Summary

get_ci <- function(fit, param) round(fit$summary[param, c("2.5%", "97.5%")], 3)
ci_includes_zero <- function(ci) ci[1] < 0 & ci[2] > 0

w_cu <- get_ci(fit_whiff, "beta_cu")
w_sl <- get_ci(fit_whiff, "beta_sl")
w_cv <- get_ci(fit_whiff, "beta_cu_vs_sl")
c_cu <- get_ci(fit_chase, "beta_cu")
c_sl <- get_ci(fit_chase, "beta_sl")
c_cv <- get_ci(fit_chase, "beta_cu_vs_sl")

summary_df <- data.frame(
  Stage = c("PCA", "LDA continuum test", "GMM",
            "Whiff — CU vs. ST", "Whiff — SL vs. ST", "Whiff — CU vs. SL",
            "Chase — CU vs. ST", "Chase — SL vs. ST", "Chase — CU vs. SL"),
  Finding = c(
    sprintf("PC1 = %.1f%%; PC1+PC2 = %.1f%%", var_exp[1]*100, cum_var[2]*100),
    sprintf("Overall LDA accuracy = %.1f%%", lda_acc),
    sprintf("G = 6 (VVV, elbow); ARI = %.3f", ari),
    sprintf("95%% CI [%.3f, %.3f]%s", w_cu[1], w_cu[2],
            if (ci_includes_zero(w_cu)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", w_sl[1], w_sl[2],
            if (ci_includes_zero(w_sl)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", w_cv[1], w_cv[2],
            if (ci_includes_zero(w_cv)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_cu[1], c_cu[2],
            if (ci_includes_zero(c_cu)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_sl[1], c_sl[2],
            if (ci_includes_zero(c_sl)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_cv[1], c_cv[2],
            if (ci_includes_zero(c_cv)) " — includes zero" else " — excludes zero")
  ),
  Interpretation = c(
    if (cum_var[2] > 0.85) "Most variance in two dimensions — low-dimensional continuum"
    else "Variance spread across PCs — higher-dimensional movement space",
    if (lda_acc < 70) "Poor recovery — movement space does not support three discrete categories"
    else if (lda_acc < 85) "Moderate recovery — partial but imperfect label separability"
    else "Strong recovery — labels are separable in movement space",
    if (ari < 0.2) "Low ARI — unsupervised clusters do not recover CU/SL/ST"
    else if (ari < 0.5) "Moderate ARI — partial label–cluster alignment"
    else "High ARI — clusters align well with labels",
    if (ci_includes_zero(w_cu)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(w_sl)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(w_cv)) "CU and SL indistinguishable on whiff" else "CU and SL distinguishable on whiff",
    if (ci_includes_zero(c_cu)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(c_sl)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(c_cv)) "CU and SL indistinguishable on chase" else "CU and SL distinguishable on chase"
  )
)

kable(summary_df, caption = "Summary of findings across all three analysis stages.")
Summary of findings across all three analysis stages.
Stage Finding Interpretation
PCA PC1 = 50.8%; PC1+PC2 = 71.0% Variance spread across PCs — higher-dimensional movement space
LDA continuum test Overall LDA accuracy = 84.0% Moderate recovery — partial but imperfect label separability
GMM G = 6 (VVV, elbow); ARI = 0.271 Moderate ARI — partial label–cluster alignment
Whiff — CU vs. ST 95% CI [-1.921, -1.676] — excludes zero Association beyond movement
Whiff — SL vs. ST 95% CI [-1.862, -1.681] — excludes zero Association beyond movement
Whiff — CU vs. SL 95% CI [-0.172, 0.111] — includes zero CU and SL indistinguishable on whiff
Chase — CU vs. ST 95% CI [-1.597, -1.389] — excludes zero Association beyond movement
Chase — SL vs. ST 95% CI [-1.601, -1.441] — excludes zero Association beyond movement
Chase — CU vs. SL 95% CI [-0.095, 0.153] — includes zero CU and SL indistinguishable on chase

Take-home: The evidence across all three stages is consistent with breaking-ball taxonomy existing on a continuous movement spectrum rather than comprising three discrete pitch types. PCA reveals a single dominant movement gradient along which curveball, slider, and sweeper grade into one another; LDA shows that the five movement features do not cleanly recover the three labels; and GMM selects six components rather than three, with only moderate alignment between data-driven clusters and Statcast labels. The slider occupies the intermediate space on this spectrum — what pitchers and coaches have historically called the slurve — and shows the highest classification uncertainty of the three types.

The Bayesian models show that curveball and slider are associated with lower whiff and chase rates than sweeper after adjusting for movement features, but the curveball/slider distinction itself carries no additional association with either outcome. These associations should not be interpreted causally: sweepers are thrown by a distinct population of pitchers in distinct contexts, and the label-outcome associations likely reflect those differences rather than an independent effect of the label. The wide credible intervals on sweeper predicted probabilities are consistent with substantial pitcher-level heterogeneity.

The emergence of the sweeper as a formal Statcast category may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The modern pitch taxonomy imposes linguistic categories — curveball, slider, sweeper — on aerodynamic variation that is better characterised as continuous, with the sweeper occupying one extreme of a horizontal break gradient that grades without a clear natural boundary into the slider, and the slider grading in turn into the curveball. Analytical frameworks that treat these labels as discrete inputs should account for the substantial movement-space overlap among adjacent categories.