Breaking Ball Taxonomy: Curveball, Slider, and Sweeper

library(dplyr)
library(tidyr)
library(ggplot2)
library(mclust)
library(jagsUI)
library(readr)
library(knitr)
library(tibble)
library(vegan)
library(ggrepel)
library(ggnewscale)
library(MASS)

Introduction

Major League Baseball’s Statcast system assigns discrete pitch-type labels — including curveball (CU), slider (SL), and sweeper (ST) — to every pitch via a proprietary classifier that draws on movement, velocity, and spin. The sweeper label was introduced publicly around 2022 and has since grown rapidly in usage, raising a fundamental question: does this label identify a genuinely novel pitch type, or does it represent a refinement of an already continuous breaking-ball movement spectrum?

An important caveat governs all analyses here. Because Statcast’s classifier itself uses movement variables as inputs, any analysis that regresses movement features against Statcast labels is partly circular — the labels and the features are not independent. This analysis therefore characterises the geometry of the movement space underlying the taxonomy rather than independently validating the labels. The question is whether the movement data, taken on their own terms, support discrete categories or a continuum.

The emergence of the sweeper label may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The traditional distinction between curveballs, sliders, and sweepers maps loosely onto what pitchers and coaches have long called the “slurve” — a pitch that resists clean categorisation because it occupies intermediate space on the horizontal/vertical break continuum. The analyses below evaluate this possibility formally across three complementary stages.

Data

breaking_raw <- read_csv("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/statcast_breaking.csv",
                         show_col_types = FALSE)

count_table <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::count(game_year, pitch_type) |>
  tidyr::pivot_wider(names_from = pitch_type, values_from = n, values_fill = 0) |>
  mutate(
    Total  = CU + SL + ST,
    CU_pct = round(100 * CU / Total, 1),
    SL_pct = round(100 * SL / Total, 1),
    ST_pct = round(100 * ST / Total, 1)
  ) |>
  dplyr::select(game_year, CU, CU_pct, SL, SL_pct, ST, ST_pct)

kable(count_table,
      col.names   = c("Season", "CU (n)", "CU (%)", "SL (n)", "SL (%)", "ST (n)", "ST (%)"),
      caption     = "Pitch counts and proportions by type and season.",
      format.args = list(big.mark = ","))

Pitch counts and proportions by type and season.
Season	CU (n)	CU (%)	SL (n)	SL (%)	ST (n)	ST (%)
2,020	18,142	31.3	37,050	63.9	2,823	4.9
2,021	41,890	27.9	94,951	63.3	13,159	8.8
2,022	39,483	26.3	87,730	58.5	22,787	15.2
2,023	34,433	23.0	82,638	55.1	32,929	22.0
2,024	33,708	22.5	78,389	52.3	37,903	25.3
2,025	35,421	23.6	75,205	50.1	39,374	26.2

label_colours <- c(CU = "#0072B2", SL = "#D55E00", ST = "#009E73")
label_names   <- c(CU = "Curveball", SL = "Slider",  ST = "Sweeper")

pct_tbl <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::count(game_year, pitch_type) |>
  group_by(game_year) |>
  mutate(pct = 100 * n / sum(n)) |>
  ungroup()

ggplot(pct_tbl, aes(game_year, pct, colour = pitch_type, group = pitch_type)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2.5) +
  scale_colour_manual(values = label_colours, labels = label_names) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  scale_x_continuous(breaks = min(pct_tbl$game_year):max(pct_tbl$game_year)) +
  labs(x = "Season", y = "Share of breaking balls", colour = NULL,
       title = "Breaking ball composition over time") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "bottom")

Share of breaking balls classified as curveball, slider, or sweeper by season. The sweeper share has grown consistently since 2020, largely at the expense of the slider — consistent with reclassification along a pre-existing movement continuum rather than the emergence of a genuinely new pitch type.

REQUIRED_COLS <- c("pitcher", "game_year", "pitch_type", "p_throws",
                   "pfx_x", "pfx_z", "release_speed", "release_spin_rate",
                   "spin_axis", "description", "zone")

breaking <- breaking_raw |>
  dplyr::filter(pitch_type %in% c("CU", "SL", "ST")) |>
  dplyr::select(any_of(REQUIRED_COLS)) |>
  mutate(
    pfx_x_adj     = if_else(p_throws == "L", -pfx_x, pfx_x),
    spin_axis_adj = if_else(p_throws == "L", 360 - spin_axis, spin_axis),
    whiff         = as.integer(description == "swinging_strike"),
    chase         = as.integer(
                      description %in% c("swinging_strike", "swinging_strike_blocked",
                                         "foul", "foul_tip", "hit_into_play") &
                      zone %in% c(11, 12, 13, 14)
                    ),
    label_cu      = as.integer(pitch_type == "CU"),
    label_sl      = as.integer(pitch_type == "SL"),
    pitcher_id    = as.integer(factor(pitcher)),
    year_f        = factor(game_year)
  ) |>
  drop_na(pfx_x_adj, pfx_z, release_speed, release_spin_rate, spin_axis_adj) |>
  mutate(
    pfx_x_std     = as.numeric(scale(pfx_x_adj)),
    pfx_z_std     = as.numeric(scale(pfx_z)),
    velo_std      = as.numeric(scale(release_speed)),
    spin_std      = as.numeric(scale(release_spin_rate)),
    spin_axis_std = as.numeric(scale(spin_axis_adj))
  ) |>
  mutate(
    pfx_x_std     = residuals(lm(pfx_x_std     ~ year_f, data = pick(everything()))),
    pfx_z_std     = residuals(lm(pfx_z_std     ~ year_f, data = pick(everything()))),
    velo_std      = residuals(lm(velo_std      ~ year_f, data = pick(everything()))),
    spin_std      = residuals(lm(spin_std      ~ year_f, data = pick(everything()))),
    spin_axis_std = residuals(lm(spin_axis_std ~ year_f, data = pick(everything())))
  )

cat(sprintf(
  "%s pitches — CU: %s  SL: %s  ST: %s  |  %s unique pitchers\n",
  format(nrow(breaking),                   big.mark = ","),
  format(sum(breaking$pitch_type == "CU"), big.mark = ","),
  format(sum(breaking$pitch_type == "SL"), big.mark = ","),
  format(sum(breaking$pitch_type == "ST"), big.mark = ","),
  format(n_distinct(breaking$pitcher),     big.mark = ",")
))

## 804,590 pitches — CU: 202,261  SL: 454,138  ST: 148,191  |  1,607 unique pitchers

Stage 1: PCA

PCA identifies the dominant axes of covariation in pitch shape across five standardised, year-residualised features: horizontal break (pfx_x_adj, sign-flipped for LHP), vertical break (pfx_z), velocity, spin rate, and spin axis (spin_axis_adj, mirrored for LHP). Year effects are residualised from all features prior to analysis so that the movement geometry reflects within-year pitch physics rather than secular classifier drift. PCA is used here because it preserves global geometry — it provides an interpretable low-dimensional representation of the full movement space without imposing cluster structure, making it well-suited for evaluating whether the space is continuous or discretely partitioned.

The convergent validity question is whether the three Statcast labels occupy clearly separable regions of this space, or whether they instead grade into one another along shared movement axes.

pca_obj      <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_pca.rds")
rda_fit      <- pca_obj$rda_fit
all_scores   <- pca_obj$all_scores
loadings_raw <- pca_obj$loadings_raw
var_exp      <- pca_obj$var_exp
cum_var      <- pca_obj$cum_var

var_tbl <- data.frame(
  PC         = paste0("PC", seq_along(var_exp)),
  Proportion = round(var_exp, 3),
  Cumulative = round(cum_var, 3),
  check.names = FALSE
)
kable(var_tbl, caption = "Variance explained by each principal component.")

Variance explained by each principal component.
	PC	Proportion	Cumulative
PC1	PC1	0.508	0.508
PC2	PC2	0.202	0.710
PC3	PC3	0.153	0.863
PC4	PC4	0.085	0.948
PC5	PC5	0.052	1.000

ldg <- loadings_raw |>
  dplyr::select(Feature = feature, PC1, PC2) |>
  mutate(
    Feature = recode(Feature,
      pfx_x_std     = "Horiz. break (pfx_x, adj.)",
      pfx_z_std     = "Vert. break (pfx_z)",
      velo_std      = "Release speed",
      spin_std      = "Spin rate",
      spin_axis_std = "Spin axis (adj.)"
    ),
    across(where(is.numeric), ~ round(., 3))
  )
kable(ldg, caption = "PCA loadings (vegan rda).")

PCA loadings (vegan rda).
	Feature	PC1	PC2
pfx_x_std	Horiz. break (pfx_x, adj.)	0.397	0.179
pfx_z_std	Vert. break (pfx_z)	-0.426	0.152
velo_std	Release speed	-0.417	0.250
spin_std	Spin rate	0.277	0.447
spin_axis_std	Spin axis (adj.)	-0.466	0.056

PC1 primarily reflects a horizontal-versus-vertical movement gradient, with spin axis loading in the opposite direction to horizontal break — consistent with the expected movement characteristics of sweepers (high horizontal break, low spin axis after handedness adjustment) versus curveballs (low horizontal break, high spin axis). PC2 captures a secondary dimension reflecting velocity and spin rate, largely independent of movement direction.

set.seed(1)
scores_plot <- all_scores |> slice_sample(n = min(nrow(all_scores), 25000))

ggplot(scores_plot, aes(x = PC1, y = PC2, colour = pitch_type, fill = pitch_type)) +
  geom_point(alpha = 0.04, size = 0.35) +
  stat_density_2d(
    geom        = "polygon",
    contour_var = "ndensity",
    breaks      = 0.5,
    alpha       = 0.4,
    linewidth   = 0
  ) +
  geom_segment(
    data = loadings_raw,
    aes(x = 0, xend = PC1, y = 0, yend = PC2),
    colour = "black", linewidth = 0.65, alpha = 0.55,
    arrow = arrow(length = unit(0.01, "npc")),
    inherit.aes = FALSE
  ) +
  geom_label_repel(
    data = loadings_raw,
    aes(x = PC1, y = PC2, label = label),
    size = 4, alpha = 0.85,
    label.padding = unit(0.3, "lines"),
    inherit.aes = FALSE
  ) +
  scale_colour_manual(values = label_colours, labels = label_names) +
  scale_fill_manual(values   = label_colours, labels = label_names) +
  labs(
    x      = sprintf("PC1 (%.1f%% variance)", var_exp[1] * 100),
    y      = sprintf("PC2 (%.1f%% variance)", var_exp[2] * 100),
    colour = NULL, fill = NULL,
    title  = sprintf("Breaking-ball movement space (%d–2025)",
                     min(breaking$game_year))
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom") +
  guides(
    colour = guide_legend(override.aes = list(alpha = 1, size = 2)),
    fill   = guide_legend(override.aes = list(alpha = 0.5))
  )

PC1 vs PC2 coloured by Statcast pitch label. Shaded regions show the 50% kernel density contour for each label. The three labels grade into one another along PC1 rather than occupying well-separated regions, consistent with a continuous movement spectrum.

Formal continuum test: LDA misclassification

If the three labels occupy clearly discrete regions of movement space, a linear discriminant classifier trained on the five features should recover them with high accuracy. Poor classification accuracy — particularly systematic confusion between adjacent categories — is consistent with a movement continuum rather than discrete types. LDA is fit on a random subsample (n = 30,000) to make computation tractable; leave-one-out cross-validation is used to avoid overfitting.

set.seed(7)
lda_df <- breaking |>
  group_by(pitch_type) |>
  slice_sample(n = 10000) |>
  ungroup()

lda_fit  <- lda(pitch_type ~ pfx_x_std + pfx_z_std + velo_std +
                              spin_std + spin_axis_std,
                data = lda_df, CV = TRUE)

lda_tab  <- table(Predicted = lda_fit$class, Actual = lda_df$pitch_type)
lda_acc  <- round(100 * sum(diag(lda_tab)) / sum(lda_tab), 1)

kable(lda_tab,
      caption = sprintf(
        "LDA leave-one-out confusion matrix (overall accuracy = %.1f%%). Rows are predicted labels; columns are Statcast labels. Systematic off-diagonal mass, especially between SL and ST, indicates the movement space does not cleanly support three discrete categories.", lda_acc))

LDA leave-one-out confusion matrix (overall accuracy = 84.0%). Rows are predicted labels; columns are Statcast labels. Systematic off-diagonal mass, especially between SL and ST, indicates the movement space does not cleanly support three discrete categories.
	CU	SL	ST
CU	8370	576	435
SL	557	8014	736
ST	1073	1410	8829

An overall accuracy substantially below 100%, combined with systematic confusion between adjacent categories (particularly SL ↔︎ ST), supports the interpretation that Statcast labels impose discrete boundaries on a continuous movement distribution.

Stage 2: Gaussian Mixture Modelling

GMM is applied to the same five scaled features to ask whether an unsupervised partition of the movement space — with no knowledge of Statcast labels — recovers three components corresponding to CU, SL, and ST. BIC selects the number of components \(G\) and covariance parameterisation jointly. BIC search is conducted on a random subsample (n = 5,000) across G = 2–9 and four covariance structures (VVV, VVI, VEV, EEE); the full-data model is fit at G = 6, the elbow of the BIC curve beyond which marginal gains are negligible. Cluster–label alignment is quantified with the Adjusted Rand Index (ARI). This stage addresses discriminant validity: does a purely data-driven partition recover MLB’s operational taxonomy?

gmm_obj    <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_gmm.rds")
fit_gmm    <- gmm_obj$fit_gmm
ari        <- gmm_obj$ari
bic_df     <- gmm_obj$bic_df
bic_best   <- gmm_obj$bic_best
cluster_df <- gmm_obj$cluster_df

cat(sprintf("GMM fit: G = %d, model = %s  |  ARI = %.4f\n",
            fit_gmm$G, fit_gmm$modelName, ari))

## GMM fit: G = 6, model = VVV  |  ARI = 0.2709

ggplot(bic_df, aes(G, BIC, group = model, colour = model)) +
  geom_line(alpha = 0.4, linewidth = 0.5) +
  geom_point(data = bic_best, aes(G, BIC), colour = "black", size = 2.5) +
  geom_vline(xintercept = 6, linetype = "dashed", colour = "red") +
  scale_x_continuous(breaks = 2:9) +
  labs(x = "Number of GMM components (G)", y = "BIC", colour = "Model",
       title = "GMM BIC curve — reported G = 6 (VVV)") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "right",
        legend.text = element_text(size = 8))

BIC values across candidate GMM components (subsample). The marked elbow after G=5 indicates diminishing returns from additional components. The data-preferred solution (G=6) exceeds the three-label taxonomy, suggesting richer latent structure than CU/SL/ST implies.

cluster_colours <- setNames(scales::hue_pal()(6), as.character(1:6))

ggplot(cluster_df, aes(PC1, PC2)) +
  geom_point(aes(colour = pitch_type), alpha = 0.15, size = 0.4) +
  scale_colour_manual(
    name   = "Statcast label",
    values = label_colours,
    labels = label_names,
    guide  = guide_legend(override.aes = list(alpha = 1, size = 2,
                                              linetype = "blank", linewidth = 0))
  ) +
  new_scale_colour() +
  stat_ellipse(
    aes(group = cluster, colour = cluster),
    type = "norm", level = 0.95, linetype = "dashed", linewidth = 0.6
  ) +
  scale_colour_manual(
    name   = "GMM cluster",
    values = cluster_colours,
    labels = paste("Cluster", 1:6),
    guide  = guide_legend(override.aes = list(alpha = 1, size = 0,
                                              linetype = "dashed", linewidth = 0.6))
  ) +
  labs(
    x     = sprintf("PC1 (%.1f%%)", var_exp[1] * 100),
    y     = sprintf("PC2 (%.1f%%)", var_exp[2] * 100),
    title = "GMM clusters (G = 6) vs. Statcast labels"
  ) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "right")

GMM cluster assignments overlaid on PC space, coloured by Statcast label. Ellipses show 95% coverage regions. No cluster is dominated by a single pitch type, and adjacent clusters span label boundaries — consistent with a continuous movement spectrum rather than discrete categories.

uncertainty_df <- data.frame(
  uncertainty = fit_gmm$uncertainty,
  pitch_type  = breaking$pitch_type
)

ggplot(uncertainty_df, aes(x = uncertainty, fill = pitch_type, colour = pitch_type)) +
  geom_histogram(binwidth = 0.02, alpha = 0.6, position = "identity") +
  facet_wrap(~ pitch_type, ncol = 1,
             labeller = labeller(pitch_type = label_names)) +
  scale_fill_manual(values   = label_colours, guide = "none") +
  scale_colour_manual(values = label_colours, guide = "none") +
  labs(x = "Classification uncertainty (1 − max posterior probability)",
       y = "Count",
       title = "GMM posterior classification uncertainty by pitch type") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), strip.background = element_blank())

Posterior classification uncertainty (1 − max component probability) by pitch type. The slider distribution is the most diffuse, consistent with its intermediate position on the movement continuum. Curveball and sweeper show lower uncertainty, consistent with their positions at opposing ends of the spectrum.

tab <- table(Cluster = fit_gmm$classification, Label = breaking$pitch_type)
kable(tab, caption = "GMM cluster (G = 6) × Statcast label contingency table.")

GMM cluster (G = 6) × Statcast label contingency table.
CU	SL	ST
42835	42313	10147
21401	68802	130131
586	183813	1653
124760	2725	2213
11065	137134	3173
1614	19351	874

ARI interpretation: ARI = 1 indicates perfect cluster–label correspondence; ARI = 0 indicates correspondence no better than chance. Values below ~0.2 suggest the data-driven partition bears little relationship to the CU/SL/ST taxonomy.

Stage 3: Bayesian Hierarchical Models

Two Bayesian logistic models with pitcher-level random intercepts estimate whether the Statcast pitch-type label is associated with whiff or chase probability above and beyond the five physical features. These associations should be interpreted with caution as outcome differences across pitch types are likely confounded by pitcher archetypes (pitchers who throw sweepers may differ systematically from those who throw curveballs), platoon effects, count leverage, and usage context. The models do not establish causal effects of the label itself.

ST (sweeper) is the reference category. beta_cu and beta_sl estimate the whiff (or chase) log-odds of curveball and slider relative to sweeper after adjusting for horizontal break, vertical break, velocity, spin rate, and spin axis. This stage addresses predictive validity: do the labels carry incremental associations with outcomes once physical features are adjusted for?

Pitcher random intercepts are centred at zero with precision tau_alpha estimated from the data. Year effects are residualised from all features upstream and are not included as a separate model term. Models were fit on a stratified random sample of approximately 50,000 pitches (~16,667 per pitch type) to ensure balanced pitch-type representation. Bayesian inference was conducted using JAGS via jagsUI (3 chains, 5,000 burn-in, 10,000 iterations, thinning = 2).

fit_whiff <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_whiff.rds")
fit_chase <- readRDS("C:/Users/fishd/OneDrive/Manuscripts/Statcast/WD/fit_chase.rds")

rhat_w <- unlist(fit_whiff$Rhat); rhat_w <- rhat_w[!is.na(rhat_w)]
rhat_c <- unlist(fit_chase$Rhat); rhat_c <- rhat_c[!is.na(rhat_c)]

cat(sprintf("Whiff — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
            max(rhat_w), sum(rhat_w > 1.1), length(rhat_w),
            if (sum(rhat_w > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))

## Whiff — max R-hat: 1.003 | nodes > 1.1: 0/10 | CONVERGED

cat(sprintf("Chase — max R-hat: %.3f | nodes > 1.1: %d/%d | %s\n",
            max(rhat_c), sum(rhat_c > 1.1), length(rhat_c),
            if (sum(rhat_c > 1.1) == 0) "CONVERGED" else "NOT CONVERGED"))

## Chase — max R-hat: 1.001 | nodes > 1.1: 0/10 | CONVERGED

target_params <- c("beta_pfx_x", "beta_pfx_z", "beta_velo", "beta_spin",
                   "beta_spin_axis", "beta_cu", "beta_sl", "beta_cu_vs_sl",
                   "tau_alpha")

param_labels <- c(
  beta_pfx_x     = "Horiz. break (β_pfx_x)",
  beta_pfx_z     = "Vert. break (β_pfx_z)",
  beta_velo      = "Velocity (β_velo)",
  beta_spin      = "Spin rate (β_spin)",
  beta_spin_axis = "Spin axis (β_spin_axis)",
  beta_cu        = "CU vs. ST (β_cu)",
  beta_sl        = "SL vs. ST (β_sl)",
  beta_cu_vs_sl  = "CU vs. SL (β_cu − β_sl, derived)",
  tau_alpha      = "Pitcher RE precision (τ_α)"
)

make_post_tbl <- function(fit) {
  fit$summary[target_params, c("mean", "sd", "2.5%", "97.5%", "Rhat"),
              drop = FALSE] |>
    as.data.frame() |>
    rownames_to_column("Parameter") |>
    mutate(
      Parameter = recode(Parameter, !!!param_labels),
      across(where(is.numeric), ~ round(., 3))
    )
}

kable(
  make_post_tbl(fit_whiff),
  col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
  caption   = paste(
    "Whiff model posterior summaries. ST is the reference category.",
    "β_cu and β_sl are the adjusted log-odds of whiff for curveball and slider",
    "relative to sweeper. Credible intervals excluding zero indicate a label",
    "association with whiff beyond the five movement features, though",
    "unmeasured confounders may contribute to these associations."
  )
)

Whiff model posterior summaries. ST is the reference category. β_cu and β_sl are the adjusted log-odds of whiff for curveball and slider relative to sweeper. Credible intervals excluding zero indicate a label association with whiff beyond the five movement features, though unmeasured confounders may contribute to these associations.
Parameter	Post. mean	Post. SD	2.5%	97.5%	R-hat
Horiz. break (β_pfx_x)	-0.331	0.024	-0.379	-0.284	1.001
Vert. break (β_pfx_z)	-0.309	0.027	-0.363	-0.256	1.000
Velocity (β_velo)	0.334	0.031	0.273	0.394	1.001
Spin rate (β_spin)	0.124	0.024	0.077	0.172	1.000
Spin axis (β_spin_axis)	0.104	0.023	0.059	0.148	1.000
CU vs. ST (β_cu)	-1.800	0.062	-1.921	-1.676	1.002
SL vs. ST (β_sl)	-1.770	0.047	-1.862	-1.681	1.000
CU vs. SL (β_cu − β_sl, derived)	-0.030	0.072	-0.172	0.111	1.003
Pitcher RE precision (τ_α)	1.183	0.088	1.019	1.363	1.002

kable(
  make_post_tbl(fit_chase),
  col.names = c("Parameter", "Post. mean", "Post. SD", "2.5%", "97.5%", "R-hat"),
  caption   = paste(
    "Chase model posterior summaries. ST is the reference category.",
    "β_cu and β_sl are the adjusted log-odds of chasing an out-of-zone pitch",
    "for curveball and slider relative to sweeper."
  )
)

Chase model posterior summaries. ST is the reference category. β_cu and β_sl are the adjusted log-odds of chasing an out-of-zone pitch for curveball and slider relative to sweeper.
Parameter	Post. mean	Post. SD	2.5%	97.5%	R-hat
Horiz. break (β_pfx_x)	-0.208	0.021	-0.250	-0.166	1.001
Vert. break (β_pfx_z)	-0.456	0.024	-0.503	-0.408	1.000
Velocity (β_velo)	0.468	0.028	0.415	0.523	1.000
Spin rate (β_spin)	0.100	0.022	0.057	0.142	1.000
Spin axis (β_spin_axis)	0.074	0.021	0.031	0.117	1.000
CU vs. ST (β_cu)	-1.492	0.053	-1.597	-1.389	1.001
SL vs. ST (β_sl)	-1.521	0.040	-1.601	-1.441	1.000
CU vs. SL (β_cu − β_sl, derived)	0.029	0.063	-0.095	0.153	1.000
Pitcher RE precision (τ_α)	1.565	0.114	1.350	1.799	1.001

compute_pred_probs <- function(fit) {
  samps <- do.call(rbind, lapply(fit$samples, as.data.frame))
  set.seed(42)
  alpha_draw <- rnorm(nrow(samps), 0, 1 / sqrt(samps$tau_alpha))
  data.frame(
    ST = plogis(alpha_draw),
    CU = plogis(alpha_draw + samps$beta_cu),
    SL = plogis(alpha_draw + samps$beta_sl)
  )
}

summarise_probs <- function(prob_df, outcome) {
  prob_df |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    group_by(Pitch) |>
    summarise(
      Mean  = round(mean(p), 3),
      Lower = round(quantile(p, 0.025), 3),
      Upper = round(quantile(p, 0.975), 3),
      .groups = "drop"
    ) |>
    mutate(Outcome = outcome)
}

probs_whiff <- compute_pred_probs(fit_whiff)
probs_chase <- compute_pred_probs(fit_chase)

prob_tbl <- bind_rows(
  summarise_probs(probs_whiff, "Whiff"),
  summarise_probs(probs_chase, "Chase")
) |>
  mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper")) |>
  dplyr::select(Outcome, Pitch, Mean, Lower, Upper)

kable(
  prob_tbl,
  col.names = c("Outcome", "Pitch type", "Post. mean", "2.5%", "97.5%"),
  caption   = paste(
    "Posterior predicted probabilities for each pitch type at mean covariate",
    "values, marginalised over the pitcher random effect distribution.",
    "Wide credible intervals for the sweeper reflect heterogeneity among",
    "pitchers who throw the pitch."
  )
)

Posterior predicted probabilities for each pitch type at mean covariate values, marginalised over the pitcher random effect distribution. Wide credible intervals for the sweeper reflect heterogeneity among pitchers who throw the pitch.
Outcome	Pitch type	Post. mean	2.5%	97.5%
Whiff	Curveball	0.174	0.026	0.507
Whiff	Slider	0.178	0.026	0.513
Whiff	Sweeper	0.498	0.138	0.860
Chase	Curveball	0.209	0.044	0.524
Chase	Slider	0.205	0.043	0.512
Chase	Sweeper	0.498	0.170	0.828

prob_long <- bind_rows(
  probs_whiff |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    mutate(Outcome = "Whiff"),
  probs_chase |>
    pivot_longer(everything(), names_to = "Pitch", values_to = "p") |>
    mutate(Outcome = "Chase")
) |>
  mutate(Pitch = recode(Pitch, CU = "Curveball", SL = "Slider", ST = "Sweeper"))

pitch_colours <- c(Curveball = "#0072B2", Slider = "#D55E00", Sweeper = "#009E73")

ggplot(prob_long, aes(x = p, fill = Pitch, colour = Pitch)) +
  geom_density(alpha = 0.35, linewidth = 0.8) +
  facet_wrap(~ Outcome, scales = "free_y") +
  scale_fill_manual(values   = pitch_colours) +
  scale_colour_manual(values = pitch_colours) +
  labs(x = "Predicted probability", y = "Posterior density",
       fill = NULL, colour = NULL,
       title = "Adjusted predicted probabilities by pitch type") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), strip.background = element_blank(),
        legend.position = "bottom")

Posterior predicted probability distributions for each pitch type at mean covariate values. Curveball and slider distributions are nearly indistinguishable on both outcomes; the sweeper is associated with substantially higher whiff and chase rates, though these associations likely reflect pitcher archetypes and movement extremity rather than a categorically distinct pitch type.

extract_samples <- function(fit, param) {
  do.call(rbind, lapply(fit$samples, as.data.frame))[[param]]
}

make_contrast_df <- function(fit, outcome_label) {
  bind_rows(
    data.frame(value = extract_samples(fit, "beta_cu"),
               contrast = "CU vs. ST", outcome = outcome_label),
    data.frame(value = extract_samples(fit, "beta_sl"),
               contrast = "SL vs. ST", outcome = outcome_label),
    data.frame(value = extract_samples(fit, "beta_cu_vs_sl"),
               contrast = "CU vs. SL (derived)", outcome = outcome_label)
  )
}

contrast_df <- bind_rows(
  make_contrast_df(fit_whiff, "Whiff"),
  make_contrast_df(fit_chase, "Chase")
)

contrast_colours <- c("CU vs. ST"           = "#0072B2",
                      "SL vs. ST"           = "#D55E00",
                      "CU vs. SL (derived)" = "#CC79A7")

ci_df <- contrast_df |>
  group_by(contrast, outcome) |>
  summarise(lo = quantile(value, 0.025), hi = quantile(value, 0.975),
            .groups = "drop")

ggplot(contrast_df, aes(value, fill = contrast, colour = contrast)) +
  geom_density(alpha = 0.3, linewidth = 0.8) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
  geom_segment(
    data = ci_df,
    aes(x = lo, xend = hi, y = -Inf, yend = -Inf, colour = contrast),
    linewidth = 2, lineend = "round", inherit.aes = FALSE,
    show.legend = FALSE
  ) +
  facet_grid(outcome ~ contrast, scales = "free_y") +
  scale_fill_manual(values   = contrast_colours) +
  scale_colour_manual(values = contrast_colours) +
  labs(x = expression(beta ~ "(log-odds)"), y = "Posterior density",
       title = "Posterior label associations with pitch outcomes",
       fill = NULL, colour = NULL) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "none",
        strip.background = element_blank())

Posterior distributions for the three pairwise label contrasts. ST is the reference category. The dashed line marks zero. Curveball and slider are indistinguishable from each other on both outcomes; both differ from the sweeper, though these associations are likely confounded by pitcher archetypes and movement extremity.

Summary

get_ci <- function(fit, param) round(fit$summary[param, c("2.5%", "97.5%")], 3)
ci_includes_zero <- function(ci) ci[1] < 0 & ci[2] > 0

w_cu <- get_ci(fit_whiff, "beta_cu")
w_sl <- get_ci(fit_whiff, "beta_sl")
w_cv <- get_ci(fit_whiff, "beta_cu_vs_sl")
c_cu <- get_ci(fit_chase, "beta_cu")
c_sl <- get_ci(fit_chase, "beta_sl")
c_cv <- get_ci(fit_chase, "beta_cu_vs_sl")

summary_df <- data.frame(
  Stage = c("PCA", "LDA continuum test", "GMM",
            "Whiff — CU vs. ST", "Whiff — SL vs. ST", "Whiff — CU vs. SL",
            "Chase — CU vs. ST", "Chase — SL vs. ST", "Chase — CU vs. SL"),
  Finding = c(
    sprintf("PC1 = %.1f%%; PC1+PC2 = %.1f%%", var_exp[1]*100, cum_var[2]*100),
    sprintf("Overall LDA accuracy = %.1f%%", lda_acc),
    sprintf("G = 6 (VVV, elbow); ARI = %.3f", ari),
    sprintf("95%% CI [%.3f, %.3f]%s", w_cu[1], w_cu[2],
            if (ci_includes_zero(w_cu)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", w_sl[1], w_sl[2],
            if (ci_includes_zero(w_sl)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", w_cv[1], w_cv[2],
            if (ci_includes_zero(w_cv)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_cu[1], c_cu[2],
            if (ci_includes_zero(c_cu)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_sl[1], c_sl[2],
            if (ci_includes_zero(c_sl)) " — includes zero" else " — excludes zero"),
    sprintf("95%% CI [%.3f, %.3f]%s", c_cv[1], c_cv[2],
            if (ci_includes_zero(c_cv)) " — includes zero" else " — excludes zero")
  ),
  Interpretation = c(
    if (cum_var[2] > 0.85) "Most variance in two dimensions — low-dimensional continuum"
    else "Variance spread across PCs — higher-dimensional movement space",
    if (lda_acc < 70) "Poor recovery — movement space does not support three discrete categories"
    else if (lda_acc < 85) "Moderate recovery — partial but imperfect label separability"
    else "Strong recovery — labels are separable in movement space",
    if (ari < 0.2) "Low ARI — unsupervised clusters do not recover CU/SL/ST"
    else if (ari < 0.5) "Moderate ARI — partial label–cluster alignment"
    else "High ARI — clusters align well with labels",
    if (ci_includes_zero(w_cu)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(w_sl)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(w_cv)) "CU and SL indistinguishable on whiff" else "CU and SL distinguishable on whiff",
    if (ci_includes_zero(c_cu)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(c_sl)) "No association beyond movement" else "Association beyond movement",
    if (ci_includes_zero(c_cv)) "CU and SL indistinguishable on chase" else "CU and SL distinguishable on chase"
  )
)

kable(summary_df, caption = "Summary of findings across all three analysis stages.")

Summary of findings across all three analysis stages.
Stage	Finding	Interpretation
PCA	PC1 = 50.8%; PC1+PC2 = 71.0%	Variance spread across PCs — higher-dimensional movement space
LDA continuum test	Overall LDA accuracy = 84.0%	Moderate recovery — partial but imperfect label separability
GMM	G = 6 (VVV, elbow); ARI = 0.271	Moderate ARI — partial label–cluster alignment
Whiff — CU vs. ST	95% CI [-1.921, -1.676] — excludes zero	Association beyond movement
Whiff — SL vs. ST	95% CI [-1.862, -1.681] — excludes zero	Association beyond movement
Whiff — CU vs. SL	95% CI [-0.172, 0.111] — includes zero	CU and SL indistinguishable on whiff
Chase — CU vs. ST	95% CI [-1.597, -1.389] — excludes zero	Association beyond movement
Chase — SL vs. ST	95% CI [-1.601, -1.441] — excludes zero	Association beyond movement
Chase — CU vs. SL	95% CI [-0.095, 0.153] — includes zero	CU and SL indistinguishable on chase

Take-home: The evidence across all three stages is consistent with breaking-ball taxonomy existing on a continuous movement spectrum rather than comprising three discrete pitch types. PCA reveals a single dominant movement gradient along which curveball, slider, and sweeper grade into one another; LDA shows that the five movement features do not cleanly recover the three labels; and GMM selects six components rather than three, with only moderate alignment between data-driven clusters and Statcast labels. The slider occupies the intermediate space on this spectrum — what pitchers and coaches have historically called the slurve — and shows the highest classification uncertainty of the three types.

The Bayesian models show that curveball and slider are associated with lower whiff and chase rates than sweeper after adjusting for movement features, but the curveball/slider distinction itself carries no additional association with either outcome. These associations should not be interpreted causally: sweepers are thrown by a distinct population of pitchers in distinct contexts, and the label-outcome associations likely reflect those differences rather than an independent effect of the label. The wide credible intervals on sweeper predicted probabilities are consistent with substantial pitcher-level heterogeneity.

The emergence of the sweeper as a formal Statcast category may represent refinement of an already continuous breaking-ball movement spectrum rather than the appearance of an entirely novel pitch type. The modern pitch taxonomy imposes linguistic categories — curveball, slider, sweeper — on aerodynamic variation that is better characterised as continuous, with the sweeper occupying one extreme of a horizontal break gradient that grades without a clear natural boundary into the slider, and the slider grading in turn into the curveball. Analytical frameworks that treat these labels as discrete inputs should account for the substantial movement-space overlap among adjacent categories.