Coors Field Analysis

0.1 Introduction

Coors Field is famous for its hitter-friendly environment, but the true question remains: do hitters, especially Rockies hitters, capitalize on this advantage effectively? This project investigates how batted ball outcomes at Coors differ from expectations and identifies approaches to optimize offensive output in Denver.

0.2 Data Preparation

library(tidyverse)
library(tidymodels)
library(lubridate)
library(gt)
library(nnet)
library(workflows)   
library(tibble)

Load in Statcast Data

statcast_22 <- read_csv('/Users/williamgifford/Statcast Data/statcast_pitch_data22.csv')
statcast_23 <- read_csv('/Users/williamgifford/Statcast Data/statcast_pitch_data23.csv')
statcast_24 <- read_csv('/Users/williamgifford/Statcast Data/statcast_pitch_data24.csv')

batted_balls <- bind_rows(statcast_22, statcast_23, statcast_24) %>%
  filter(type == "X", !is.na(bb_type)) %>%
  mutate(
    is_coors = home_team == "COL"
  )

We begin by combining Statcast data from the 2023 and 2024 seasons, filtering for valid batted ball events. A new variable, is_coors, flags events that occurred at Coors Field. This binary flag allows us to compare outcomes at Coors to other MLB parks.

0.3 Outcome Distributions

batted_balls %>%
  ggplot(aes(x = launch_speed, fill = is_coors)) +
  geom_density(alpha = 0.5) +
  labs(title = "Exit Velocity Distribution", fill = "Coors")

Warning: Removed 3330 rows containing non-finite outside the scale range
(`stat_density()`).

bb_type_dist <- batted_balls %>%
  group_by(is_coors, bb_type) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(is_coors) %>%
  mutate(pct = n / sum(n))

This chart compares the distribution of how hard a ball are hit in game, we see a slight shift to the right indicating that the middle of the data shows a minute difference in the exit velocity across all batted balls but nothing too crazy.

batted_balls_filtered <- batted_balls %>%
  filter(
    !(
      (home_team == "COL" & inning_topbot == "Bot") |
      (away_team == "COL" & inning_topbot == "Top")
    )
  )

Remove Rockies Hitters to control for player skill and swing profile

bb_type_dist_filtered <- batted_balls_filtered %>%
  group_by(is_coors, bb_type) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(is_coors) %>%
  mutate(pct = n / sum(n))

ggplot(bb_type_dist_filtered, aes(x = bb_type, y = pct, fill = is_coors)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("TRUE" = "#7570b3", "FALSE" = "#1b9e77"),
                    labels = c("Other Parks", "Coors Field")) +
  labs(
    title = "Distribution of Batted Ball Types (Non-Rockies Hitters Only)",
    x = "Batted Ball Type",
    y = "Percentage",
    fill = "Location"
  ) +
  theme_minimal()

This chart compares the distribution of batted ball types (groundballs, flyballs, line drives, popups) between Coors Field and all other stadiums. A notable shift toward more line drives or flyballs at Coors may hint at differences in hitter approach or environmental effects on batted ball trajectories.

batted_balls_filtered_cleaned <- batted_balls_filtered %>%
  mutate(
    outcome_group = case_when(
      events == "single" ~ "Single",
      events == "double" ~ "Double",
      events == "triple" ~ "Triple",
      events == "home_run" ~ "Home Run",
      TRUE ~ "Out"
    )
  )

event_summary_cleaned <- batted_balls_filtered_cleaned %>%
  group_by(is_coors, outcome_group) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(is_coors) %>%
  mutate(pct = n / sum(n))

ggplot(event_summary_cleaned, aes(x = outcome_group, y = pct, fill = is_coors)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(
    values = c("TRUE" = "#1b9e77", "FALSE" = "#7570b3"),
    labels = c("Other Parks", "Coors Field")
  ) +
  labs(
    title = "Batted Ball Outcomes (Grouped, Non-Rockies Hitters)",
    x = "Outcome",
    y = "Proportion",
    fill = "Location"
  ) +
  theme_minimal()

woba_by_bbtype <- batted_balls_filtered %>%
  filter(!is.na(woba_value), !is.na(bb_type)) %>%
  group_by(is_coors, bb_type) %>%
  summarise(mean_woba = mean(woba_value), .groups = "drop")

ggplot(woba_by_bbtype, aes(x = bb_type, y = mean_woba, fill = is_coors)) +
  geom_col(position = "dodge") +
  scale_fill_manual(
    values = c("TRUE" = "#1b9e77", "FALSE" = "#7570b3"),
    labels = c("Other Parks", "Coors Field")
  ) +
  labs(
    title = "Average wOBA by Batted Ball Type (Non-Rockies Hitters)",
    x = "Batted Ball Type",
    y = "Average wOBA",
    fill = "Location"
  ) +
  theme_minimal()

bb_type_table <- batted_balls_filtered %>%
  count(is_coors, bb_type) %>%
  group_by(is_coors) %>%
  mutate(rate = n / sum(n)) %>%
  select(is_coors, bb_type, rate) %>%
  pivot_wider(names_from = is_coors, values_from = rate, names_prefix = "Coors_") %>%
  mutate(
    Difference = Coors_TRUE - Coors_FALSE,
    Category = "Batted Ball Type"
  ) %>%
  rename(
    Type = bb_type,
    `Other Parks` = Coors_FALSE,
    `Coors Field` = Coors_TRUE
  )

event_table <- batted_balls_filtered %>%
  mutate(
    outcome_group = case_when(
      events == "single" ~ "Single",
      events == "double" ~ "Double",
      events == "triple" ~ "Triple",
      events == "home_run" ~ "Home Run",
      TRUE ~ "Out"
    )
  ) %>%
  count(is_coors, outcome_group) %>%
  group_by(is_coors) %>%
  mutate(rate = n / sum(n)) %>%
  select(is_coors, outcome_group, rate) %>%
  pivot_wider(names_from = is_coors, values_from = rate, names_prefix = "Coors_") %>%
  mutate(
    Difference = Coors_TRUE - Coors_FALSE,
    Category = "Batted Ball Outcome"
  ) %>%
  rename(
    Type = outcome_group,
    `Other Parks` = Coors_FALSE,
    `Coors Field` = Coors_TRUE
  )

woba_table <- batted_balls_filtered %>%
  filter(!is.na(woba_value), !is.na(bb_type)) %>%
  group_by(is_coors, bb_type) %>%
  summarise(mean_woba = mean(woba_value), .groups = "drop") %>%
  pivot_wider(names_from = is_coors, values_from = mean_woba, names_prefix = "Coors_") %>%
  mutate(
    Difference = Coors_TRUE - Coors_FALSE,
    Category = "Average wOBA by BB Type"
  ) %>%
  rename(
    Type = bb_type,
    `Other Parks` = Coors_FALSE,
    `Coors Field` = Coors_TRUE
  )

combined_table <- bind_rows(bb_type_table, event_table, woba_table)

combined_table %>%
  arrange(Category, desc(Difference)) %>%
  gt() %>%
  fmt_percent(columns = c(`Other Parks`, `Coors Field`, Difference),
              decimals = 1) %>%
  tab_header(
    title = "Coors Field vs. Other Parks: Distribution and wOBA Differences (Non-Rockies Hitters)"
  ) %>%
  tab_spanner(label = "Rate or wOBA", columns = c(`Other Parks`, `Coors Field`, Difference)) %>%
  cols_label(
    Type = "Type / Outcome",
    `Other Parks` = "Other Parks",
    `Coors Field` = "Coors Field",
    Difference = "Δ (Coors – Other)",
    Category = "Group"
  )

Coors Field vs. Other Parks: Distribution and wOBA Differences (Non-Rockies Hitters)
Type / Outcome	Rate or wOBA			Group
Type / Outcome	Other Parks	Coors Field	Δ (Coors – Other)	Group
fly_ball	42.7%	49.9%	7.2%	Average wOBA by BB Type
line_drive	65.5%	68.4%	2.9%	Average wOBA by BB Type
ground_ball	25.2%	25.2%	−0.0%	Average wOBA by BB Type
popup	1.8%	1.2%	−0.5%	Average wOBA by BB Type
Single	20.8%	22.9%	2.1%	Batted Ball Outcome
Double	6.4%	7.1%	0.7%	Batted Ball Outcome
Triple	0.5%	1.2%	0.6%	Batted Ball Outcome
Home Run	4.5%	4.6%	0.1%	Batted Ball Outcome
Out	67.8%	64.2%	−3.6%	Batted Ball Outcome
line_drive	23.7%	25.9%	2.2%	Batted Ball Type
ground_ball	43.0%	44.0%	0.9%	Batted Ball Type
fly_ball	26.2%	25.6%	−0.6%	Batted Ball Type
popup	7.1%	4.5%	−2.6%	Batted Ball Type

woba_xwoba_summary <- batted_balls_filtered %>%
  filter(!is.na(woba_value), !is.na(estimated_woba_using_speedangle)) %>%
  group_by(is_coors) %>%
  summarise(
    mean_woba = mean(woba_value),
    mean_xwoba = mean(estimated_woba_using_speedangle),
    difference = mean(woba_value) - mean(estimated_woba_using_speedangle),
    .groups = "drop"
  )

residual_data <- batted_balls_filtered %>%
  filter(!is.na(woba_value), !is.na(estimated_woba_using_speedangle)) %>%
  mutate(
    woba_residual = woba_value - estimated_woba_using_speedangle,
    location = if_else(is_coors, "Coors Field", "Other Parks")
  )

ggplot(residual_data, aes(y = location, x = woba_residual, fill = location)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.5) +
  scale_fill_manual(values = c("Coors Field" = "#1b9e77", "Other Parks" = "#7570b3")) +
  labs(
    title = "Residuals: wOBA – xwOBA by Location (Non-Rockies Hitters)",
    x = "Residual (wOBA – xwOBA)",
    y = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray30")

woba_xwoba_summary %>%
  mutate(
    Location = if_else(is_coors, "Coors Field", "Other Parks")
  ) %>%
  select(Location, mean_woba, mean_xwoba, difference) %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  cols_label(
    mean_woba = "Actual wOBA",
    mean_xwoba = "Expected xwOBA",
    difference = "Difference (wOBA – xwOBA)"
  ) %>%
  tab_header(
    title = "Actual vs Expected wOBA by Location (Non-Rockies Hitters)"
  )

Actual vs Expected wOBA by Location (Non-Rockies Hitters)
Location	Actual wOBA	Expected xwOBA	Difference (wOBA – xwOBA)
Other Parks	0.377	0.367	0.010
Coors Field	0.418	0.379	0.039

0.4 Multinomial LR Batted Ball Outcomes

model_data <- batted_balls_filtered %>%
  filter(
    events %in% c("single", "double", "triple", "home_run", 
                  "field_out", "fly_out", "ground_out", "lineout", "pop_out"),
    !is.na(launch_speed), !is.na(launch_angle)
  ) %>%
  mutate(
    outcome_group = case_when(
      events == "single" ~ "Single",
      events == "double" ~ "Double",
      events == "triple" ~ "Triple",
      events == "home_run" ~ "Home_Run",
      TRUE ~ "Out"
    ),
    outcome_group = factor(outcome_group, levels = c("Out", "Single", "Double", "Triple", "Home_Run"))
  )

Here we prepare the data for modeling by grouping event outcomes into a manageable set of categories. This binning reflects not just baseball conventions but also the value implications for wOBA and overall offensive impact.

model_data <- model_data %>%
  select(outcome_group, launch_speed, launch_angle, is_coors)

We focus on just the relevant predictor variables—launch characteristics and park indicator—for a clean modeling input.

set.seed(42)
library(tidymodels)

data_split <- initial_split(model_data, prop = 0.8, strata = outcome_group)
train_data <- training(data_split)
test_data  <- testing(data_split)

multi_spec <- multinom_reg(penalty = NULL) %>%
  set_engine("nnet") %>%
  set_mode("classification")

multi_fit_direct <- fit(
  multi_spec,
  outcome_group ~ launch_speed + launch_angle + is_coors,
  data = train_data
)

multi_fit_direct %>%
  tidy()

# A tibble: 16 × 6
   y.level  term          estimate std.error statistic   p.value
   <chr>    <chr>            <dbl>     <dbl>     <dbl>     <dbl>
 1 Single   (Intercept)   -3.04     0.0328      -92.5  0        
 2 Single   launch_speed   0.0244   0.000368     66.3  0        
 3 Single   launch_angle  -0.0147   0.000181    -81.4  0        
 4 Single   is_coorsTRUE   0.101    0.0342        2.95 3.20e-  3
 5 Double   (Intercept)  -11.2      0.0937     -119.   0        
 6 Double   launch_speed   0.0954   0.000945    101.   0        
 7 Double   launch_angle   0.00386  0.000378     10.2  1.74e- 24
 8 Double   is_coorsTRUE   0.161    0.0545        2.96 3.05e-  3
 9 Triple   (Intercept)  -15.2      0.339       -44.8  0        
10 Triple   launch_speed   0.110    0.00335      32.8  5.23e-236
11 Triple   launch_angle   0.0109   0.00131       8.36 6.44e- 17
12 Triple   is_coorsTRUE   0.833    0.128         6.51 7.67e- 11
13 Home_Run (Intercept)  -38.3      0.276      -139.   0        
14 Home_Run launch_speed   0.341    0.00252     135.   0        
15 Home_Run launch_angle   0.0805   0.000821     98.1  0        
16 Home_Run is_coorsTRUE   0.179    0.0775        2.31 2.06e-  2

Here we view the model coefficients. These represent the change in log-odds for each outcome class relative to the baseline (in this case, an out), associated with one-unit increases in the predictors. So right away we see some pretty heavy indications that Coors field has a meaningful imact on the results of batted balls. With all of the hit types attached to Coors field exhibiting strong upward change in log-odds.

test_preds <- predict(multi_fit_direct, new_data = test_data, type = "class")

test_results <- test_data %>%
  select(outcome_group) %>%
  bind_cols(test_preds)

test_probs <- predict(multi_fit_direct, new_data = test_data, type = "prob") %>%
  bind_cols(test_data)

test_probs <- predict(multi_fit_direct, new_data = test_data, type = "prob") %>%
  bind_cols(test_data %>% select(outcome_group))

calib_long_eq <- test_probs %>%
  pivot_longer(
    cols = starts_with(".pred_"),
    names_to = "predicted_class",
    values_to = "pred_prob"
  ) %>%
  mutate(
    predicted_class = str_remove(predicted_class, ".pred_"),
    is_true_class = if_else(outcome_group == predicted_class, 1, 0)
  ) %>%
  group_by(predicted_class) %>%
  mutate(prob_bin = ntile(pred_prob, 5)) %>%
  ungroup()

calib_curve_eq <- calib_long_eq %>%
  group_by(predicted_class, prob_bin) %>%
  summarise(
    avg_pred = mean(pred_prob),
    obs_rate = mean(is_true_class),
    count = n(),
    .groups = "drop"
  )

ggplot(calib_curve_eq, aes(x = avg_pred, y = obs_rate)) +
  geom_line(color = "firebrick", linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray40") +
  facet_wrap(~ predicted_class) +
  labs(
    title = "Calibration Curves by Outcome (Equal-Frequency Bins)",
    x = "Predicted Probability (Bin Avg)",
    y = "Observed Frequency"
  ) +
  theme_minimal()

This plot shows calibration curves for each predicted batted ball outcome, comparing the model’s predicted probabilities to the actual observed frequencies across equal-frequency bins. A well-calibrated model would follow the dashed 45° line. The “Out” outcome is fairly well calibrated, while rarer events like “Triple” and “Home Run” show less stability due to limited high-confidence predictions. These curves help us assess how trustworthy the model’s probability estimates are, especially important when using them to compute expected values like xwOBA. Given a confusion matrix in this context would illustrate the model predicting very well, as outs still make up 70% of the data set, everything would return either an out or a homerun.

sim_grid <- expand_grid(
  launch_speed = seq(40, 125, by = 1),
  launch_angle = seq(-100, 100, by = 1),
  is_coors = c(TRUE, FALSE)
)

grid_preds <- predict(multi_fit_direct, new_data = sim_grid, type = "prob") %>%
  bind_cols(sim_grid)

plot_data <- grid_preds %>%
  select(launch_speed, launch_angle, is_coors, .pred_Home_Run, .pred_Single, .pred_Out,
         .pred_Double, .pred_Triple) %>%
  mutate(location = if_else(is_coors, "Coors Field", "Other Parks"))

prob_diffs <- plot_data %>%
  select(launch_speed, launch_angle, is_coors, starts_with(".pred_")) %>%
  pivot_wider(
    names_from = is_coors,
    values_from = starts_with(".pred_"),
    names_sep = "_"
  )

prob_diffs <- prob_diffs %>%
  mutate(
    delta_single = .pred_Single_TRUE - .pred_Single_FALSE,
    delta_double = .pred_Double_TRUE - .pred_Double_FALSE,
    delta_triple =.pred_Triple_TRUE - .pred_Triple_FALSE,
    delta_hr = .pred_Home_Run_TRUE - .pred_Home_Run_FALSE,
    delta_out = .pred_Out_TRUE - .pred_Out_FALSE
  )

This grid setup allows us to view the different probabilities of all of the hit types across all launch angles and exit velocities. This is best displayed in the form of heatmap where you can see bands or clusters of batted ball types that are more likely to be that particular hit:

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = delta_triple)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    limits = c(-0.05, 0.05)
  ) +
  labs(
    title = "Coors Field Boost: Δ Probability of Triple",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)",
    fill = "Δ P(Triple)"
  ) +
  theme_minimal()

The red areas represent combinations where triples are more likely at Coors, particularly at mid-to-high exit velocities and modest launch angles. This visualization helps isolate the batted ball profiles most affected by Coors’ expansive outfield, showing where the environment most alters expected outcomes.

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = delta_double)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    limits = c(-0.05, 0.05)
  ) +
  labs(
    title = "Coors Field Boost: Δ Probability of Double",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)",
    fill = "Δ P(Double)"
  ) +
  theme_minimal()

Similar story to the triples although perhaps a little more modest of a difference across the field.

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = delta_single)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    limits = c(-0.05, 0.05)
  ) +
  labs(
    title = "Coors Field Boost: Δ Probability of Single",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)",
    fill = "Δ P(Single)"
  ) +
  theme_minimal()

This is interesting given how much more a single is likely for essentially every batted ball type under the home run threshold.

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = delta_hr)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    limits = c(-0.05, 0.05)
  ) +
  labs(
    title = "Coors Field Boost: Δ Probability of Home Run",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)",
    fill = "Δ P(Home Run)"
  ) +
  theme_minimal()

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = delta_out)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    limits = c(-0.05, 0.05)
  ) +
  labs(
    title = "Coors Field Boost: Δ Probability of Out",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)",
    fill = "Δ P(Out)"
  ) +
  theme_minimal()

prob_diffs <- prob_diffs %>%
  mutate(
    coors_xwobacon = 
      .pred_Single_TRUE * 0.882 +
      .pred_Double_TRUE * 1.254 +
      .pred_Triple_TRUE * 1.590 +
      .pred_Home_Run_TRUE * 2.050
  )

ggplot(prob_diffs, aes(x = launch_speed, y = launch_angle, fill = coors_xwobacon)) +
  geom_tile() +
  scale_fill_viridis_c(option = "plasma", name = "xwOBAcon (Coors)") +
  labs(
    title = "Expected wOBA on Contact at Coors Field",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)"
  ) +
  theme_minimal()

league_grid <- sim_grid %>%
  filter(is_coors == FALSE)

league_preds <- predict(multi_fit_direct, new_data = league_grid, type = "prob") %>%
  bind_cols(league_grid)

league_preds <- league_preds %>%
  mutate(
    league_xwobacon =
      .pred_Single * 0.882 +
      .pred_Double * 1.254 +
      .pred_Triple * 1.590 +
      .pred_Home_Run * 2.050
  )

xwoba_surface <- prob_diffs %>%
  select(launch_speed, launch_angle, coors_xwobacon) %>%
  left_join(
    league_preds %>%
      select(launch_speed, launch_angle, league_xwobacon),
    by = c("launch_speed", "launch_angle")
  ) %>%
  mutate(delta_xwoba = coors_xwobacon - league_xwobacon)

ggplot(xwoba_surface, aes(x = launch_speed, y = launch_angle, fill = delta_xwoba)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "darkred", midpoint = 0,
    name = "Δ xwOBAcon (Coors - League)"
  ) +
  labs(
    title = "Coors Field Boost: Difference in Expected wOBA on Contact",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (°)"
  ) +
  theme_minimal()

This is ultimately the story of the model, the thick band hovering around the “barrel” zone in which the xwoba is almost a full tenth higher in that zone compared to the mean of other parks. Lets now take a look at who would benefit most from this “Coors Barrell Boost” testing against real players and their batted ball metrics with this weighting.

batter_data <- batted_balls_filtered %>%
  filter(!is.na(launch_speed), !is.na(launch_angle)) %>%
  select(batter, des, launch_speed, launch_angle, is_coors)

library(stringr)

batter_data <- batter_data %>%
  mutate(
    player_name = str_extract(des, "^[^\\s]+\\s[^\\s]+")  
  ) %>%
    filter(!player_name %in% c("Umpire reviewed", "Umpire review", "Crew Chief"))

batter_data <- batter_data %>%
  mutate(
    launch_speed = round(launch_speed),
    launch_angle = round(launch_angle)
  )

batter_eval <- batter_data %>%
  left_join(
    prob_diffs %>%
      select(launch_speed, launch_angle, coors_xwobacon),
    by = c("launch_speed", "launch_angle")
  ) %>%
  filter(!is.na(coors_xwobacon))

hitter_eval <- batter_eval %>%
  group_by(player_name) %>%
  summarise(
    n_batted_balls = n(),
    avg_coors_xwobacon = mean(coors_xwobacon),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_coors_xwobacon))

hitter_eval %>%
  filter(n_batted_balls >= 200) %>% 
  slice_max(avg_coors_xwobacon, n = 25) %>%
  ggplot(aes(x = reorder(player_name, avg_coors_xwobacon), y = avg_coors_xwobacon)) +
  geom_col(fill = "firebrick") +
  coord_flip() +
  labs(
    title = "Top Hitters by Coors-Optimized xwOBAcon",
    x = NULL,
    y = "Average Coors xwOBAcon"
  ) +
  theme_minimal()

All familiar names at the top of the Coors xWobacon leader board, up and down the list, people who consistently hit the ball hard and in the air, this does not answer the question of who has the most to gain.

all_pa_data <- bind_rows(statcast_22, statcast_23, statcast_24) %>%
  filter(!is.na(events)) %>%
  mutate(
    is_coors = home_team == "COL",
    launch_speed = round(launch_speed),
    launch_angle = round(launch_angle)
  )

# Join Coors xwOBAcon where launch data exists
pa_with_xwoba <- all_pa_data %>%
  left_join(
    prob_diffs %>%
      select(launch_speed, launch_angle, coors_xwobacon),
    by = c("launch_speed", "launch_angle")
  ) %>%
  mutate(
    xwoba_event = case_when(
      events == "walk" ~ 0.689,
      events == "hit_by_pitch" ~ 0.720,
      !is.na(coors_xwobacon) ~ coors_xwobacon,
      TRUE ~ 0  # strikeouts, outs, etc.
    )
  )

cleaned_xwoba <- pa_with_xwoba %>%
  filter(!(xwoba_event == 0 & events %in% c("single", "double", "triple", "home_run")))

player_xwoba <- cleaned_xwoba %>%
  mutate(player_name = str_extract(des, "^[^\\s]+\\s[^\\s]+")) %>%
  filter(!player_name %in% c("Umpire reviewed", "Umpire review", "Crew Chief")) %>%
  group_by(player_name) %>%
  summarise(
    n_pa = n(),
    coors_xwoba = mean(xwoba_event, na.rm = TRUE),
    true_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    woba = mean(woba_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(coors_xwoba))

player_xwoba %>%
  mutate(delta = coors_xwoba - true_xwoba) %>%
  filter(n_pa >= 1000) %>%
  ggplot(aes(x = true_xwoba, y = coors_xwoba)) +
  geom_point(aes(size = n_pa, color = delta), alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_color_gradient2(low = "blue", high = "darkred", mid = "gray80", midpoint = 0) +
  labs(
    title = "Coors vs League xwOBA per Player",
    x = "League xwOBA",
    y = "Coors xwOBA",
    color = "Coors – League",
    size = "PA"
  ) +
  theme_minimal()

Its clear that most all of the players in the major leagues would benefit from playing all of their games in Colorado

player_xwoba %>%
  filter(n_pa >= 1000) %>%
  mutate(coors_boost = coors_xwoba - true_xwoba) %>%
  arrange(desc(coors_boost)) %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = reorder(player_name, coors_boost), y = coors_boost)) +
  geom_col(fill = "firebrick") +
  coord_flip() +
  labs(
    title = "Top 20 Coors-Boosted Hitters by xwOBA Differential",
    subtitle = "2022-2024 Aggregated: Min 1000PA",
    x = "Player",
    y = "Coors xwOBA – Standard xwOBA"
  ) +
  theme_minimal()

Here are the players who gain the most value from Coors Field based on the difference between their predicted wOBA at Coors versus the league average environment. These are mostly right-handed or power-first hitters whose batted ball profiles—particularly high exit velocities and certain launch angles—take better advantage of Coors’ spacious outfield and thin air. Notably, big names like Giancarlo Stanton and Vladimir Guerrero Jr. top the list, reinforcing the idea that certain elite hitters’ production is even more amplified in Denver.

# Calculate the difference and slice both tails
top_vs_bottom <- player_xwoba %>%
  filter(n_pa >= 1000) %>%
  mutate(woba_vs_coors_x = woba - coors_xwoba) %>%
  arrange(woba_vs_coors_x) %>%
  slice_head(n = 10) %>%
  bind_rows(
    player_xwoba %>%
      filter(n_pa >= 1000) %>%
      mutate(woba_vs_coors_x = woba - coors_xwoba) %>%
      arrange(desc(woba_vs_coors_x)) %>%
      slice_head(n = 10)
  ) %>%
  mutate(
    color_group = if_else(woba_vs_coors_x > 0, "Overperformed", "Underperformed")
  )

# Plot with color distinction
ggplot(top_vs_bottom, aes(x = reorder(player_name, woba_vs_coors_x), y = woba_vs_coors_x, fill = color_group)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("Overperformed" = "firebrick", "Underperformed" = "steelblue")) +
  labs(
    title = "Biggest Over/Underperformers vs Coors xwOBA",
    subtitle = "2022–2024 Aggregated: Min. 1000 PA",
    x = "Player",
    y = "Actual wOBA – Coors xwOBA",
    fill = NULL
  ) +
  theme_minimal()

Lastly here we see the difference in a players actual weighted on base average over the last 3 season against the the their predicted weighted on base average with the Coors field probabilities. The common theme among the top 10 under performers are big power bats that elevate the ball, and the common theme among the over performers is that they are primarily contact first hitters. Both Tovar and Blackmon themselves play on the Colorado Rockies, which is a good indicator that our model works decently well, given over half of their games are in the Coors xwOBA environemtnt.

0.5 Multiple Linear Regression Predicting The Coors Hangover Effect

A common phenomenon that is talked about in the circles of Rockies fans is the “Coors Hangover Effect”, an idea that after being in such a run happy environment where the ball moves less, the games immediately following a three game stint at Coors field are particularity bad as hitters adjust to their normal conditions. To test this I am going to run an MLR that predicts a vareity of indicators of hitting performance using the after_coors_flag and team as predictors. The after_coors_flag is a variable that kicks in for the following three games that a team has played at Coors field. Given we only have three years worth of data in our set, we will only examine teams within the NL West division as they have the highest n of instances that they played games after being at Coors Field.

all_pitch_data <- bind_rows(statcast_22, statcast_23, statcast_24) |>
   mutate(
    is_coors = home_team == "COL"
  )

team_games_raw <- all_pitch_data %>%
  mutate(
    batting_team = if_else(inning_topbot == "Top", away_team, home_team),
    pitching_team = if_else(inning_topbot == "Top", home_team, away_team)
  )

whiff_by_game <- all_pitch_data %>%
  mutate(
    batting_team = if_else(inning_topbot == "Top", away_team, home_team),
    is_whiff = description %in% c("swinging_strike", "swinging_strike_blocked", "missed_bunt"),
    is_swing = description %in% c("swinging_strike", "swinging_strike_blocked", "foul", "foul_tip", "missed_bunt, hit_into_play", "hit_into_play_score", "hit_into_play_no_out", "foul_bunt")
  ) %>%
  group_by(batting_team, game_date) %>%
  summarise(
    total_whiffs = sum(is_whiff, na.rm = TRUE),
    total_swings = sum(is_swing, na.rm = TRUE),
    whiff_rate = total_whiffs / total_swings,
    .groups = "drop"
  )

whiff_by_game <- whiff_by_game %>%
  filter(game_date >= as.Date("2022-04-08"))

pa_stats_by_game <- all_pitch_data %>%
  filter(!is.na(events)) %>%
  mutate(
    batting_team = if_else(inning_topbot == "Top", away_team, home_team),
    is_walk = events == "walk",
    is_pa = TRUE  # all rows are PA-ending now
  ) %>%
  group_by(batting_team, game_date) %>%
  summarise(
    n_pa = n(),
    team_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    walk_rate = mean(is_walk, na.rm = TRUE),
    total_walks = sum(is_walk),
    .groups = "drop"
  )

team_game_summary <- left_join(pa_stats_by_game, whiff_by_game, by = c("batting_team", "game_date")) |>
  filter(game_date >= as.Date("2022-04-08"))

team_game_summary <- team_game_summary|>
  filter(n_pa <= 60)

team_games_lagged <- team_games_raw %>%
  arrange(batting_team, game_date) %>%
  group_by(batting_team) %>%
  mutate(
    coors_game = if_else(is_coors, game_date, as.Date(NA)),
    coors_game = zoo::na.locf(coors_game, na.rm = FALSE),  # Fill forward the date of last Coors game
    games_since_coors = as.integer(game_date - coors_game),
    after_coors = case_when(
      games_since_coors == 1 ~ "after_coors_1",
      games_since_coors == 2 ~ "after_coors_2",
      games_since_coors == 3 ~ "after_coors_3",
      TRUE ~ NA_character_
    )
  ) %>%
  ungroup()

lagged_after_coors <- team_games_lagged %>%
  select(batting_team, game_date, after_coors) %>%
  distinct(batting_team, game_date, .keep_all = TRUE)

team_game_summary <- team_game_summary %>%
  left_join(lagged_after_coors, by = c("batting_team", "game_date"))

nl_west_teams <- c("COL", "LAD", "SD", "SF", "AZ")

nlw_games <- team_game_summary %>%
  filter(batting_team %in% nl_west_teams)

nlw_games <- nlw_games %>%
  mutate(after_coors_flag = !is.na(after_coors))

xwoba_mod <- linear_reg() %>%
  set_engine("lm")

xwoba_fit <- fit(
  xwoba_mod,
  team_xwoba ~ after_coors_flag * batting_team,
  data = nlw_games
)

tidy(xwoba_fit) %>%
  filter(str_detect(term, "after_coors_flag")) %>%
  arrange(estimate)

# A tibble: 5 × 5
  term                                 estimate std.error statistic p.value
  <chr>                                   <dbl>     <dbl>     <dbl>   <dbl>
1 after_coors_flagTRUE                 -0.0251     0.0154    -1.62    0.105
2 after_coors_flagTRUE:batting_teamCOL  0.00831    0.0173     0.481   0.631
3 after_coors_flagTRUE:batting_teamLAD  0.0175     0.0229     0.767   0.443
4 after_coors_flagTRUE:batting_teamSD   0.0219     0.0218     1.00    0.315
5 after_coors_flagTRUE:batting_teamSF   0.0271     0.0218     1.24    0.214

The first MLR fit produced no statistically significant coefficnets when predicting the teams weighted on base average for games after being in Colorado, lets move on to predicting whiff % and walk % and see if we get different results.

walk_mod <- linear_reg() %>%
  set_engine("lm")

walk_fit <- fit(
  walk_mod,
  walk_rate ~ after_coors_flag * batting_team,
  data = nlw_games
)

tidy(walk_fit) %>%
  filter(str_detect(term, "after_coors_flag")) %>%
  arrange(estimate)

# A tibble: 5 × 5
  term                                 estimate std.error statistic p.value
  <chr>                                   <dbl>     <dbl>     <dbl>   <dbl>
1 after_coors_flagTRUE:batting_teamSD  -0.0177     0.0160    -1.11    0.269
2 after_coors_flagTRUE:batting_teamLAD -0.00246    0.0167    -0.147   0.883
3 after_coors_flagTRUE:batting_teamCOL -0.00134    0.0127    -0.106   0.916
4 after_coors_flagTRUE                 -0.00119    0.0113    -0.105   0.916
5 after_coors_flagTRUE:batting_teamSF   0.0202     0.0160     1.26    0.207

whiff_mod <- linear_reg() %>%
  set_engine("lm")

whiff_fit <- fit(
  whiff_mod,
  whiff_rate ~ after_coors_flag * batting_team,
  data = nlw_games
)

tidy(whiff_fit) %>%
  filter(str_detect(term, "after_coors_flag")) %>%
  arrange(estimate)

# A tibble: 5 × 5
  term                                 estimate std.error statistic p.value
  <chr>                                   <dbl>     <dbl>     <dbl>   <dbl>
1 after_coors_flagTRUE:batting_teamSD  -0.0398     0.0302    -1.32    0.188
2 after_coors_flagTRUE:batting_teamCOL -0.0316     0.0239    -1.32    0.187
3 after_coors_flagTRUE:batting_teamSF   0.00339    0.0302     0.112   0.911
4 after_coors_flagTRUE:batting_teamLAD  0.0161     0.0316     0.509   0.611
5 after_coors_flagTRUE                  0.0190     0.0214     0.890   0.374

Neither the whiff % or walk % had statistically significant coefficients, more work should be done analyzing this problem, perhaps aggregating more years worth of data or changing the modeling technique. For now though let’s move on to the pitching analysis.

1 Pitching Analysis

1.1 Pitch Type Summary Tables

# Filter out Rockies pitchers
filtered_data <- all_pa_data %>%
   filter(!(home_team == "COL" & inning_topbot == "Top"))


summary_table <- all_pa_data %>%
  group_by(pitch_type, is_coors) %>%
  summarise(xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE), n = n(), .groups = "drop") %>%
  pivot_wider(
    names_from = is_coors,
    values_from = c(xwoba, n),
    names_glue = "{.value}_{is_coors}"
  ) %>%
  mutate(delta = xwoba_TRUE - xwoba_FALSE) %>%
  arrange(desc(delta))

summary_table %>%
  filter(!is.na(delta)) %>%
  filter(n_TRUE >= 500) %>%
  ggplot(aes(x = reorder(pitch_type, delta), y = delta, fill = delta > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "darkred", "FALSE" = "steelblue"),
                    labels = c("Better at Coors", "Worse at Coors")) +
  labs(
    title = "Change in xwOBA by Pitch Type at Coors Field",
    x = "Pitch Type",
    y = "Δ xwOBA (Coors - Other Parks)",
    fill = "Performance"
  ) +
  theme_minimal()

This first graph is a quick glance of the types of pitches that work well at Coors compared to all other fields in the league. Here we see that the sweeper, curveball, and knucklecurve are the only three pitches that out performed in Colorado compared to everywhere else. Changeups, fastballs, sinkers, sliders, and cutters all perfomred better in different parks compared to Coors field. a reasonable result given the altitude efferts in Denver.

summary_table %>%
  filter(!is.na(xwoba_TRUE), !is.na(xwoba_FALSE)) %>%
  filter(n_TRUE > 500) %>%
  ggplot(aes(x = xwoba_FALSE, y = xwoba_TRUE, label = pitch_type)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray60") +
  geom_point(color = "purple", size = 3) +
  ggrepel::geom_text_repel(max.overlaps = 10) +
  labs(
    title = "xwOBA at Coors vs Other Parks",
    x = "xwOBA (Other Parks)",
    y = "xwOBA (Coors Field)"
  ) +
  theme_minimal()

1.2 Lasso Regression Predicting Pitch Movement by Specs

This Lasso regression model predicts the horizontal movement of pitches using inputs such as velocity, spin rate, and spin axis. It also includes a binary flag indicating whether the game was played at Coors Field to test if the park environment affects pitch movement. This approach can be replicated across different pitch types and pitcher handedness. For this analysis, we focus specifically on sliders thrown by right-handed pitchers.

sliders_rhp <- all_pitch_data %>%
  filter(
    p_throws == "R",
    pitch_type == "SL", 
    !is.na(api_break_x_arm),
    !is.na(api_break_z_with_gravity)
  ) %>%
  select(
    api_break_x_arm,
    api_break_z_with_gravity,
    release_spin_rate,
    spin_axis,             
    release_speed,
    release_extension,
    is_coors
  ) %>%
  drop_na() %>%
  mutate(is_coors = factor(is_coors)) %>%
  mutate(api_break_x_arm = api_break_x_arm * 12) %>%
  mutate(api_break_z_with_gravity = api_break_z_with_gravity * 12)

# Lasso spec
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

# Recipe with interactions
slider_rec <- recipe(api_break_x_arm ~ ., data = sliders_rhp) %>%
  step_interact(terms = ~ release_spin_rate:spin_axis) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

slider_wf <- workflow() %>%
  add_model(lasso_spec) %>%
  add_recipe(slider_rec)

set.seed(42)
cv_folds <- vfold_cv(sliders_rhp, v = 5)

tune_results <- tune_grid(
  slider_wf,
  resamples = cv_folds,
  grid = 30,
  metrics = metric_set(rmse)
)

best_lasso <- select_best(tune_results, metric = "rmse")
final_lasso <- finalize_workflow(slider_wf, best_lasso)
final_fit <- fit(final_lasso, data = sliders_rhp)

final_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  filter(estimate != 0) %>%
  arrange(desc(abs(estimate)))

Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.

# A tibble: 7 × 3
  term                          estimate  penalty
  <chr>                            <dbl>    <dbl>
1 (Intercept)                     -5.32  1.15e-10
2 release_speed                    1.84  1.15e-10
3 release_spin_rate               -1.54  1.15e-10
4 release_spin_rate_x_spin_axis    1.24  1.15e-10
5 api_break_z_with_gravity         0.513 1.15e-10
6 release_extension               -0.389 1.15e-10
7 is_coors_TRUE.                   0.171 1.15e-10

Because break is measured relative to the pitcher’s arm side, more negative values indicate more total movement. The model confirms that higher release speed is associated with less movement, while greater spin rate leads to more arm-side break. The interaction between spin rate and spin axis is also negative, reinforcing that not just the amount but the type of spin affects break.

Interestingly, vertical break (with gravity) has a positive relationship, meaning pitches that drop more vertically tend to sweep less horizontally. Release extension has a negative coefficient, suggesting that pitches thrown farther from the mound tend to move more arm-side. Finally, the positive effect of Coors Field (is_coors_TRUE) implies that sliders break less at Coors than elsewhere, aligning with expectations about how thin air diminishes pitch movement.

library(yardstick)

preds <- predict(final_fit, new_data = sliders_rhp) %>%
  bind_cols(sliders_rhp)

metrics_regression <- metric_set(rmse, mae, rsq)

metrics_regression(preds, truth = api_break_x_arm, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.67 
2 mae     standard       2.84 
3 rsq     standard       0.345

preds %>%
  ggplot(aes(x = .pred, y = api_break_x_arm)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = "dashed", color = "red") +
  labs(
    title = "Predicted vs Actual Horizontal Break (RHP Sliders)",
    x = "Predicted Break (inches)",
    y = "Actual Break (inches)"
  ) +
  theme_minimal()

preds %>%
  ggplot(aes(x = .pred, y = api_break_x_arm, color = is_coors)) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "red") +
  labs(
    title = "Predicted vs Actual Horizontal Break\nColored by Coors Field",
    x = "Predicted Break (inches)",
    y = "Actual Break (inches)",
    color = "Coors Field"
  ) +
  theme_minimal()

preds %>%
  mutate(resid = api_break_x_arm - .pred) %>%
  ggplot(aes(x = .pred, y = resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Residual Plot: Horizontal Break Predictions",
    x = "Predicted Break (inches)",
    y = "Residual (Actual - Predicted)"
  ) +
  theme_minimal()

To better capture potential non-linear relationships between pitch characteristics and horizontal break, we constructed a polynomial regression recipe. This preprocessing pipeline includes third-degree polynomial terms for both release speed and spin rate, which allows the model to flexibly account for curvature in how these variables influence break.

poly_rec <- recipe(api_break_x_arm ~ release_speed + release_spin_rate + spin_axis + release_extension + is_coors, data = sliders_rhp) %>%
  step_poly(release_speed, degree = 3) %>%
  step_poly(release_spin_rate, degree = 3) %>%
  step_dummy(all_nominal_predictors()) %>%  
  step_normalize(all_numeric_predictors())

poly_wf <- workflow() %>%
  add_recipe(poly_rec) %>%
  add_model(lasso_spec)

set.seed(42)
cv_folds <- vfold_cv(sliders_rhp, v = 5)

tune_results <- tune_grid(
  poly_wf,
  resamples = cv_folds,
  grid = 30,
  metrics = metric_set(rmse, rsq)
)

best_lasso <- select_best(tune_results, metric = "rmse")
final_lasso <- finalize_workflow(poly_wf, best_lasso)
final_fit <- fit(final_lasso, data = sliders_rhp)

final_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  filter(estimate != 0) %>%
  arrange(desc(abs(estimate)))

# A tibble: 10 × 3
   term                     estimate  penalty
   <chr>                       <dbl>    <dbl>
 1 (Intercept)               -5.32   1.15e-10
 2 release_speed_poly_1       1.55   1.15e-10
 3 release_spin_rate_poly_1  -1.24   1.15e-10
 4 spin_axis                  1.11   1.15e-10
 5 release_speed_poly_3      -0.525  1.15e-10
 6 release_extension         -0.344  1.15e-10
 7 release_speed_poly_2      -0.310  1.15e-10
 8 release_spin_rate_poly_2  -0.221  1.15e-10
 9 is_coors_TRUE.             0.176  1.15e-10
10 release_spin_rate_poly_3   0.0236 1.15e-10

Compared to the original lasso model, we now see that the relationship between release speed and horizontal break is more nuanced: the first-degree term is strongly positive, while the second- and third-degree terms are increasingly negative, suggesting diminishing returns and possible curvature in the effect. The same pattern appears for spin rate, where the strongest influence is still negative, but now captured more flexibly through multiple degrees. The effect of Coors Field remains positive, consistent with the earlier model, reinforcing the idea that pitches tend to break less horizontally in the Coors environment. Overall, this model offers a more expressive fit without drastically changing the interpretation.

preds <- predict(final_fit, new_data = sliders_rhp) %>%
  bind_cols(sliders_rhp)

metrics_regression <- metric_set(rmse, mae, rsq)

metrics_regression(preds, truth = api_break_x_arm, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.63 
2 mae     standard       2.82 
3 rsq     standard       0.358

In this variation of the model, we aim to explore whether the effects of pitch characteristics on horizontal break differ specifically at Coors Field. To do this, we include interaction terms between is_coors and both release_speed and release_spin_rate. These interactions allow the model to estimate whether the relationship between velocity or spin and movement changes in the Coors environment. We again apply dummy encoding to handle the binary Coors indicator and normalize all numeric predictors for consistent model scaling. This approach helps isolate environment-specific behavior in pitch movement, providing a more targeted lens on how Coors affects physical pitch outcomes.

poly_rec <- recipe(api_break_x_arm ~ release_speed + release_spin_rate + spin_axis + release_extension + is_coors, data = sliders_rhp) %>%
  step_interact(terms = ~ release_speed:is_coors) %>%
  step_interact(terms = ~ release_spin_rate:is_coors) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

set.seed(42)
cv_folds <- vfold_cv(sliders_rhp, v = 5)

tune_results <- tune_grid(
  workflow() %>%
    add_model(lasso_spec) %>%
    add_recipe(poly_rec),
  resamples = cv_folds,
  grid = 30,
  metrics = metric_set(rmse, rsq)
)

best_lasso <- select_best(tune_results, metric = "rmse")
final_fit <- finalize_workflow(workflow() %>% add_model(lasso_spec) %>% add_recipe(poly_rec), best_lasso) %>%
  fit(data = sliders_rhp)

final_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  filter(estimate != 0) %>%
  arrange(desc(abs(estimate)))

# A tibble: 6 × 3
  term                             estimate  penalty
  <chr>                               <dbl>    <dbl>
1 (Intercept)                        -5.32  1.15e-10
2 release_speed                       1.57  1.15e-10
3 release_spin_rate                  -1.27  1.15e-10
4 spin_axis                           1.08  1.15e-10
5 release_extension                  -0.364 1.15e-10
6 release_spin_rate_x_is_coorsTRUE    0.186 1.15e-10

The main effects for release speed and spin rate are consistent with previous models: more velocity reduces break, while higher spin increases arm-side movement. The positive coefficient on the release_spin_rate × is_coors interaction suggests that spin is less effective at generating horizontal break at Coors compared to other parks. This aligns with expectations around decreased air resistance in Denver, which weakens the Magnus effect. The rest of the model remains stable, reinforcing the robustness of these variables across specifications.

# Set representative values for other features
mean_speed <- mean(sliders_rhp$release_speed, na.rm = TRUE)
mean_axis <- mean(sliders_rhp$spin_axis, na.rm = TRUE)
mean_ext  <- mean(sliders_rhp$release_extension, na.rm = TRUE)

# Create spin rate range and prediction grid
spin_range <- seq(min(sliders_rhp$release_spin_rate), max(sliders_rhp$release_spin_rate), length.out = 100)

pred_grid <- expand.grid(
  release_spin_rate = spin_range,
  release_speed = mean_speed,
  spin_axis = mean_axis,
  release_extension = mean_ext,
  is_coors = factor(c(TRUE, FALSE), levels = c(FALSE, TRUE))  # match factor levels
)

pred_grid_with_preds <- predict(final_fit, new_data = pred_grid) %>%
  bind_cols(pred_grid)

ggplot(pred_grid_with_preds, aes(x = release_spin_rate, y = .pred, color = is_coors)) +
  geom_line(size = 1.2) +
  labs(
    title = "Predicted Horizontal Break vs Spin Rate",
    subtitle = "Right-Handed Sliders — Coors vs Other Parks",
    x = "Spin Rate (RPM)",
    y = "Predicted Horizontal Break (inches)",
    color = "Coors Field"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

This graph illustrates the difference in total horizontal beak by spin rate between pitches thrown at altitude and those thrown in other parks. This process would be able to be replicated every single pitch and handedness to study which pitches are affected the most by the thinner air in Colorado.

1.3 Linear Discriminate Analysis Predicting Pitches thrown at Coors Field

This section applies a Linear Discriminant Analysis (LDA) model to investigate whether we can distinguish between pitches thrown at Coors Field and those thrown elsewhere based solely on their physical characteristics, such as velocity, spin rate, movement, and location. The goal isn’t necessarily high classification accuracy, but rather to see if Coors-specific environmental effects leave a detectable signature in the measurable features of a pitch. A meaningful separation would suggest that even with no explicit park label, the model can pick up on how pitches “behave differently” at altitude.

lda_data <- all_pa_data %>%
  filter(pitch_type %in% c("FF", "SL", "CH", "CU" )) %>%  
  select(
    is_coors,
    pfx_x, pfx_z,
    release_spin_rate,
    release_speed,
    release_extension,
    spin_axis
  ) %>%
  drop_na() %>%
  mutate(is_coors = factor(is_coors))

library(dplyr)

coors <- lda_data %>% filter(is_coors == TRUE)
non_coors_sample <- lda_data %>% filter(is_coors == FALSE) %>%
  slice_sample(n = nrow(coors))  # takes same number as coors

lda_balanced <- bind_rows(coors, non_coors_sample) %>%
  mutate(is_coors = factor(is_coors))  # ensure factor

library(discrim)


Attaching package: 'discrim'

The following object is masked from 'package:dials':

    smoothness

lda_spec <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")

lda_rec <- recipe(is_coors ~ ., data = lda_balanced) %>%
  step_normalize(all_predictors())

lda_wf <- workflow() %>%
  add_recipe(lda_rec) %>%
  add_model(lda_spec)

set.seed(42)
data_split <- initial_split(lda_balanced, strata = is_coors)
train <- training(data_split)
test <- testing(data_split)

lda_fit <- fit(lda_wf, data = train)

predict(lda_fit, new_data = test) %>%
  bind_cols(test) %>%
  metrics(truth = is_coors, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.566
2 kap      binary         0.133

lda_preds <- predict(lda_fit, new_data = test) %>%
  bind_cols(test)

library(yardstick)

conf_mat(lda_preds, truth = is_coors, estimate = .pred_class)

          Truth
Prediction FALSE TRUE
     FALSE  1767 1368
     TRUE   1242 1641

With an accuracy of roughly 55.7%, our LDA model performs only slightly better than random guessing in classifying whether a pitch occurred at Coors Field or not. While this is modest, it still suggests that there are subtle, measurable differences in pitch characteristics at altitude, just not strong enough to drive high prediction performance on their own. This aligns with our understanding that while the Coors environment does affect pitch behavior, the effects are nuanced and may be confounded by other factors like pitcher intent, mechanics, or game context.

lda_model <- extract_fit_parsnip(lda_fit)$fit 

lda_projection <- predict(lda_model, newdata = test)

lda_ld1_data <- test %>%
  mutate(LD1 = lda_projection$x[, 1])

library(ggplot2)

ggplot(lda_ld1_data, aes(x = LD1, fill = is_coors)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "LDA Projection: LD1 (Coors vs Non-Coors)",
    x = "LD1 Score (Discriminant Axis 1)",
    y = "Density",
    fill = "At Coors Field"
  ) +
  theme_minimal()

This plot shows the distribution of LDA scores for pitches thrown at Coors Field versus all other parks. While the two curves mostly overlap, there are slight shifts in their shape,especially in the tails, suggesting some separation in pitch characteristics based on location. The LDA model has attempted to project pitch features into a single dimension that best distinguishes Coors from non-Coors pitches. The small separation explains the modest model accuracy, but it also validates that Coors Field does alter pitch traits enough to detect, albeit subtly.