Week 7 Data Dive: Hypothesis Testing

Introduction

This analysis explores airline performance from 2021 to 2023 using two hypothesis tests grounded in different statistical philosophies. The dataset contains route-month level records including arrival delays, weather-related metrics, and cancellations across U.S. carriers.

We conduct two A/B tests:

Hypothesis 1 (Neyman-Pearson Framework): Did average arrival delays change between 2021 and 2023?
Hypothesis 2 (Fisher Significance Testing): Do higher weather-related disruptions lead to more cancellations?

Hypothesis 1 focuses on pre-specified decision rules and error rates. Hypothesis 2 interprets evidence through the p-value alone.

library(tidyverse)

Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

names(Airline_Delay_Post_COVID_2021_2023)

##  [1] "year"                "month"               "carrier"            
##  [4] "carrier_name"        "airport"             "airport_name"       
##  [7] "arr_flights"         "arr_del15"           "carrier_ct"         
## [10] "weather_ct"          "nas_ct"              "security_ct"        
## [13] "late_aircraft_ct"    "arr_cancelled"       "arr_diverted"       
## [16] "arr_delay"           "carrier_delay"       "weather_delay"      
## [19] "nas_delay"           "security_delay"      "late_aircraft_delay"

Hypothesis 1: Neyman-Pearson Framework

Research Question

Did airline arrival delays change from 2021 to 2023?

Group A: Year 2021
Group B: Year 2023
Main Variable: arr_delay (total arrival delay in minutes per route-month record)

Define Hypotheses

\[H_0: \mu_{2021} = \mu_{2023}\] \[H_A: \mu_{2021} \neq \mu_{2023}\]

This is a two-sided test. We are open to delays either improving or worsening after the post-COVID recovery period.

Design Parameters

Parameter	Value	Rationale
Alpha (Type I error)	0.05	Standard threshold. We accept a 5% chance of falsely concluding delays changed when they did not.
Power (1 minus Beta)	0.80	We want an 80% chance of detecting a true change. Missing a real operational shift could misdirect policy decisions.
Minimum effect size	5 minutes	A 5-minute difference in average arrival delay is operationally meaningful for airline scheduling and passenger satisfaction at the individual flight level.

A Type I error here means concluding delays changed when they did not, which could cause airlines to falsely credit or blame recovery initiatives. A Type II error means missing a real change, which is costly if airlines are benchmarking post-COVID performance. An 80% power level balances these two risks appropriately.

Sample Size Check

table(Airline_Delay_Post_COVID_2021_2023$year)

## 
##  2021  2022  2023 
## 19954 20345  4612

The dataset contains 19,954 observations for 2021 and 4,612 for 2023. We check whether this is sufficient to detect a 5-minute difference.

delay_2021 <- subset(Airline_Delay_Post_COVID_2021_2023, year == 2021)$arr_delay
delay_2023 <- subset(Airline_Delay_Post_COVID_2021_2023, year == 2023)$arr_delay

sd_est <- sd(delay_2021, na.rm = TRUE)
cat("Estimated SD (2021):", round(sd_est, 2), "\n")

## Estimated SD (2021): 10515.31

power_result <- power.t.test(
  delta = 5,
  sd = sd_est,
  sig.level = 0.05,
  power = 0.80,
  type = "two.sample",
  alternative = "two.sided"
)
cat("Required n per group:", ceiling(power_result$n), "\n")

## Required n per group: 69429167

cat("Available n (2021):", length(delay_2021), "| Available n (2023):", length(delay_2023), "\n")

## Available n (2021): 19954 | Available n (2023): 4612

The required sample size (~69 million per group) far exceeds what is available. We do not have enough data to reliably detect a 5-minute difference at this scale.

This is a data structure problem. arr_delay represents total aggregate delay across all flights for a given carrier, airport, and month in a single record. It is not a per-flight measurement. The SD of ~10,515 minutes reflects that aggregation, making a 5-minute threshold statistically invisible relative to the natural variance in these totals. The implied Cohen’s d for a 5-minute difference at this SD is approximately 0.000475, which explains why ~69 million observations would be needed.

Test for Difference in Means

t_test_result <- t.test(delay_2021, delay_2023)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  delay_2021 and delay_2023
## t = -7.2747, df = 5758.8, p-value = 3.935e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2101.199 -1209.138
## sample estimates:
## mean of x mean of y 
##  3342.028  4997.197

Interpretation

p_val <- t_test_result$p.value
mean_2021 <- mean(delay_2021, na.rm = TRUE)
mean_2023 <- mean(delay_2023, na.rm = TRUE)
diff_means <- mean_2023 - mean_2021
cohens_d <- diff_means / sd_est

cat("Mean arr_delay 2021:", round(mean_2021, 2), "minutes\n")

## Mean arr_delay 2021: 3342.03 minutes

cat("Mean arr_delay 2023:", round(mean_2023, 2), "minutes\n")

## Mean arr_delay 2023: 4997.2 minutes

cat("Difference (2023 - 2021):", round(diff_means, 2), "minutes\n")

## Difference (2023 - 2021): 1655.17 minutes

cat("Cohen's d:", round(cohens_d, 4), "\n")

## Cohen's d: 0.1574

cat("p-value:", format(p_val, scientific = TRUE), "\n")

## p-value: 3.934722e-13

Cohen’s d of approximately 0.157 is a small effect by conventional standards (Cohen, 1988). The difference is real and statistically detectable at this magnitude, but modest relative to the total variance in the data. The test succeeded not because we were adequately powered for 5 minutes, but because the actual observed difference of 1,655 minutes was large enough to overcome the power shortfall.

Visualization

Airline_Delay_Post_COVID_2021_2023 %>%
  filter(year %in% c(2021, 2023)) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(x = reorder(year, arr_delay, FUN = median), y = arr_delay, fill = factor(year))) +
  geom_boxplot(outlier.shape = NA, alpha = 0.8) +
  coord_cartesian(ylim = c(0, quantile(
    filter(Airline_Delay_Post_COVID_2021_2023, year %in% c(2021, 2023))$arr_delay, 0.95, na.rm = TRUE
  ))) +
  scale_fill_manual(values = c("2021" = "skyblue", "2023" = "steelblue")) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Arrival Delay Comparison: 2021 vs 2023",
    subtitle = "Years sorted by median delay; outliers beyond 95th percentile excluded for readability",
    x = "Year (Sorted by Median Delay)",
    y = "Arrival Delay (Minutes)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Insight: The boxplot shows that 2023 has a substantially wider interquartile range and longer upper whisker than 2021, indicating greater variability in aggregate delays rather than a simple uniform shift upward. The medians are relatively close, but the spread in 2023 is much larger, suggesting that while some routes experienced significantly higher delays, others remained comparable to 2021 levels. This uneven pattern is consistent with the t-test finding of a statistically significant mean difference driven by a subset of high-delay route-month records.

Conclusion

Since p < 0.001, we reject H0. Arrival delays increased by approximately 1,655 minutes per route-month record between 2021 and 2023. However, the test was underpowered for our 5-minute threshold given the aggregated nature of arr_delay and its SD of ~10,515 minutes. The result achieves significance only because the observed difference is orders of magnitude larger than that threshold. Cohen’s d of 0.157 confirms the effect is real but small relative to overall variance.

Further investigation worth pursuing: The 2023 sample covers only 4,612 records compared to roughly 20,000 for 2021. If the 2023 data are concentrated in high-delay months or busy airports due to incomplete year coverage, the apparent worsening could reflect sampling bias rather than a true year-over-year trend. A follow-up analysis controlling for month and airport would isolate whether the increase persists after accounting for that imbalance.

Hypothesis 2: Fisher Significance Testing

Research Question

Do higher weather-related disruptions increase cancellations?

Group A: High weather impact (above median weather_ct)
Group B: Low weather impact (at or below median weather_ct)
Main Variable: arr_cancelled (number of cancelled arrivals per route-month record)

In Fisher’s framework, we do not pre-specify power or Type II error. We compute the p-value and assess how surprising the data would be if H0 were true.

Create Weather Groups

Airline_Delay_Post_COVID_2021_2023$weather_group <-
  ifelse(
    Airline_Delay_Post_COVID_2021_2023$weather_ct >
      median(Airline_Delay_Post_COVID_2021_2023$weather_ct, na.rm = TRUE),
    "High Weather",
    "Low Weather"
  )

table(Airline_Delay_Post_COVID_2021_2023$weather_group)

## 
## High Weather  Low Weather 
##        22431        22446

The median split produces two nearly equal groups: 22,431 High Weather records and 22,446 Low Weather records. This balance is ideal for a two-sample comparison.

Define Hypotheses

\[H_0: \mu_{\text{High}} = \mu_{\text{Low}}\] \[H_A: \mu_{\text{High}} \neq \mu_{\text{Low}}\]

Test for Difference in Means

weather_test <- t.test(arr_cancelled ~ weather_group, data = Airline_Delay_Post_COVID_2021_2023)
weather_test

## 
##  Welch Two Sample t-test
## 
## data:  arr_cancelled by weather_group
## t = 37.377, df = 23148, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High Weather and group Low Weather is not equal to 0
## 95 percent confidence interval:
##  10.23567 11.36862
## sample estimates:
## mean in group High Weather  mean in group Low Weather 
##                  12.334403                   1.532255

Test Assumptions and Validity

A Welch two-sample t-test is appropriate here for three reasons specific to this data:

First, with approximately 22,400 observations per group, the Central Limit Theorem guarantees the sampling distribution of the mean is approximately normal regardless of how skewed the raw cancellation counts are.

Second, the Welch variant does not assume equal variances. Given the High Weather group has a mean of 12.3 cancellations versus 1.5 for Low Weather, the variances are almost certainly unequal, making the Welch adjustment a necessary correction rather than a formality.

Third, the grouping variable cleanly separates 44,877 records into two non-overlapping groups. The 8x difference in group means is large enough that even mild violations of test assumptions would not change the conclusion.

Visualization 1: Group Comparison

Airline_Delay_Post_COVID_2021_2023 %>%
  filter(!is.na(weather_group)) %>%
  ggplot(aes(x = weather_group, y = arr_cancelled, fill = weather_group)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.15, size = 0.8, color = "gray30") +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red",
               aes(group = weather_group)) +
  scale_y_continuous(labels = scales::comma,
                     limits = c(0, quantile(Airline_Delay_Post_COVID_2021_2023$arr_cancelled,
                                            0.97, na.rm = TRUE))) +
  scale_fill_manual(values = c("High Weather" = "tomato", "Low Weather" = "lightgreen")) +
  labs(
    title = "Cancellations by Weather Impact Group",
    subtitle = "Red diamond shows group mean; jitter reveals density of overlapping observations",
    x = "Weather Impact Group",
    y = "Number of Cancellations"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Insight: The red diamond (group mean) sits at approximately 12.3 for High Weather versus 1.5 for Low Weather, an 8x difference visible through the jitter. The High Weather group has a noticeably heavier upper tail, meaning more route-month combinations with large cancellation counts. Practically, a difference of ~10.8 cancellations per route-month record is operationally significant: for a typical route operating 200 flights per month, this represents roughly a 5 percentage point increase in the cancellation rate, which meaningfully affects both passenger experience and airline revenue.

Visualization 2: Weather Delay vs. Cancellations

Airline_Delay_Post_COVID_2021_2023 %>%
  filter(!is.na(weather_delay), !is.na(arr_cancelled)) %>%
  ggplot(aes(x = weather_delay, y = arr_cancelled)) +
  geom_point(alpha = 0.2, size = 0.9, color = "steelblue",
             position = position_jitter(width = 0.5, height = 0.3)) +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
  coord_cartesian(ylim = c(0, 500)) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Weather Delay vs. Cancellations",
    subtitle = "Zoomed to data-dense region; line of best fit with 95% confidence interval",
    x = "Weather Delay (Minutes)",
    y = "Number of Cancellations"
  ) +
  theme_minimal(base_size = 13)

Insight: Zooming to the data-dense region (below 500 cancellations) makes both the upward trend and the confidence band clearly visible. As weather delay minutes increase, cancellations rise consistently. The narrow shaded band indicates the estimate is reliable where most observations are concentrated. This plot extends the group comparison by showing the relationship is continuous rather than a binary artifact of how the groups were defined.

Conclusion

cat("p-value:", format(weather_test$p.value, scientific = TRUE), "\n")

## p-value: 6.09722e-297

cat("Mean cancellations (High Weather):", round(weather_test$estimate[1], 2), "\n")

## Mean cancellations (High Weather): 12.33

cat("Mean cancellations (Low Weather):", round(weather_test$estimate[2], 2), "\n")

## Mean cancellations (Low Weather): 1.53

cat("Difference in means:", round(weather_test$estimate[1] - weather_test$estimate[2], 2), "\n")

## Difference in means: 10.8

The p-value of approximately 6.1e-297 is effectively zero. This is not merely significant at the 0.05 level but represents overwhelming evidence against H0. We conclude that weather disruptions are a statistically and practically significant driver of cancellations, a relationship that holds both at the group level and continuously across the full range of weather delay values.

Further investigation worth pursuing: The median split treats all above-median records as equally high weather regardless of intensity. A multivariate model isolating weather from correlated factors like NAS delays and late aircraft counts would clarify how much of the cancellation effect is uniquely attributable to weather versus co-occurring disruptions.

Overall Insights

Taken together, these results suggest post-COVID airline performance has not improved uniformly: aggregate delays worsened between 2021 and 2023 while weather remains a dominant and measurable driver of cancellations. Both findings are statistically robust but point toward the need for carrier-level and seasonally-controlled analyses before drawing firm operational conclusions.

Week 7 Data Dive: Hypothesis Testing

2026-03-01

Introduction

Hypothesis 1: Neyman-Pearson Framework

Research Question

Define Hypotheses

Design Parameters

Sample Size Check

Test for Difference in Means

Interpretation

Visualization

Conclusion

Hypothesis 2: Fisher Significance Testing

Research Question

Create Weather Groups

Define Hypotheses

Test for Difference in Means

Test Assumptions and Validity

Visualization 1: Group Comparison

Visualization 2: Weather Delay vs. Cancellations

Conclusion

Overall Insights