This analysis explores airline performance from 2021 to 2023 using two hypothesis tests grounded in different statistical philosophies. The dataset contains route-month level records including arrival delays, weather-related metrics, and cancellations across U.S. carriers.
We conduct two A/B tests:
Hypothesis 1 focuses on pre-specified decision rules and error rates. Hypothesis 2 interprets evidence through the p-value alone.
library(tidyverse)
Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")
names(Airline_Delay_Post_COVID_2021_2023)
## [1] "year" "month" "carrier"
## [4] "carrier_name" "airport" "airport_name"
## [7] "arr_flights" "arr_del15" "carrier_ct"
## [10] "weather_ct" "nas_ct" "security_ct"
## [13] "late_aircraft_ct" "arr_cancelled" "arr_diverted"
## [16] "arr_delay" "carrier_delay" "weather_delay"
## [19] "nas_delay" "security_delay" "late_aircraft_delay"
Did airline arrival delays change from 2021 to 2023?
arr_delay (total
arrival delay in minutes per route-month record)\[H_0: \mu_{2021} = \mu_{2023}\] \[H_A: \mu_{2021} \neq \mu_{2023}\]
This is a two-sided test. We are open to delays either improving or worsening after the post-COVID recovery period.
| Parameter | Value | Rationale |
|---|---|---|
| Alpha (Type I error) | 0.05 | Standard threshold. We accept a 5% chance of falsely concluding delays changed when they did not. |
| Power (1 minus Beta) | 0.80 | We want an 80% chance of detecting a true change. Missing a real operational shift could misdirect policy decisions. |
| Minimum effect size | 5 minutes | A 5-minute difference in average arrival delay is operationally meaningful for airline scheduling and passenger satisfaction at the individual flight level. |
A Type I error here means concluding delays changed when they did not, which could cause airlines to falsely credit or blame recovery initiatives. A Type II error means missing a real change, which is costly if airlines are benchmarking post-COVID performance. An 80% power level balances these two risks appropriately.
table(Airline_Delay_Post_COVID_2021_2023$year)
##
## 2021 2022 2023
## 19954 20345 4612
The dataset contains 19,954 observations for 2021 and 4,612 for 2023. We check whether this is sufficient to detect a 5-minute difference.
delay_2021 <- subset(Airline_Delay_Post_COVID_2021_2023, year == 2021)$arr_delay
delay_2023 <- subset(Airline_Delay_Post_COVID_2021_2023, year == 2023)$arr_delay
sd_est <- sd(delay_2021, na.rm = TRUE)
cat("Estimated SD (2021):", round(sd_est, 2), "\n")
## Estimated SD (2021): 10515.31
power_result <- power.t.test(
delta = 5,
sd = sd_est,
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
cat("Required n per group:", ceiling(power_result$n), "\n")
## Required n per group: 69429167
cat("Available n (2021):", length(delay_2021), "| Available n (2023):", length(delay_2023), "\n")
## Available n (2021): 19954 | Available n (2023): 4612
The required sample size (~69 million per group) far exceeds what is available. We do not have enough data to reliably detect a 5-minute difference at this scale.
This is a data structure problem. arr_delay represents
total aggregate delay across all flights for a given carrier, airport,
and month in a single record. It is not a per-flight measurement. The SD
of ~10,515 minutes reflects that aggregation, making a 5-minute
threshold statistically invisible relative to the natural variance in
these totals. The implied Cohen’s d for a 5-minute difference at this SD
is approximately 0.000475, which explains why ~69 million observations
would be needed.
t_test_result <- t.test(delay_2021, delay_2023)
t_test_result
##
## Welch Two Sample t-test
##
## data: delay_2021 and delay_2023
## t = -7.2747, df = 5758.8, p-value = 3.935e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2101.199 -1209.138
## sample estimates:
## mean of x mean of y
## 3342.028 4997.197
p_val <- t_test_result$p.value
mean_2021 <- mean(delay_2021, na.rm = TRUE)
mean_2023 <- mean(delay_2023, na.rm = TRUE)
diff_means <- mean_2023 - mean_2021
cohens_d <- diff_means / sd_est
cat("Mean arr_delay 2021:", round(mean_2021, 2), "minutes\n")
## Mean arr_delay 2021: 3342.03 minutes
cat("Mean arr_delay 2023:", round(mean_2023, 2), "minutes\n")
## Mean arr_delay 2023: 4997.2 minutes
cat("Difference (2023 - 2021):", round(diff_means, 2), "minutes\n")
## Difference (2023 - 2021): 1655.17 minutes
cat("Cohen's d:", round(cohens_d, 4), "\n")
## Cohen's d: 0.1574
cat("p-value:", format(p_val, scientific = TRUE), "\n")
## p-value: 3.934722e-13
Cohen’s d of approximately 0.157 is a small effect by conventional standards (Cohen, 1988). The difference is real and statistically detectable at this magnitude, but modest relative to the total variance in the data. The test succeeded not because we were adequately powered for 5 minutes, but because the actual observed difference of 1,655 minutes was large enough to overcome the power shortfall.
Airline_Delay_Post_COVID_2021_2023 %>%
filter(year %in% c(2021, 2023)) %>%
mutate(year = factor(year)) %>%
ggplot(aes(x = reorder(year, arr_delay, FUN = median), y = arr_delay, fill = factor(year))) +
geom_boxplot(outlier.shape = NA, alpha = 0.8) +
coord_cartesian(ylim = c(0, quantile(
filter(Airline_Delay_Post_COVID_2021_2023, year %in% c(2021, 2023))$arr_delay, 0.95, na.rm = TRUE
))) +
scale_fill_manual(values = c("2021" = "skyblue", "2023" = "steelblue")) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Arrival Delay Comparison: 2021 vs 2023",
subtitle = "Years sorted by median delay; outliers beyond 95th percentile excluded for readability",
x = "Year (Sorted by Median Delay)",
y = "Arrival Delay (Minutes)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Insight: The boxplot shows that 2023 has a substantially wider interquartile range and longer upper whisker than 2021, indicating greater variability in aggregate delays rather than a simple uniform shift upward. The medians are relatively close, but the spread in 2023 is much larger, suggesting that while some routes experienced significantly higher delays, others remained comparable to 2021 levels. This uneven pattern is consistent with the t-test finding of a statistically significant mean difference driven by a subset of high-delay route-month records.
Since p < 0.001, we reject H0. Arrival delays increased by
approximately 1,655 minutes per route-month record between 2021 and
2023. However, the test was underpowered for our 5-minute threshold
given the aggregated nature of arr_delay and its SD of
~10,515 minutes. The result achieves significance only because the
observed difference is orders of magnitude larger than that threshold.
Cohen’s d of 0.157 confirms the effect is real but small relative to
overall variance.
Further investigation worth pursuing: The 2023 sample covers only 4,612 records compared to roughly 20,000 for 2021. If the 2023 data are concentrated in high-delay months or busy airports due to incomplete year coverage, the apparent worsening could reflect sampling bias rather than a true year-over-year trend. A follow-up analysis controlling for month and airport would isolate whether the increase persists after accounting for that imbalance.
Do higher weather-related disruptions increase cancellations?
weather_ct)weather_ct)arr_cancelled (number
of cancelled arrivals per route-month record)In Fisher’s framework, we do not pre-specify power or Type II error. We compute the p-value and assess how surprising the data would be if H0 were true.
Airline_Delay_Post_COVID_2021_2023$weather_group <-
ifelse(
Airline_Delay_Post_COVID_2021_2023$weather_ct >
median(Airline_Delay_Post_COVID_2021_2023$weather_ct, na.rm = TRUE),
"High Weather",
"Low Weather"
)
table(Airline_Delay_Post_COVID_2021_2023$weather_group)
##
## High Weather Low Weather
## 22431 22446
The median split produces two nearly equal groups: 22,431 High Weather records and 22,446 Low Weather records. This balance is ideal for a two-sample comparison.
\[H_0: \mu_{\text{High}} = \mu_{\text{Low}}\] \[H_A: \mu_{\text{High}} \neq \mu_{\text{Low}}\]
weather_test <- t.test(arr_cancelled ~ weather_group, data = Airline_Delay_Post_COVID_2021_2023)
weather_test
##
## Welch Two Sample t-test
##
## data: arr_cancelled by weather_group
## t = 37.377, df = 23148, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High Weather and group Low Weather is not equal to 0
## 95 percent confidence interval:
## 10.23567 11.36862
## sample estimates:
## mean in group High Weather mean in group Low Weather
## 12.334403 1.532255
A Welch two-sample t-test is appropriate here for three reasons specific to this data:
First, with approximately 22,400 observations per group, the Central Limit Theorem guarantees the sampling distribution of the mean is approximately normal regardless of how skewed the raw cancellation counts are.
Second, the Welch variant does not assume equal variances. Given the High Weather group has a mean of 12.3 cancellations versus 1.5 for Low Weather, the variances are almost certainly unequal, making the Welch adjustment a necessary correction rather than a formality.
Third, the grouping variable cleanly separates 44,877 records into two non-overlapping groups. The 8x difference in group means is large enough that even mild violations of test assumptions would not change the conclusion.
Airline_Delay_Post_COVID_2021_2023 %>%
filter(!is.na(weather_group)) %>%
ggplot(aes(x = weather_group, y = arr_cancelled, fill = weather_group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.15, size = 0.8, color = "gray30") +
stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red",
aes(group = weather_group)) +
scale_y_continuous(labels = scales::comma,
limits = c(0, quantile(Airline_Delay_Post_COVID_2021_2023$arr_cancelled,
0.97, na.rm = TRUE))) +
scale_fill_manual(values = c("High Weather" = "tomato", "Low Weather" = "lightgreen")) +
labs(
title = "Cancellations by Weather Impact Group",
subtitle = "Red diamond shows group mean; jitter reveals density of overlapping observations",
x = "Weather Impact Group",
y = "Number of Cancellations"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Insight: The red diamond (group mean) sits at approximately 12.3 for High Weather versus 1.5 for Low Weather, an 8x difference visible through the jitter. The High Weather group has a noticeably heavier upper tail, meaning more route-month combinations with large cancellation counts. Practically, a difference of ~10.8 cancellations per route-month record is operationally significant: for a typical route operating 200 flights per month, this represents roughly a 5 percentage point increase in the cancellation rate, which meaningfully affects both passenger experience and airline revenue.
Airline_Delay_Post_COVID_2021_2023 %>%
filter(!is.na(weather_delay), !is.na(arr_cancelled)) %>%
ggplot(aes(x = weather_delay, y = arr_cancelled)) +
geom_point(alpha = 0.2, size = 0.9, color = "steelblue",
position = position_jitter(width = 0.5, height = 0.3)) +
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
coord_cartesian(ylim = c(0, 500)) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Weather Delay vs. Cancellations",
subtitle = "Zoomed to data-dense region; line of best fit with 95% confidence interval",
x = "Weather Delay (Minutes)",
y = "Number of Cancellations"
) +
theme_minimal(base_size = 13)
Insight: Zooming to the data-dense region (below 500 cancellations) makes both the upward trend and the confidence band clearly visible. As weather delay minutes increase, cancellations rise consistently. The narrow shaded band indicates the estimate is reliable where most observations are concentrated. This plot extends the group comparison by showing the relationship is continuous rather than a binary artifact of how the groups were defined.
cat("p-value:", format(weather_test$p.value, scientific = TRUE), "\n")
## p-value: 6.09722e-297
cat("Mean cancellations (High Weather):", round(weather_test$estimate[1], 2), "\n")
## Mean cancellations (High Weather): 12.33
cat("Mean cancellations (Low Weather):", round(weather_test$estimate[2], 2), "\n")
## Mean cancellations (Low Weather): 1.53
cat("Difference in means:", round(weather_test$estimate[1] - weather_test$estimate[2], 2), "\n")
## Difference in means: 10.8
The p-value of approximately 6.1e-297 is effectively zero. This is not merely significant at the 0.05 level but represents overwhelming evidence against H0. We conclude that weather disruptions are a statistically and practically significant driver of cancellations, a relationship that holds both at the group level and continuously across the full range of weather delay values.
Further investigation worth pursuing: The median split treats all above-median records as equally high weather regardless of intensity. A multivariate model isolating weather from correlated factors like NAS delays and late aircraft counts would clarify how much of the cancellation effect is uniquely attributable to weather versus co-occurring disruptions.
Taken together, these results suggest post-COVID airline performance has not improved uniformly: aggregate delays worsened between 2021 and 2023 while weather remains a dominant and measurable driver of cancellations. Both findings are statistically robust but point toward the need for carrier-level and seasonally-controlled analyses before drawing firm operational conclusions.