Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")
df <- Airline_Delay_Post_COVID_2021_2023
# Inspect column names
names(Airline_Delay_Post_COVID_2021_2023)
## [1] "year" "month" "carrier"
## [4] "carrier_name" "airport" "airport_name"
## [7] "arr_flights" "arr_del15" "carrier_ct"
## [10] "weather_ct" "nas_ct" "security_ct"
## [13] "late_aircraft_ct" "arr_cancelled" "arr_diverted"
## [16] "arr_delay" "carrier_delay" "weather_delay"
## [19] "nas_delay" "security_delay" "late_aircraft_delay"
Rather than using the raw count of delayed flights
(arr_del15), we compute each row’s delay rate, meaning the
proportion of arriving flights delayed 15+ minutes. A row is labeled
high_delay = 1 if its delay rate exceeds the dataset
median, and high_delay = 0 otherwise.
Why this variable is worth modeling: Delay rate adjusts for flight volume. A carrier operating 1,000 flights with 400 delayed is meaningfully different from one with 50 flights and 10 delayed, even though the raw counts look different. Modeling whether a carrier-airport-month combination falls in the “high delay” half of the distribution tells us something actionable about which operational conditions systematically produce worse outcomes.
Why not use the delay-minute columns as predictors?
Columns like carrier_delay and weather_delay
are the minute-level components of the delay outcome itself. Using them
to predict arr_del15 causes complete separation, meaning
the model can predict the outcome perfectly by construction, producing
infinite coefficients. Instead, we use cancel_rate,
divert_rate, and month, which are variables
that correlate with high delay but are not mathematical components of
it.
Airline_Delay_Post_COVID_2021_2023 <- Airline_Delay_Post_COVID_2021_2023 %>%
filter(arr_flights > 0) %>%
mutate(
delay_rate = arr_del15 / arr_flights,
cancel_rate = arr_cancelled / arr_flights,
divert_rate = arr_diverted / arr_flights,
high_delay = ifelse(delay_rate > median(delay_rate, na.rm = TRUE), 1, 0),
month = as.factor(month)
) %>%
filter(!is.na(delay_rate), !is.na(high_delay))
# Quick check — should be roughly 50/50
table(Airline_Delay_Post_COVID_2021_2023$high_delay)
##
## 0 1
## 22499 22366
The near-even split (22,499 vs. 22,366) confirms this is a well-balanced binary outcome, which is ideal for logistic regression and avoids class-imbalance issues.
Insight and Significance: The choice of delay rate
over raw delay counts is itself an analytical decision with real
consequences. Had we used raw arr_del15 counts, larger
airlines would appear systematically worse simply because they operate
more flights, masking which carriers are genuinely underperforming
relative to their size. By modeling a proportional outcome, we ensure
the question being answered is operationally meaningful: which
conditions make a carrier-airport combination more likely to fall in the
worse half of the industry? A further question this raises is whether
the median is the right threshold, or whether a different cutoff such as
the 75th percentile would better capture what passengers would consider
an unacceptably delayed operation.
logit_model <- glm(
high_delay ~ cancel_rate + divert_rate + month,
data = Airline_Delay_Post_COVID_2021_2023,
family = binomial,
na.action = na.exclude
)
summary(logit_model)
##
## Call:
## glm(formula = high_delay ~ cancel_rate + divert_rate + month,
## family = binomial, data = Airline_Delay_Post_COVID_2021_2023,
## na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38452 0.03097 -12.417 < 2e-16 ***
## cancel_rate 5.55694 0.28330 19.615 < 2e-16 ***
## divert_rate 3.48832 1.04673 3.333 0.000860 ***
## month2 -0.03122 0.04167 -0.749 0.453683
## month3 0.14098 0.04154 3.394 0.000689 ***
## month4 -0.18469 0.04663 -3.960 7.48e-05 ***
## month5 0.05989 0.04604 1.301 0.193391
## month6 0.95377 0.04673 20.412 < 2e-16 ***
## month7 1.08396 0.04769 22.727 < 2e-16 ***
## month8 0.78826 0.04625 17.043 < 2e-16 ***
## month9 -0.41187 0.04707 -8.750 < 2e-16 ***
## month10 -0.20411 0.04635 -4.404 1.06e-05 ***
## month11 0.09184 0.04584 2.004 0.045103 *
## month12 1.01037 0.04779 21.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62196 on 44864 degrees of freedom
## Residual deviance: 59072 on 44851 degrees of freedom
## AIC: 59100
##
## Number of Fisher Scoring iterations: 4
This model allows us to quantify how operational disruptions and seasonal patterns translate into the probability of a high-delay outcome for a given carrier-airport-month combination.
odds_ratios <- exp(cbind(
OR = coef(logit_model),
Lower = confint.default(logit_model)[, 1],
Upper = confint.default(logit_model)[, 2]
))
round(odds_ratios, 4)
## OR Lower Upper
## (Intercept) 0.6808 0.6407 0.7234
## cancel_rate 259.0291 148.6637 451.3277
## divert_rate 32.7308 4.2070 254.6468
## month2 0.9693 0.8933 1.0517
## month3 1.1514 1.0614 1.2491
## month4 0.8314 0.7587 0.9109
## month5 1.0617 0.9701 1.1620
## month6 2.5955 2.3683 2.8444
## month7 2.9564 2.6925 3.2460
## month8 2.1996 2.0089 2.4083
## month9 0.6624 0.6040 0.7264
## month10 0.8154 0.7446 0.8929
## month11 1.0962 1.0020 1.1992
## month12 2.7466 2.5010 3.0163
How to read this table: An odds ratio (OR) of 1.0
means a predictor has no effect on the odds of
high_delay = 1. OR greater than 1 means higher values of
that predictor increase the odds of a high-delay observation; OR less
than 1 means they decrease it. Any predictor whose 95% CI excludes 1.0
is statistically significant at the 5% level.
cancel_rate (OR = 259.03, 95% CI 148.66 to 451.33): This is the strongest predictor in the model. The odds ratio indicates a very strong relationship between cancellation rate and high delay probability. Because a full 0–100% increase is unrealistic, a more practical interpretation is that a 1 percentage point increase in cancellation rate increases the odds of high delay by approximately 5.7%. The confidence interval is wide due to the rarity of high cancellation rates but clearly excludes 1.0, confirming strong statistical significance.
divert_rate (OR = 32.73, 95% CI 4.21 to
254.65): Diversions are rare but extreme events. An OR of 32.73
means that a carrier-airport-month with a meaningfully higher diversion
rate has roughly 33 times the odds of also being a high-delay
observation. The CI is very wide (4.21 to 254.65), which reflects how
infrequently diversions occur. There are few data points at high
diversion rates, making the estimate uncertain. However, the CI still
firmly excludes 1.0 (p = 0.00086), confirming that diversion spikes are
a genuine signal of systemic operational stress that co-occurs with
elevated delays.
Month coefficients (vs. January baseline): Each month OR compares that month’s odds of being a high-delay observation to January, which is the reference level. The results reveal a clear and interpretable seasonal pattern:
June (OR = 2.60), July (OR = 2.96), August (OR = 2.20), and December (OR = 2.75) all have ORs well above 2, with CIs that firmly exclude 1.0. A carrier-airport observation in any of these months has roughly 2 to 3 times the odds of being a high-delay observation compared to January, reflecting peak summer travel congestion and winter holiday disruption respectively.
September (OR = 0.66) and April (OR = 0.83) have ORs below 1, meaning these months are actually lower delay-risk than January. September in particular, coming after summer congestion ends and before the holiday season begins, is the most delay-favorable month in the dataset.
February (OR = 0.97) and May (OR = 1.06) have CIs that cross 1.0, meaning their difference from January is not statistically significant at the 5% level. These months are effectively indistinguishable from January in terms of delay risk.
Overall model fit: The null deviance (62,196) drops
to a residual deviance of 59,072 after adding our predictors, a
reduction of 3,124 on 13 degrees of freedom. This confirms that
cancel_rate, divert_rate, and
month collectively provide genuine explanatory power beyond
simply predicting the majority class for every row. That said, remaining
variation in the residual deviance suggests additional predictors such
as airport or carrier identity could further improve model
performance.
cancel_rateConfidence intervals are computed using the Wald method via
confint.default() rather than profile likelihood, as the
Wald approach ensures numerical stability given the large coefficient
estimates in this model.
coef_est <- coef(summary(logit_model))["cancel_rate", "Estimate"]
se_est <- coef(summary(logit_model))["cancel_rate", "Std. Error"]
# 95% Wald CI on the log-odds scale: Estimate +/- 1.96 x SE
lower_log <- coef_est - 1.96 * se_est
upper_log <- coef_est + 1.96 * se_est
# Exponentiate to get the CI on the odds-ratio scale
lower_or <- exp(lower_log)
upper_or <- exp(upper_log)
point_or <- exp(coef_est)
cat("cancel_rate log-odds estimate: ", round(coef_est, 4), "\n")
## cancel_rate log-odds estimate: 5.5569
cat("Standard error: ", round(se_est, 4), "\n\n")
## Standard error: 0.2833
cat("95% CI (log-odds): (", round(lower_log, 4), ",", round(upper_log, 4), ")\n")
## 95% CI (log-odds): ( 5.0017 , 6.1122 )
cat("95% CI (odds ratio): (", round(lower_or, 4), ",", round(upper_or, 4), ")\n")
## 95% CI (odds ratio): ( 148.6622 , 451.3323 )
cat("Point estimate (OR): ", round(point_or, 4), "\n")
## Point estimate (OR): 259.0291
How the CI was built: Using the Wald method, meaning Estimate plus or minus 1.96 times the Standard Error on the log-odds scale, then exponentiated to get the odds-ratio CI.
The log-odds estimate for cancel_rate is 5.5569 with a
standard error of 0.2833. Applying the formula:
What this means in plain language: We are 95%
confident that the true odds ratio for cancel_rate is
between 148.66 and 451.33. In other words, a carrier-airport-month with
a higher cancellation rate has between 149 and 451 times the odds of
being a high-delay observation compared to one with near-zero
cancellations, all else equal.
Because this interval does not include 1.0, we can conclude with very high confidence that the association between cancellation rate and high delay rate is real and not due to chance. The width of the interval (148 to 451) reflects genuine uncertainty at the extremes: very high cancellation rates are rare in the data, so the model has fewer observations to pin down the exact effect size at those levels. Despite this width, even the lower bound of 149 represents a large and operationally meaningful effect.
Insight and Significance: The CI tells us something
beyond just statistical significance. Its width is itself informative.
The interval spanning 148 to 451 is wide because extreme cancellation
events are rare in the data, meaning the model has limited observations
at the high end of cancel_rate to precisely estimate the
effect. This is a signal that collecting more data during
high-disruption periods such as major storms or staffing crises would be
the highest-value investment for refining this model. A further question
worth investigating is whether a bootstrapped CI, which makes fewer
distributional assumptions than the Wald method, would produce a
meaningfully different interval given the rarity of high-cancellation
observations.
Insight 1: Cancellations and delays are symptoms of the same
disruptions. The OR of 259 for cancel_rate, the
largest in the model, tells us that cancellations and delays are not
independent outcomes that airlines can trade off against each other.
When a carrier is cancelling flights at a given airport in a given
month, it is very likely also delaying the flights that do operate. Both
are driven by the same root causes: severe weather, staffing crises, and
ATC or infrastructure failures. This means that reducing delay rates
requires addressing those root causes directly, not simply shifting
outcomes from one metric to the other.
Insight 2: Seasonality creates predictable, recurring delay risk. June, July, August, and December all have ORs between 2.2 and 3.0, meaning these months carry 2 to 3 times the delay odds of January. This reflects the structural reality of peak summer travel and winter holiday congestion combined with weather disruptions. September stands out as the lowest-risk month (OR = 0.66), offering travelers a meaningful advantage in on-time performance. This pattern is stable enough to be actionable: airlines should pre-position recovery resources in peak months, and travelers should factor seasonality into their booking decisions.
Insight 3: The model is well-specified, and key predictors
such as cancel_rate and divert_rate provide
strong and statistically significant explanatory power. While
most month levels are also significant, February and May are not
distinguishable from January at the 5% level, suggesting delay risk in
those months is not meaningfully different from the winter baseline. The
residual deviance drops by over 3,100 from the null model, confirming
genuine explanatory power. The model converged in just 4 Fisher Scoring
iterations, indicating no numerical issues.
Further questions for investigation:
Airport-level effects: Do specific hub airports (ORD, ATL, JFK)
drive a disproportionate share of high-delay observations, independent
of carrier and month? Adding airport as a predictor could
reveal whether delay risk is concentrated geographically.
Interaction between month and cancel_rate: Does the relationship between cancellations and delays strengthen in certain seasons? A winter cancellation cascade may have a larger knock-on effect than a summer one due to fewer re-routing and recovery options.
Carrier heterogeneity: Are some airlines’ delay patterns more predictable than others? Running carrier-stratified models would reveal whether the same predictors work uniformly across all carriers or whether different airlines have structurally different delay drivers.
Pre-departure forecasting: Could we predict high-delay probability using only information available before departure, such as scheduled load factor, historical airport congestion, or weather forecasts? That would be far more actionable for travelers than retrospective analysis.