Week 10 Data Dive

Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

df <- Airline_Delay_Post_COVID_2021_2023

# Inspect column names
names(Airline_Delay_Post_COVID_2021_2023)

##  [1] "year"                "month"               "carrier"            
##  [4] "carrier_name"        "airport"             "airport_name"       
##  [7] "arr_flights"         "arr_del15"           "carrier_ct"         
## [10] "weather_ct"          "nas_ct"              "security_ct"        
## [13] "late_aircraft_ct"    "arr_cancelled"       "arr_diverted"       
## [16] "arr_delay"           "carrier_delay"       "weather_delay"      
## [19] "nas_delay"           "security_delay"      "late_aircraft_delay"

Select Outcome and Predictor Variables

Binary Variable: High Delay Rate

Rather than using the raw count of delayed flights (arr_del15), we compute each row’s delay rate, meaning the proportion of arriving flights delayed 15+ minutes. A row is labeled high_delay = 1 if its delay rate exceeds the dataset median, and high_delay = 0 otherwise.

Why this variable is worth modeling: Delay rate adjusts for flight volume. A carrier operating 1,000 flights with 400 delayed is meaningfully different from one with 50 flights and 10 delayed, even though the raw counts look different. Modeling whether a carrier-airport-month combination falls in the “high delay” half of the distribution tells us something actionable about which operational conditions systematically produce worse outcomes.

Why not use the delay-minute columns as predictors? Columns like carrier_delay and weather_delay are the minute-level components of the delay outcome itself. Using them to predict arr_del15 causes complete separation, meaning the model can predict the outcome perfectly by construction, producing infinite coefficients. Instead, we use cancel_rate, divert_rate, and month, which are variables that correlate with high delay but are not mathematical components of it.

Airline_Delay_Post_COVID_2021_2023 <- Airline_Delay_Post_COVID_2021_2023 %>%
  filter(arr_flights > 0) %>%
  mutate(
    delay_rate  = arr_del15 / arr_flights,
    cancel_rate = arr_cancelled / arr_flights,
    divert_rate = arr_diverted / arr_flights,
    high_delay  = ifelse(delay_rate > median(delay_rate, na.rm = TRUE), 1, 0),
    month       = as.factor(month)
  ) %>%
  filter(!is.na(delay_rate), !is.na(high_delay))

# Quick check — should be roughly 50/50
table(Airline_Delay_Post_COVID_2021_2023$high_delay)

## 
##     0     1 
## 22499 22366

The near-even split (22,499 vs. 22,366) confirms this is a well-balanced binary outcome, which is ideal for logistic regression and avoids class-imbalance issues.

Insight and Significance: The choice of delay rate over raw delay counts is itself an analytical decision with real consequences. Had we used raw arr_del15 counts, larger airlines would appear systematically worse simply because they operate more flights, masking which carriers are genuinely underperforming relative to their size. By modeling a proportional outcome, we ensure the question being answered is operationally meaningful: which conditions make a carrier-airport combination more likely to fall in the worse half of the industry? A further question this raises is whether the median is the right threshold, or whether a different cutoff such as the 75th percentile would better capture what passengers would consider an unacceptably delayed operation.

Build Logistic Regression Model

logit_model <- glm(
  high_delay ~ cancel_rate + divert_rate + month,
  data      = Airline_Delay_Post_COVID_2021_2023,
  family    = binomial,
  na.action = na.exclude
)

summary(logit_model)

## 
## Call:
## glm(formula = high_delay ~ cancel_rate + divert_rate + month, 
##     family = binomial, data = Airline_Delay_Post_COVID_2021_2023, 
##     na.action = na.exclude)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38452    0.03097 -12.417  < 2e-16 ***
## cancel_rate  5.55694    0.28330  19.615  < 2e-16 ***
## divert_rate  3.48832    1.04673   3.333 0.000860 ***
## month2      -0.03122    0.04167  -0.749 0.453683    
## month3       0.14098    0.04154   3.394 0.000689 ***
## month4      -0.18469    0.04663  -3.960 7.48e-05 ***
## month5       0.05989    0.04604   1.301 0.193391    
## month6       0.95377    0.04673  20.412  < 2e-16 ***
## month7       1.08396    0.04769  22.727  < 2e-16 ***
## month8       0.78826    0.04625  17.043  < 2e-16 ***
## month9      -0.41187    0.04707  -8.750  < 2e-16 ***
## month10     -0.20411    0.04635  -4.404 1.06e-05 ***
## month11      0.09184    0.04584   2.004 0.045103 *  
## month12      1.01037    0.04779  21.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62196  on 44864  degrees of freedom
## Residual deviance: 59072  on 44851  degrees of freedom
## AIC: 59100
## 
## Number of Fisher Scoring iterations: 4

This model allows us to quantify how operational disruptions and seasonal patterns translate into the probability of a high-delay outcome for a given carrier-airport-month combination.

Interpret the Coefficients

odds_ratios <- exp(cbind(
  OR    = coef(logit_model),
  Lower = confint.default(logit_model)[, 1],
  Upper = confint.default(logit_model)[, 2]
))

round(odds_ratios, 4)

##                   OR    Lower    Upper
## (Intercept)   0.6808   0.6407   0.7234
## cancel_rate 259.0291 148.6637 451.3277
## divert_rate  32.7308   4.2070 254.6468
## month2        0.9693   0.8933   1.0517
## month3        1.1514   1.0614   1.2491
## month4        0.8314   0.7587   0.9109
## month5        1.0617   0.9701   1.1620
## month6        2.5955   2.3683   2.8444
## month7        2.9564   2.6925   3.2460
## month8        2.1996   2.0089   2.4083
## month9        0.6624   0.6040   0.7264
## month10       0.8154   0.7446   0.8929
## month11       1.0962   1.0020   1.1992
## month12       2.7466   2.5010   3.0163

How to read this table: An odds ratio (OR) of 1.0 means a predictor has no effect on the odds of high_delay = 1. OR greater than 1 means higher values of that predictor increase the odds of a high-delay observation; OR less than 1 means they decrease it. Any predictor whose 95% CI excludes 1.0 is statistically significant at the 5% level.

cancel_rate (OR = 259.03, 95% CI 148.66 to 451.33): This is the strongest predictor in the model. The odds ratio indicates a very strong relationship between cancellation rate and high delay probability. Because a full 0–100% increase is unrealistic, a more practical interpretation is that a 1 percentage point increase in cancellation rate increases the odds of high delay by approximately 5.7%. The confidence interval is wide due to the rarity of high cancellation rates but clearly excludes 1.0, confirming strong statistical significance.

divert_rate (OR = 32.73, 95% CI 4.21 to 254.65): Diversions are rare but extreme events. An OR of 32.73 means that a carrier-airport-month with a meaningfully higher diversion rate has roughly 33 times the odds of also being a high-delay observation. The CI is very wide (4.21 to 254.65), which reflects how infrequently diversions occur. There are few data points at high diversion rates, making the estimate uncertain. However, the CI still firmly excludes 1.0 (p = 0.00086), confirming that diversion spikes are a genuine signal of systemic operational stress that co-occurs with elevated delays.

Month coefficients (vs. January baseline): Each month OR compares that month’s odds of being a high-delay observation to January, which is the reference level. The results reveal a clear and interpretable seasonal pattern:

June (OR = 2.60), July (OR = 2.96), August (OR = 2.20), and December (OR = 2.75) all have ORs well above 2, with CIs that firmly exclude 1.0. A carrier-airport observation in any of these months has roughly 2 to 3 times the odds of being a high-delay observation compared to January, reflecting peak summer travel congestion and winter holiday disruption respectively.
September (OR = 0.66) and April (OR = 0.83) have ORs below 1, meaning these months are actually lower delay-risk than January. September in particular, coming after summer congestion ends and before the holiday season begins, is the most delay-favorable month in the dataset.
February (OR = 0.97) and May (OR = 1.06) have CIs that cross 1.0, meaning their difference from January is not statistically significant at the 5% level. These months are effectively indistinguishable from January in terms of delay risk.

Overall model fit: The null deviance (62,196) drops to a residual deviance of 59,072 after adding our predictors, a reduction of 3,124 on 13 degrees of freedom. This confirms that cancel_rate, divert_rate, and month collectively provide genuine explanatory power beyond simply predicting the majority class for every row. That said, remaining variation in the residual deviance suggests additional predictors such as airport or carrier identity could further improve model performance.

Confidence Interval for `cancel_rate`

Confidence intervals are computed using the Wald method via confint.default() rather than profile likelihood, as the Wald approach ensures numerical stability given the large coefficient estimates in this model.

coef_est <- coef(summary(logit_model))["cancel_rate", "Estimate"]
se_est   <- coef(summary(logit_model))["cancel_rate", "Std. Error"]

# 95% Wald CI on the log-odds scale: Estimate +/- 1.96 x SE
lower_log <- coef_est - 1.96 * se_est
upper_log <- coef_est + 1.96 * se_est

# Exponentiate to get the CI on the odds-ratio scale
lower_or <- exp(lower_log)
upper_or <- exp(upper_log)
point_or <- exp(coef_est)

cat("cancel_rate log-odds estimate: ", round(coef_est, 4), "\n")

## cancel_rate log-odds estimate:  5.5569

cat("Standard error:                ", round(se_est,   4), "\n\n")

## Standard error:                 0.2833

cat("95% CI (log-odds):   (", round(lower_log, 4), ",", round(upper_log, 4), ")\n")

## 95% CI (log-odds):   ( 5.0017 , 6.1122 )

cat("95% CI (odds ratio): (", round(lower_or,  4), ",", round(upper_or,  4), ")\n")

## 95% CI (odds ratio): ( 148.6622 , 451.3323 )

cat("Point estimate (OR): ",  round(point_or,  4), "\n")

## Point estimate (OR):  259.0291

How the CI was built: Using the Wald method, meaning Estimate plus or minus 1.96 times the Standard Error on the log-odds scale, then exponentiated to get the odds-ratio CI.

The log-odds estimate for cancel_rate is 5.5569 with a standard error of 0.2833. Applying the formula:

Lower bound (log-odds): 5.5569 minus (1.96 x 0.2833) = 5.0017
Upper bound (log-odds): 5.5569 plus (1.96 x 0.2833) = 6.1122
Exponentiated: (e to the power of 5.0017, e to the power of 6.1122) = (148.66, 451.33)

What this means in plain language: We are 95% confident that the true odds ratio for cancel_rate is between 148.66 and 451.33. In other words, a carrier-airport-month with a higher cancellation rate has between 149 and 451 times the odds of being a high-delay observation compared to one with near-zero cancellations, all else equal.

Because this interval does not include 1.0, we can conclude with very high confidence that the association between cancellation rate and high delay rate is real and not due to chance. The width of the interval (148 to 451) reflects genuine uncertainty at the extremes: very high cancellation rates are rare in the data, so the model has fewer observations to pin down the exact effect size at those levels. Despite this width, even the lower bound of 149 represents a large and operationally meaningful effect.

Insight and Significance: The CI tells us something beyond just statistical significance. Its width is itself informative. The interval spanning 148 to 451 is wide because extreme cancellation events are rare in the data, meaning the model has limited observations at the high end of cancel_rate to precisely estimate the effect. This is a signal that collecting more data during high-disruption periods such as major storms or staffing crises would be the highest-value investment for refining this model. A further question worth investigating is whether a bootstrapped CI, which makes fewer distributional assumptions than the Wald method, would produce a meaningfully different interval given the rarity of high-cancellation observations.

Insights, Significance, and Further Questions

Insight 1: Cancellations and delays are symptoms of the same disruptions. The OR of 259 for cancel_rate, the largest in the model, tells us that cancellations and delays are not independent outcomes that airlines can trade off against each other. When a carrier is cancelling flights at a given airport in a given month, it is very likely also delaying the flights that do operate. Both are driven by the same root causes: severe weather, staffing crises, and ATC or infrastructure failures. This means that reducing delay rates requires addressing those root causes directly, not simply shifting outcomes from one metric to the other.

Insight 2: Seasonality creates predictable, recurring delay risk. June, July, August, and December all have ORs between 2.2 and 3.0, meaning these months carry 2 to 3 times the delay odds of January. This reflects the structural reality of peak summer travel and winter holiday congestion combined with weather disruptions. September stands out as the lowest-risk month (OR = 0.66), offering travelers a meaningful advantage in on-time performance. This pattern is stable enough to be actionable: airlines should pre-position recovery resources in peak months, and travelers should factor seasonality into their booking decisions.

Insight 3: The model is well-specified, and key predictors such as cancel_rate and divert_rate provide strong and statistically significant explanatory power. While most month levels are also significant, February and May are not distinguishable from January at the 5% level, suggesting delay risk in those months is not meaningfully different from the winter baseline. The residual deviance drops by over 3,100 from the null model, confirming genuine explanatory power. The model converged in just 4 Fisher Scoring iterations, indicating no numerical issues.

Further questions for investigation:

Airport-level effects: Do specific hub airports (ORD, ATL, JFK) drive a disproportionate share of high-delay observations, independent of carrier and month? Adding airport as a predictor could reveal whether delay risk is concentrated geographically.
Interaction between month and cancel_rate: Does the relationship between cancellations and delays strengthen in certain seasons? A winter cancellation cascade may have a larger knock-on effect than a summer one due to fewer re-routing and recovery options.
Carrier heterogeneity: Are some airlines’ delay patterns more predictable than others? Running carrier-stratified models would reveal whether the same predictors work uniformly across all carriers or whether different airlines have structurally different delay drivers.
Pre-departure forecasting: Could we predict high-delay probability using only information available before departure, such as scheduled load factor, historical airport congestion, or weather forecasts? That would be far more actionable for travelers than retrospective analysis.