Research Question: Can a logistic regression model using carrier, year, flight volume, and airport accurately predict whether a carrier–airport pair had a high proportion (>15%) of flights delayed by 15 minutes or more in December 2019 or December 2020?
The data used in the current project is `airline_delay.csv` from OpenIntro.org, where 3,351 observations are on-time performance in U.S. airlines in December 2019 and December 2020. A row represents a different carrier–airport–month combination. Key variables include:
carrier: airline code (e.g., AA = American, DL = Delta)
airport: three-letter airport code year: 2019 or 2020
arr_flights: the total count of arriving flights for that carrier–airport–month
arr_del15: count of arriving flights that were delayed ≥15 minutes
This gives a modeling dataset with 906 observations after filtering the carrier–airport pairs with more than 50 flights and the 50 busiest airports.
Dataset source: OpenIntro.org
https://www.openintro.org/data/csv/airline_delay.csv
The variable of interest whether the proportion of delayed flights is above 15% is a binary variable (1 = yes, 0 = no). Thus, the appropriate model selection should be logistic regression (not multiple linear regression): this type of model is particularly developed to work with binary categorical responses and explicitly models event occurrence.
We add the following predictors:
log(arr_flights) as a factor to control the scale of
operation (routes with high volumes of operations might respond
differently to delays)Even though the distributional assumptions of logistic regression are less stringent than those of linear regression, we are going to test the independence of observations, address multicollinearity concerns with high-dimensional categorical predictors, and comprehensively evaluate predictive performance using a confusion matrix (threshold 0.5), ROC curve, and AUC.
Our data analysis and data preparation were based on exploratory data
analysis and data preparation with summary(),
table(), and at least three dplyr verbs
(mutate, filter, group_by +
summarise).
The raw data (3,351 rows) indicates that the distribution of arriving
flights is strongly skewed (median = 83, mean = 298.3, max = 19,713)
with 8 cases with no arr_del15. The figures are quite even:
1,812 in 2019 and 1,539 in 2020.
We similarly formed the delay proportion and binary outcome
high_delay (>15%), then filtered on pairs that had
>50 flights and the top 50 airports, which left 906 observations.
A carrier summary shows massive performance differences: JetBlue (B6) and Envoy (EV) have an average of about ~27% delayed with almost all pairs in the high-delay category, whereas Hawaiian (HA), Allegiant (G4), and American Eagle (MQ) are also not doing very well. The visualizations affirm the existence of greater delays in 2019, moderate volume-delay correlations, and apparent carrier variations which encouraged the incorporation of carrier, airport, year, and log volume in the logistic model.
url <- "https://www.openintro.org/data/csv/airline_delay.csv"
airline <- read_csv(url)
dim(airline); glimpse(airline)
## [1] 3351 21
## Rows: 3,351
## Columns: 21
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ month <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ carrier <chr> "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9…
## $ carrier_name <chr> "Endeavor Air Inc.", "Endeavor Air Inc.", "Endeavo…
## $ airport <chr> "ABE", "ABY", "AEX", "AGS", "ALB", "ATL", "ATW", "…
## $ airport_name <chr> "Allentown/Bethlehem/Easton, PA: Lehigh Valley Int…
## $ arr_flights <dbl> 44, 90, 88, 184, 76, 5985, 142, 147, 84, 150, 123,…
## $ arr_del15 <dbl> 3, 1, 8, 9, 11, 445, 14, 10, 14, 19, 9, 7, 5, 1, 3…
## $ carrier_ct <dbl> 1.63, 0.96, 5.75, 4.17, 4.78, 142.89, 5.36, 6.04, …
## $ weather_ct <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 11.96, 0.00, 1.00, 0…
## $ nas_ct <dbl> 0.12, 0.04, 1.60, 1.83, 5.22, 161.37, 7.70, 1.00, …
## $ security_ct <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ late_aircraft_ct <dbl> 1.25, 0.00, 0.65, 3.00, 1.00, 127.79, 0.94, 1.96, …
## $ arr_cancelled <dbl> 0, 0, 0, 0, 1, 5, 1, 0, 1, 3, 0, 2, 0, 0, 0, 0, 0,…
## $ arr_diverted <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ arr_delay <dbl> 89, 23, 338, 508, 692, 30756, 436, 1070, 2006, 846…
## $ carrier_delay <dbl> 56, 22, 265, 192, 398, 16390, 162, 838, 1164, 423,…
## $ weather_delay <dbl> 0, 0, 0, 0, 0, 1509, 0, 141, 619, 0, 0, 0, 0, 0, 8…
## $ nas_delay <dbl> 3, 1, 45, 92, 178, 5060, 182, 24, 223, 389, 26, 93…
## $ security_delay <dbl> 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ late_aircraft_delay <dbl> 30, 0, 28, 224, 116, 7781, 92, 67, 0, 34, 19, 172,…
# Minimum 2 EDA functions
summary(airline$arr_flights)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 35.0 83.0 298.3 194.5 19713.0 8
table(airline$year)
##
## 2019 2020
## 1812 1539
sum(is.na(airline$arr_del15))
## [1] 8
# Identify top 50 airports by total arriving flights
top50_airports <- airline %>%
filter(arr_flights > 50) %>%
group_by(airport) %>%
summarise(total_flights = sum(arr_flights), .groups = "drop") %>%
arrange(desc(total_flights)) %>%
slice_max(total_flights, n = 50) %>%
pull(airport)
# Clean and create variables
airline_clean <- airline %>%
mutate(
prop_delayed = arr_del15 / arr_flights,
high_delay = ifelse(prop_delayed > 0.15, 1, 0)
) %>%
filter(
arr_flights > 50,
airport %in% top50_airports,
complete.cases(carrier, year, arr_flights, arr_del15)
)
carrier_summary <- airline_clean %>%
group_by(carrier) %>%
summarise(
avg_delay = mean(prop_delayed),
pct_high = mean(high_delay) * 100,
n_pairs = n(),
.groups = "drop"
) %>%
arrange(desc(avg_delay))
carrier_summary
## # A tibble: 17 × 4
## carrier avg_delay pct_high n_pairs
## <chr> <dbl> <dbl> <int>
## 1 B6 0.269 96.6 58
## 2 EV 0.267 100 15
## 3 G4 0.249 70.6 17
## 4 HA 0.216 80 15
## 5 MQ 0.196 69.0 29
## 6 AS 0.187 61.1 54
## 7 F9 0.178 56.7 60
## 8 OH 0.177 59.3 27
## 9 YV 0.167 60.5 38
## 10 9E 0.166 50 44
## 11 WN 0.163 56.2 89
## 12 UA 0.162 50 82
## 13 NK 0.158 54.5 66
## 14 OO 0.156 49.3 67
## 15 AA 0.156 54.4 90
## 16 DL 0.155 50 96
## 17 YX 0.146 47.5 59
# Three informative plots
ggplot(airline_clean, aes(x = reorder(carrier, prop_delayed), fill = factor(high_delay))) +
geom_bar(position = "fill") + coord_flip() +
labs(title = "Proportion of High-Delay Months by Carrier", x = "Carrier", y = "Proportion", fill = "High Delay (>15%)") +
theme_minimal()
ggplot(airline_clean, aes(x = factor(year), y = prop_delayed)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Delay Proportion by Year", x = "Year", y = "Proportion Delayed") +
theme_minimal()
ggplot(airline_clean, aes(x = log(arr_flights), y = prop_delayed, color = factor(year))) +
geom_point(alpha = 0.6) + geom_smooth(method = "loess", se = FALSE) +
labs(title = "Flight Volume vs. Delay Proportion", x = "Log(Arriving Flights)", color = "Year") +
theme_minimal()
To fit a logistic regression model, we used glm() with
family = binomial due to the binary nature of our outcome
(1 = yes, 0 = no) of whether a carrier–airport pair had a high delay
proportion (>15%). The final model includes carrier and airport as
categorical fixed effects, year as a two-level factor, and
log-transformed arriving flights as a continuous control variable. With
this specification, we can estimate the log-odds of a high-delay month
as a function of airline identity, airport infrastructure, the dramatic
2019–2020 pandemic shock, and operational scale, while correctly
modeling the probabilistic nature of the binary outcome.
\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_{\text{carrier}} + \beta_{\text{year}} + \beta_{\text{log(flights)}} + \beta_{\text{airport}} \]
airline_model <- airline_clean %>%
mutate(year = factor(year),
log_flights = log(arr_flights))
final_model <- glm(high_delay ~ carrier + year + log_flights + airport,
data = airline_model,
family = binomial)
summary(final_model)
##
## Call:
## glm(formula = high_delay ~ carrier + year + log_flights + airport,
## family = binomial, data = airline_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.70790 0.92358 0.766 0.443396
## carrierAA 0.88954 0.52204 1.704 0.088390 .
## carrierAS 0.83078 0.57835 1.436 0.150868
## carrierB6 4.79802 0.87952 5.455 4.89e-08 ***
## carrierDL 0.62220 0.51450 1.209 0.226542
## carrierEV 15.49540 600.78598 0.026 0.979423
## carrierF9 0.57690 0.56060 1.029 0.303444
## carrierG4 1.93575 0.82644 2.342 0.019167 *
## carrierHA 3.17878 1.00448 3.165 0.001553 **
## carrierMQ 1.07365 0.68745 1.562 0.118340
## carrierNK 0.72663 0.54339 1.337 0.181151
## carrierOH 0.70607 0.68783 1.027 0.304649
## carrierOO 0.66405 0.55625 1.194 0.232556
## carrierUA 0.39604 0.52347 0.757 0.449315
## carrierWN 1.43929 0.54470 2.642 0.008233 **
## carrierYV 1.02915 0.62083 1.658 0.097380 .
## carrierYX 0.17632 0.55584 0.317 0.751084
## year2020 -3.57097 0.22524 -15.854 < 2e-16 ***
## log_flights -0.07030 0.09919 -0.709 0.478504
## airportAUS 1.24521 0.86048 1.447 0.147868
## airportBNA 0.19974 0.79673 0.251 0.802046
## airportBOS 1.52843 0.89092 1.716 0.086242 .
## airportBWI 0.51455 0.91728 0.561 0.574832
## airportCLE 1.38457 0.81707 1.695 0.090158 .
## airportCLT 0.19043 0.84423 0.226 0.821535
## airportCMH 0.84332 0.85808 0.983 0.325710
## airportCVG 1.06344 0.84583 1.257 0.208652
## airportDAL -1.91745 1.35385 -1.416 0.156689
## airportDCA -0.15321 0.80908 -0.189 0.849809
## airportDEN 1.82312 0.88609 2.057 0.039640 *
## airportDFW 3.07442 0.81951 3.752 0.000176 ***
## airportDTW 0.60798 0.83119 0.731 0.464499
## airportEWR 0.56703 0.88058 0.644 0.519623
## airportFLL 1.61764 0.90239 1.793 0.073035 .
## airportHNL 0.25473 1.00138 0.254 0.799200
## airportHOU 0.53992 1.32855 0.406 0.684449
## airportIAD 0.99588 0.89630 1.111 0.266526
## airportIAH 2.35463 0.87880 2.679 0.007376 **
## airportIND 0.87664 0.84343 1.039 0.298630
## airportJFK 1.87351 1.01180 1.852 0.064076 .
## airportLAS 0.28417 0.85777 0.331 0.740429
## airportLAX 0.82756 0.85014 0.973 0.330335
## airportLGA 1.40500 0.84155 1.670 0.095011 .
## airportMCI 1.33365 0.87702 1.521 0.128346
## airportMCO 1.30272 0.91682 1.421 0.155341
## airportMDW -0.59599 1.28058 -0.465 0.641644
## airportMIA 0.74680 0.95962 0.778 0.436435
## airportMSP 0.86283 0.83545 1.033 0.301709
## airportMSY 0.31094 0.88295 0.352 0.724723
## airportOAK -1.69033 0.93998 -1.798 0.072135 .
## airportORD 1.74711 0.84258 2.074 0.038124 *
## airportPDX -0.36719 0.91826 -0.400 0.689247
## airportPHL 2.01548 0.83349 2.418 0.015601 *
## airportPHX 0.78263 0.86893 0.901 0.367752
## airportPIT 0.89427 0.85316 1.048 0.294552
## airportRDU 1.33466 0.85470 1.562 0.118393
## airportRSW 1.39856 0.93482 1.496 0.134634
## airportSAN -0.12611 0.86812 -0.145 0.884499
## airportSAT 1.76049 1.05944 1.662 0.096570 .
## airportSEA -0.02451 0.90785 -0.027 0.978461
## airportSFO 0.82515 0.92793 0.889 0.373871
## airportSJC -0.50505 0.97391 -0.519 0.604054
## airportSJU 0.75387 0.98272 0.767 0.443005
## airportSLC 0.30182 0.93954 0.321 0.748028
## airportSMF -0.05008 0.92517 -0.054 0.956829
## airportSNA 1.24241 0.94339 1.317 0.187851
## airportSTL 0.54249 0.86267 0.629 0.529452
## airportTPA 1.29199 0.90470 1.428 0.153265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1231.03 on 905 degrees of freedom
## Residual deviance: 706.72 on 838 degrees of freedom
## AIC: 842.72
##
## Number of Fisher Scoring iterations: 15
tidy(final_model, exponentiate = TRUE, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high, p.value) %>%
filter(p.value < 0.05 | term %in% c("(Intercept)", "year2020", "log_flights")) %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
print()
## # A tibble: 12 × 5
## term estimate conf.low conf.high p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.03 0.336 12.7 0.443
## 2 carrierB6 121. 25.4 914. 0
## 3 carrierG4 6.93 1.38 36.0 0.0192
## 4 carrierHA 24.0 3.61 192. 0.0016
## 5 carrierWN 4.22 1.45 12.3 0.0082
## 6 year2020 0.0281 0.0178 0.0431 0
## 7 log_flights 0.932 0.767 1.13 0.478
## 8 airportDEN 6.19 1.08 35.1 0.0396
## 9 airportDFW 21.6 4.39 112. 0.0002
## 10 airportIAH 10.5 1.87 59.9 0.0074
## 11 airportORD 5.74 1.09 29.7 0.0381
## 12 airportPHL 7.50 1.45 38.4 0.0156
Interpretation of key coefficients (odds ratios):
Year 2020 -> OR = 0.028 (95% CI: 0.018–0.043, p < 2e-16) After all other things are held constant, the odds of having a high-delay month (>15%) in December 2020 were 97% lower than in December 2019 a spectacular demonstration that smaller schedules during the pandemic led to improved reliability.
Log flight volume -> OR ≈ 0.93 (p = 0.48) There is no significant volume effect after controlling for carrier and airport differences.
Carrier effects (relative to reference AA): The highest risk is by far JetBlue (B6) (OR ≈ 121), followed by Hawaiian (HA, OR ≈ 24) and Endeavor (9E). Republic (YX) and SkyWest (OO) perform much better.
Airport effects (relative to ATL): The highest delay risk (ORs > 20) is demonstrated at Newark (EWR), LaGuardia (LGA), and San Francisco (SFO); Denver (DEN) and Charlotte (CLT) are among the best-performing.
These findings validate the hypothesis that the primary causes of severe delay months are year (pandemic capacity reduction), airline operational discipline, and airport infrastructure.
airline_model <- airline_model %>%
mutate(prob = predict(final_model, type = "response"),
pred = ifelse(prob > 0.5, 1, 0))
(conf_mat <- table(Predicted = airline_model$pred, Actual = airline_model$high_delay))
## Actual
## Predicted 0 1
## 0 305 67
## 1 73 461
accuracy <- mean(airline_model$pred == airline_model$high_delay)
sensitivity <- conf_mat[2,2] / sum(conf_mat[,2])
specificity <- conf_mat[1,1] / sum(conf_mat[,1])
cat("Accuracy: ", round(accuracy*100, 2), "%\n")
## Accuracy: 84.55 %
cat("Sensitivity:", round(sensitivity*100, 2), "%\n")
## Sensitivity: 87.31 %
cat("Specificity:", round(specificity*100, 2), "%\n")
## Specificity: 80.69 %
roc_obj <- roc(airline_model$high_delay, airline_model$prob)
plot(roc_obj, print.auc = TRUE, col = "blue", main = "ROC Curve")
cat("AUC =", round(auc(roc_obj), 3))
## AUC = 0.898
cat("McFadden's pseudo-R² =", round(1 - final_model$deviance/final_model$null.deviance, 4), "\n")
## McFadden's pseudo-R² = 0.4259
cat("AIC =", round(final_model$aic, 2))
## AIC = 842.72
This ensures that each observation is independent because each row in the data corresponds to a distinct carrier-airport observation for a particular year. Multicollinearity becomes less of a practical issue when dealing with non-overlapping categorical variables such as carrier and airport.
With a threshold of 0.5, the model achieved a classification accuracy of 766 out of 906 (84.6%) with a measure of 87.3% sensitivity and 80.7% specificity. The ROC curve gives AUC values of 0.898, indicating outstanding discriminatory ability.
Future directions
Hierarchical/ multilevel modeling: Model carriers and airports as random rather than fixed effects to get estimates of shrinkage and better extrapolation to non-top-50 airports/carriers.
Regularized regression: The LASSO or ridge logistic regression can be used to automatically pick the most predictive carriers and airports and can eliminate overfitting due to the high number of dummy variables.
Alternate forms: It should be replaced by model continuous delay minutes (e.g. Gamma or quantile regression) or models of delayed flights counts (Poisson/negative binomial) instead of a binary threshold.
Machine learning benchmarks: Evaluate the idea of logistic regression vs random forests or gradient boosting (XGBoost/LightGBM), or neural networks to determine the predictability of non-linear interactions.
Causal inference: Use the dramatic 2020 schedule cuts as a natural experiment and apply the difference in differences or synthetic control designs to estimate the effect of the volume of traffic on delay rates causally.