A. Introduction

Research Question: Can a logistic regression model using carrier, year, flight volume, and airport accurately predict whether a carrier–airport pair had a high proportion (>15%) of flights delayed by 15 minutes or more in December 2019 or December 2020?

The data used in the current project is `airline_delay.csv` from OpenIntro.org, where 3,351 observations are on-time performance in U.S. airlines in December 2019 and December 2020. A row represents a different carrier–airport–month combination. Key variables include:

This gives a modeling dataset with 906 observations after filtering the carrier–airport pairs with more than 50 flights and the 50 busiest airports.

Dataset source: OpenIntro.org
https://www.openintro.org/data/csv/airline_delay.csv

Justification of Modeling Approach

The variable of interest whether the proportion of delayed flights is above 15% is a binary variable (1 = yes, 0 = no). Thus, the appropriate model selection should be logistic regression (not multiple linear regression): this type of model is particularly developed to work with binary categorical responses and explicitly models event occurrence.

We add the following predictors:

Even though the distributional assumptions of logistic regression are less stringent than those of linear regression, we are going to test the independence of observations, address multicollinearity concerns with high-dimensional categorical predictors, and comprehensively evaluate predictive performance using a confusion matrix (threshold 0.5), ROC curve, and AUC.

B. Data Analysis

Our data analysis and data preparation were based on exploratory data analysis and data preparation with summary(), table(), and at least three dplyr verbs (mutate, filter, group_by + summarise).

The raw data (3,351 rows) indicates that the distribution of arriving flights is strongly skewed (median = 83, mean = 298.3, max = 19,713) with 8 cases with no arr_del15. The figures are quite even: 1,812 in 2019 and 1,539 in 2020.

We similarly formed the delay proportion and binary outcome high_delay (>15%), then filtered on pairs that had >50 flights and the top 50 airports, which left 906 observations.

A carrier summary shows massive performance differences: JetBlue (B6) and Envoy (EV) have an average of about ~27% delayed with almost all pairs in the high-delay category, whereas Hawaiian (HA), Allegiant (G4), and American Eagle (MQ) are also not doing very well. The visualizations affirm the existence of greater delays in 2019, moderate volume-delay correlations, and apparent carrier variations which encouraged the incorporation of carrier, airport, year, and log volume in the logistic model.

url <- "https://www.openintro.org/data/csv/airline_delay.csv"
airline <- read_csv(url)
dim(airline); glimpse(airline)
## [1] 3351   21
## Rows: 3,351
## Columns: 21
## $ year                <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ month               <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ carrier             <chr> "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9…
## $ carrier_name        <chr> "Endeavor Air Inc.", "Endeavor Air Inc.", "Endeavo…
## $ airport             <chr> "ABE", "ABY", "AEX", "AGS", "ALB", "ATL", "ATW", "…
## $ airport_name        <chr> "Allentown/Bethlehem/Easton, PA: Lehigh Valley Int…
## $ arr_flights         <dbl> 44, 90, 88, 184, 76, 5985, 142, 147, 84, 150, 123,…
## $ arr_del15           <dbl> 3, 1, 8, 9, 11, 445, 14, 10, 14, 19, 9, 7, 5, 1, 3…
## $ carrier_ct          <dbl> 1.63, 0.96, 5.75, 4.17, 4.78, 142.89, 5.36, 6.04, …
## $ weather_ct          <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 11.96, 0.00, 1.00, 0…
## $ nas_ct              <dbl> 0.12, 0.04, 1.60, 1.83, 5.22, 161.37, 7.70, 1.00, …
## $ security_ct         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ late_aircraft_ct    <dbl> 1.25, 0.00, 0.65, 3.00, 1.00, 127.79, 0.94, 1.96, …
## $ arr_cancelled       <dbl> 0, 0, 0, 0, 1, 5, 1, 0, 1, 3, 0, 2, 0, 0, 0, 0, 0,…
## $ arr_diverted        <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ arr_delay           <dbl> 89, 23, 338, 508, 692, 30756, 436, 1070, 2006, 846…
## $ carrier_delay       <dbl> 56, 22, 265, 192, 398, 16390, 162, 838, 1164, 423,…
## $ weather_delay       <dbl> 0, 0, 0, 0, 0, 1509, 0, 141, 619, 0, 0, 0, 0, 0, 8…
## $ nas_delay           <dbl> 3, 1, 45, 92, 178, 5060, 182, 24, 223, 389, 26, 93…
## $ security_delay      <dbl> 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ late_aircraft_delay <dbl> 30, 0, 28, 224, 116, 7781, 92, 67, 0, 34, 19, 172,…
# Minimum 2 EDA functions
summary(airline$arr_flights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    35.0    83.0   298.3   194.5 19713.0       8
table(airline$year)
## 
## 2019 2020 
## 1812 1539
sum(is.na(airline$arr_del15))
## [1] 8
# Identify top 50 airports by total arriving flights
top50_airports <- airline %>%
  filter(arr_flights > 50) %>%
  group_by(airport) %>%
  summarise(total_flights = sum(arr_flights), .groups = "drop") %>%
  arrange(desc(total_flights)) %>%
  slice_max(total_flights, n = 50) %>%
  pull(airport)

# Clean and create variables 
airline_clean <- airline %>%
  mutate(
    prop_delayed = arr_del15 / arr_flights,
    high_delay   = ifelse(prop_delayed > 0.15, 1, 0)
  ) %>%
  filter(
    arr_flights > 50,
    airport %in% top50_airports,
    complete.cases(carrier, year, arr_flights, arr_del15)
  )

carrier_summary <- airline_clean %>%
  group_by(carrier) %>%
  summarise(
    avg_delay = mean(prop_delayed),
    pct_high  = mean(high_delay) * 100,
    n_pairs   = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_delay))

carrier_summary
## # A tibble: 17 × 4
##    carrier avg_delay pct_high n_pairs
##    <chr>       <dbl>    <dbl>   <int>
##  1 B6          0.269     96.6      58
##  2 EV          0.267    100        15
##  3 G4          0.249     70.6      17
##  4 HA          0.216     80        15
##  5 MQ          0.196     69.0      29
##  6 AS          0.187     61.1      54
##  7 F9          0.178     56.7      60
##  8 OH          0.177     59.3      27
##  9 YV          0.167     60.5      38
## 10 9E          0.166     50        44
## 11 WN          0.163     56.2      89
## 12 UA          0.162     50        82
## 13 NK          0.158     54.5      66
## 14 OO          0.156     49.3      67
## 15 AA          0.156     54.4      90
## 16 DL          0.155     50        96
## 17 YX          0.146     47.5      59
# Three informative plots
ggplot(airline_clean, aes(x = reorder(carrier, prop_delayed), fill = factor(high_delay))) +
  geom_bar(position = "fill") + coord_flip() +
  labs(title = "Proportion of High-Delay Months by Carrier", x = "Carrier", y = "Proportion", fill = "High Delay (>15%)") +
  theme_minimal()

ggplot(airline_clean, aes(x = factor(year), y = prop_delayed)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Delay Proportion by Year", x = "Year", y = "Proportion Delayed") +
  theme_minimal()

ggplot(airline_clean, aes(x = log(arr_flights), y = prop_delayed, color = factor(year))) +
  geom_point(alpha = 0.6) + geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Flight Volume vs. Delay Proportion", x = "Log(Arriving Flights)", color = "Year") +
  theme_minimal()

C. Regression Analysis

To fit a logistic regression model, we used glm() with family = binomial due to the binary nature of our outcome (1 = yes, 0 = no) of whether a carrier–airport pair had a high delay proportion (>15%). The final model includes carrier and airport as categorical fixed effects, year as a two-level factor, and log-transformed arriving flights as a continuous control variable. With this specification, we can estimate the log-odds of a high-delay month as a function of airline identity, airport infrastructure, the dramatic 2019–2020 pandemic shock, and operational scale, while correctly modeling the probabilistic nature of the binary outcome.

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_{\text{carrier}} + \beta_{\text{year}} + \beta_{\text{log(flights)}} + \beta_{\text{airport}} \]

airline_model <- airline_clean %>%
  mutate(year = factor(year),
         log_flights = log(arr_flights))

final_model <- glm(high_delay ~ carrier + year + log_flights + airport,
                   data = airline_model,
                   family = binomial)

summary(final_model)
## 
## Call:
## glm(formula = high_delay ~ carrier + year + log_flights + airport, 
##     family = binomial, data = airline_model)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.70790    0.92358   0.766 0.443396    
## carrierAA     0.88954    0.52204   1.704 0.088390 .  
## carrierAS     0.83078    0.57835   1.436 0.150868    
## carrierB6     4.79802    0.87952   5.455 4.89e-08 ***
## carrierDL     0.62220    0.51450   1.209 0.226542    
## carrierEV    15.49540  600.78598   0.026 0.979423    
## carrierF9     0.57690    0.56060   1.029 0.303444    
## carrierG4     1.93575    0.82644   2.342 0.019167 *  
## carrierHA     3.17878    1.00448   3.165 0.001553 ** 
## carrierMQ     1.07365    0.68745   1.562 0.118340    
## carrierNK     0.72663    0.54339   1.337 0.181151    
## carrierOH     0.70607    0.68783   1.027 0.304649    
## carrierOO     0.66405    0.55625   1.194 0.232556    
## carrierUA     0.39604    0.52347   0.757 0.449315    
## carrierWN     1.43929    0.54470   2.642 0.008233 ** 
## carrierYV     1.02915    0.62083   1.658 0.097380 .  
## carrierYX     0.17632    0.55584   0.317 0.751084    
## year2020     -3.57097    0.22524 -15.854  < 2e-16 ***
## log_flights  -0.07030    0.09919  -0.709 0.478504    
## airportAUS    1.24521    0.86048   1.447 0.147868    
## airportBNA    0.19974    0.79673   0.251 0.802046    
## airportBOS    1.52843    0.89092   1.716 0.086242 .  
## airportBWI    0.51455    0.91728   0.561 0.574832    
## airportCLE    1.38457    0.81707   1.695 0.090158 .  
## airportCLT    0.19043    0.84423   0.226 0.821535    
## airportCMH    0.84332    0.85808   0.983 0.325710    
## airportCVG    1.06344    0.84583   1.257 0.208652    
## airportDAL   -1.91745    1.35385  -1.416 0.156689    
## airportDCA   -0.15321    0.80908  -0.189 0.849809    
## airportDEN    1.82312    0.88609   2.057 0.039640 *  
## airportDFW    3.07442    0.81951   3.752 0.000176 ***
## airportDTW    0.60798    0.83119   0.731 0.464499    
## airportEWR    0.56703    0.88058   0.644 0.519623    
## airportFLL    1.61764    0.90239   1.793 0.073035 .  
## airportHNL    0.25473    1.00138   0.254 0.799200    
## airportHOU    0.53992    1.32855   0.406 0.684449    
## airportIAD    0.99588    0.89630   1.111 0.266526    
## airportIAH    2.35463    0.87880   2.679 0.007376 ** 
## airportIND    0.87664    0.84343   1.039 0.298630    
## airportJFK    1.87351    1.01180   1.852 0.064076 .  
## airportLAS    0.28417    0.85777   0.331 0.740429    
## airportLAX    0.82756    0.85014   0.973 0.330335    
## airportLGA    1.40500    0.84155   1.670 0.095011 .  
## airportMCI    1.33365    0.87702   1.521 0.128346    
## airportMCO    1.30272    0.91682   1.421 0.155341    
## airportMDW   -0.59599    1.28058  -0.465 0.641644    
## airportMIA    0.74680    0.95962   0.778 0.436435    
## airportMSP    0.86283    0.83545   1.033 0.301709    
## airportMSY    0.31094    0.88295   0.352 0.724723    
## airportOAK   -1.69033    0.93998  -1.798 0.072135 .  
## airportORD    1.74711    0.84258   2.074 0.038124 *  
## airportPDX   -0.36719    0.91826  -0.400 0.689247    
## airportPHL    2.01548    0.83349   2.418 0.015601 *  
## airportPHX    0.78263    0.86893   0.901 0.367752    
## airportPIT    0.89427    0.85316   1.048 0.294552    
## airportRDU    1.33466    0.85470   1.562 0.118393    
## airportRSW    1.39856    0.93482   1.496 0.134634    
## airportSAN   -0.12611    0.86812  -0.145 0.884499    
## airportSAT    1.76049    1.05944   1.662 0.096570 .  
## airportSEA   -0.02451    0.90785  -0.027 0.978461    
## airportSFO    0.82515    0.92793   0.889 0.373871    
## airportSJC   -0.50505    0.97391  -0.519 0.604054    
## airportSJU    0.75387    0.98272   0.767 0.443005    
## airportSLC    0.30182    0.93954   0.321 0.748028    
## airportSMF   -0.05008    0.92517  -0.054 0.956829    
## airportSNA    1.24241    0.94339   1.317 0.187851    
## airportSTL    0.54249    0.86267   0.629 0.529452    
## airportTPA    1.29199    0.90470   1.428 0.153265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1231.03  on 905  degrees of freedom
## Residual deviance:  706.72  on 838  degrees of freedom
## AIC: 842.72
## 
## Number of Fisher Scoring iterations: 15
tidy(final_model, exponentiate = TRUE, conf.int = TRUE) %>%
  select(term, estimate, conf.low, conf.high, p.value) %>%
  filter(p.value < 0.05 | term %in% c("(Intercept)", "year2020", "log_flights")) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  print()
## # A tibble: 12 × 5
##    term        estimate conf.low conf.high p.value
##    <chr>          <dbl>    <dbl>     <dbl>   <dbl>
##  1 (Intercept)   2.03     0.336    12.7     0.443 
##  2 carrierB6   121.      25.4     914.      0     
##  3 carrierG4     6.93     1.38     36.0     0.0192
##  4 carrierHA    24.0      3.61    192.      0.0016
##  5 carrierWN     4.22     1.45     12.3     0.0082
##  6 year2020      0.0281   0.0178    0.0431  0     
##  7 log_flights   0.932    0.767     1.13    0.478 
##  8 airportDEN    6.19     1.08     35.1     0.0396
##  9 airportDFW   21.6      4.39    112.      0.0002
## 10 airportIAH   10.5      1.87     59.9     0.0074
## 11 airportORD    5.74     1.09     29.7     0.0381
## 12 airportPHL    7.50     1.45     38.4     0.0156

Interpretation of key coefficients (odds ratios):

These findings validate the hypothesis that the primary causes of severe delay months are year (pandemic capacity reduction), airline operational discipline, and airport infrastructure.

D. Model Assumptions and Diagnostics

airline_model <- airline_model %>%
  mutate(prob = predict(final_model, type = "response"),
         pred = ifelse(prob > 0.5, 1, 0))
(conf_mat <- table(Predicted = airline_model$pred, Actual = airline_model$high_delay))
##          Actual
## Predicted   0   1
##         0 305  67
##         1  73 461
accuracy    <- mean(airline_model$pred == airline_model$high_delay)
sensitivity <- conf_mat[2,2] / sum(conf_mat[,2])
specificity <- conf_mat[1,1] / sum(conf_mat[,1])

cat("Accuracy:   ", round(accuracy*100, 2), "%\n")
## Accuracy:    84.55 %
cat("Sensitivity:", round(sensitivity*100, 2), "%\n")
## Sensitivity: 87.31 %
cat("Specificity:", round(specificity*100, 2), "%\n")
## Specificity: 80.69 %
roc_obj <- roc(airline_model$high_delay, airline_model$prob)
plot(roc_obj, print.auc = TRUE, col = "blue", main = "ROC Curve")

cat("AUC =", round(auc(roc_obj), 3))
## AUC = 0.898
cat("McFadden's pseudo-R² =", round(1 - final_model$deviance/final_model$null.deviance, 4), "\n")
## McFadden's pseudo-R² = 0.4259
cat("AIC =", round(final_model$aic, 2))
## AIC = 842.72

This ensures that each observation is independent because each row in the data corresponds to a distinct carrier-airport observation for a particular year. Multicollinearity becomes less of a practical issue when dealing with non-overlapping categorical variables such as carrier and airport.

With a threshold of 0.5, the model achieved a classification accuracy of 766 out of 906 (84.6%) with a measure of 87.3% sensitivity and 80.7% specificity. The ROC curve gives AUC values of 0.898, indicating outstanding discriminatory ability.

Future directions

  1. Hierarchical/ multilevel modeling: Model carriers and airports as random rather than fixed effects to get estimates of shrinkage and better extrapolation to non-top-50 airports/carriers.

  2. Regularized regression: The LASSO or ridge logistic regression can be used to automatically pick the most predictive carriers and airports and can eliminate overfitting due to the high number of dummy variables.

  3. Alternate forms: It should be replaced by model continuous delay minutes (e.g. Gamma or quantile regression) or models of delayed flights counts (Poisson/negative binomial) instead of a binary threshold.

  4. Machine learning benchmarks: Evaluate the idea of logistic regression vs random forests or gradient boosting (XGBoost/LightGBM), or neural networks to determine the predictability of non-linear interactions.

  5. Causal inference: Use the dramatic 2020 schedule cuts as a natural experiment and apply the difference in differences or synthetic control designs to estimate the effect of the volume of traffic on delay rates causally.

F. References