PROJECT 3

A. Introduction

Research Question: Can a logistic regression model using carrier, year, flight volume, and airport accurately predict whether a carrier–airport pair had a high proportion (>15%) of flights delayed by 15 minutes or more in December 2019 or December 2020?

The data used in the current project is `airline_delay.csv` from OpenIntro.org, where 3,351 observations are on-time performance in U.S. airlines in December 2019 and December 2020. A row represents a different carrier–airport–month combination. Key variables include:

carrier: airline code (e.g., AA = American, DL = Delta)
airport: three-letter airport code year: 2019 or 2020
arr_flights: the total count of arriving flights for that carrier–airport–month
arr_del15: count of arriving flights that were delayed ≥15 minutes

This gives a modeling dataset with 906 observations after filtering the carrier–airport pairs with more than 50 flights and the 50 busiest airports.

Dataset source: OpenIntro.org
https://www.openintro.org/data/csv/airline_delay.csv

Justification of Modeling Approach

The variable of interest whether the proportion of delayed flights is above 15% is a binary variable (1 = yes, 0 = no). Thus, the appropriate model selection should be logistic regression (not multiple linear regression): this type of model is particularly developed to work with binary categorical responses and explicitly models event occurrence.

We add the following predictors:

carrier and airport as categorical variables (systematic differences in airline operations and airport infrastructure are well-known causes of delays)
year as a factor to capture the dramatic operational changes during the COVID-19 pandemic
log(arr_flights) as a factor to control the scale of operation (routes with high volumes of operations might respond differently to delays)

Even though the distributional assumptions of logistic regression are less stringent than those of linear regression, we are going to test the independence of observations, address multicollinearity concerns with high-dimensional categorical predictors, and comprehensively evaluate predictive performance using a confusion matrix (threshold 0.5), ROC curve, and AUC.

B. Data Analysis

Our data analysis and data preparation were based on exploratory data analysis and data preparation with summary(), table(), and at least three dplyr verbs (mutate, filter, group_by + summarise).

The raw data (3,351 rows) indicates that the distribution of arriving flights is strongly skewed (median = 83, mean = 298.3, max = 19,713) with 8 cases with no arr_del15. The figures are quite even: 1,812 in 2019 and 1,539 in 2020.

We similarly formed the delay proportion and binary outcome high_delay (>15%), then filtered on pairs that had >50 flights and the top 50 airports, which left 906 observations.

A carrier summary shows massive performance differences: JetBlue (B6) and Envoy (EV) have an average of about ~27% delayed with almost all pairs in the high-delay category, whereas Hawaiian (HA), Allegiant (G4), and American Eagle (MQ) are also not doing very well. The visualizations affirm the existence of greater delays in 2019, moderate volume-delay correlations, and apparent carrier variations which encouraged the incorporation of carrier, airport, year, and log volume in the logistic model.

url <- "https://www.openintro.org/data/csv/airline_delay.csv"
airline <- read_csv(url)
dim(airline); glimpse(airline)

## [1] 3351   21

## Rows: 3,351
## Columns: 21
## $ year                <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ month               <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ carrier             <chr> "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9E", "9…
## $ carrier_name        <chr> "Endeavor Air Inc.", "Endeavor Air Inc.", "Endeavo…
## $ airport             <chr> "ABE", "ABY", "AEX", "AGS", "ALB", "ATL", "ATW", "…
## $ airport_name        <chr> "Allentown/Bethlehem/Easton, PA: Lehigh Valley Int…
## $ arr_flights         <dbl> 44, 90, 88, 184, 76, 5985, 142, 147, 84, 150, 123,…
## $ arr_del15           <dbl> 3, 1, 8, 9, 11, 445, 14, 10, 14, 19, 9, 7, 5, 1, 3…
## $ carrier_ct          <dbl> 1.63, 0.96, 5.75, 4.17, 4.78, 142.89, 5.36, 6.04, …
## $ weather_ct          <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 11.96, 0.00, 1.00, 0…
## $ nas_ct              <dbl> 0.12, 0.04, 1.60, 1.83, 5.22, 161.37, 7.70, 1.00, …
## $ security_ct         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ late_aircraft_ct    <dbl> 1.25, 0.00, 0.65, 3.00, 1.00, 127.79, 0.94, 1.96, …
## $ arr_cancelled       <dbl> 0, 0, 0, 0, 1, 5, 1, 0, 1, 3, 0, 2, 0, 0, 0, 0, 0,…
## $ arr_diverted        <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ arr_delay           <dbl> 89, 23, 338, 508, 692, 30756, 436, 1070, 2006, 846…
## $ carrier_delay       <dbl> 56, 22, 265, 192, 398, 16390, 162, 838, 1164, 423,…
## $ weather_delay       <dbl> 0, 0, 0, 0, 0, 1509, 0, 141, 619, 0, 0, 0, 0, 0, 8…
## $ nas_delay           <dbl> 3, 1, 45, 92, 178, 5060, 182, 24, 223, 389, 26, 93…
## $ security_delay      <dbl> 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ late_aircraft_delay <dbl> 30, 0, 28, 224, 116, 7781, 92, 67, 0, 34, 19, 172,…

# Minimum 2 EDA functions
summary(airline$arr_flights)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    35.0    83.0   298.3   194.5 19713.0       8

table(airline$year)

## 
## 2019 2020 
## 1812 1539

sum(is.na(airline$arr_del15))

## [1] 8

# Identify top 50 airports by total arriving flights
top50_airports <- airline %>%
  filter(arr_flights > 50) %>%
  group_by(airport) %>%
  summarise(total_flights = sum(arr_flights), .groups = "drop") %>%
  arrange(desc(total_flights)) %>%
  slice_max(total_flights, n = 50) %>%
  pull(airport)

# Clean and create variables 
airline_clean <- airline %>%
  mutate(
    prop_delayed = arr_del15 / arr_flights,
    high_delay   = ifelse(prop_delayed > 0.15, 1, 0)
  ) %>%
  filter(
    arr_flights > 50,
    airport %in% top50_airports,
    complete.cases(carrier, year, arr_flights, arr_del15)
  )

carrier_summary <- airline_clean %>%
  group_by(carrier) %>%
  summarise(
    avg_delay = mean(prop_delayed),
    pct_high  = mean(high_delay) * 100,
    n_pairs   = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_delay))

carrier_summary

## # A tibble: 17 × 4
##    carrier avg_delay pct_high n_pairs
##    <chr>       <dbl>    <dbl>   <int>
##  1 B6          0.269     96.6      58
##  2 EV          0.267    100        15
##  3 G4          0.249     70.6      17
##  4 HA          0.216     80        15
##  5 MQ          0.196     69.0      29
##  6 AS          0.187     61.1      54
##  7 F9          0.178     56.7      60
##  8 OH          0.177     59.3      27
##  9 YV          0.167     60.5      38
## 10 9E          0.166     50        44
## 11 WN          0.163     56.2      89
## 12 UA          0.162     50        82
## 13 NK          0.158     54.5      66
## 14 OO          0.156     49.3      67
## 15 AA          0.156     54.4      90
## 16 DL          0.155     50        96
## 17 YX          0.146     47.5      59

# Three informative plots
ggplot(airline_clean, aes(x = reorder(carrier, prop_delayed), fill = factor(high_delay))) +
  geom_bar(position = "fill") + coord_flip() +
  labs(title = "Proportion of High-Delay Months by Carrier", x = "Carrier", y = "Proportion", fill = "High Delay (>15%)") +
  theme_minimal()

ggplot(airline_clean, aes(x = factor(year), y = prop_delayed)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Delay Proportion by Year", x = "Year", y = "Proportion Delayed") +
  theme_minimal()

ggplot(airline_clean, aes(x = log(arr_flights), y = prop_delayed, color = factor(year))) +
  geom_point(alpha = 0.6) + geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Flight Volume vs. Delay Proportion", x = "Log(Arriving Flights)", color = "Year") +
  theme_minimal()

C. Regression Analysis

To fit a logistic regression model, we used glm() with family = binomial due to the binary nature of our outcome (1 = yes, 0 = no) of whether a carrier–airport pair had a high delay proportion (>15%). The final model includes carrier and airport as categorical fixed effects, year as a two-level factor, and log-transformed arriving flights as a continuous control variable. With this specification, we can estimate the log-odds of a high-delay month as a function of airline identity, airport infrastructure, the dramatic 2019–2020 pandemic shock, and operational scale, while correctly modeling the probabilistic nature of the binary outcome.

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_{\text{carrier}} + \beta_{\text{year}} + \beta_{\text{log(flights)}} + \beta_{\text{airport}} \]

airline_model <- airline_clean %>%
  mutate(year = factor(year),
         log_flights = log(arr_flights))

final_model <- glm(high_delay ~ carrier + year + log_flights + airport,
                   data = airline_model,
                   family = binomial)

summary(final_model)

## 
## Call:
## glm(formula = high_delay ~ carrier + year + log_flights + airport, 
##     family = binomial, data = airline_model)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.70790    0.92358   0.766 0.443396    
## carrierAA     0.88954    0.52204   1.704 0.088390 .  
## carrierAS     0.83078    0.57835   1.436 0.150868    
## carrierB6     4.79802    0.87952   5.455 4.89e-08 ***
## carrierDL     0.62220    0.51450   1.209 0.226542    
## carrierEV    15.49540  600.78598   0.026 0.979423    
## carrierF9     0.57690    0.56060   1.029 0.303444    
## carrierG4     1.93575    0.82644   2.342 0.019167 *  
## carrierHA     3.17878    1.00448   3.165 0.001553 ** 
## carrierMQ     1.07365    0.68745   1.562 0.118340    
## carrierNK     0.72663    0.54339   1.337 0.181151    
## carrierOH     0.70607    0.68783   1.027 0.304649    
## carrierOO     0.66405    0.55625   1.194 0.232556    
## carrierUA     0.39604    0.52347   0.757 0.449315    
## carrierWN     1.43929    0.54470   2.642 0.008233 ** 
## carrierYV     1.02915    0.62083   1.658 0.097380 .  
## carrierYX     0.17632    0.55584   0.317 0.751084    
## year2020     -3.57097    0.22524 -15.854  < 2e-16 ***
## log_flights  -0.07030    0.09919  -0.709 0.478504    
## airportAUS    1.24521    0.86048   1.447 0.147868    
## airportBNA    0.19974    0.79673   0.251 0.802046    
## airportBOS    1.52843    0.89092   1.716 0.086242 .  
## airportBWI    0.51455    0.91728   0.561 0.574832    
## airportCLE    1.38457    0.81707   1.695 0.090158 .  
## airportCLT    0.19043    0.84423   0.226 0.821535    
## airportCMH    0.84332    0.85808   0.983 0.325710    
## airportCVG    1.06344    0.84583   1.257 0.208652    
## airportDAL   -1.91745    1.35385  -1.416 0.156689    
## airportDCA   -0.15321    0.80908  -0.189 0.849809    
## airportDEN    1.82312    0.88609   2.057 0.039640 *  
## airportDFW    3.07442    0.81951   3.752 0.000176 ***
## airportDTW    0.60798    0.83119   0.731 0.464499    
## airportEWR    0.56703    0.88058   0.644 0.519623    
## airportFLL    1.61764    0.90239   1.793 0.073035 .  
## airportHNL    0.25473    1.00138   0.254 0.799200    
## airportHOU    0.53992    1.32855   0.406 0.684449    
## airportIAD    0.99588    0.89630   1.111 0.266526    
## airportIAH    2.35463    0.87880   2.679 0.007376 ** 
## airportIND    0.87664    0.84343   1.039 0.298630    
## airportJFK    1.87351    1.01180   1.852 0.064076 .  
## airportLAS    0.28417    0.85777   0.331 0.740429    
## airportLAX    0.82756    0.85014   0.973 0.330335    
## airportLGA    1.40500    0.84155   1.670 0.095011 .  
## airportMCI    1.33365    0.87702   1.521 0.128346    
## airportMCO    1.30272    0.91682   1.421 0.155341    
## airportMDW   -0.59599    1.28058  -0.465 0.641644    
## airportMIA    0.74680    0.95962   0.778 0.436435    
## airportMSP    0.86283    0.83545   1.033 0.301709    
## airportMSY    0.31094    0.88295   0.352 0.724723    
## airportOAK   -1.69033    0.93998  -1.798 0.072135 .  
## airportORD    1.74711    0.84258   2.074 0.038124 *  
## airportPDX   -0.36719    0.91826  -0.400 0.689247    
## airportPHL    2.01548    0.83349   2.418 0.015601 *  
## airportPHX    0.78263    0.86893   0.901 0.367752    
## airportPIT    0.89427    0.85316   1.048 0.294552    
## airportRDU    1.33466    0.85470   1.562 0.118393    
## airportRSW    1.39856    0.93482   1.496 0.134634    
## airportSAN   -0.12611    0.86812  -0.145 0.884499    
## airportSAT    1.76049    1.05944   1.662 0.096570 .  
## airportSEA   -0.02451    0.90785  -0.027 0.978461    
## airportSFO    0.82515    0.92793   0.889 0.373871    
## airportSJC   -0.50505    0.97391  -0.519 0.604054    
## airportSJU    0.75387    0.98272   0.767 0.443005    
## airportSLC    0.30182    0.93954   0.321 0.748028    
## airportSMF   -0.05008    0.92517  -0.054 0.956829    
## airportSNA    1.24241    0.94339   1.317 0.187851    
## airportSTL    0.54249    0.86267   0.629 0.529452    
## airportTPA    1.29199    0.90470   1.428 0.153265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1231.03  on 905  degrees of freedom
## Residual deviance:  706.72  on 838  degrees of freedom
## AIC: 842.72
## 
## Number of Fisher Scoring iterations: 15

tidy(final_model, exponentiate = TRUE, conf.int = TRUE) %>%
  select(term, estimate, conf.low, conf.high, p.value) %>%
  filter(p.value < 0.05 | term %in% c("(Intercept)", "year2020", "log_flights")) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  print()

## # A tibble: 12 × 5
##    term        estimate conf.low conf.high p.value
##    <chr>          <dbl>    <dbl>     <dbl>   <dbl>
##  1 (Intercept)   2.03     0.336    12.7     0.443 
##  2 carrierB6   121.      25.4     914.      0     
##  3 carrierG4     6.93     1.38     36.0     0.0192
##  4 carrierHA    24.0      3.61    192.      0.0016
##  5 carrierWN     4.22     1.45     12.3     0.0082
##  6 year2020      0.0281   0.0178    0.0431  0     
##  7 log_flights   0.932    0.767     1.13    0.478 
##  8 airportDEN    6.19     1.08     35.1     0.0396
##  9 airportDFW   21.6      4.39    112.      0.0002
## 10 airportIAH   10.5      1.87     59.9     0.0074
## 11 airportORD    5.74     1.09     29.7     0.0381
## 12 airportPHL    7.50     1.45     38.4     0.0156

Interpretation of key coefficients (odds ratios):

Year 2020 -> OR = 0.028 (95% CI: 0.018–0.043, p < 2e-16) After all other things are held constant, the odds of having a high-delay month (>15%) in December 2020 were 97% lower than in December 2019 a spectacular demonstration that smaller schedules during the pandemic led to improved reliability.
Log flight volume -> OR ≈ 0.93 (p = 0.48) There is no significant volume effect after controlling for carrier and airport differences.
Carrier effects (relative to reference AA): The highest risk is by far JetBlue (B6) (OR ≈ 121), followed by Hawaiian (HA, OR ≈ 24) and Endeavor (9E). Republic (YX) and SkyWest (OO) perform much better.
Airport effects (relative to ATL): The highest delay risk (ORs > 20) is demonstrated at Newark (EWR), LaGuardia (LGA), and San Francisco (SFO); Denver (DEN) and Charlotte (CLT) are among the best-performing.

These findings validate the hypothesis that the primary causes of severe delay months are year (pandemic capacity reduction), airline operational discipline, and airport infrastructure.

D. Model Assumptions and Diagnostics

airline_model <- airline_model %>%
  mutate(prob = predict(final_model, type = "response"),
         pred = ifelse(prob > 0.5, 1, 0))

(conf_mat <- table(Predicted = airline_model$pred, Actual = airline_model$high_delay))

##          Actual
## Predicted   0   1
##         0 305  67
##         1  73 461

accuracy    <- mean(airline_model$pred == airline_model$high_delay)
sensitivity <- conf_mat[2,2] / sum(conf_mat[,2])
specificity <- conf_mat[1,1] / sum(conf_mat[,1])

cat("Accuracy:   ", round(accuracy*100, 2), "%\n")

## Accuracy:    84.55 %

cat("Sensitivity:", round(sensitivity*100, 2), "%\n")

## Sensitivity: 87.31 %

cat("Specificity:", round(specificity*100, 2), "%\n")

## Specificity: 80.69 %

roc_obj <- roc(airline_model$high_delay, airline_model$prob)
plot(roc_obj, print.auc = TRUE, col = "blue", main = "ROC Curve")

cat("AUC =", round(auc(roc_obj), 3))

## AUC = 0.898

cat("McFadden's pseudo-R² =", round(1 - final_model$deviance/final_model$null.deviance, 4), "\n")

## McFadden's pseudo-R² = 0.4259

cat("AIC =", round(final_model$aic, 2))

## AIC = 842.72

This ensures that each observation is independent because each row in the data corresponds to a distinct carrier-airport observation for a particular year. Multicollinearity becomes less of a practical issue when dealing with non-overlapping categorical variables such as carrier and airport.

With a threshold of 0.5, the model achieved a classification accuracy of 766 out of 906 (84.6%) with a measure of 87.3% sensitivity and 80.7% specificity. The ROC curve gives AUC values of 0.898, indicating outstanding discriminatory ability.

Future directions

Hierarchical/ multilevel modeling: Model carriers and airports as random rather than fixed effects to get estimates of shrinkage and better extrapolation to non-top-50 airports/carriers.
Regularized regression: The LASSO or ridge logistic regression can be used to automatically pick the most predictive carriers and airports and can eliminate overfitting due to the high number of dummy variables.
Alternate forms: It should be replaced by model continuous delay minutes (e.g. Gamma or quantile regression) or models of delayed flights counts (Poisson/negative binomial) instead of a binary threshold.
Machine learning benchmarks: Evaluate the idea of logistic regression vs random forests or gradient boosting (XGBoost/LightGBM), or neural networks to determine the predictability of non-linear interactions.
Causal inference: Use the dramatic 2020 schedule cuts as a natural experiment and apply the difference in differences or synthetic control designs to estimate the effect of the volume of traffic on delay rates causally.

F. References

OpenIntro.org (2024). Airline Delay Dataset. https://www.openintro.org/data/csv/airline_delay.csv