Introduction

This analysis examines what factors drive airline delays using linear regression on the FAA’s post-COVID dataset (2021-2023). The central research question is: How do component delays relate to overall per-flight delays?

We model per-flight delay as a function of per-flight components of weather, carrier operations, air traffic control (NAS), and late aircraft arrivals. This approach identifies which delay sources have the strongest impact on passenger experience.

Data Preparation

library(tidyverse)
library(ggplot2)

df <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

# Create proportion variables to match response variable scale
# Response: delay_per_flight = total delay / number of flights
# Predictors: *_prop = delay type / number of flights (same units!)
df <- df %>%
  mutate(
    delay_per_flight = arr_delay / arr_flights,
    weather_prop = weather_delay / arr_flights,
    carrier_prop = carrier_delay / arr_flights,
    nas_prop = nas_delay / arr_flights,
    late_aircraft_prop = late_aircraft_delay / arr_flights
  )

cat("Sample size:", nrow(df), "airline-airport-month observations\n")

## Sample size: 44911 airline-airport-month observations

cat("Mean delay per flight:", round(mean(df$delay_per_flight, na.rm=T), 2), "minutes\n")

## Mean delay per flight: 12.36 minutes

cat("SD delay per flight:", round(sd(df$delay_per_flight, na.rm=T), 2), "minutes\n")

## SD delay per flight: 11.09 minutes

Why proportions? The response variable (delay_per_flight) represents delay averaged per flight. To predict it, we must use proportion-based predictors that represent delay per flight, not aggregate totals. This ensures variables are on the same scale.

Exploratory Data Analysis

Distribution of Delay Per Flight

p99 <- quantile(df$delay_per_flight, 0.99, na.rm=TRUE)
median_val <- median(df$delay_per_flight, na.rm=TRUE)

ggplot(df, aes(x = delay_per_flight)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black", alpha = 0.7) +
  xlim(0, p99) +
  geom_vline(aes(xintercept = median_val, color = "Median"), linetype = "dashed", size = 1) +
  scale_color_manual(values = c("Median" = "red")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Distribution of Delay Per Flight",
       x = "Delay (minutes)", y = "Frequency") +
  theme_minimal()

Insight: The distribution is right-skewed with median delay of 10.12 minutes. Most delays are small (1-5 min), but occasional extremes exist (up to 47.5 min at 99th percentile). This skewness suggests deviation from normality assumptions but is typical of real-world delay data.

Airlines: Substantial Variation

airline_summary <- df %>%
  group_by(carrier_name) %>%
  summarise(avg_delay = mean(delay_per_flight, na.rm=TRUE), .groups='drop') %>%
  arrange(desc(avg_delay))

ggplot(airline_summary, aes(x = reorder(carrier_name, avg_delay), y = avg_delay)) +
  geom_col(fill = "darkred", alpha = 0.8) +
  geom_text(aes(label = round(avg_delay, 1)), hjust = -0.1, size = 2.5) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Average Delay Per Flight by Airline", x = "Airline", y = "Delay (minutes)") +
  theme_minimal()

delay_range <- max(airline_summary$avg_delay) - min(airline_summary$avg_delay)
cat("Delay range:", round(delay_range, 1), "minutes\n")

## Delay range: 12.2 minutes

Insight: Airlines differ substantially—best performer has ~7.2 min delay, worst has ~19.5 min (a 12.2 min difference). This variation justifies investigating delay drivers.

Weather and Total Delay Relationship

ggplot(df, aes(x = weather_prop, y = delay_per_flight)) +
  geom_point(alpha = 0.25, size = 1.5, position = position_jitter(width = 0.001, height = 0.1)) +
  geom_smooth(method = "lm", color = "red", se = TRUE, alpha = 0.2, size = 1) +
  xlim(0, quantile(df$weather_prop, 0.99, na.rm=TRUE)) +
  ylim(0, quantile(df$delay_per_flight, 0.99, na.rm=TRUE)) +
  labs(title = "Weather Delay Per Flight vs Total Delay Per Flight",
       x = "Weather Delay Per Flight (min)", y = "Total Delay Per Flight (min)") +
  theme_minimal()

cor_weather <- cor(df$weather_prop, df$delay_per_flight, use = "complete.obs")
cat("Correlation:", round(cor_weather, 3), "\n")

## Correlation: 0.501

Insight: The correlation between weather delay and total delay per flight is moderate and positive (r = 0.501). This corresponds to approximately 25% of the variation explained (r² ≈ 0.251). This indicates that while weather is a meaningful contributor to overall delays, it does not fully explain delay variability on its own, confirming the need for a multi-factor model that includes additional operational and system-level components.

Linear Regression Model

model <- lm(delay_per_flight ~ weather_prop + carrier_prop + nas_prop + late_aircraft_prop, 
            data = df)

summary(model)

## 
## Call:
## lm(formula = delay_per_flight ~ weather_prop + carrier_prop + 
##     nas_prop + late_aircraft_prop, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.444 -0.038 -0.029 -0.023 42.981 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        0.0186348  0.0027515    6.773 1.28e-11 ***
## weather_prop       1.0003998  0.0003529 2835.128  < 2e-16 ***
## carrier_prop       1.0007338  0.0002896 3455.081  < 2e-16 ***
## nas_prop           1.0035155  0.0006640 1511.425  < 2e-16 ***
## late_aircraft_prop 1.0013315  0.0003455 2898.371  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3708 on 44872 degrees of freedom
##   (34 observations deleted due to missingness)
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 1.002e+07 on 4 and 44872 DF,  p-value: < 2.2e-16

r2 <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared

Model Interpretation

Overall Fit: R² = 0.9989 indicates an almost perfect fit due to the additive construction of the response variable. In other words, total delay is mathematically constructed as the sum of weather, carrier, NAS, and late aircraft delays, so the model reconstructs the outcome rather than discovering independent causal relationships.

Coefficients:

Weather Proportion (coef = 0.9969, p < 0.001): Each additional minute of weather delay per flight increases total delay by ~1 minute. Weather delays translate nearly one-to-one to passenger delays.

Carrier Proportion (coef = 0.9946, p < 0.001): Each additional minute of carrier (operational) delay increases total delay by ~1 minute. This is the controllable component—airline operations directly affect passenger experience.

NAS Proportion (coef = 1.0035, p < 0.001): NAS delays are statistically significant, with their contribution similar in magnitude to other delay components, reflecting system-level congestion rather than airline-specific control.

Late Aircraft Proportion (coef = 0.9976, p < 0.001): Each minute of late arrival cascades to the next flight, increasing total delay by ~1 minute.

All predictors are statistically significant (p < 0.001), confirming that each delay component contributes meaningfully to total delay.

Key Finding: Coefficients close to 1 indicate that each component delay translates almost one-to-one into total delay, consistent with the additive structure of the dataset. This confirms that delays are constructed as additive components rather than independent drivers.

Model Diagnostics

par(mfrow = c(2, 2))
plot(model, main = "")

par(mfrow = c(1, 1))

Residuals vs Fitted: Some heteroskedasticity visible—residual spread increases at higher fitted values.

Q-Q Plot: Deviation at tails indicates heavier tails than normal distribution—typical of real-world data with occasional extreme delays.

Scale-Location: Confirms heteroskedasticity (upward trend in red line).

Residuals vs Leverage: A few influential points exist but represent genuine extreme events, not errors.

Assessment: Model violates normality and constant variance assumptions moderately. However, with large sample size (n = 44,906) and high F-statistic, the model remains reliable. Predictions for typical delays (5-20 min) are solid; extreme predictions (>40 min) should be viewed with caution.

Conclusions

Summary: The model shows that total delay is primarily determined by the additive combination of its component delays (weather, carrier, NAS, and late aircraft). This does not indicate a simple system; rather, it reflects that the response variable is mathematically constructed from the predictors. The analysis confirms that delays decompose additively across weather, carrier operations, NAS congestion, and aircraft cascading effects.

Key Findings:
1. Weather, carrier operations, and delay cascades each contribute substantially (coefficients are approximately equal to 1.0 across all components)
2. NAS delays contribute meaningfully within the additive structure
3. Model fit is extremely high due to the additive structure of the variables rather than independent predictive power

Limitations: - Heteroskedasticity suggests prediction uncertainty increases for high delays - Non-normal residuals limit extreme prediction reliability - Many unmeasured confounders (airport-specific factors, holiday effects, labor issues)

Practical Implications: Airlines can use this model to forecast expected delays based on weather and operational factors. Focus should be on controllable factors (carrier delays) while managing expectations for external factors (weather, system congestion).

Week 11: Data Dive

2026-04-04