Airline Delay Analysis (2021–2023)

Audience

This analysis is prepared for airline operations managers and aviation stakeholders who want to understand the major causes of flight delays in the United States.

Objective

The goal of this project is to identify the main factors contributing to airline arrival delays, including weather, carrier issues, NAS congestion, and late aircraft effects.

Background

Flight delays have remained a major issue even after COVID-19. Understanding the sources of delay can help airlines improve scheduling, reduce costs, and improve customer satisfaction.

Load Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

head(df) 
  year month carrier      carrier_name airport
1 2023     3      9E Endeavor Air Inc.     ABY
2 2023     3      9E Endeavor Air Inc.     AEX
3 2023     3      9E Endeavor Air Inc.     AGS
4 2023     3      9E Endeavor Air Inc.     ALB
5 2023     3      9E Endeavor Air Inc.     ATL
6 2023     3      9E Endeavor Air Inc.     ATW
                                           airport_name arr_flights arr_del15
1                Albany, GA: Southwest Georgia Regional          89         8
2              Alexandria, LA: Alexandria International          62         8
3           Augusta, GA: Augusta Regional at Bush Field          11         2
4                      Albany, NY: Albany International         201        27
5 Atlanta, GA: Hartsfield-Jackson Atlanta International        1598       222
6                  Appleton, WI: Appleton International          47         8
  carrier_ct weather_ct nas_ct security_ct late_aircraft_ct arr_cancelled
1       4.46       1.00   1.61           0             0.93             1
2       3.95       0.37   1.29           0             2.40             0
3       1.00       0.00   0.00           0             1.00             0
4      13.04       0.46   7.06           0             6.44             7
5      57.22       8.08  61.32           0            95.38             8
6       1.42       1.80   3.62           0             1.16             0
  arr_diverted arr_delay carrier_delay weather_delay nas_delay security_delay
1            1       412           262            38        53              0
2            0       357           188             7        44              0
3            0        60            24             0         0              0
4            1      1336           742            13       220              0
5            6     18248          7265           774      3458              0
6            1       674           114           353       167              0
  late_aircraft_delay
1                  59
2                 118
3                  36
4                 361
5                6751
6                  40
str(df)
'data.frame':   44911 obs. of  21 variables:
 $ year               : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ month              : int  3 3 3 3 3 3 3 3 3 3 ...
 $ carrier            : chr  "9E" "9E" "9E" "9E" ...
 $ carrier_name       : chr  "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
 $ airport            : chr  "ABY" "AEX" "AGS" "ALB" ...
 $ airport_name       : chr  "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" "Albany, NY: Albany International" ...
 $ arr_flights        : num  89 62 11 201 1598 ...
 $ arr_del15          : num  8 8 2 27 222 8 7 9 8 0 ...
 $ carrier_ct         : num  4.46 3.95 1 13.04 57.22 ...
 $ weather_ct         : num  1 0.37 0 0.46 8.08 1.8 0 0 0.88 0 ...
 $ nas_ct             : num  1.61 1.29 0 7.06 61.32 ...
 $ security_ct        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ late_aircraft_ct   : num  0.93 2.4 1 6.44 95.38 ...
 $ arr_cancelled      : num  1 0 0 7 8 0 0 1 2 1 ...
 $ arr_diverted       : num  1 0 0 1 6 1 0 0 0 0 ...
 $ arr_delay          : num  412 357 60 1336 18248 ...
 $ carrier_delay      : num  262 188 24 742 7265 ...
 $ weather_delay      : num  38 7 0 13 774 353 0 0 29 0 ...
 $ nas_delay          : num  53 44 0 220 3458 ...
 $ security_delay     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ late_aircraft_delay: num  59 118 36 361 6751 ...

Data Preparation

We create standardized delay measures to compare causes fairly.

df <- df %>%
  mutate(
    delay_per_flight = arr_delay / arr_flights,
    weather_prop = weather_ct / arr_flights,
    carrier_prop = carrier_ct / arr_flights,
    nas_prop = nas_ct / arr_flights,
    late_aircraft_prop = late_aircraft_ct / arr_flights
  )

Exploratory Data Analysis

  1. Average Delay Causes
df %>%
  summarise(
    weather = mean(weather_ct, na.rm = TRUE),
    carrier = mean(carrier_ct, na.rm = TRUE),
    nas = mean(nas_ct, na.rm = TRUE),
    late_aircraft = mean(late_aircraft_ct, na.rm = TRUE)
  )
   weather  carrier      nas late_aircraft
1 2.362735 22.91071 15.06738      20.33411
  1. Average Delay by Year
df %>%
  group_by(year) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = year, y = avg_delay)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Airline Delay by Year",
    x = "Year",
    y = "Average Delay"
  )

  1. Top Airports with Highest Delays
df %>%
  group_by(airport_name) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(airport_name, avg_delay), y = avg_delay)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Airports by Delay",
    x = "Airport",
    y = "Average Delay"
  )

Regression Analysis

We test which factors influence delay per flight.

model <- lm(
  delay_per_flight ~ weather_prop + carrier_prop + nas_prop + late_aircraft_prop,
  data = df
)

summary(model)

Call:
lm(formula = delay_per_flight ~ weather_prop + carrier_prop + 
    nas_prop + late_aircraft_prop, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-133.70   -2.79   -0.70    1.41  563.35 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.50539    0.06929  -7.294 3.05e-13 ***
weather_prop       152.20813    1.61962  93.978  < 2e-16 ***
carrier_prop        66.00279    0.55890 118.094  < 2e-16 ***
nas_prop            41.34244    0.71850  57.540  < 2e-16 ***
late_aircraft_prop  75.99525    0.64395 118.014  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.544 on 44872 degrees of freedom
  (34 observations deleted due to missingness)
Multiple R-squared:  0.5371,    Adjusted R-squared:  0.5371 
F-statistic: 1.302e+04 on 4 and 44872 DF,  p-value: < 2.2e-16

Interpretation

  • Weather delays indicate external uncontrollable conditions.

  • Carrier delays reflect airline operational efficiency.

  • NAS delays relate to air traffic congestion.

  • Late aircraft delays show cascading operational issues.

The regression helps identify which factors have the strongest impact on delays.

Conclusion

Flight delays are driven by both operational and environmental factors. Improving turnaround time and reducing late aircraft delays may significantly reduce total delay impact.