Audience
This analysis is prepared for airline operations managers and aviation stakeholders who want to understand the major causes of flight delays in the United States.
Objective
The goal of this project is to identify the main factors contributing to airline arrival delays, including weather, carrier issues, NAS congestion, and late aircraft effects.
Background
Flight delays have remained a major issue even after COVID-19. Understanding the sources of delay can help airlines improve scheduling, reduce costs, and improve customer satisfaction.
Load Data
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv ("Airline_Delay_Post_COVID_2021_2023.csv" )
head (df)
year month carrier carrier_name airport
1 2023 3 9E Endeavor Air Inc. ABY
2 2023 3 9E Endeavor Air Inc. AEX
3 2023 3 9E Endeavor Air Inc. AGS
4 2023 3 9E Endeavor Air Inc. ALB
5 2023 3 9E Endeavor Air Inc. ATL
6 2023 3 9E Endeavor Air Inc. ATW
airport_name arr_flights arr_del15
1 Albany, GA: Southwest Georgia Regional 89 8
2 Alexandria, LA: Alexandria International 62 8
3 Augusta, GA: Augusta Regional at Bush Field 11 2
4 Albany, NY: Albany International 201 27
5 Atlanta, GA: Hartsfield-Jackson Atlanta International 1598 222
6 Appleton, WI: Appleton International 47 8
carrier_ct weather_ct nas_ct security_ct late_aircraft_ct arr_cancelled
1 4.46 1.00 1.61 0 0.93 1
2 3.95 0.37 1.29 0 2.40 0
3 1.00 0.00 0.00 0 1.00 0
4 13.04 0.46 7.06 0 6.44 7
5 57.22 8.08 61.32 0 95.38 8
6 1.42 1.80 3.62 0 1.16 0
arr_diverted arr_delay carrier_delay weather_delay nas_delay security_delay
1 1 412 262 38 53 0
2 0 357 188 7 44 0
3 0 60 24 0 0 0
4 1 1336 742 13 220 0
5 6 18248 7265 774 3458 0
6 1 674 114 353 167 0
late_aircraft_delay
1 59
2 118
3 36
4 361
5 6751
6 40
'data.frame': 44911 obs. of 21 variables:
$ year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ month : int 3 3 3 3 3 3 3 3 3 3 ...
$ carrier : chr "9E" "9E" "9E" "9E" ...
$ carrier_name : chr "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
$ airport : chr "ABY" "AEX" "AGS" "ALB" ...
$ airport_name : chr "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" "Albany, NY: Albany International" ...
$ arr_flights : num 89 62 11 201 1598 ...
$ arr_del15 : num 8 8 2 27 222 8 7 9 8 0 ...
$ carrier_ct : num 4.46 3.95 1 13.04 57.22 ...
$ weather_ct : num 1 0.37 0 0.46 8.08 1.8 0 0 0.88 0 ...
$ nas_ct : num 1.61 1.29 0 7.06 61.32 ...
$ security_ct : num 0 0 0 0 0 0 0 0 0 0 ...
$ late_aircraft_ct : num 0.93 2.4 1 6.44 95.38 ...
$ arr_cancelled : num 1 0 0 7 8 0 0 1 2 1 ...
$ arr_diverted : num 1 0 0 1 6 1 0 0 0 0 ...
$ arr_delay : num 412 357 60 1336 18248 ...
$ carrier_delay : num 262 188 24 742 7265 ...
$ weather_delay : num 38 7 0 13 774 353 0 0 29 0 ...
$ nas_delay : num 53 44 0 220 3458 ...
$ security_delay : num 0 0 0 0 0 0 0 0 0 0 ...
$ late_aircraft_delay: num 59 118 36 361 6751 ...
Data Preparation
We create standardized delay measures to compare causes fairly.
df <- df %>%
mutate (
delay_per_flight = arr_delay / arr_flights,
weather_prop = weather_ct / arr_flights,
carrier_prop = carrier_ct / arr_flights,
nas_prop = nas_ct / arr_flights,
late_aircraft_prop = late_aircraft_ct / arr_flights
)
Exploratory Data Analysis
Average Delay Causes
df %>%
summarise (
weather = mean (weather_ct, na.rm = TRUE ),
carrier = mean (carrier_ct, na.rm = TRUE ),
nas = mean (nas_ct, na.rm = TRUE ),
late_aircraft = mean (late_aircraft_ct, na.rm = TRUE )
)
weather carrier nas late_aircraft
1 2.362735 22.91071 15.06738 20.33411
Average Delay by Year
df %>%
group_by (year) %>%
summarise (avg_delay = mean (arr_delay, na.rm = TRUE )) %>%
ggplot (aes (x = year, y = avg_delay)) +
geom_line () +
geom_point () +
labs (
title = "Average Airline Delay by Year" ,
x = "Year" ,
y = "Average Delay"
)
Top Airports with Highest Delays
df %>%
group_by (airport_name) %>%
summarise (avg_delay = mean (arr_delay, na.rm = TRUE )) %>%
arrange (desc (avg_delay)) %>%
head (10 ) %>%
ggplot (aes (x = reorder (airport_name, avg_delay), y = avg_delay)) +
geom_col () +
coord_flip () +
labs (
title = "Top 10 Airports by Delay" ,
x = "Airport" ,
y = "Average Delay"
)
Regression Analysis
We test which factors influence delay per flight.
model <- lm (
delay_per_flight ~ weather_prop + carrier_prop + nas_prop + late_aircraft_prop,
data = df
)
summary (model)
Call:
lm(formula = delay_per_flight ~ weather_prop + carrier_prop +
nas_prop + late_aircraft_prop, data = df)
Residuals:
Min 1Q Median 3Q Max
-133.70 -2.79 -0.70 1.41 563.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.50539 0.06929 -7.294 3.05e-13 ***
weather_prop 152.20813 1.61962 93.978 < 2e-16 ***
carrier_prop 66.00279 0.55890 118.094 < 2e-16 ***
nas_prop 41.34244 0.71850 57.540 < 2e-16 ***
late_aircraft_prop 75.99525 0.64395 118.014 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.544 on 44872 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.5371, Adjusted R-squared: 0.5371
F-statistic: 1.302e+04 on 4 and 44872 DF, p-value: < 2.2e-16
Interpretation
Weather delays indicate external uncontrollable conditions.
Carrier delays reflect airline operational efficiency.
NAS delays relate to air traffic congestion.
Late aircraft delays show cascading operational issues.
The regression helps identify which factors have the strongest impact on delays.
Conclusion
Flight delays are driven by both operational and environmental factors. Improving turnaround time and reducing late aircraft delays may significantly reduce total delay impact.