This data dive explores U.S. domestic airline delays from 2021 to 2023 using a modified dataset from the Bureau of Transportation Statistics. Each observation represents an airline–airport–time period combination and includes information on arrival delays and contributing delay causes. The goal of this analysis is to understand how airline delays are distributed, how they vary across airlines, and how they have changed during the post-COVID recovery period.
# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")
# See all column names in your dataset
names(Airline_Delay_Post_COVID_2021_2023)
## [1] "year" "month" "carrier"
## [4] "carrier_name" "airport" "airport_name"
## [7] "arr_flights" "arr_del15" "carrier_ct"
## [10] "weather_ct" "nas_ct" "security_ct"
## [13] "late_aircraft_ct" "arr_cancelled" "arr_diverted"
## [16] "arr_delay" "carrier_delay" "weather_delay"
## [19] "nas_delay" "security_delay" "late_aircraft_delay"
Airline_Delay_Post_COVID_2021_2023 %>%
summarise(
arr_min = min(arr_delay, na.rm = TRUE),
arr_q1 = quantile(arr_delay, 0.25, na.rm = TRUE),
arr_median = median(arr_delay, na.rm = TRUE),
arr_mean = mean(arr_delay, na.rm = TRUE),
arr_q3 = quantile(arr_delay, 0.75, na.rm = TRUE),
arr_max = max(arr_delay, na.rm = TRUE),
carrier_delay_mean = mean(carrier_delay, na.rm = TRUE),
weather_delay_mean = mean(weather_delay, na.rm = TRUE),
nas_delay_mean = mean(nas_delay, na.rm = TRUE),
security_delay_mean = mean(security_delay, na.rm = TRUE),
late_aircraft_delay_mean = mean(late_aircraft_delay, na.rm = TRUE)
)
## arr_min arr_q1 arr_median arr_mean arr_q3 arr_max carrier_delay_mean
## 1 0 321 941 4061.559 2687 337375 1612.297
## weather_delay_mean nas_delay_mean security_delay_mean
## 1 245.4794 697.7206 10.51554
## late_aircraft_delay_mean
## 1 1495.546
The summary statistics indicate that arrival delays have a median close to zero, meaning that at least half of flights are on time or early. However, the maximum delay values are much larger than the median, revealing a strongly right-skewed distribution. This suggests that while most flights experience little to no delay, a small number of flights account for extremely long delays. For travelers and airlines, this highlights the importance of managing rare but severe disruption events.
Airline_Delay_Post_COVID_2021_2023 %>%
count(carrier, sort = TRUE)
## carrier n
## 1 OO 6295
## 2 MQ 3773
## 3 DL 3471
## 4 G4 3379
## 5 AA 2941
## 6 9E 2837
## 7 WN 2830
## 8 UA 2802
## 9 F9 2520
## 10 OH 2467
## 11 YV 2320
## 12 YX 2240
## 13 AS 2127
## 14 B6 1702
## 15 NK 1477
## 16 QX 1157
## 17 HA 573
Based on the data summaries and project goals, this analysis investigates the following questions:
# Average arrival delay by airline
Airline_Delay_Post_COVID_2021_2023 %>%
group_by(carrier) %>%
summarise(avg_arrival_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_arrival_delay))
## # A tibble: 17 × 2
## carrier avg_arrival_delay
## <chr> <dbl>
## 1 WN 10927.
## 2 AA 9491.
## 3 B6 6974.
## 4 UA 5618.
## 5 DL 5321.
## 6 NK 5110.
## 7 OO 3491.
## 8 YX 3475.
## 9 F9 2564.
## 10 HA 2260.
## 11 OH 2243.
## 12 AS 2104.
## 13 YV 2013.
## 14 9E 1795.
## 15 G4 1628.
## 16 MQ 1505.
## 17 QX 1257.
This aggregation shows that average arrival delays vary across airlines. Some carriers experience higher average delays than others, which may reflect differences in scheduling practices, network congestion, or operational efficiency. Identifying these differences can help airlines focus improvement efforts where delays are most severe.
# Histogram of arrival delays
ggplot(
Airline_Delay_Post_COVID_2021_2023 %>% filter(!is.na(arr_delay)),
aes(x = arr_delay)
) +
geom_histogram(binwidth = 50, fill = "lightblue", color = "black") +
labs(
title = "Distribution of Arrival Delays",
x = "Arrival Delay (minutes)",
y = "Number of Flights"
)
The histogram shows that most flights experience small arrival delays, while a small number of flights experience very large delays. This pattern suggests that extreme delay events, though uncommon, play a significant role in overall delay outcomes. Airlines may benefit from focusing on preventing or mitigating these extreme events.
# Total contribution of each delay cause
Airline_Delay_Post_COVID_2021_2023 %>%
summarise(
total_carrier_delay = sum(carrier_delay, na.rm = TRUE),
total_weather_delay = sum(weather_delay, na.rm = TRUE),
total_nas_delay = sum(nas_delay, na.rm = TRUE),
total_security_delay = sum(security_delay, na.rm = TRUE),
total_late_aircraft_delay = sum(late_aircraft_delay, na.rm = TRUE)
)
## total_carrier_delay total_weather_delay total_nas_delay total_security_delay
## 1 72355046 11016381 31311609 471906
## total_late_aircraft_delay
## 1 67115605
The total contribution of late aircraft and NAS delays dominates overall delays, while security and weather delays are smaller contributors. Airlines may focus on improving aircraft turnaround and managing airspace congestion to reduce the largest sources of delays.
# Top 5 airlines by total arrival delays
top_carriers <- Airline_Delay_Post_COVID_2021_2023 %>%
group_by(carrier) %>%
summarise(total_arr_delay = sum(arr_delay, na.rm = TRUE)) %>%
arrange(desc(total_arr_delay)) %>%
slice_head(n = 5) %>%
pull(carrier)
# Stacked bar chart
Airline_Delay_Post_COVID_2021_2023 %>%
filter(carrier %in% top_carriers) %>%
select(carrier, carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) %>%
pivot_longer(cols = -carrier, names_to = "delay_type", values_to = "minutes") %>%
group_by(carrier, delay_type) %>%
summarise(total_minutes = sum(minutes, na.rm = TRUE)) %>%
ggplot(aes(x = carrier, y = total_minutes, fill = delay_type)) +
geom_col(position = "stack") +
labs(
title = "Contribution of Delay Causes for Top 5 Airlines",
x = "Carrier",
y = "Total Delay Minutes",
fill = "Delay Cause"
)
For the top 5 airlines, late aircraft and NAS delays are the largest contributors. This highlights the operational areas that airlines should prioritize to reduce the most severe delays.
# Scatter plot: Arrival delay vs Late Aircraft delay
ggplot(Airline_Delay_Post_COVID_2021_2023, aes(x = late_aircraft_delay, y = arr_delay)) +
geom_point(alpha = 0.3, color = "darkblue") +
labs(
title = "Relationship Between Arrival Delays and Late Aircraft Delays",
x = "Late Aircraft Delay (minutes)",
y = "Arrival Delay (minutes)"
)
There is a clear positive relationship between late aircraft delays and overall arrival delays, meaning that flights delayed due to late aircraft tend to have much higher total arrival delays. Airlines can target scheduling and aircraft rotation efficiency to reduce these delays.