# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")
df <- Airline_Delay_Post_COVID_2021_2023
I created four new variables:
total_delay_minutes — the sum of
all five delay cause categories (carrier, weather, NAS, security, and
late aircraft), giving a single comprehensive measure of total delay
burden per observation.
delay_pct — the percentage of
flights delayed more than 15 minutes, calculated as
(arr_del15 / arr_flights) * 100. Using a percentage instead
of a raw count avoids unfairly penalizing airlines that simply operate
more flights.
avg_carrier_delay — the average
delay in minutes per carrier-caused delay event
(carrier_delay / carrier_ct). This normalizes carrier delay
by frequency so we can compare the severity of each event
rather than just how many occurred. Observations where
carrier_ct is zero are assigned NA to avoid
division by zero.
delay_rate — the proportion of
arriving flights delayed 15+ minutes
(arr_del15 / arr_flights). This serves as the response
variable in Pair 2 and captures airline reliability as passengers
experience it.
df <- df %>%
mutate(
total_delay_minutes = carrier_delay + weather_delay +
nas_delay + security_delay + late_aircraft_delay,
delay_pct = (arr_del15 / arr_flights) * 100,
avg_carrier_delay = ifelse(carrier_ct > 0,
carrier_delay / carrier_ct,
NA),
delay_rate = arr_del15 / arr_flights
)
These new variables help us compare airlines more fairly. For example, using percentages instead of raw counts avoids unfairly penalizing airlines that operate more flights.
Variables for analysis:
Explanatory variable (X): arr_flights - original column
Response variable (Y): total_delay_minutes - created variable
This direction is appropriate because the number of flights logically influences the total amount of delay minutes accumulated.
ggplot(df, aes(x = arr_flights, y = total_delay_minutes)) +
geom_point(alpha = 0.2, size = 1.2, color = "steelblue") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Number of Arriving Flights vs Total Delay Minutes",
subtitle = "Each point = one carrier–airport–month observation (2021–2023)",
x = "Number of Arriving Flights",
y = "Total Delay Minutes"
) +
theme_minimal(base_size = 15)
The scatterplot shows a strong positive relationship: as the number of arriving flights increases, total delay minutes also increase. The pattern appears approximately linear, supporting the use of Pearson’s correlation coefficient. Variability is slightly higher for large hubs, but the linear trend still holds.
A few high-volume airports have extreme total delay minutes, but they follow the overall trend and do not distort the linear relationship.
r1 <- cor(df$arr_flights, df$total_delay_minutes, use = "complete.obs")
cat("Pearson r (arr_flights vs total_delay_minutes):", round(r1, 4))
## Pearson r (arr_flights vs total_delay_minutes): 0.9004
The correlation is strong and positive (r ≈ 0.90), confirming the nearly linear pattern observed in the scatterplot. This indicates that total delay minutes are largely driven by operational volume.
t.test(df$total_delay_minutes)
##
## One Sample t-test
##
## data: df$total_delay_minutes
## t = 70.298, df = 44876, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3948.316 4174.800
## sample estimates:
## mean of x
## 4061.558
The 95% confidence interval for mean total delay minutes is approximately 3,948 to 4,175 minutes per carrier–airport–month observation, indicating that the average carrier–airport–month observation accumulates approximately 4,000 total delay minutes.
We use the delay_rate variable created above:
Variables are as follows:
Explanatory (X): arr_flights - original column
Response (Y): delay_rate - created variable
arr_flights is used as X instead of arr_del15 to avoid artificially inflating the correlation.
ggplot(df, aes(x = arr_flights, y = delay_rate)) +
geom_jitter(alpha = 0.2, size = 1.0, color = "darkorange",
width = 10, height = 0.002) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Number of Arriving Flights vs Delay Rate",
subtitle = "Each point = one carrier–airport–month observation (2021–2023)",
x = "Number of Arriving Flights",
y = "Delay Rate (Proportion of Flights Delayed)"
) +
theme_minimal(base_size = 13)
The scatterplot displays a clear funnel-shaped pattern. For small operations (low arr_flights), delay rates vary widely, ranging from near 0% to well above 50%. In contrast, larger operations cluster within a narrower band, generally around 15–30% delay rates. This indicates that while smaller carrier–airport combinations experience substantial month-to-month variability, larger operations exhibit more stable and consistent delay proportions. Importantly, the plot does not show a systematic increase in delay rate as flight volume grows; instead, variability decreases with scale.
A small number of observations show extremely high delay rates, particularly among low-volume operations. These points occur where a limited number of flights experienced unusually high disruption within a single month. Because they are concentrated among small operations and do not form a broader pattern, they do not alter the overall relationship observed in the plot.
r2 <- cor(df$arr_flights, df$delay_rate, use = "complete.obs")
cat("Pearson r (arr_flights vs delay_rate):", round(r2, 4))
## Pearson r (arr_flights vs delay_rate): -0.0031
The correlation between arr_flights and delay_rate is weak, confirming what the scatterplot suggests: operational scale does not strongly predict delay proportion. While larger operations accumulate more total delay minutes, they are not systematically more delayed on a per-flight basis. This reinforces that delay_rate is a more meaningful performance measure than raw delay counts.
t.test(df$delay_rate)
##
## One Sample t-test
##
## data: df$delay_rate
## t = 351.3, df = 44864, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.1904073 0.1925439
## sample estimates:
## mean of x
## 0.1914756
The 95% confidence interval for the mean delay rate is approximately 19.0% to 19.3%. This indicates that, on average, about one in five flights was delayed during 2021–2023. The narrow interval reflects the large sample size and provides a stable baseline for evaluating airline performance.
This analysis highlights the distinction between operational volume and operational performance. While larger operations generate more total delay minutes, delay proportion remains relatively stable across scale. By combining visualization, correlation analysis, and interval estimation, we obtain a clearer understanding of airline reliability during the post-COVID recovery period.