Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")
names(Airline_Delay_Post_COVID_2021_2023)
## [1] "year" "month" "carrier"
## [4] "carrier_name" "airport" "airport_name"
## [7] "arr_flights" "arr_del15" "carrier_ct"
## [10] "weather_ct" "nas_ct" "security_ct"
## [13] "late_aircraft_ct" "arr_cancelled" "arr_diverted"
## [16] "arr_delay" "carrier_delay" "weather_delay"
## [19] "nas_delay" "security_delay" "late_aircraft_delay"
This analysis examines airline delay data from 2021–2023 to understand factors that influence arrival delays. Airline delays are important because they impact passenger travel time, airline scheduling, and airport efficiency.
Two statistical methods are used in this analysis:
Understanding these relationships may help airlines and airport managers improve scheduling and reduce delays.
The response variable selected is arr_delay, which represents the total arrival delay in minutes. This variable is valuable because arrival delays directly affect passenger travel experience and airline performance.
The average arrival delay per flight is calculated for each airline, which reflects how severely delays affect passengers across different carriers.
avg_delay_airline <- Airline_Delay_Post_COVID_2021_2023 %>%
group_by(carrier_name) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE))
ggplot(avg_delay_airline, aes(x = reorder(carrier_name, avg_delay), y = avg_delay)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Average Arrival Delay by Airline",
x = "Airline",
y = "Average Delay (minutes)"
) +
theme_minimal()
Southwest Airlines Co. has the highest average arrival delay, followed closely by American Airlines Inc. At the other end, Horizon Air shows the lowest average delay, suggesting meaningfully better on-time performance compared to the larger carriers.
Grouped bar charts help compare the major causes of delays across airlines.
delay_causes <- Airline_Delay_Post_COVID_2021_2023 %>%
group_by(carrier_name) %>%
summarise(
carrier = mean(carrier_delay, na.rm = TRUE),
weather = mean(weather_delay, na.rm = TRUE),
nas = mean(nas_delay, na.rm = TRUE),
late_aircraft = mean(late_aircraft_delay, na.rm = TRUE)
) %>%
tidyr::pivot_longer(cols = -carrier_name,
names_to = "delay_type",
values_to = "delay_minutes")
ggplot(delay_causes,
aes(x = carrier_name, y = delay_minutes, fill = delay_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Average Delay Causes by Airline",
x = "Airline",
y = "Average Delay (minutes)",
fill = "Delay Type"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Late aircraft delays are the dominant cause for Southwest Airlines Co., while carrier-caused delays are notably high for American Airlines Inc. and JetBlue Airways. Weather delays are consistently the smallest contributor across all airlines, suggesting that most delay accumulation is operationally driven rather than weather-driven.
Arrival delays often contain extreme values that create long tails in the distribution. To improve readability, we apply a threshold to remove extreme outliers.
delay_threshold <- Airline_Delay_Post_COVID_2021_2023 %>%
filter(arr_delay < quantile(arr_delay, 0.95, na.rm = TRUE))
ggplot(delay_threshold, aes(x = arr_delay)) +
geom_histogram(bins = 40, fill = "skyblue", color = "black", na.rm = TRUE) +
scale_x_continuous(breaks = seq(0, max(delay_threshold$arr_delay), by = 5000)) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Distribution of Arrival Delays (Outliers Removed)",
x = "Arrival Delay (minutes)",
y = "Number of Flights"
) +
theme_minimal()
The distribution is strongly right-skewed, and the majority of observations fall below 2,500 minutes, with a long tail extending toward 15,000+ minutes. This indicates that while most carrier-airport-month combinations experience moderate delays, a small number of cases accumulate extremely large totals, likely driven by major disruption events.
The categorical explanatory variable selected is carrier_name, which represents the airline carrier.
Different airlines may have different operational efficiencies, which could influence delay times.
The dataset contains more than 10 carriers, so we consolidate to the top 10 by total flight volume before running the ANOVA test.
top_10_carriers <- Airline_Delay_Post_COVID_2021_2023 %>%
group_by(carrier_name) %>%
summarise(total_flights = sum(arr_flights, na.rm = TRUE)) %>%
arrange(desc(total_flights)) %>%
slice_head(n = 10) %>%
pull(carrier_name)
df_anova <- Airline_Delay_Post_COVID_2021_2023 %>%
filter(carrier_name %in% top_10_carriers)
Null Hypothesis (H₀): The average arrival delay is the same for all airlines.
Alternative Hypothesis (H₁): At least one airline has a different average arrival delay.
anova_model <- aov(arr_delay ~ carrier_name, data = df_anova)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier_name 9 261402506488 29044722943 150.6 <0.0000000000000002 ***
## Residuals 30331 5850012277173 192872384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 27 observations deleted due to missingness
ggplot(df_anova,
aes(x = carrier_name, y = arr_delay)) +
geom_boxplot(fill = "lightblue", na.rm = TRUE) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Arrival Delay Distribution by Airline",
x = "Airline",
y = "Arrival Delay (minutes)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_cartesian(ylim = c(0, 50000))
The ANOVA output shows an F-statistic of 150.6 with a p-value far below 0.0001. We therefore reject the null hypothesis, and there is sufficient evidence to conclude that at least one airline has a significantly different mean arrival delay compared to the others.
From the boxplot, Southwest Airlines Co. shows the highest median and widest IQR, indicating both larger and more variable delays. Envoy Air and Endeavor Air Inc. show notably tighter, lower distributions. This means it would not be safe to assume all airlines perform equally in terms of delays, and passengers choosing between airlines should consider delay performance as a meaningful differentiator, and airlines with higher average delays should review their operational processes such as turnaround times and crew scheduling as areas for improvement.
The continuous explanatory variable selected is arr_flights, representing the number of arriving flights. Higher flight traffic could increase airport congestion and contribute to delays.
lm_model <- lm(arr_delay ~ arr_flights, data = Airline_Delay_Post_COVID_2021_2023)
summary(lm_model)
##
## Call:
## lm(formula = arr_delay ~ arr_flights, data = Airline_Delay_Post_COVID_2021_2023)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130360 -634 -200 259 172886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.30061 26.71593 3.38 0.000725 ***
## arr_flights 12.42247 0.02833 438.47 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5324 on 44875 degrees of freedom
## (34 observations deleted due to missingness)
## Multiple R-squared: 0.8108, Adjusted R-squared: 0.8108
## F-statistic: 1.923e+05 on 1 and 44875 DF, p-value: < 0.00000000000000022
ggplot(Airline_Delay_Post_COVID_2021_2023,
aes(x = arr_flights, y = arr_delay)) +
geom_point(alpha = 0.4, na.rm = TRUE) +
geom_smooth(method = "lm", se = FALSE, color = "red", na.rm = TRUE) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Relationship Between Number of Flights and Arrival Delay",
x = "Number of Arriving Flights",
y = "Arrival Delay (minutes)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot shows a positive, roughly linear relationship between flight volume and arrival delay, supporting the use of a linear model. Points cluster tightly along the regression line at lower flight volumes (0–5,000 flights) and spread more at higher volumes, suggesting increasing variability as congestion grows. This confirms that a linear regression is appropriate here.
Intercept (90.3 minutes): When the number of arriving flights is zero, the model predicts a baseline arrival delay of approximately 90.3 minutes. This reflects delays that exist independently of flight volume, such as staffing or maintenance overhead.
Slope (12.42 minutes per flight): For every one additional arriving flight, total arrival delay increases by approximately 12.42 minutes on average. This confirms that increased airport traffic is meaningfully associated with longer delays.
R² = 0.81: The model explains 81% of the variance in total arrival delay using flight volume alone - a strong result for a single-predictor model. The remaining 19% is attributable to other factors such as weather, NAS congestion, and carrier-specific issues.
Recommendation: Airlines and airport planners could use the slope (~12.4 minutes per flight) as a congestion benchmark. Adding flights to an already busy route carries a predictable delay cost, and airports approaching high flight-volume thresholds should consider introducing scheduling buffers or staggered departure windows to prevent delay accumulation.
Future analysis could investigate:
These factors could provide deeper insights into the causes of airline delays.