Is there an association between the primary cause of flight delays and the year (2019 vs. 2020)?
Dataset: airline delay
Dimensions: 3,351 rows × 21 variables
Cases: Each row represents delay summary data for a specific airline at a specific U.S. airport in December 2019 or December 2020
Variables: - year: 2019 or 2020 - count of delays by cause: carrier_ct, weather_ct, nas_ct, security_ct, late_aircraft_ct. - new categorical variable to represent the primary cause delay (highest count): main_cause
To explore whether the main causes of flight delays changed between 2019 and 2020, I will first create a new variable, main_cause, identifying the delay type with the highest count per observation. I will then summarize and visualize the distribution of causes by year using bar plots.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
getwd()
## [1] "C:/Users/vick_/OneDrive/Área de Trabalho/Project2"
airline_delay <- read.csv("airline_delay.csv")
summary(airline_delay)
## year month carrier carrier_name
## Min. :2019 Min. :12 Length:3351 Length:3351
## 1st Qu.:2019 1st Qu.:12 Class :character Class :character
## Median :2019 Median :12 Mode :character Mode :character
## Mean :2019 Mean :12
## 3rd Qu.:2020 3rd Qu.:12
## Max. :2020 Max. :12
##
## airport airport_name arr_flights arr_del15
## Length:3351 Length:3351 Min. : 1.0 Min. : 0
## Class :character Class :character 1st Qu.: 35.0 1st Qu.: 5
## Mode :character Mode :character Median : 83.0 Median : 12
## Mean : 298.3 Mean : 51
## 3rd Qu.: 194.5 3rd Qu.: 33
## Max. :19713.0 Max. :2289
## NA's :8 NA's :8
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 1.49 1st Qu.: 0.000 1st Qu.: 0.82 1st Qu.: 0.0000
## Median : 4.75 Median : 0.060 Median : 2.98 Median : 0.0000
## Mean : 16.07 Mean : 1.443 Mean : 16.18 Mean : 0.1373
## 3rd Qu.: 12.26 3rd Qu.: 1.010 3rd Qu.: 8.87 3rd Qu.: 0.0000
## Max. :697.00 Max. :89.420 Max. :1039.54 Max. :17.3100
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.00 Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 0.90 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 230
## Median : 3.28 Median : 0.000 Median : 0.0000 Median : 746
## Mean : 17.17 Mean : 2.885 Mean : 0.5758 Mean : 3334
## 3rd Qu.: 10.24 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2096
## Max. :819.66 Max. :224.000 Max. :42.0000 Max. :160383
## NA's :8 NA's :8 NA's :8 NA's :8
## carrier_delay weather_delay nas_delay security_delay
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 68.5 1st Qu.: 0.0 1st Qu.: 21.5 1st Qu.: 0.000
## Median : 272.0 Median : 3.0 Median : 106.0 Median : 0.000
## Mean : 1144.8 Mean : 177.6 Mean : 749.6 Mean : 5.401
## 3rd Qu.: 830.5 3rd Qu.: 82.0 3rd Qu.: 362.0 3rd Qu.: 0.000
## Max. :55215.0 Max. :14219.0 Max. :82064.0 Max. :553.000
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_delay
## Min. : 0
## 1st Qu.: 31
## Median : 205
## Mean : 1257
## 3rd Qu.: 724
## Max. :75179
## NA's :8
str(airline_delay)
## 'data.frame': 3351 obs. of 21 variables:
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ carrier : chr "9E" "9E" "9E" "9E" ...
## $ carrier_name : chr "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
## $ airport : chr "ABE" "ABY" "AEX" "AGS" ...
## $ airport_name : chr "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
## $ arr_flights : int 44 90 88 184 76 5985 142 147 84 150 ...
## $ arr_del15 : int 3 1 8 9 11 445 14 10 14 19 ...
## $ carrier_ct : num 1.63 0.96 5.75 4.17 4.78 ...
## $ weather_ct : num 0 0 0 0 0 ...
## $ nas_ct : num 0.12 0.04 1.6 1.83 5.22 ...
## $ security_ct : num 0 0 0 0 0 1 0 0 0 0 ...
## $ late_aircraft_ct : num 1.25 0 0.65 3 1 ...
## $ arr_cancelled : int 0 0 0 0 1 5 1 0 1 3 ...
## $ arr_diverted : int 1 0 1 0 0 0 0 1 1 0 ...
## $ arr_delay : int 89 23 338 508 692 30756 436 1070 2006 846 ...
## $ carrier_delay : int 56 22 265 192 398 16390 162 838 1164 423 ...
## $ weather_delay : int 0 0 0 0 0 1509 0 141 619 0 ...
## $ nas_delay : int 3 1 45 92 178 5060 182 24 223 389 ...
## $ security_delay : int 0 0 0 0 0 16 0 0 0 0 ...
## $ late_aircraft_delay: int 30 0 28 224 116 7781 92 67 0 34 ...
This chunk creates a new variable called main_cause which will be used to identify the primary cause of delay for each row in the dataset which is the cause with the highest delay count.
airline_delay <- airline_delay %>%
mutate(main_cause = case_when(
carrier_ct >= weather_ct & carrier_ct >= nas_ct & carrier_ct >= security_ct & carrier_ct >= late_aircraft_ct ~ "Carrier",
weather_ct >= carrier_ct & weather_ct >= nas_ct & weather_ct >= security_ct & weather_ct >= late_aircraft_ct ~ "Weather",
nas_ct >= carrier_ct & nas_ct >= weather_ct & nas_ct >= security_ct & nas_ct >= late_aircraft_ct ~ "NAS",
security_ct >= carrier_ct & security_ct >= weather_ct & security_ct >= nas_ct & security_ct >= late_aircraft_ct ~ "Security",
TRUE ~ "Late Aircraft"
))
print(head(airline_delay))
## year month carrier carrier_name airport
## 1 2020 12 9E Endeavor Air Inc. ABE
## 2 2020 12 9E Endeavor Air Inc. ABY
## 3 2020 12 9E Endeavor Air Inc. AEX
## 4 2020 12 9E Endeavor Air Inc. AGS
## 5 2020 12 9E Endeavor Air Inc. ALB
## 6 2020 12 9E Endeavor Air Inc. ATL
## airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International 44
## 2 Albany, GA: Southwest Georgia Regional 90
## 3 Alexandria, LA: Alexandria International 88
## 4 Augusta, GA: Augusta Regional at Bush Field 184
## 5 Albany, NY: Albany International 76
## 6 Atlanta, GA: Hartsfield-Jackson Atlanta International 5985
## arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1 3 1.63 0.00 0.12 0 1.25
## 2 1 0.96 0.00 0.04 0 0.00
## 3 8 5.75 0.00 1.60 0 0.65
## 4 9 4.17 0.00 1.83 0 3.00
## 5 11 4.78 0.00 5.22 0 1.00
## 6 445 142.89 11.96 161.37 1 127.79
## arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1 0 1 89 56 0 3
## 2 0 0 23 22 0 1
## 3 0 1 338 265 0 45
## 4 0 0 508 192 0 92
## 5 1 0 692 398 0 178
## 6 5 0 30756 16390 1509 5060
## security_delay late_aircraft_delay main_cause
## 1 0 30 Carrier
## 2 0 0 Carrier
## 3 0 28 Carrier
## 4 0 224 Carrier
## 5 0 116 NAS
## 6 16 7781 NAS
This chunk will help us refine our analysis by grouping main_cause and year.
delay_summary <- airline_delay %>%
group_by(year, main_cause) %>%
summarise(count = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(delay_summary, aes(x = main_cause, y = count, fill = factor(year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Main Causes of Flight Delays by Year",
x = "Delay Cause", y = "Count of Records",
fill = "Year") +
theme_minimal()
### Statistical Analysis
Ho: There is no association between the year and the primary cause of flight delays Ha: There is an association between the year and the primary cause of flight delays
cause_table <- table(airline_delay$year, airline_delay$main_cause)
chi_result <- chisq.test(cause_table)
## Warning in chisq.test(cause_table): Chi-squared approximation may be incorrect
chi_result
##
## Pearson's Chi-squared test
##
## data: cause_table
## X-squared = 442.84, df = 4, p-value < 2.2e-16
The chi-squared test produced a statistic of chi- square = 125.77 with 4 degrees of freedom and a p-value < 0.002. Using a significance level of α = 0.05, the p-value is well below α, so we reject the null hypothesis. This indicates a statistically significant association between the year and the primary cause of flight delays. In practice, this means that the distribution of main delay causes shifted from 2019 to 2020, likely reflecting the impact of the COVID-19 pandemic on airline operations and air traffic patterns. The contingency table also shows which delay types increased or decreased, directly linking the observed frequencies to the significant test result.
The analysis revealed a statistically significant association between the year and the primary cause of flight delays. In 2019, delays were more often caused by air carrier and NAS issues, whereas in 2020, weather-related delays and late aircraft became more prominent. This shift likely reflects the impact of the COVID-19 pandemic, which altered air traffic patterns, airline operations, and passenger demand. These findings directly support the results of the Chi-Squared Test of Independence, which indicated that the differences observed were unlikely to occur by chance. These results have important implications for understanding how external factors, such as global events or policy changes, can influence the patterns of operational delays in the airline industry. For future research, it would be valuable to examine monthly or seasonal trends over multiple years to identify longer-term patterns, as well as to compare domestic and international airports to assess whether certain types of delays are more sensitive to external disruptions. Additionally, incorporating other factors such as airport size, flight volume, or airline-specific operational policies could provide a deeper understanding of the drivers behind these changes.
OpenIntro. airline_delay. OpenIntro, 2025 (or the year listed on the site), https://www.openintro.org/data/index.php?data=airline_delay . Accessed 10 Nov. 2025.