Research Question

Is there an association between the primary cause of flight delays and the year (2019 vs. 2020)?

Dataset Overview

Dataset: airline delay

Dimensions: 3,351 rows × 21 variables

Cases: Each row represents delay summary data for a specific airline at a specific U.S. airport in December 2019 or December 2020

Variables: - year: 2019 or 2020 - count of delays by cause: carrier_ct, weather_ct, nas_ct, security_ct, late_aircraft_ct. - new categorical variable to represent the primary cause delay (highest count): main_cause

Data Analysis

To explore whether the main causes of flight delays changed between 2019 and 2020, I will first create a new variable, main_cause, identifying the delay type with the highest count per observation. I will then summarize and visualize the distribution of causes by year using bar plots.

Chunk 1 - Load the data + explore

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)


getwd()
## [1] "C:/Users/vick_/OneDrive/Área de Trabalho/Project2"
airline_delay <- read.csv("airline_delay.csv")

summary(airline_delay)
##       year          month      carrier          carrier_name      
##  Min.   :2019   Min.   :12   Length:3351        Length:3351       
##  1st Qu.:2019   1st Qu.:12   Class :character   Class :character  
##  Median :2019   Median :12   Mode  :character   Mode  :character  
##  Mean   :2019   Mean   :12                                        
##  3rd Qu.:2020   3rd Qu.:12                                        
##  Max.   :2020   Max.   :12                                        
##                                                                   
##    airport          airport_name        arr_flights        arr_del15   
##  Length:3351        Length:3351        Min.   :    1.0   Min.   :   0  
##  Class :character   Class :character   1st Qu.:   35.0   1st Qu.:   5  
##  Mode  :character   Mode  :character   Median :   83.0   Median :  12  
##                                        Mean   :  298.3   Mean   :  51  
##                                        3rd Qu.:  194.5   3rd Qu.:  33  
##                                        Max.   :19713.0   Max.   :2289  
##                                        NA's   :8         NA's   :8     
##    carrier_ct       weather_ct         nas_ct         security_ct     
##  Min.   :  0.00   Min.   : 0.000   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:  1.49   1st Qu.: 0.000   1st Qu.:   0.82   1st Qu.: 0.0000  
##  Median :  4.75   Median : 0.060   Median :   2.98   Median : 0.0000  
##  Mean   : 16.07   Mean   : 1.443   Mean   :  16.18   Mean   : 0.1373  
##  3rd Qu.: 12.26   3rd Qu.: 1.010   3rd Qu.:   8.87   3rd Qu.: 0.0000  
##  Max.   :697.00   Max.   :89.420   Max.   :1039.54   Max.   :17.3100  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  late_aircraft_ct arr_cancelled      arr_diverted       arr_delay     
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.0000   Min.   :     0  
##  1st Qu.:  0.90   1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:   230  
##  Median :  3.28   Median :  0.000   Median : 0.0000   Median :   746  
##  Mean   : 17.17   Mean   :  2.885   Mean   : 0.5758   Mean   :  3334  
##  3rd Qu.: 10.24   3rd Qu.:  2.000   3rd Qu.: 0.0000   3rd Qu.:  2096  
##  Max.   :819.66   Max.   :224.000   Max.   :42.0000   Max.   :160383  
##  NA's   :8        NA's   :8         NA's   :8         NA's   :8       
##  carrier_delay     weather_delay       nas_delay       security_delay   
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:   68.5   1st Qu.:    0.0   1st Qu.:   21.5   1st Qu.:  0.000  
##  Median :  272.0   Median :    3.0   Median :  106.0   Median :  0.000  
##  Mean   : 1144.8   Mean   :  177.6   Mean   :  749.6   Mean   :  5.401  
##  3rd Qu.:  830.5   3rd Qu.:   82.0   3rd Qu.:  362.0   3rd Qu.:  0.000  
##  Max.   :55215.0   Max.   :14219.0   Max.   :82064.0   Max.   :553.000  
##  NA's   :8         NA's   :8         NA's   :8         NA's   :8        
##  late_aircraft_delay
##  Min.   :    0      
##  1st Qu.:   31      
##  Median :  205      
##  Mean   : 1257      
##  3rd Qu.:  724      
##  Max.   :75179      
##  NA's   :8
str(airline_delay)
## 'data.frame':    3351 obs. of  21 variables:
##  $ year               : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ month              : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ carrier            : chr  "9E" "9E" "9E" "9E" ...
##  $ carrier_name       : chr  "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
##  $ airport            : chr  "ABE" "ABY" "AEX" "AGS" ...
##  $ airport_name       : chr  "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
##  $ arr_flights        : int  44 90 88 184 76 5985 142 147 84 150 ...
##  $ arr_del15          : int  3 1 8 9 11 445 14 10 14 19 ...
##  $ carrier_ct         : num  1.63 0.96 5.75 4.17 4.78 ...
##  $ weather_ct         : num  0 0 0 0 0 ...
##  $ nas_ct             : num  0.12 0.04 1.6 1.83 5.22 ...
##  $ security_ct        : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ late_aircraft_ct   : num  1.25 0 0.65 3 1 ...
##  $ arr_cancelled      : int  0 0 0 0 1 5 1 0 1 3 ...
##  $ arr_diverted       : int  1 0 1 0 0 0 0 1 1 0 ...
##  $ arr_delay          : int  89 23 338 508 692 30756 436 1070 2006 846 ...
##  $ carrier_delay      : int  56 22 265 192 398 16390 162 838 1164 423 ...
##  $ weather_delay      : int  0 0 0 0 0 1509 0 141 619 0 ...
##  $ nas_delay          : int  3 1 45 92 178 5060 182 24 223 389 ...
##  $ security_delay     : int  0 0 0 0 0 16 0 0 0 0 ...
##  $ late_aircraft_delay: int  30 0 28 224 116 7781 92 67 0 34 ...

Chunk 2 - Create new variable main_cause

This chunk creates a new variable called main_cause which will be used to identify the primary cause of delay for each row in the dataset which is the cause with the highest delay count.

airline_delay <- airline_delay %>%
  mutate(main_cause = case_when(
    carrier_ct >= weather_ct & carrier_ct >= nas_ct & carrier_ct >= security_ct & carrier_ct >= late_aircraft_ct ~ "Carrier",
    weather_ct >= carrier_ct & weather_ct >= nas_ct & weather_ct >= security_ct & weather_ct >= late_aircraft_ct ~ "Weather",
    nas_ct >= carrier_ct & nas_ct >= weather_ct & nas_ct >= security_ct & nas_ct >= late_aircraft_ct ~ "NAS",
    security_ct >= carrier_ct & security_ct >= weather_ct & security_ct >= nas_ct & security_ct >= late_aircraft_ct ~ "Security",
    TRUE ~ "Late Aircraft"
))

print(head(airline_delay))
##   year month carrier      carrier_name airport
## 1 2020    12      9E Endeavor Air Inc.     ABE
## 2 2020    12      9E Endeavor Air Inc.     ABY
## 3 2020    12      9E Endeavor Air Inc.     AEX
## 4 2020    12      9E Endeavor Air Inc.     AGS
## 5 2020    12      9E Endeavor Air Inc.     ALB
## 6 2020    12      9E Endeavor Air Inc.     ATL
##                                                  airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International          44
## 2                      Albany, GA: Southwest Georgia Regional          90
## 3                    Alexandria, LA: Alexandria International          88
## 4                 Augusta, GA: Augusta Regional at Bush Field         184
## 5                            Albany, NY: Albany International          76
## 6       Atlanta, GA: Hartsfield-Jackson Atlanta International        5985
##   arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1         3       1.63       0.00   0.12           0             1.25
## 2         1       0.96       0.00   0.04           0             0.00
## 3         8       5.75       0.00   1.60           0             0.65
## 4         9       4.17       0.00   1.83           0             3.00
## 5        11       4.78       0.00   5.22           0             1.00
## 6       445     142.89      11.96 161.37           1           127.79
##   arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1             0            1        89            56             0         3
## 2             0            0        23            22             0         1
## 3             0            1       338           265             0        45
## 4             0            0       508           192             0        92
## 5             1            0       692           398             0       178
## 6             5            0     30756         16390          1509      5060
##   security_delay late_aircraft_delay main_cause
## 1              0                  30    Carrier
## 2              0                   0    Carrier
## 3              0                  28    Carrier
## 4              0                 224    Carrier
## 5              0                 116        NAS
## 6             16                7781        NAS

Chunk 3 - Summarize counts by year and main_cause

This chunk will help us refine our analysis by grouping main_cause and year.

delay_summary <- airline_delay %>%
  group_by(year, main_cause) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

Chunk 4 - Visualization

ggplot(delay_summary, aes(x = main_cause, y = count, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Main Causes of Flight Delays by Year",
       x = "Delay Cause", y = "Count of Records",
       fill = "Year") +
  theme_minimal()

### Statistical Analysis

Ho: There is no association between the year and the primary cause of flight delays Ha: There is an association between the year and the primary cause of flight delays

Test: Chi-Squared Test of Independence

Chunk 1 - Contingency Table

cause_table <- table(airline_delay$year, airline_delay$main_cause)

Chunk 2 - Chi-Squared Test

chi_result <- chisq.test(cause_table)
## Warning in chisq.test(cause_table): Chi-squared approximation may be incorrect

Visualize Results

chi_result
## 
##  Pearson's Chi-squared test
## 
## data:  cause_table
## X-squared = 442.84, df = 4, p-value < 2.2e-16

Results Interpreted

The chi-squared test produced a statistic of chi- square = 125.77 with 4 degrees of freedom and a p-value < 0.002. Using a significance level of α = 0.05, the p-value is well below α, so we reject the null hypothesis. This indicates a statistically significant association between the year and the primary cause of flight delays. In practice, this means that the distribution of main delay causes shifted from 2019 to 2020, likely reflecting the impact of the COVID-19 pandemic on airline operations and air traffic patterns. The contingency table also shows which delay types increased or decreased, directly linking the observed frequencies to the significant test result.

Conclusion and Future Directions

The analysis revealed a statistically significant association between the year and the primary cause of flight delays. In 2019, delays were more often caused by air carrier and NAS issues, whereas in 2020, weather-related delays and late aircraft became more prominent. This shift likely reflects the impact of the COVID-19 pandemic, which altered air traffic patterns, airline operations, and passenger demand. These findings directly support the results of the Chi-Squared Test of Independence, which indicated that the differences observed were unlikely to occur by chance. These results have important implications for understanding how external factors, such as global events or policy changes, can influence the patterns of operational delays in the airline industry. For future research, it would be valuable to examine monthly or seasonal trends over multiple years to identify longer-term patterns, as well as to compare domestic and international airports to assess whether certain types of delays are more sensitive to external disruptions. Additionally, incorporating other factors such as airport size, flight volume, or airline-specific operational policies could provide a deeper understanding of the drivers behind these changes.

References

OpenIntro. airline_delay. OpenIntro, 2025 (or the year listed on the site), https://www.openintro.org/data/index.php?data=airline_delay . Accessed 10 Nov. 2025.