Is there sufficient statistical evidence, at the 5% significance level, to determine whether the average number of flight delays caused by the National Aviation System is significantly lower than those caused by Air Carriers?
The dataset I’m using to answer my research question is called Airline Delays. It contains 3,351 rows and 21 variables, showing data for flight delays across different U.S. cities and airlines in December 2019 and 2020. Each row represents a specific airline at a specific airport. The dataset includes variables such as the year and month, the airline name, and the airport where the flights arrived. However, the variables I am focusing on are “carrier_ct”, which represents the number of flights delayed due to the air carrier not having a crew, and “nas_ct”, which represents the number of flights delayed because of the National Aviation System, often due to heavy air traffic. The dataset was obtained from the Bureau of Transportation Statistics, which collects and publishes data on airline on-time performance across U.S. airports (U.S. Department of Transportation).
For this project, I began my data analysis by checking the structure of the dataset to see which class each variable belongs to. I then used the summary() function to obtain the five-number summary (minimum, first quartile, median, third quartile, and maximum) and the mean for my variables, which helped me gain a better understanding of the dataset.
Next, I created a new dataset containing only the variables relevant to my research question using the select() function. I used the colSums(is.na()) function to determine how many missing values (NAs) were present. I noticed that the number of NAs was the same across several columns, which led me to discover that all the NAs appeared in the same rows. Since those rows would not provide useful information for my analysis, I decided to omit them from the dataset.
After cleaning the data, I used the mutate() function to create a new variable that calculates the difference between the two types of flight delays so I could see whether the differences in their counts were significant. I also renamed the variables to clearer and more descriptive names to make the dataset easier to interpret, especially since I planned to convert it into a long format with headers like delay_type and delay_count. Seeing “nas_ct” under delay_type could be confusing to the audience.
I then transformed the dataset into a long format, which made it easier to create clear and effective visualizations. Finally, I used the arrange() function to sort the dataset in descending order, allowing me to identify which cause of delay appeared more and the highest number of delays that are associated with them, to make comparisons between the two delay types more effectively.
Finally, I created a jitter plot to visually compare individual delay counts for Air Carrier and National Aviation System delays. This plot allowed me to see how the data points were distributed for both groups and to observe any clustering patterns that could potentially represent the mean. By examining the plot, I could infer whether one type of delay tends to occur more frequently or at higher counts than the other.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(RColorBrewer)
flights <- read.csv('airline_delay.csv')
str(flights)
## 'data.frame': 3351 obs. of 21 variables:
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ carrier : chr "9E" "9E" "9E" "9E" ...
## $ carrier_name : chr "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
## $ airport : chr "ABE" "ABY" "AEX" "AGS" ...
## $ airport_name : chr "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
## $ arr_flights : int 44 90 88 184 76 5985 142 147 84 150 ...
## $ arr_del15 : int 3 1 8 9 11 445 14 10 14 19 ...
## $ carrier_ct : num 1.63 0.96 5.75 4.17 4.78 ...
## $ weather_ct : num 0 0 0 0 0 ...
## $ nas_ct : num 0.12 0.04 1.6 1.83 5.22 ...
## $ security_ct : num 0 0 0 0 0 1 0 0 0 0 ...
## $ late_aircraft_ct : num 1.25 0 0.65 3 1 ...
## $ arr_cancelled : int 0 0 0 0 1 5 1 0 1 3 ...
## $ arr_diverted : int 1 0 1 0 0 0 0 1 1 0 ...
## $ arr_delay : int 89 23 338 508 692 30756 436 1070 2006 846 ...
## $ carrier_delay : int 56 22 265 192 398 16390 162 838 1164 423 ...
## $ weather_delay : int 0 0 0 0 0 1509 0 141 619 0 ...
## $ nas_delay : int 3 1 45 92 178 5060 182 24 223 389 ...
## $ security_delay : int 0 0 0 0 0 16 0 0 0 0 ...
## $ late_aircraft_delay: int 30 0 28 224 116 7781 92 67 0 34 ...
summary(flights)
## year month carrier carrier_name
## Min. :2019 Min. :12 Length:3351 Length:3351
## 1st Qu.:2019 1st Qu.:12 Class :character Class :character
## Median :2019 Median :12 Mode :character Mode :character
## Mean :2019 Mean :12
## 3rd Qu.:2020 3rd Qu.:12
## Max. :2020 Max. :12
##
## airport airport_name arr_flights arr_del15
## Length:3351 Length:3351 Min. : 1.0 Min. : 0
## Class :character Class :character 1st Qu.: 35.0 1st Qu.: 5
## Mode :character Mode :character Median : 83.0 Median : 12
## Mean : 298.3 Mean : 51
## 3rd Qu.: 194.5 3rd Qu.: 33
## Max. :19713.0 Max. :2289
## NA's :8 NA's :8
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 1.49 1st Qu.: 0.000 1st Qu.: 0.82 1st Qu.: 0.0000
## Median : 4.75 Median : 0.060 Median : 2.98 Median : 0.0000
## Mean : 16.07 Mean : 1.443 Mean : 16.18 Mean : 0.1373
## 3rd Qu.: 12.26 3rd Qu.: 1.010 3rd Qu.: 8.87 3rd Qu.: 0.0000
## Max. :697.00 Max. :89.420 Max. :1039.54 Max. :17.3100
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.00 Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 0.90 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 230
## Median : 3.28 Median : 0.000 Median : 0.0000 Median : 746
## Mean : 17.17 Mean : 2.885 Mean : 0.5758 Mean : 3334
## 3rd Qu.: 10.24 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2096
## Max. :819.66 Max. :224.000 Max. :42.0000 Max. :160383
## NA's :8 NA's :8 NA's :8 NA's :8
## carrier_delay weather_delay nas_delay security_delay
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 68.5 1st Qu.: 0.0 1st Qu.: 21.5 1st Qu.: 0.000
## Median : 272.0 Median : 3.0 Median : 106.0 Median : 0.000
## Mean : 1144.8 Mean : 177.6 Mean : 749.6 Mean : 5.401
## 3rd Qu.: 830.5 3rd Qu.: 82.0 3rd Qu.: 362.0 3rd Qu.: 0.000
## Max. :55215.0 Max. :14219.0 Max. :82064.0 Max. :553.000
## NA's :8 NA's :8 NA's :8 NA's :8
## late_aircraft_delay
## Min. : 0
## 1st Qu.: 31
## Median : 205
## Mean : 1257
## 3rd Qu.: 724
## Max. :75179
## NA's :8
flights1<- flights |>
select('carrier_ct' ,'nas_ct')
head(flights1)
## carrier_ct nas_ct
## 1 1.63 0.12
## 2 0.96 0.04
## 3 5.75 1.60
## 4 4.17 1.83
## 5 4.78 5.22
## 6 142.89 161.37
colSums(is.na(flights1))
## carrier_ct nas_ct
## 8 8
flights1 <- na.omit(flights1)
delay_data <- flights1 |>
mutate(delay_difference = nas_ct - carrier_ct)
head(delay_data)
## carrier_ct nas_ct delay_difference
## 1 1.63 0.12 -1.51
## 2 0.96 0.04 -0.92
## 3 5.75 1.60 -4.15
## 4 4.17 1.83 -2.34
## 5 4.78 5.22 0.44
## 6 142.89 161.37 18.48
flights2 <- flights1|>
rename(air_carrier_delay = carrier_ct,
nas_delay = nas_ct)
head(flights2)
## air_carrier_delay nas_delay
## 1 1.63 0.12
## 2 0.96 0.04
## 3 5.75 1.60
## 4 4.17 1.83
## 5 4.78 5.22
## 6 142.89 161.37
flights_long <- flights2 |>
pivot_longer(cols = c(air_carrier_delay, nas_delay),
names_to = "delay_type",
values_to = "delay_count")
flights_long
## # A tibble: 6,686 × 2
## delay_type delay_count
## <chr> <dbl>
## 1 air_carrier_delay 1.63
## 2 nas_delay 0.12
## 3 air_carrier_delay 0.96
## 4 nas_delay 0.04
## 5 air_carrier_delay 5.75
## 6 nas_delay 1.6
## 7 air_carrier_delay 4.17
## 8 nas_delay 1.83
## 9 air_carrier_delay 4.78
## 10 nas_delay 5.22
## # ℹ 6,676 more rows
flights_long |>
arrange(desc(delay_count))
## # A tibble: 6,686 × 2
## delay_type delay_count
## <chr> <dbl>
## 1 nas_delay 1040.
## 2 nas_delay 916.
## 3 nas_delay 762.
## 4 nas_delay 758.
## 5 nas_delay 707.
## 6 air_carrier_delay 697
## 7 air_carrier_delay 686.
## 8 nas_delay 653.
## 9 nas_delay 650.
## 10 nas_delay 637.
## # ℹ 6,676 more rows
plot <- flights_long |>
ggplot(aes(x = delay_type,
y = delay_count,
color = delay_type)) +
geom_jitter( alpha = 0.7, size= 3) +
scale_color_brewer(palette = "Set2") +
theme_minimal() +
labs(title = "Comparison of Delay Counts by Cause",
x = "Delay Cause",
y = "Number of Delays",
color = "Delay Type",
caption = "Source: Bureau of Transportation Statistics")
plot
\(H_0\): \(\mu_1\) = \(\mu_2\)
\(H_a\): \(\mu_1\) < \(\mu_2\)
\(\mu_1\) : Average number of flights delayed due to National Aviation System
\(\mu_2\): Average number of flights delayed due to Air Carriers
t.test(flights1$nas_ct,flights1$carrier_ct, conf.level = 0.95, alternative = "less")
##
## Welch Two Sample t-test
##
## data: flights1$nas_ct and flights1$carrier_ct
## t = 0.097233, df = 6158.3, p-value = 0.5387
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 2.115303
## sample estimates:
## mean of x mean of y
## 16.18338 16.06534
The results of the Welch two-sample t-test indicated that the sample means were 16.18 for the National Aviation System and 16.07 for Air Carriers. Since the p-value (0.539) is greater than the significance level α = 0.05, we fail to reject the null hypothesis. This suggests that there is insufficient statistical evidence to conclude that the average number of flight delays caused by the National Aviation System is significantly less from those caused by Air Carriers.
Overall, I found that the mean number of flights delayed by Air Carriers and the National Aviation System were very similar. Because the t-test led to a failure to reject the null hypothesis, we can infer that the average number of delays caused by both factors are likely similar or equal.
For future analysis, I could expand this study by comparing the means of other types of delay causes to determine whether significant differences exist among them or if most delay types tend to have similar averages.
Bureau of Transportation Statistics. (2021). Airline on-time performance data (December 2019 and December 2020)[Data set]. U.S. Department of Transportation. https://www.transtats.bts.gov