Research Question:

Is there sufficient statistical evidence, at the 5% significance level, to determine whether the average number of flight delays caused by the National Aviation System is significantly lower than those caused by Air Carriers?

Introduction:

The dataset I’m using to answer my research question is called Airline Delays. It contains 3,351 rows and 21 variables, showing data for flight delays across different U.S. cities and airlines in December 2019 and 2020. Each row represents a specific airline at a specific airport. The dataset includes variables such as the year and month, the airline name, and the airport where the flights arrived. However, the variables I am focusing on are “carrier_ct”, which represents the number of flights delayed due to the air carrier not having a crew, and “nas_ct”, which represents the number of flights delayed because of the National Aviation System, often due to heavy air traffic. The dataset was obtained from the Bureau of Transportation Statistics, which collects and publishes data on airline on-time performance across U.S. airports (U.S. Department of Transportation).

Data Analysis

For this project, I began my data analysis by checking the structure of the dataset to see which class each variable belongs to. I then used the summary() function to obtain the five-number summary (minimum, first quartile, median, third quartile, and maximum) and the mean for my variables, which helped me gain a better understanding of the dataset.

Next, I created a new dataset containing only the variables relevant to my research question using the select() function. I used the colSums(is.na()) function to determine how many missing values (NAs) were present. I noticed that the number of NAs was the same across several columns, which led me to discover that all the NAs appeared in the same rows. Since those rows would not provide useful information for my analysis, I decided to omit them from the dataset.

After cleaning the data, I used the mutate() function to create a new variable that calculates the difference between the two types of flight delays so I could see whether the differences in their counts were significant. I also renamed the variables to clearer and more descriptive names to make the dataset easier to interpret, especially since I planned to convert it into a long format with headers like delay_type and delay_count. Seeing “nas_ct” under delay_type could be confusing to the audience.

I then transformed the dataset into a long format, which made it easier to create clear and effective visualizations. Finally, I used the arrange() function to sort the dataset in descending order, allowing me to identify which cause of delay appeared more and the highest number of delays that are associated with them, to make comparisons between the two delay types more effectively.

Finally, I created a jitter plot to visually compare individual delay counts for Air Carrier and National Aviation System delays. This plot allowed me to see how the data points were distributed for both groups and to observe any clustering patterns that could potentially represent the mean. By examining the plot, I could infer whether one type of delay tends to occur more frequently or at higher counts than the other.

Link to access dataset

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(RColorBrewer)

flights <- read.csv('airline_delay.csv')

str(flights)

## 'data.frame':    3351 obs. of  21 variables:
##  $ year               : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ month              : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ carrier            : chr  "9E" "9E" "9E" "9E" ...
##  $ carrier_name       : chr  "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." "Endeavor Air Inc." ...
##  $ airport            : chr  "ABE" "ABY" "AEX" "AGS" ...
##  $ airport_name       : chr  "Allentown/Bethlehem/Easton, PA: Lehigh Valley International" "Albany, GA: Southwest Georgia Regional" "Alexandria, LA: Alexandria International" "Augusta, GA: Augusta Regional at Bush Field" ...
##  $ arr_flights        : int  44 90 88 184 76 5985 142 147 84 150 ...
##  $ arr_del15          : int  3 1 8 9 11 445 14 10 14 19 ...
##  $ carrier_ct         : num  1.63 0.96 5.75 4.17 4.78 ...
##  $ weather_ct         : num  0 0 0 0 0 ...
##  $ nas_ct             : num  0.12 0.04 1.6 1.83 5.22 ...
##  $ security_ct        : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ late_aircraft_ct   : num  1.25 0 0.65 3 1 ...
##  $ arr_cancelled      : int  0 0 0 0 1 5 1 0 1 3 ...
##  $ arr_diverted       : int  1 0 1 0 0 0 0 1 1 0 ...
##  $ arr_delay          : int  89 23 338 508 692 30756 436 1070 2006 846 ...
##  $ carrier_delay      : int  56 22 265 192 398 16390 162 838 1164 423 ...
##  $ weather_delay      : int  0 0 0 0 0 1509 0 141 619 0 ...
##  $ nas_delay          : int  3 1 45 92 178 5060 182 24 223 389 ...
##  $ security_delay     : int  0 0 0 0 0 16 0 0 0 0 ...
##  $ late_aircraft_delay: int  30 0 28 224 116 7781 92 67 0 34 ...

summary(flights)

##       year          month      carrier          carrier_name      
##  Min.   :2019   Min.   :12   Length:3351        Length:3351       
##  1st Qu.:2019   1st Qu.:12   Class :character   Class :character  
##  Median :2019   Median :12   Mode  :character   Mode  :character  
##  Mean   :2019   Mean   :12                                        
##  3rd Qu.:2020   3rd Qu.:12                                        
##  Max.   :2020   Max.   :12                                        
##                                                                   
##    airport          airport_name        arr_flights        arr_del15   
##  Length:3351        Length:3351        Min.   :    1.0   Min.   :   0  
##  Class :character   Class :character   1st Qu.:   35.0   1st Qu.:   5  
##  Mode  :character   Mode  :character   Median :   83.0   Median :  12  
##                                        Mean   :  298.3   Mean   :  51  
##                                        3rd Qu.:  194.5   3rd Qu.:  33  
##                                        Max.   :19713.0   Max.   :2289  
##                                        NA's   :8         NA's   :8     
##    carrier_ct       weather_ct         nas_ct         security_ct     
##  Min.   :  0.00   Min.   : 0.000   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:  1.49   1st Qu.: 0.000   1st Qu.:   0.82   1st Qu.: 0.0000  
##  Median :  4.75   Median : 0.060   Median :   2.98   Median : 0.0000  
##  Mean   : 16.07   Mean   : 1.443   Mean   :  16.18   Mean   : 0.1373  
##  3rd Qu.: 12.26   3rd Qu.: 1.010   3rd Qu.:   8.87   3rd Qu.: 0.0000  
##  Max.   :697.00   Max.   :89.420   Max.   :1039.54   Max.   :17.3100  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  late_aircraft_ct arr_cancelled      arr_diverted       arr_delay     
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.0000   Min.   :     0  
##  1st Qu.:  0.90   1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:   230  
##  Median :  3.28   Median :  0.000   Median : 0.0000   Median :   746  
##  Mean   : 17.17   Mean   :  2.885   Mean   : 0.5758   Mean   :  3334  
##  3rd Qu.: 10.24   3rd Qu.:  2.000   3rd Qu.: 0.0000   3rd Qu.:  2096  
##  Max.   :819.66   Max.   :224.000   Max.   :42.0000   Max.   :160383  
##  NA's   :8        NA's   :8         NA's   :8         NA's   :8       
##  carrier_delay     weather_delay       nas_delay       security_delay   
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:   68.5   1st Qu.:    0.0   1st Qu.:   21.5   1st Qu.:  0.000  
##  Median :  272.0   Median :    3.0   Median :  106.0   Median :  0.000  
##  Mean   : 1144.8   Mean   :  177.6   Mean   :  749.6   Mean   :  5.401  
##  3rd Qu.:  830.5   3rd Qu.:   82.0   3rd Qu.:  362.0   3rd Qu.:  0.000  
##  Max.   :55215.0   Max.   :14219.0   Max.   :82064.0   Max.   :553.000  
##  NA's   :8         NA's   :8         NA's   :8         NA's   :8        
##  late_aircraft_delay
##  Min.   :    0      
##  1st Qu.:   31      
##  Median :  205      
##  Mean   : 1257      
##  3rd Qu.:  724      
##  Max.   :75179      
##  NA's   :8

flights1<- flights |>
select('carrier_ct' ,'nas_ct')
head(flights1)

##   carrier_ct nas_ct
## 1       1.63   0.12
## 2       0.96   0.04
## 3       5.75   1.60
## 4       4.17   1.83
## 5       4.78   5.22
## 6     142.89 161.37

colSums(is.na(flights1))

## carrier_ct     nas_ct 
##          8          8

flights1 <- na.omit(flights1)

delay_data <- flights1 |>
  mutate(delay_difference = nas_ct - carrier_ct)
head(delay_data)

##   carrier_ct nas_ct delay_difference
## 1       1.63   0.12            -1.51
## 2       0.96   0.04            -0.92
## 3       5.75   1.60            -4.15
## 4       4.17   1.83            -2.34
## 5       4.78   5.22             0.44
## 6     142.89 161.37            18.48

flights2 <- flights1|>
  rename(air_carrier_delay = carrier_ct,
         nas_delay = nas_ct)
head(flights2)

##   air_carrier_delay nas_delay
## 1              1.63      0.12
## 2              0.96      0.04
## 3              5.75      1.60
## 4              4.17      1.83
## 5              4.78      5.22
## 6            142.89    161.37

flights_long <- flights2 |>
  pivot_longer(cols = c(air_carrier_delay, nas_delay),
               names_to = "delay_type",
               values_to = "delay_count")
flights_long

## # A tibble: 6,686 × 2
##    delay_type        delay_count
##    <chr>                   <dbl>
##  1 air_carrier_delay        1.63
##  2 nas_delay                0.12
##  3 air_carrier_delay        0.96
##  4 nas_delay                0.04
##  5 air_carrier_delay        5.75
##  6 nas_delay                1.6 
##  7 air_carrier_delay        4.17
##  8 nas_delay                1.83
##  9 air_carrier_delay        4.78
## 10 nas_delay                5.22
## # ℹ 6,676 more rows

flights_long |>
  arrange(desc(delay_count))

## # A tibble: 6,686 × 2
##    delay_type        delay_count
##    <chr>                   <dbl>
##  1 nas_delay               1040.
##  2 nas_delay                916.
##  3 nas_delay                762.
##  4 nas_delay                758.
##  5 nas_delay                707.
##  6 air_carrier_delay        697 
##  7 air_carrier_delay        686.
##  8 nas_delay                653.
##  9 nas_delay                650.
## 10 nas_delay                637.
## # ℹ 6,676 more rows

plot <- flights_long |> 
  ggplot(aes(x = delay_type,       
             y = delay_count,  
             color = delay_type)) +  
  geom_jitter( alpha = 0.7, size= 3) +  
  scale_color_brewer(palette = "Set2") +  
  theme_minimal() + 
  labs(title = "Comparison of Delay Counts by Cause",   
       x = "Delay Cause",                              
       y = "Number of Delays",                        
       color = "Delay Type",                         
       caption = "Source: Bureau of Transportation Statistics")
plot

Statistical Analysis

\(H_0\): \(\mu_1\) = \(\mu_2\)

\(H_a\): \(\mu_1\) < \(\mu_2\)

\(\mu_1\) : Average number of flights delayed due to National Aviation System

\(\mu_2\): Average number of flights delayed due to Air Carriers

t.test(flights1$nas_ct,flights1$carrier_ct, conf.level = 0.95, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  flights1$nas_ct and flights1$carrier_ct
## t = 0.097233, df = 6158.3, p-value = 0.5387
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 2.115303
## sample estimates:
## mean of x mean of y 
##  16.18338  16.06534

The results of the Welch two-sample t-test indicated that the sample means were 16.18 for the National Aviation System and 16.07 for Air Carriers. Since the p-value (0.539) is greater than the significance level α = 0.05, we fail to reject the null hypothesis. This suggests that there is insufficient statistical evidence to conclude that the average number of flight delays caused by the National Aviation System is significantly less from those caused by Air Carriers.

Conclusion

Overall, I found that the mean number of flights delayed by Air Carriers and the National Aviation System were very similar. Because the t-test led to a failure to reject the null hypothesis, we can infer that the average number of delays caused by both factors are likely similar or equal.

For future analysis, I could expand this study by comparing the means of other types of delay causes to determine whether significant differences exist among them or if most delay types tend to have similar averages.

Reference

Bureau of Transportation Statistics. (2021). Airline on-time performance data (December 2019 and December 2020)[Data set]. U.S. Department of Transportation. https://www.transtats.bts.gov

Project 2

Kenny Nguyen