Analysis of NYC Flight Delays and Cancellations in 2013

Introduction

This report seeks to analyze patterns in New York City flights that were either canceled or delayed by more than one hour. We will be answering the following questions:

Why did some days have a higher proportion of cancellations and delays than others? Is there a relationship between scheduled departure time and flights that were canceled or delayed by more than one hour? And on days that had a high proportion of flights canceled or delayed, what is the relationship between scheduled departure time and flights that still were able to leave on time or early?

Throughout, we will use the functionality of the tidyverse package, mainly to perform data transformations, and the DT package to make the data viewable.

library(tidyverse)
library(DT)

We will be using a data set called flights loaded from the nycflights13 package. This data set contains 19 variables on all 336,776 flights that were scheduled to depart JFK, LGA, or EWR airports during 2013. The variables we are most interested in are month and day which tell us what day the flight took place, and hour tells us the hour of scheduled departure, in hours since midnight. We will also be using the dep_delay variable to determine how late a flight was or if it was canceled.

library(nycflights13)

Dates with High Proportion of Cancellations and Delays

We will use 60 minutes as our threshold to determine flights that were excessively delayed. First, we will categorize our original data set into 4 categories based on dep_delay: “canceled”, “late > 1hr”, “late < 1hr”, and “on time”. We will store these categorizations under a new variable called dep_status. Flights that left early are considered to be “on time” for our purposes.

flights_w_dep_status <- flights %>% 
  mutate(dep_status = case_when(is.na(dep_delay) ~ "canceled",
                                dep_delay > 60 ~ "delayed > 1hr",
                                dep_delay > 0 ~ "delayed < 1hr",
                                TRUE ~ "on time"))

Now that we have this “new” data set, we can determine how many days had a high proportion of excessive delays and cancellations. We will use 35% as our threshold.

sum_table_1 <- flights_w_dep_status %>% 
  group_by(month, day) %>% 
  summarize(percentage = mean(dep_status == "canceled" | dep_status == "delayed > 1hr") * 100,
            count = n()) %>% 
  filter(percentage > 35)
datatable(sum_table_1)

A little bit of research shows that the two days with the biggest percentages, February 8 and 9, were when “Winter Storm Nemo” rolled in, a huge blizzard categorized as a Category 3 bomb cyclone. Flights had to be canceled once it rolled in on the 8th, and the airports stayed closed most of the day on the 9th. There was also heavy snowfall on March 8th causing a lot of cancellations and delays, the day with the third-highest percentage. The rest of the days all had complications due to thunderstorms in the area, fog, or a combination of the two. When unsafe flying weather is detected, the FAA will issue ground stops and ground delay programs, limiting the number of flights that can depart on time.

Relationship between Cancellations and Departure Hour

We shift our focus from dates with high proportions of cancellations to times of day with high percentages. By changing the variable we group by in our summary table, we can easily make this shift. A line graph will help us visualize the trend in average percentage of cancellations and excessive delays throughout the day.

sum_table_2 <- flights_w_dep_status %>% 
  group_by(hour) %>% 
  summarize(percentage = 100 * mean(dep_status == "canceled" | dep_status == "delayed > 1hr"),
            count = n())
datatable(sum_table_2)

Note that the data table above had 1 entry at the 1:00 a.m. hour, and that 100% of flights at that hour were canceled or excessively delayed. We can take a closer look at that entry here:

outlier_flight <- filter(flights, hour == 1)
datatable(outlier_flight, options = list(scrollX = TRUE))

This entry was likely a repositioning flight, as it was scheduled to go from one NYC airport to another, but got canceled. This entry was likely added to the data set by mistake, as it was not a regular commercial flight. Therefore, we can exclude it when analyzing the trend in cancellations and excessive delays throughout the day. This is why we have chosen to begin our line graph at hour 5, as those were the earliest commercial flights.

ggplot(data = sum_table_2, mapping = aes(x = hour, y = percentage)) +
  geom_point() +
  geom_line() +
  xlim(5, 23) +
  ylim(0, 25)

This plot clearly shows that as the day goes on, the excessive delays and cancellations pile up. This makes sense because as flights start to get delayed, subsequent flights must wait for the delayed flight to finally take off, making the next flight delayed as well, and by hour 21 (9:00 p.m.), we see the riskiest time to have a flight, with approximately 20% of flights being delayed by over an hour or canceled. However, the 10 p.m. and 11 p.m. hours suddenly see a decrease. This is due to significantly lower volume of flight traffic in the night hours, as can be seen under the count variable in the data table above.

On Time Flights on Days with High Delay and Cancellation Percentages

Just because a particular date had a high proportion of flights that were canceled or excessively delayed doesn’t mean every flight was late. We will take a closer look at the flights that managed to still depart on time or early. This can most easily be done by combining the month and day variables into a single variable (which I will call monthday), and matching it with all the dates that show up in our table of dates with high percentages.

cancel_delay_date_list <- sum_table_1$month * 100 + sum_table_1$day
flights_v2 <- flights_w_dep_status %>% 
  mutate(monthday = month * 100 + day) %>% 
  filter(monthday %in% cancel_delay_date_list, dep_status == "on time") %>% 
  select(-monthday) %>% 
  arrange(month, day, dep_time)
datatable(flights_v2, options = list(scrollX = TRUE))

It seems reasonable to suggest that the flights that were on time or early were able to depart before the bad weather came in. We can test this hypothesis by finding the average departure hour for each day in our list.

sum_table_3 <- flights_v2 %>% 
  group_by(month, day) %>% 
  summarize(avg_dep_hour = mean(hour), count = n())
datatable(sum_table_3)

This summary table overwhelmingly support our hypothesis. All but one of these dates has an average departure hour of 8, 9, or 10 (which correspond to 8 a.m., 9 a.m., and 10 a.m.). We can check the average departure hour of all on time flights in 2013 for comparison:

ontime_flights <- flights_w_dep_status %>% 
  filter(dep_status == "on time")
mean(ontime_flights$hour)

## [1] 12.27121

On days without severe weather, the average departure hour of on time flights is around noon, further supporting our claim that on time flights on days with high proportions of excessive delays and cancellations must have been before the weather hit. Notice, however, the average departure hour of on time flights on February 9 was around 17, corresponding to 5 p.m. As stated earlier, a bad blizzard hit the east coast on February 8 which affected flight patterns for two days in a row. The afternoon hour indicates that flights were able to resume later in the day once the storm had cleared and runways were once again safe for takeoff.

Conclusion

In summary, we can conclude that days with high proportions of late or canceled flights were caused by severe weather, such as heavy snowfall, thunderstorms, and/or fog. On days where this was the case, the flights that still were able to depart on time mostly left before the weather hit. For all flights, the likelihood of a flight to be cancelled or excessively delayed increases as the day goes on, peaking at 9 p.m. and then decreasing again during the hours of lower flight traffic.