This report seeks to analyze patterns in New York City flights that were either canceled or delayed by more than one hour. We will be answering the following questions:
Why did some days have a higher proportion of cancellations and delays than others? Is there a relationship between scheduled departure time and flights that were canceled or delayed by more than one hour? And on days that had a high proportion of flights canceled or delayed, what is the relationship between scheduled departure time and flights that still were able to leave on time or early?
Throughout, we will use the functionality of the
tidyverse package, mainly to perform data transformations,
and the DT package to make the data viewable.
library(tidyverse)
library(DT)
We will be using a data set called flights loaded from
the nycflights13 package. This data set contains 19
variables on all 336,776 flights that were scheduled to depart JFK, LGA,
or EWR airports during 2013. The variables we are most interested in are
month and day which tell us what day the
flight took place, and hour tells us the hour of scheduled
departure, in hours since midnight. We will also be using the
dep_delay variable to determine how late a flight was or if
it was canceled.
library(nycflights13)
We will use 60 minutes as our threshold to determine flights that
were excessively delayed. First, we will categorize our original data
set into 4 categories based on dep_delay: “canceled”, “late
> 1hr”, “late < 1hr”, and “on time”. We will store these
categorizations under a new variable called dep_status.
Flights that left early are considered to be “on time” for our
purposes.
flights_w_dep_status <- flights %>%
mutate(dep_status = case_when(is.na(dep_delay) ~ "canceled",
dep_delay > 60 ~ "delayed > 1hr",
dep_delay > 0 ~ "delayed < 1hr",
TRUE ~ "on time"))
Now that we have this “new” data set, we can determine how many days had a high proportion of excessive delays and cancellations. We will use 35% as our threshold.
sum_table_1 <- flights_w_dep_status %>%
group_by(month, day) %>%
summarize(percentage = mean(dep_status == "canceled" | dep_status == "delayed > 1hr") * 100,
count = n()) %>%
filter(percentage > 35)
datatable(sum_table_1)
A little bit of research shows that the two days with the biggest percentages, February 8 and 9, were when “Winter Storm Nemo” rolled in, a huge blizzard categorized as a Category 3 bomb cyclone. Flights had to be canceled once it rolled in on the 8th, and the airports stayed closed most of the day on the 9th. There was also heavy snowfall on March 8th causing a lot of cancellations and delays, the day with the third-highest percentage. The rest of the days all had complications due to thunderstorms in the area, fog, or a combination of the two. When unsafe flying weather is detected, the FAA will issue ground stops and ground delay programs, limiting the number of flights that can depart on time.
We shift our focus from dates with high proportions of cancellations to times of day with high percentages. By changing the variable we group by in our summary table, we can easily make this shift. A line graph will help us visualize the trend in average percentage of cancellations and excessive delays throughout the day.
sum_table_2 <- flights_w_dep_status %>%
group_by(hour) %>%
summarize(percentage = 100 * mean(dep_status == "canceled" | dep_status == "delayed > 1hr"),
count = n())
datatable(sum_table_2)
Note that the data table above had 1 entry at the 1:00 a.m. hour, and that 100% of flights at that hour were canceled or excessively delayed. We can take a closer look at that entry here:
outlier_flight <- filter(flights, hour == 1)
datatable(outlier_flight, options = list(scrollX = TRUE))
This entry was likely a repositioning flight, as it was scheduled to go from one NYC airport to another, but got canceled. This entry was likely added to the data set by mistake, as it was not a regular commercial flight. Therefore, we can exclude it when analyzing the trend in cancellations and excessive delays throughout the day. This is why we have chosen to begin our line graph at hour 5, as those were the earliest commercial flights.
ggplot(data = sum_table_2, mapping = aes(x = hour, y = percentage)) +
geom_point() +
geom_line() +
xlim(5, 23) +
ylim(0, 25)
This plot clearly shows that as the day goes on, the excessive delays
and cancellations pile up. This makes sense because as flights start to
get delayed, subsequent flights must wait for the delayed flight to
finally take off, making the next flight delayed as well, and by hour 21
(9:00 p.m.), we see the riskiest time to have a flight, with
approximately 20% of flights being delayed by over an hour or canceled.
However, the 10 p.m. and 11 p.m. hours suddenly see a decrease. This is
due to significantly lower volume of flight traffic in the night hours,
as can be seen under the count variable in the data table
above.
Just because a particular date had a high proportion of flights that
were canceled or excessively delayed doesn’t mean every flight was late.
We will take a closer look at the flights that managed to still depart
on time or early. This can most easily be done by combining the
month and day variables into a single variable
(which I will call monthday), and matching it with all the
dates that show up in our table of dates with high percentages.
cancel_delay_date_list <- sum_table_1$month * 100 + sum_table_1$day
flights_v2 <- flights_w_dep_status %>%
mutate(monthday = month * 100 + day) %>%
filter(monthday %in% cancel_delay_date_list, dep_status == "on time") %>%
select(-monthday) %>%
arrange(month, day, dep_time)
datatable(flights_v2, options = list(scrollX = TRUE))
It seems reasonable to suggest that the flights that were on time or early were able to depart before the bad weather came in. We can test this hypothesis by finding the average departure hour for each day in our list.
sum_table_3 <- flights_v2 %>%
group_by(month, day) %>%
summarize(avg_dep_hour = mean(hour), count = n())
datatable(sum_table_3)
This summary table overwhelmingly support our hypothesis. All but one of these dates has an average departure hour of 8, 9, or 10 (which correspond to 8 a.m., 9 a.m., and 10 a.m.). We can check the average departure hour of all on time flights in 2013 for comparison:
ontime_flights <- flights_w_dep_status %>%
filter(dep_status == "on time")
mean(ontime_flights$hour)
## [1] 12.27121
On days without severe weather, the average departure hour of on time flights is around noon, further supporting our claim that on time flights on days with high proportions of excessive delays and cancellations must have been before the weather hit. Notice, however, the average departure hour of on time flights on February 9 was around 17, corresponding to 5 p.m. As stated earlier, a bad blizzard hit the east coast on February 8 which affected flight patterns for two days in a row. The afternoon hour indicates that flights were able to resume later in the day once the storm had cleared and runways were once again safe for takeoff.
In summary, we can conclude that days with high proportions of late or canceled flights were caused by severe weather, such as heavy snowfall, thunderstorms, and/or fog. On days where this was the case, the flights that still were able to depart on time mostly left before the weather hit. For all flights, the likelihood of a flight to be cancelled or excessively delayed increases as the day goes on, peaking at 9 p.m. and then decreasing again during the hours of lower flight traffic.