Introduction

This report will analyze a data set containing information about flights departing NYC in 2013. This analysis will cover multiple aspects of the flights’ departures: were they canceled, delayed, on time, or early? When and why was the departure status each of these?

The data set we will be working with is called “flights.” As explained above, its contents include various statistics about flights that departed from NYC airports in the year 2013. It has 336,776 observations and 19 variables, which provide information obtained from this website: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236.

Out of the 19 variables, we will be working with 4 of them. These are listed and described below.

month = month of departure

day = day of departure

dep_delay = departure delay in minutes

hour = hour of scheduled departure

Three packages will be needed for this report: DT, tidyverse, and nycflights. DT allows us to work with data tables, tidyverse allows us to use many functions needed to work with data sets and to create visualizations, and nycflights13 provides us with the flights data set.

library(DT)
library(tidyverse)
library(nycflights13)

Days with High Cancellations and Delays

Our first step will be to find days on which cancellations and long delays were exceptionally frequent. To do this, we will create a new data set using the information from flights.

flights_canceled_delayed_day <- flights %>% 
  group_by(month, day) %>% 
  summarize(percent_canceled_or_delayed = mean((is.na(dep_delay)) | (dep_delay > 60))) %>% 
  filter(percent_canceled_or_delayed > 0.35)

With this data set, we can see that there were 12 days in 2013 on which at least 35% of the flights scheduled for that day were either canceled or delayed by over an hour. This was the case for over half of the flights on three of these days (February 8th, February 9th, and March 8th).

The large percentage of cancellations and delays on these days can be explained by the weather. On each of the 12 days, New York experienced some sort of extreme weather. This was usually either a storm (snow, rain, or thunder) or abnormal temperatures (intense heat or cold, sharp increases or decreases in temperature, etc.). March 8th had heavy snow and high winds. A huge blizzard known as “Winter Storm Nemo” took place on February 8th through 9th. The blizzard brought heavy snow and extremely strong winds. These storms can account for the high percentages of cancellations and delays.

Hours with High Cancellations and Delays

Next, we are going to look at each scheduled departure hour within a day and find how many flights were canceled or delayed for that hour over the course of the year.

flights_canceled_delayed_hour <- flights %>% 
  group_by(hour) %>% 
  summarize(percent_canceled_or_delayed = mean((is.na(dep_delay)) | (dep_delay > 60)))

This code allows us to access information about the percentage of flights that were canceled or delayed by over an hour for each scheduled departure hour. One of these percentages really stands out: between 1:00 AM and 2:00AM, 100% of the flights were either canceled or delayed by over an hour. A look at the flights data set can tell us the reason for this. There was only one flight scheduled to leave at the hour of 1, and this flight was canceled.

The information provided in this data set can be displayed in a scatter plot:

ggplot(data=flights_canceled_delayed_hour)+
  geom_point(mapping=aes(x=hour, y=percent_canceled_or_delayed))+
  labs(x="scheduled departure hour", y="percent canceled or delayed", title="Flight Departure Status per Hour", caption="Data obtained from transtats.bts.gov")

The previously discussed outlier at 1 is immediately noticeable. However, the rest of the plot follows a fairly consistent pattern. Earlier flights have the smallest percentages of cancellations and delays. The percentages generally increase as the day goes on, peaking at 9:00 PM. They then decrease as the hours approach midnight (24). This means between 9:00 PM and 10:00 PM is the riskiest time of day to fly due to the relatively high percentage of cancellations and delays over an hour. In a more general sense, any time during the evening would be considered risky. This could be caused by a build up of delays throughout the day. Delayed arrivals or departures of flights during the day may cause subsequent flights to leave later than planned. The lengths of these delays would accumulate and increase as the day goes on. Flights that are scheduled to leave later are more likely to be affected by an earlier flight because there are more chances for this issue to take place. Storms are also more common at later hours.

On Time and Early Flights

While at least 35% of the flights that were scheduled to depart on the days contained in the earlier data set were either canceled or delayed by over an hour, there were still plenty of flights that left on time or early on those days. Here is a data set that provides those flights:

flights_abnormally_ontime_or_early <- flights %>% 
  filter((month == 2 & day == 8) | (month == 2 & day == 9) | (month == 3 & day == 8) | (month == 5 & day == 23) | (month == 6 & day == 24) | (month == 6 & day == 28) | (month == 7 & day == 1) | (month == 7 & day == 10) | (month == 7 & day == 23) | (month == 9 & day == 2) | (month == 9 & day == 12) | (month == 12 & day == 5)) %>% 
  filter(dep_delay <= 0)

The reason these days had a lot of cancellations and delays was because of the weather conditions of that day (as discussed above). So, the flights that departed either on time or early on these days were likely scheduled to leave in the morning, prior to the onset of this bad weather. To check this theory, we will use the above data set to create a new data set that tells us the average scheduled departure hour of the flights that left on time or early for each day.

flights_abnormally_ontime_or_early_w_avg_hour <- flights_abnormally_ontime_or_early %>% 
  group_by(month, day) %>% 
  summarize(avg_hour = mean(hour))

With this information, we can see that almost all of the days (11 out of 12) have an average scheduled departure hour that is before noon. Only one day, February 9th, had an average scheduled departure hour that was not in the morning. On this day, the average scheduled departure hour of flights that left on time or early was close to 5:00 PM. February 9th was one of the two days affected by the blizzard, “Winter Storm Nemo.” Since the storm began on the 8th, its effects would still be present on the morning of the 9th. This would explain why, on average, the flights that left on time or early departed later in the day. By that time the impacts of the storm would have started to subside.

Conclusion

Overall, flights are often canceled or delayed due to severe weather conditions, usually storms. The worse a storm, the more flights canceled or delayed on that day. Flights also tend to get canceled or delayed more the later they are scheduled to leave within a day. For days on which bad storms occur, flights that are scheduled to depart in the morning are still able to do so either on time or early.