Introduction

This report seeks to answer the following question:

What is the relationship between the number of cancellations and delays that occurred upon departure and the weather that occurred?

We will be using a data set called flights obtained from the nycflights13 package. This data set includes the flight information and each departure and arrival times and delays for each flight that took place in the year of 2013. There are 19 total variables for each flight; the relevant ones for this report are year, month, and day (when the flight was scheduled), dep_time (when the flight departed the airport), sched_dep_time (when the flight was scheduled to take off), dep_delay (how late or early the flight took off), the orgin and dest (what airport the flight took off from and landed at), and hour (the hour of departure). Additionally there are 336,776 rows for this data set. Therefore, this data is too large to include in the report but it can be obtained from [https://www.transtats.bts.gov/Homepage.asp].

Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations. As well as the DT package to help display our data tables.

library(tidyverse)
library(DT)

Relationship Between Cancellations/Delays and the Weather on the Day

To begin to gather an answer to the question this report seeks to answer we need to begin looking on a smaller scale. It is imperative to start on a smaller scale because there is a lot of information that needs to be considered to answer the overall question of this report. The first thing that we can consider is how many days in 2013 experienced flight cancellations and delays, as well as the percentage of which were cancelled or delayed upon departure by more than one hour. While this is likely to return a large number of days we will narrow our consideration to how many of those days experienced these hour plus delays at a rate of 35% or more. It is reasonable to assume that when we narrow our consideration we will gain a result of fewer days. We can see the results of these considerations through a new data set:

delayed_flights_2 <- flights %>%
  mutate(
    cancelled = is.na(dep_time),
    delayed = !is.na(dep_time) & dep_delay > 60
  ) %>%
  group_by(month, day) %>%
  summarize(
    cancelled_count = sum(cancelled),
    delayed_count = sum(delayed),
    total_count = n(),
    cancelled_percentage = (cancelled_count / total_count) * 100,
    delayed_percentage = (delayed_count / total_count) * 100,
    total_percentage = (cancelled_count + delayed_count) / total_count * 100
  ) %>%
  filter(total_percentage > 35) 

We can transform this into a data table that would look like this:

datatable(delayed_flights_2, options = list(scrollx = TRUE))

This data set indicates how many days experienced cancelled or delayed flights of more than one hour at a rate of 35% or more. While we can see that there are a variety of days that experienced these cancellations or delays, it is reasonable to question why these cancellations and delays occurred. We can assume that many of the flights were cancelled or delayed due to poor weather including snow, rain, and dense fog. To further consider our hypothesis we can do some research on what the weather was for each of the days that experienced a high percentage of flight cancellations or delays. We will do this by focusing on the weather in New York on the specified dates.

Our research depicts that most of the dates have reasonable explanation as to why there may have been so many cancellations and delays, but there are some where the explanation isn’t clear. First we can look at the days that had clearer explanations as to why there was such a high percentage. For the first 3 dates, 2-8, 2-9, and 3-8, there were snow storms going through New York which brought some stronger winds with it. There were a lot of dates where the weather was cloudy or foggy which could be an explanation as to why there was a high percentage of cancellations and delays. These dates include 5-23, 6-28, 7-1, 7-10, 7-23, 9-2, and 12-5. Then there are a few dates where the explanation isn’t very clear. These dates include 6-24 which was sunny but was also hot and 9-12 which was sunny but also foggy. The heat and fog could’ve resulted in the cancellations and delays, or it could’ve been from other factors such as too much air traffic and mechanical issues.

This research depicts that our hypothesis was mostly correct but there are times where we cannot find a truly defining answer.

Relationship Between Cancellations/Delays and the Hour

Another consideration that can be made is how the percentage of cancellations and delays relates to the hour of the day that the flights are taking off. We can assume that there will be more cancellations and delays in the afternoon because from our prior knowledge that the afternoon tends to be when the greatest number of flights are taking off. We can test this hypothesis by creating a data set that specifically shows what the percentage of cancellations and delays are for each scheduled departure hour of the day. We will do this by considering how many of these cancelled and delayed flights were cancelled or delayed by more than an hour.

hourly_flights <- flights %>%
  mutate(
    cancelled = is.na(dep_time),
    delayed = !is.na(dep_time) & dep_delay > 60
  ) %>%
  group_by(hour) %>%
  summarize(
    cancelled_count = sum(cancelled),
    delayed_count = sum(delayed),
    total_count = n(),
    total_percentage = (cancelled_count + delayed_count) / total_count * 100
  )

We can also transform this into a data table that would look like this:

datatable(hourly_flights, options = list(scrollx = TRUE))

Here we can see that many of the percentages of delayed and cancelled flights are similar to one another but they tend to get higher as the day goes on before down turning again. To better understand this data table and to find a conclusion to our hypothesis we can transform this data set into a scatter plot. That scatter plot would look like this:

hourly_flights_2 <- hourly_flights %>%
  filter(hour > 1)
ggplot(data = hourly_flights_2, mapping = aes(x = hour, y = total_percentage)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "Hour",
       y = "Percent of Flights Cancelled or Delayed",
       title = "Relationship Between Hour and Percentage of Flights Cancelled or Delayed")

It should be noted that we filtered out the flight that was cancelled at 1am because it was the only flight that took off at that time and there are no other flights recorded until 5am. The flight at 1am was our outlier and would skew our data if we included it in the scatter plot. Before we find a conclusion to our hypothesis we should first analyze the graph that accompanies the data.

When considering why the graph looks the way it does we can conclude that generally most flights leave in the afternoon and the early evening rather than the early morning and late at night. Additionally, we can consider that if they are delaying flights they are going to begin piling up once you get to the afternoon and evening because if one flight gets delayed the ones behind it are going to be delayed and the event is going to continue to occur. This is mainly because there are additional flights that need to go out but because the ones before them aren’t in they have to continue to delay. This can also give us insight into the downturn towards the end of the graph which could be because there are less departures so they are betting to get caught up on the flights that were delayed earlier on in the day.

This allows us to interpret that the riskiest time of day to fly if you want to avoid cancellations and long delays is in the afternoon and evening (like 2pm - 9 or 10pm). This is because this is when the most amount of flights go out during the day and if one gets delayed it can begin to pile up. Therefore, if you want to avoid cancellations and long delays you should try to find a flight that takes off in the early morning or later at night.

Now that we have interpreted the data we can find a conclusion for hypothesis. We can conclude that our hypothesis was correct in the sense that most of the delays and cancellations occur in the afternoon. It did omit the fact that this carries on into the evening hours and that a lot of delays and cancellations occur during those hours as well.

Relationship Between Highest Days with Cancellations/Delays & How Many Flights Left Early

To better grasp an answer to our overarching question we should consider how many of the flights left early on the days that had the most cancellations and delays. We can assume that depending on the day there were probably still a decent number of flights that left early because if the delays were caused by weather it is likely that the weather didn’t last all day. We can test this theory by creating a data set that shows all the flights that were able to leave on time or early on the days with the most delays.

above_flights_date <- flights %>%
  filter((month == 2 & day == 8)|(month == 2 & day == 9)|(month == 3 & day == 8)|(month == 5 & day == 23)|(month == 6 & day == 24)|(month == 6 & day == 28)|(month == 7 & day == 1)|(month == 7 & day == 10)|(month == 7 & day == 23)|(month == 9 & day == 2)|(month == 9 & day == 12)|(month == 12 & day == 5), dep_delay <= 0) %>%
  select(year, month, day, dep_delay)

We can transform this data set into a data table like this:

datatable(above_flights_date, options = list(scrollx = TRUE))

This data set depicts the amount of flights that were able to leave on time or early on the dates with the highest percentage of delayed or cancelled flights. This shows that our theory was correct in relation to the fact that there were still many flights that left early or on time on the days that had the highest percentage of cancelled or delayed flights. We can hypothesize that this was because the flights probably left in the morning before the weather hit the area. We can further test this hypothesis by creating a data set to find the average scheduled hour of departure.

hour_on_time_flights <- flights %>%
  filter((month == 2 & day == 8)|(month == 2 & day == 9)|(month == 3 & day == 8)|(month == 5 & day == 23)|(month == 6 & day == 24)|(month == 6 & day == 28)|(month == 7 & day == 1)|(month == 7 & day == 10)|(month == 7 & day == 23)|(month == 9 & day == 2)|(month == 9 & day == 12)|(month == 12 & day == 5), dep_delay <= 0) %>%
  group_by(year, month, day) %>%
  summarize(
    avg_scheduled_hour = mean(hour, na.rm = TRUE)
  )

Which we can then turn into a data table:

datatable(hour_on_time_flights, options = list(scrollx = TRUE))

This data set depicts the average hour of departure for each of the dates with the highest percentage of cancellations and delays. We can use this data table to see that our findings mostly confirm this guess. It can be seen that out of all of the dates with the most cancellations and delays only one had an average departure hour in the afternoon. Most of the dates with a high percentage of cancellations and delays saw the flights that left on time or early average a departure hour between 9 and 10 am.

There was only one date that had an average departure hour in the afternoon which was 2-9-2013. This date saw an average departure hour of around 4 or 5 pm. Based on the research done above we know that this is because of the snow storm that went through New York. This snow storm started in the afternoon on 2-8-2013 and continued into the morning of 2-9-2013. We know that the snow storm ended in the morning of 2-9 so it is reasonable for us to see that all the delayed flights were taking off in the afternoon hours with a few possible in the morning of that day.

Conclusion

In summary we can conclude that in most cases the number of cancellations and delays that occurred upon departure are related to the weather that occurred on the dates. We can also conclude that a large number of cancelled and delayed flights are in the afternoon rather than in the morning. This isn’t necessarily always true but it is a reasonable explanation for a large percentage of the days with high cancellation and delay rates. Our data shows this through the number of cancellations and delays and how that percentage relates to the overall number of flights that were scheduled to occur on that given day. However, further research must be done in order to make connections because the data doesn’t tell us the weather that occurred on each day.