Introduction

This report seeks to clarify and analyze the provided data. This will be accomplished by looking at several key relationships within the data set. This report will seek to isolate days with large numbers of cancellations, the relationship between hour of the day and delays/cancellations, and look at the relationships between the relevant variables.

The data is derived from the flights data set. It contains 336,776 observations across 19 variables. The variables that will be most relevant to this report are dep_delay and arr_delay, the departure and arrival delays for each flight, dep_time, the departure time in hours and minutes, month and day, the numerical value of the month and the day, sched_dep_time, the scheduled time of departure in hours and minutes, and the hour, the scheduled hour of departure. The data set is too large to be displayed below; however, it was sourced from The Bureau of Transportation statistics.

In this report I will be making use of tidyverse’s functionality to create visualizations and data sets.

library(tidyverse)

Notable Delays and Cancellation Statistics

An important statistic when comparing flights is how often certain flights are cancelled or delayed. First, this report will look at the percentage of flights that were either cancelled or delayed by more than an hour.

flights %>%
  group_by(month, day) %>%
  summarize(canc_or_delay = 
              (mean((is.na(dep_delay)) | dep_delay >= 60)) * 100 ) %>%
  filter(canc_or_delay >= 35) %>%
  arrange(desc(canc_or_delay))

This shows a percentage of the flights each day that were either cancelled or delayed by at least an hour. It also filters out any day that had a percentage lower than 35%. This shows us 12 days with inordinately high cancellation or delay percentages.

A reasonable hypothesis as to why this could be is the weather in New York City on those days. On 2/ 8 and 2/9, the weather could have been a factor as there was a long snowstorm a most of the day on 2/8 that continued overnight into 2/9 that would have certainly caused delays or cancellations. 3/8 was extremely foggy with some snow and ice throughout the day. This would have also been a large contributor to the cancellation and delay percentage. 5/23 was a little rainy, but I don’t see that being a huge factor to the cancellation and delay percentage on that day. 6/24 was sunny and not very windy. I don’t think the weather on this day was a factor. 6/28 was a little cloudy, but not a large reason to cancel or delay flights. 7/1 was overcast and a little foggy, I think it may have caused a slight bump to the cancellation and delay percentage; however, not enough to cause as large of a percentage as it has. 7/10 was a little cloudy but not a large factor in the percentage. 7/23 has some early morning rain and scattered clouds the rest of the day. I don’t think the weather was a large factor on that day. 9/2 had a lot of rain and fog that could have cause some delays but probably no cancellations. 9/12 was clear for a lot of the day, however there was some haze and fog which could have caused delays as well as heavy rain in the evening. 12/5 was overcast and foggy which could have caused some delays and cancellations. Overall, I think the weather in New York City was a factor for some of these days but not all of them.

Next, this report will look at what hours during the day had the most cancellations and delays.

flights_w_canc_or_delay_hour <- flights %>%
  group_by(hour) %>%
  summarize(can_del_per = (mean((is.na(dep_delay)) | dep_delay >= 60)) * 100,
            count = n()) %>%
  arrange(hour)
ggplot(data = flights_w_canc_or_delay_hour) +
  geom_line(mapping = aes(x = hour, y = can_del_per)) +
  labs(x = "Hour",
       y = "Cancellation or Delay Percentage",
       title = "Percentage of Flights Cancelled or Delayed by at Least 60 Minutes Per Hour") +
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))

This shows the percentage for each hour of flights that were either cancelled or delayed by more than an hour.

We see a general upward trend in this graph, I think this could be because as the day goes on, there is a stacking effect in the departure delay. If one flight gets delayed out of a gate, it can create a cascading effect where the next flight is delayed and so on. It tapers off toward the end of the day as there are less flights during these times and the airport is able to catch up.

There is a noticeable outlier at 1 in the morning. This is because there was only one flight scheduled at that time and it was cancelled.

It would appear that the riskiest time of day to fly would be 9 pm. I think this is because around 9-10 pm is when the flights start tapering off for the day, so 9 pm is the time that sees the most backup before the airport can start catching up.

Next, this report will isolate flights on days with a large cancellation and delay percentage that left on time or early.

flights_on_time <- flights %>%
  filter(((month == 2) & (day == 8 | day == 9)) | ((month == 3) & (day == 8)) | 
           ((month == 5) & (day == 23)) | ((month == 6) & (day == 24 | day == 28)) | 
           ((month == 7) & (day == 1 | day == 10 | day == 23)) | 
           ((month == 9) & (day == 2 | day == 12)) | 
           ((month == 12) & (day == 5))) %>%
  filter(dep_delay <= 0)

This data set shows the flights on the days with large amounts of cancellations or delays that left either on-time or early.

A reasonable explanation is that the majority of the flights that were able to leave on-time or early on those days left before any poor weather that may have caused delays.

flights_on_time %>%
  group_by(month, day) %>%
  summarize(avg_scheduled_dep_hour = mean(hour))

The average scheduled departure hour for the majority of the flights that departed on time was in the morning which makes sense because for the most part, that was before the bad weather on those days.

2/9 has an average departure hour in the afternoon This is against the trend, but it makes sense because there was an overnight storm that subsided in the morning. The snow and ice would’ve had to be cleared prior to allowing departures to resume.

Conclusions

There are a lot of useful insights that this report was able to provide in regards to the Flights data set. Primarily, it was able to generate a data set isolating dates on which there was a large proportion of cancellations or long delays. From this, it was able to be deduced that the weather in New York plays a role in the overall number of cancellations and delays. Then, the cancellation and delay percentage was found for each hour of the day, a graph was created, and it was interpreted. The interpretation showed that it is less risky to fly later at night. The flights that left early or on time on the problematic days were isolated. They were mostly able to be explained as flights that departed before or after bad weather. The problematic dates average hour of departure was found to confirm this and exceptions were explained.