Introduction

This report seeks to answer the following question:

Is there a relationship between the weather conditions and flights in the year 2013 in relation to their status?

We will be using the built-in data set in the tidyverse package called flights. It contains all day of the year flights that left New York in 2013. There are 19 total variables in the data set, but the most important ones for this analysis are; month (month of departure), day (day of departure), dep_time (The actual departure time of a flight (format HHMM or HMM), local tz.), dep_delay (The amount of time that a flight was delayed, in minutes. Negative times represent early departures), and hour (The time of scheduled departure broken down into hours). The full data set is too large to be able to showcase it in its entirety, so I have explained the necessary variables earlier. The source for the full data set is in the Bureau of transportation statistics https://www.transtats.bts.gov/Homepage.asp.

Percentage of Delayed or Canceled Flights

The problem that we are trying to solve is how many flights, on a given day in 2013, were either canceled or delayed by more than an hour exceeds 35%. We could suspect that there would not be that many days that had more then 35% of their flights canceled or delayed. The reason why is because a flight would only get canceled for serious emergencies because some factor is hindering the fight from being able to safely travel in the air. Also, a flight would only be delayed by more than an hour if they suspect that whatever the factor that is stopping the flight from taking off would subside later in the day. We can test this hypothesis by creating a new data set from the flights one:

percentage_flights <- flights %>%
  group_by(year, month, day) %>%
  summarize(canceled_flights = sum(is.na(dep_time)),
            delayed_flights = sum(dep_delay > 60 & !is.na(dep_time)),
            percent_flights = (canceled_flights + delayed_flights) / n()) %>%
  filter(percent_flights > .35)

datatable(percentage_flights, options = list(scrollX = TRUE))

I would say my hypothesis was somewhat right. The reason why is because from the new data set we can see that there are only 12 days in the year 2013 that had more than 35% of their flights canceled or delayed for more than an hour. We can also see that more flights were delayed for a lot of those days, while only three days, Feb 8th, Feb 9th, and may 23rd, had the opposite where there were more canceled flights than delayed ones. What we can infer from this is that whatever was stopping flights from being able to depart on their scheduled departure times was mostly just temporary and not permanent most of the time. Which means it could be something like the weather on these specific days was not very favorable to fly through so they had to either delay for quite some time or outright cancel it because it would not go away in the near future.

Weather in New York

From the previous problem we can see that there are 12 total data points and some of them have varying months and days that have a lot of cancellations and delays of flights leaving New York. We can suspect that the underlying issue for why there were so many cancellations or delays on these specific dates is because of severe weather. This is one of the biggest contributors, generally, to why flights are not able to take off and get to their destinations. We can see if this theory is true by looking up the weather in New York, in 2013, that was happening on each of these days and seeing if there is a correlation:

The Resources I Used:

https://abcnews.go.com/Travel/blizzard-2013-cancels-thousands-flights/story?id=18438547.

https://www.washingtonpost.com/blogs/capital-weather-gang/post/incredible-imagery-from-the-february-8-9-2013-new-england-blizzard/2013/02/11/b51df444-73f1-11e2-aa12-e6cf1d31106b_blog.html.

https://www.nbcnews.com/news/photo/snow-storm-blankets-new-york-city-leaving-pretty-scenes-sloshy-flna1c8776255.

https://www.weather.gov/media/btv/events/2013May23_Flooding.pdf.

https://www.weather.gov/dvn/summary_062413.

https://cbs6albany.com/newsletter-daily/new-york-and-new-england-flooding-severe-weather-awareness-week-2023-capital-region-mohawk-valley-lake-george-saratoga-adirondacks-catskills-vermont-berkshires.

https://www.njweather.org/content/yet-another-hot-summer-month-july-2013-summary.

https://www.weathergamut.com/2013/08/01/nyc-monthly-summary-july-2013/.

https://lite987.com/labor-day-thunderstorm-2013/.

https://wx1box.org/2013/09/.

https://www.cbsnews.com/newyork/news/fog-blankets-area-wreaks-havoc-on-travel/.

The February dates had the biggest snow storm ever hit New York at that time and caused most of the flights to be canceled or delayed because of it. The whole New York area experienced more than 2-3 feet of snow, heavy winds, ground blizzard conditions, etc. The March date also had a severe snow storm hit New York and blanketed it in snow as high as 4-7 inches, high winds, rain, etc. The May date saw heavy rainfall and thunderstorms come down on New York and causing big flash and river flooding to hit. This caused a lot of damage to roads and properties around the area. June 24 had significant heavy rain, damaging winds, widespread power outages, and 2 confirmed tornadoes in and around the New York area. June 28, on the other hand, had heavy down pour and flash flooding in the New York area. The July data points recorded the 5th most hottest July in recent years for New York. The average statewide temperature was around 78.2°F - 79.8°F, and reaching temperatures as high as 90°F or higher. On the September dates, there was a heavy storm the went through New York bringing with it heavy winds, heavy rain, flash floods, and hail. Finally, the last data point in December had a very bad blanket of fog over most parts of New York causing lots of flights to not be able to fly. All in all, the findings that I have found did explain the large number of delays and cancellations, in New York, is the result of bad weather conditions halting lots of flights from leaving.

Canceled or Delayed Scheduled Percentage

The question that we are trying to solve here is the percentage of flights, based on each of the scheduled departure hour during the day, were either canceled or delayed by more than an hour. We can hypothesis that the percentages will not be that massive for a lot of the days. The reason why is because most days probably did not have any major weather condition or other things that could have caused a massive wave of flights to be canceled or delayed. There would only be a handful of days that could have higher percentages, but the rest would be around the same amount. We can test this theory by creating a new data set from the flights data set:

percentage_hour_flights <- flights %>%
  filter(hour > 1) %>%
  group_by(hour) %>%
  summarize(canceled_flights = sum(is.na(dep_time)),
            delayed_flights = sum(dep_delay > 60 & !is.na(dep_time)),
            percent_flights = ((canceled_flights + delayed_flights) / n()))

datatable(percentage_hour_flights, options = list(scrollX = TRUE))

I believe that my hypothesis was about right. There were only eight hours that had a percentage of canceled or delayed flights that was more than 10%. Also the highest percentage was only about 20%. Which means that not that many flights were delayed or canceled for a lot of these hours. There was one data point that I got rid of that was at the 1 hour mark. The reason why I got rid of it was because this one was a huge outlier in the data table. There was only one flight that left at that hour and that flight was canceled. This caused that hour to have a 100% canceled or delayed flight percentage which just was not true because it did not have as many flights as the other hours.

Relationship Between the Hour and Canceled or Delayed Flights

The problem that needs investigating here is if there is a relationship between the hour and canceled or delayed flights by more than an hour. It seems reasonable to think that there would not be that much of a correlation because the leading cause of flights being delayed or canceled is unpleasant weather conditions. Weather conditions are hard to predict when they will happen and they can happen at any time of the day. I believe that the graph will have no correlation and will look all scattered. We can check this speculation with the use of a scatter plot and a line graph:

ggplot(data = percentage_hour_flights, mapping = aes(x = hour, y = percent_flights)) +
  geom_point() +
  geom_line() +
  labs(x = "time of scheduled departure broken into hours",
       y = "percentage of canceled or delayed flights",
       title = "The relationship between hour and canceled or delayed flights")

My hypothesis was incorrect. There is in fact a big correlation between hour and canceled or delayed flights. That big connection is that if one flight early on in the day gets canceled or delayed, it causes a big ripple effect that makes other flights later in the day also get canceled or delayed. This can go on for hours until it eventually dies down at the end of the day where the airport can finally catch up on flights. The big contributor to this ripple effect is weather because it can go on for hours causing many flights to be canceled or delayed till its safe enough for them to fly again.

Explaining the Relationship

There are a few questions we are trying to get at here with the previous graph. Those questions are, “Why does the plot shape look the way that it is?”, “Are there any outliers that can be seen on the graph?”, and “What hours are the worst to avoid cancellations and delays?” What I believe is:

The reason why the plot has the given shape is because there is a relationship between the scheduling of flights, and the amount of flights that exist. As the day goes on more and more flights are getting scheduled. Because there are more flights, if one flight gets canceled it causes a snowball effect where it causes all the other flights that come after it to be delayed later or canceled. This snowball effect goes on through out the whole day and then finally eases down at around the late night times where the airport can finally catch up on the flights, as an example if a flight was set to depart at 9:00 AM, but was delayed an hour, it would cause other flights proceeding it to be adjusted to this delay leading to later departure times.

Yes, there was one huge outlier that can be seen on the graph that I chose to get rid of. That outlier is a US Airways flight 1632 from EWR to LGA at 1:00 AM. The reason why this outlier is there because its the only flight that was scheduled at 1 AM and it was swiftly canceled because it was such a short flight with only 17 miles and taking a total of about 90 minutes to fly. This outlier can be easily disregarded from the graph because there was only one flight at that hour resulting in that hour having a 100% canceled or delayed flights record.

The riskiest time of the day to fly would be at the 21 hour mark (9:00 PM) if you want to avoid cancellations or long delays. The reason why this hour is riskiest is because it has the highest percentage, at about 20%, of flights that were either canceled or delayed. It has this big of a percentage because of all the flights that are prior to it that are canceled or delayed later. As more and more flights get canceled or delayed earlier in the day they get pushed back 2-7 times to later points in the day which causes all flights that proceed it to have their times change to accommodate the hour change as well.

Flights that Left Early or On Time

The problem that needs to be investigated here is how many flights, on the twelve specific days that we found earlier, were able to leave early or on time in New York. We can hypothesis that the amount of early or on time flights will be more than the amount of delayed or canceled flights. The reason why is because the main contributing factor to flights being stopped is bad weather, and that does not generally happen all the time. We can test this theory by creating a new data set from the flights one:

on_time_percentage_flights <- flights %>%
  filter((month == 2 & day == 8) | (month == 2 & day == 9) | (month == 3 & day == 8) | (month == 5 & day == 23) | (month == 6 & day == 24) | (month == 6 & day == 28) | (month == 7 & day == 1) | (month == 7 & day == 10) | (month == 7 & day == 23) | (month == 9 & day == 2) | (month == 9 & day == 12) | (month == 12 & day == 5), dep_delay <= 0)

datatable(on_time_percentage_flights, options = list(scrollX = TRUE))

I would say my hypothesis is correct because if we look at the new data set that I have created. There are over 3,000 data entries for all of these twelve specific days. This data set shows all flights that either left on time or early for each of the twelve bad days. Meaning that there were more flights that were able to successfully leave their airport in a timely manner or earlier then getting caught in like a weather storm and being delayed or canceled.

Average Scheduled Hour of Departure

The question we are trying to solve here is if the flights that were able to leave early despite bad weather were probably the ones that left in the morning before the severe weather hit. I think this guess is true. The reason why is because if you look at the previous problem there were a lot of flights that left early or on time. This makes me believe that a lot of these flights had to have their scheduled departure in the early morning, but they might have seen that there would be a huge bad weather coming in soon so they left early to not get caught by it. We can test this guess by finding the average scheduled hour of departure:

avg_sched_hour_flights <- on_time_percentage_flights %>%
  group_by(month, day) %>%
  summarize(avg_sched_hour_dep = mean(hour))

datatable(avg_sched_hour_flights, options = list(scrollX = TRUE))

Yes, my findings do confirm the guess that the flights that were able to leave early despite the bad weather were the ones that left in the morning before the severe weather hit. I found out that almost all of the flights had an average scheduled hour of departure that was between the 8-10 hour mark (8:00 AM - 10:00 AM), and only one had about a 17 hour mark (5:00 PM). This means that a lot of the flights were able to leave early before the severe weather came on their day, expect for flights on February 9th.

Average Departure Hour

The problem that we are trying to find here is if the previous question had any problematic dates that had an average departure hour in the afternoon? I believe that there could be a couple of problematic dates that had an average departure hour that was in the afternoon. The way that we can prove this question is by looking at the problematic dates and seeing what their average departure hour is from the previous data table:

Yes, there was only one problematic date that had an average departure hour in the late afternoon. That problematic date was February 9th. The reason why I believe that the flights on February 9th were not able to leave early in the morning is because on that day the severe weather was from the previous day. This caused a lot of the flights to have to wait for the blizzard to fully pass because it was occurring in the morning and causing flights to be postponed to later that day.

Conclusion

In conclusion, we can conclude that the weather conditions that happened do have a relationship with the flights that left New York, in the year 2013. This means that there is a correlation between severe weather conditions and the status of a flight. The data sets that I have created and the plot all help in showcasing that weather does indeed have an affect on if a flight gets canceled, delayed, leaves on time, or early. All in all, the specific variables that were used from flights, in this analysis, to create the data sets and plot help decipher any significant information from them.