How Weather Affected Flights in NYC

Introduction

This report seeks to answer the following question:

Did the weather conditions significantly impact the departures of flights in New York City (NYC) in 2013?

I will be using a data set called flights obtained from the built-in r package “nycflights13”. This data contains information on all of the flights that departed New York City (NYC) in 2013. This data set contains 336,776 entries as well as 19 variables. Of these variables, the relevant ones in this report are month (the month of departure), day (the day of departure), hour (the scheduled departure time), dep_delay (the delay in departure, in minutes), and dep_time (the actual departure time).

The source of this data is RITA, Bureau of transportation statistics, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

Throughout this report, I will need the functionality of the tidyverse package, Lahman package, nycflights13 package, and DT package.

library(tidyverse)
library(Lahman)
library(nycflights13)
library(DT)

The Amount of Cancellations and Delays

The first piece of data needed when investigating the relationship between weather and flight delays and cancellations is the amount of flights that were actually cancelled or delayed. Once I find the dates with a significant amount of cancellations and delays of at least an hour, I can then research the weather that was occurring at the time and see if there is any relationship. For the purpose of my research, I am going to use 35% as the benchmark of significance. I will use the variables of month, day, departure time, and departure delay to find the data that fits this classification. Below is a data set from “flights” that contains the flight percentage that exceeds 35% with cancellations and delays of at least an hour.

flights %>%
  group_by(month, day) %>%
  summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
  mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
  arrange(desc(percentage_canceled_delayed)) %>%
  filter(percentage_canceled_delayed >= .35)

## # A tibble: 12 × 5
## # Groups:   month [7]
##    month   day canceled_delayed count percentage_canceled_delayed
##    <int> <int>            <int> <int>                       <dbl>
##  1     2     9              421   684                       0.615
##  2     3     8              578   979                       0.590
##  3     2     8              508   930                       0.546
##  4     5    23              436   988                       0.441
##  5     7     1              412   966                       0.427
##  6     9    12              404   992                       0.407
##  7    12     5              386   969                       0.398
##  8     6    28              372   994                       0.374
##  9     7    23              365   997                       0.366
## 10     7    10              364  1004                       0.363
## 11     6    24              352   994                       0.354
## 12     9     2              327   929                       0.352

After researching the different weather conditions on each of these dates, there is a small pattern that arises. On February 8th, February 9th, March 8th, May 23rd, July 1st, July 23rd, and September 2nd, there were large amounts of heavy fog in NYC. The visibility was also significantly low, causing unsafe conditions for flying. On July 10th, September 12th, and December 5th, there was heavy fog again, but there was also a lot of strong winds. These conditions also contributed to many cancellations and delays in flights. There are no significant weather changes on June 24th and June 28th, meaning that the weather was not necessarily a factor in the delays and cancellations on those days.

Cancellations and Delays by Hour

Now that I have found the dates with 35% of flights cancelled or delayed by at least an hour, I can get more specific with my research. I will now look at each scheduled departure hour throughout the day. I am going to find the percentage of flights cancelled or delayed by at least an hour for each scheduled departure hour. This information will help me determine if any hour of the day is less or more likely to have cancelled or delayed flights. To do this, I will be using the hour, departure time, and departure delay variables. The data set containing the percentage of flights that were canceled or delayed by at least an hour for each scheduled departure hour is below.

flights %>%
  group_by(hour) %>%
  summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
  mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
  arrange(desc(percentage_canceled_delayed)) %>%
  filter(!hour == 1)

## # A tibble: 19 × 4
##     hour canceled_delayed count percentage_canceled_delayed
##    <dbl>            <int> <int>                       <dbl>
##  1    21             2226 10933                      0.204 
##  2    20             3181 16739                      0.190 
##  3    19             4051 21441                      0.189 
##  4    22              426  2639                      0.161 
##  5    18             3352 21783                      0.154 
##  6    17             3645 24426                      0.149 
##  7    16             3402 23002                      0.148 
##  8    15             2926 23888                      0.122 
##  9    14             2325 21706                      0.107 
## 10    23              111  1061                      0.105 
## 11    13             1817 19956                      0.0911
## 12    12             1402 18181                      0.0771
## 13    11             1066 16033                      0.0665
## 14    10             1106 16708                      0.0662
## 15     9             1062 20312                      0.0523
## 16     8             1367 27242                      0.0502
## 17     6             1007 25951                      0.0388
## 18     7              803 22821                      0.0352
## 19     5               38  1953                      0.0195

From this data, I can create a visualization to aid me in explaining my result. In order to create a visualization that shows the relationship between the scheduled departure hour and the percentage of flights canceled or delayed by more than an hour, I must make this filtered data its own data set.

Canceled_Delayed_Days <- flights %>%
  group_by(hour) %>%
  summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
  mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
  arrange(desc(percentage_canceled_delayed)) %>%
  filter(!hour == 1)

Now that I have given the data set a name, I will create a line graph displaying my data.

ggplot(data = Canceled_Delayed_Days) +
  geom_line(mapping = aes(x = hour, y = percentage_canceled_delayed)) + 
  labs(x = "Scheduled Departure Hour",
       y = "Percentage Canceled & Delayed")

From the visualization, I can see that there is a positive trend in cancellations and delays throughout the hours of the day. Based on the shape of the line graph, there is evidence that as the day goes on, the amount of cancellations and delays by at least an hour increases. One of the possible reasons for this trend could be that as the day goes on, the weather gets worse, contributing to more cancellations and delays. This is not the direct cause of this pattern, but it could be one of the factors contributing to it.

There was an outlier in the data that I filtered out in order to make the visualization easier to follow. It was in the scheduled departure hour of 1 am and had 100% of the flights cancelled or delayed by at least an hour. Flights typically do not leave after midnight. This flight could have been an earlier flight that had to reassign a departure time due to a previous delay and got attached to this data.

Based on the information from the graph, the riskiest time for someone to fly if they want to avoid cancellations and delays of at least an hour is any time from 7 pm to 9 pm, with 9 pm being the peak time of cancellations and delays.

Flights Leaving On Time or Early

Despite the great number of flights that departed late or did not depart at all, there were still several flights that were able to leave on time or even earlier than their scheduled departure time. In order to find the flights that left on time or early, I will be using the departure delay variable again. I am going to filter the data to only include the days in my previous data set and the flights that had a departure delay less than or equal to zero on those specific days. The data set containing these flights is below.

flights %>%
  filter((dep_delay <= 0) & ((month == 2 & day == 9) | (month == 3 & day == 8) | (month == 2 & day == 8) | (month == 5 & day == 23) | (month == 7 & day == 1) | (month == 9 & day == 12) | (month == 12 & day == 5) | (month== 6 & day == 28) | (month == 7 & day == 23) | (month == 7 & day == 10) | (month == 6 & day == 24) | (month == 9 & day == 2)))

## # A tibble: 3,441 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

I can argue that these flights that left on time or early were the flights that left earlier in the morning before the bad weather hit NYC. To test this idea, I am going to find the average scheduled hour of departure for each day found in this data set. In order to find the average hour of departure for each of the flights, I must make this filtered data its own data set and name it.

Flights_OnTime_Early <- flights %>%
  filter((dep_delay <= 0) & ((month == 2 & day == 9) | (month == 3 & day == 8) | (month == 2 & day == 8) | (month == 5 & day == 23) | (month == 7 & day == 1) | (month == 9 & day == 12) | (month == 12 & day == 5) | (month== 6 & day == 28) | (month == 7 & day == 23) | (month == 7 & day == 10) | (month == 6 & day == 24) | (month == 9 & day == 2)))

Now that the data set is named, I can find the average hour of departure by using the month, day, and hour variables. The new data set containing the average hours of departure by day is below.

Flights_OnTime_Early %>%
  group_by(month, day) %>%
  summarize(avg_hour_dep = mean(hour), count = n())

## # A tibble: 12 × 4
## # Groups:   month [7]
##    month   day avg_hour_dep count
##    <int> <int>        <dbl> <int>
##  1     2     8         8.86   219
##  2     2     9        17.0    125
##  3     3     8        10.2    146
##  4     5    23         9.45   329
##  5     6    24         9.82   369
##  6     6    28        10.1    309
##  7     7     1         9.43   229
##  8     7    10         9.60   360
##  9     7    23        10.0    257
## 10     9     2         9.85   386
## 11     9    12         9.03   386
## 12    12     5        10.1    326

The results of the average hours of departure do confirm my hypothesis that the flights that left on time or early departed in the earlier morning before the bad weather hit NYC. All of the flights that left early or on time departed between the morning hours of 8 am and 11 am.

There is one exception in this data though. The flights that departed on February 9th had an average of around 5 pm, which does not line up with my hypothesis. This could simply be because the weather was not as bad as the other dates or that the weather was worse in the morning than it was in the later of the day.

Conclusion

Overall, the data that I have collected and the research I have done allows me to conclude that the weather conditions are some of the main factors that contributed to the cancellations or delays of flights in NYC in 2013. It is safe to say that the poor weather conditions significantly impacted flight departures on various days throughout the year.