Introduction

This report seeks to answer the following question:

Is there a relationship between the number of flights delayed or canceled and the weather conditions in New York City?

We will be using a data set called flights obtained from the R package nycflights13.Sourced from RITA, Bureau of transportation statistics, {https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236}. This data includes flight information on all the flights the departed New York City (NYC) in 2013. This a large data set with 336,776 observations with 19 variables for each observation. Of these variables, the relevant ones include: month (the month of departure), day (the day of departure), hour (the scheduled departure hour), dep_delay (the delay of departure, in minutes), and dep_time (the actual departure time). A small portion of the data can be viewed below:

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Throughout, we will need the functionality of the tiqdyverse package, and the nycflight13 package.

library(tidyverse)
library(nycflights13)

The Number of Cancellations and Delays

The problem we are investigating deals with the relationship between the weather in NYC and the frequency with which flights were canceled or delayed. By using the group_by function to identify which days in 2013 had a high percentage of canceled or delayed flights. After identifying these dates research can be gathered to identify the weather conditions that may have impacted flight departures. For the purpose of this report, we will use 35% as the benchmark. Grouping the data by month and day while using departure time, and departure delay as the transformation variables we can identify which dates had more than 35% of flights delayed or canceled. Below is a date set that highlight this:

flights_percent_35<-flights %>% 
  group_by(month,day) %>% 
  summarize(exceeds_35= sum(is.na(dep_time)|dep_delay>=60, na.rm=TRUE), count=n()) %>% 
  mutate(percentage_exceed_35 = exceeds_35/count) %>% 
  arrange(desc(percentage_exceed_35)) %>% 
  filter(percentage_exceed_35>=.35)
flights_percent_35
## # A tibble: 12 × 5
## # Groups:   month [7]
##    month   day exceeds_35 count percentage_exceed_35
##    <int> <int>      <int> <int>                <dbl>
##  1     2     9        421   684                0.615
##  2     3     8        578   979                0.590
##  3     2     8        508   930                0.546
##  4     5    23        436   988                0.441
##  5     7     1        412   966                0.427
##  6     9    12        404   992                0.407
##  7    12     5        386   969                0.398
##  8     6    28        372   994                0.374
##  9     7    23        365   997                0.366
## 10     7    10        364  1004                0.363
## 11     6    24        352   994                0.354
## 12     9     2        327   929                0.352

Upon researching the weather conditions on this dates from WeatherSpark {https://weatherspark.com/h/y/23912/2013/Historical-Weather-during-2013-in-New-York-City-New-York-United-States}. There is a pattern that emerges in conjuncture with poor weather conditions, February 9th, March 8th, February 8th all had large snow storms with low visibility. May 23rd, July 1st, July 23rd, and September 2nd also had heavy fog with mixed thunderstorms also decreasing the visibility. On July 10th, September 12th, and December 5th the visibility was okay, but there were strong winds making it unsafe to fly. June 24th and June 28th are outlines in the pattern, as the had good visibility and winds were not unsafe to fly in, meaning weather was less of a factor in these delays and cancellations.

The Number of Cancellations and Delays by Hour

Since the dates with cancellation or delay rates higher than 35% have been identified, the research can be specified to look at the hour of departure of the flights that took off from NYC on these days. Again using group_by to group the observations by the hour in which they departed. This data helps identify if any specific hour of the day experiences delays or cancellations more frequently than others. By using hour, departure time, and departure delay, the following data set was created:

flights_delay_per_hour <- flights %>%
  group_by(hour) %>%
  summarize(canceled_delayed_per_hour = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
  mutate(percentage_canceled_delayed = canceled_delayed_per_hour / count) %>%
  arrange(desc(percentage_canceled_delayed)) %>%
  filter(!hour == 1)
flights_delay_per_hour
## # A tibble: 19 × 4
##     hour canceled_delayed_per_hour count percentage_canceled_delayed
##    <dbl>                     <int> <int>                       <dbl>
##  1    21                      2226 10933                      0.204 
##  2    20                      3181 16739                      0.190 
##  3    19                      4051 21441                      0.189 
##  4    22                       426  2639                      0.161 
##  5    18                      3352 21783                      0.154 
##  6    17                      3645 24426                      0.149 
##  7    16                      3402 23002                      0.148 
##  8    15                      2926 23888                      0.122 
##  9    14                      2325 21706                      0.107 
## 10    23                       111  1061                      0.105 
## 11    13                      1817 19956                      0.0911
## 12    12                      1402 18181                      0.0771
## 13    11                      1066 16033                      0.0665
## 14    10                      1106 16708                      0.0662
## 15     9                      1062 20312                      0.0523
## 16     8                      1367 27242                      0.0502
## 17     6                      1007 25951                      0.0388
## 18     7                       803 22821                      0.0352
## 19     5                        38  1953                      0.0195

By creating this data set entitles flights_delay_per_hour we can than create a visualization to assist in explaining the results from above. By using a line graph the distribution of the data can be organized to highlight the change in percentage throughout any given day in 2013.

ggplot(flights_delay_per_hour, mapping = aes(x=hour,y=percentage_canceled_delayed))+
  geom_line()+
  labs(title="Percentage of Delayed and Canceled Grouped by Departure Hour ",
       x="Scheduled Departure Hour",
       y="Percentage of Canceled and Delayed")

This visualization highlights a positive trend in the percentage of cancellations and delays throughout the day. Bashed on the left skew of the graph the argument can be made that as the day goes on, the amount of cancellation and delays increases. One possible explanation could be weather conditions may worsen as the day goes on making it unsafe to fly in the worsening conditions. While this could be a factor contributing to it, this is not a direct cause of this pattern.

There is an outlier in the data that was filtered out to make the interpretation of the graph easier. This was the only flight during the 1 AM hour and was cancelled, meaning that 100% during that hour were cancelled. Typically flights do not have a scheduled departure time during 1 AM, this perhaps was a delayed flight that was then cancelled included in the data.

Based on the visualization, the riskiest time to fly in terms of experiencing a cancellation or delay is any time between 7 PM and 9 PM, with a peak in the 9 o’clock hour with a 20.3% chance of a cancellation or delay.

Flights Leaving on Time or Early Despite Weather Conditions

Although there were large numbers of flights that had a delayed departure or were canceled, there were still a considerable number of flights that were able to leave on time, if not earlier than the departure time. By using the filter function, we can filter for a departure delay that’s less then or equal to zero for the dates above. The data set containing these flights is below.

excption_flights<- flights %>% 
  filter((dep_delay<=0) &((month==2 & day==9)|(month==3 & day==8)|(month==2 & day==8)|
           (month==5 & day==23)|(month==7 & day==1)|(month==9 & day==12)|(month==12 & day==5)|
           (month==6 & day==28)|(month==7 & day==23)|(month==7 & day==10)|(month==6 & day==24)|
           (month==9 & day==2)))
excption_flights
## # A tibble: 3,441 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

One can argue that these flights were able to take off in the early morning before the weather hit NYC. In order to test this hypothesis we can group the data set titled exception_flights and find the average hour of departures on days with a considerable number of delays and cancelations.

excption_flights %>% 
  group_by(month,day) %>% 
  summarize(avg_hour_dep=mean(hour),count=n())
## # A tibble: 12 × 4
## # Groups:   month [7]
##    month   day avg_hour_dep count
##    <int> <int>        <dbl> <int>
##  1     2     8         8.86   219
##  2     2     9        17.0    125
##  3     3     8        10.2    146
##  4     5    23         9.45   329
##  5     6    24         9.82   369
##  6     6    28        10.1    309
##  7     7     1         9.43   229
##  8     7    10         9.60   360
##  9     7    23        10.0    257
## 10     9     2         9.85   386
## 11     9    12         9.03   386
## 12    12     5        10.1    326

As the above data transformation highlights, this hypothesis is accurate as eleven of the twelve problematic dates had an average departure hour before 11 AM. Which highlights that these flights were able to leave before the weather reached unsafe conditions.

There is one exception on February 9th, the average departure time was close to 5 PM, which does not match the hypothesis that flights left in the morning before the weather got bad. This is due to a blizzard happening on the 8th and 9th coming to a halt in the afternoon on the 9th allowing for safe flying conditions to resume around 5 PM.

Conclusion

Overall, the data from nycflight13 highlights the importance weather plays in the number of flights that can safely fly out of NYC on any given day. The conclusion can be made that there is a significant correlation between the weather and flight delays and cancellations. Poor weather conditions is one of the main factors airports in NYC looked at when determining delays and cancellations in 2013.