It’s Not Always Sunny in New York City

Justin Fawley

This analysis explores patterns in cancellations and delays during the year 2013, as well as investigating any related weather and the relationship to the flights being delayed or cancelled. Using data transformation and visualizations, we will also analyze any days that had high amounts of cancellations or delays (35%) and what may have caused this amount of disruption, whether it be weather, time of day, or another factor.

For this analysis, we will use the “tidyverse”, “Lahman”, and “nycflights13” libraries.

library(tidyverse)
library(Lahman)
library(nycflights13)

First, we will analyze the amount of flights that were cancelled or delayed in 2013 and see which days had high amounts of delays and cancellations. In order to create our dataset, we first have to create our variable to be able to measure our bar for high amounts of delays and cancellations, which we set at 35% of all flights.

flights_delays_cancelled <- flights %>%
  mutate(delay_or_cancelled = dep_delay > 60 | is.na(dep_time)) %>%
  group_by(month, day) %>%
  summarize(total = n(), delaycancelcount = sum(delay_or_cancelled), percentage_of_delayscancel = delaycancelcount / total * 100)

Now that we have our variable, “percentage_of_delayscancel,” we can filter to 35% to find which days have the most amount of cancellations or delays in New York City.

filter(flights_delays_cancelled, percentage_of_delayscancel > 35) %>%
  arrange(desc(percentage_of_delayscancel))
## # A tibble: 12 × 5
## # Groups:   month [7]
##    month   day total delaycancelcount percentage_of_delayscancel
##    <int> <int> <int>            <int>                      <dbl>
##  1     2     9   684              419                       61.3
##  2     3     8   979              574                       58.6
##  3     2     8   930              506                       54.4
##  4     5    23   988              435                       44.0
##  5     7     1   966              407                       42.1
##  6     9    12   992              404                       40.7
##  7    12     5   969              385                       39.7
##  8     6    28   994              369                       37.1
##  9     7    23   997              362                       36.3
## 10     7    10  1004              364                       36.3
## 11     9     2   929              326                       35.1
## 12     6    24   994              348                       35.0

Upon filtering, we find that February 9, March 8, and February 8 had the highest percentage of delays or cancellations, which if we Google the weather in New York City on those days in 2013, we find that February 8 and 9 was the “Blizzard of 2013” where there was heavy snow and strong winds in the entire Northeast United States, leading to 61.26% of all flights being delayed by more than an hour or cancelled on February 8th and 9th. March 8th, on a similar note, brought heavy snow and a nor’easter to New York City, blanketing the New York metro area. Given these events, we can definitely understand why there was such a high percentage of flights delayed/cancelled on those days! Brrrr…

Does Time of Day affect cancellations?

We know that February 8-9 and March 8 had the highest percentage of flights delayed/cancelled, but maybe flights are delayed / canceled because of the time of day. Let’s see how much the time of day really affects delays & cancellations.

flights_time_of_day <- flights %>%
  mutate(delay_cancel = dep_delay > 60 | is.na(dep_delay)) %>%
  group_by(hour) %>%
summarize(total_flights = n(), count = sum(delay_cancel), percentage_delaycancel = (count / total_flights * 100))
print(flights_time_of_day)
## # A tibble: 20 × 4
##     hour total_flights count percentage_delaycancel
##    <dbl>         <int> <int>                  <dbl>
##  1     1             1     1                 100   
##  2     5          1953    37                   1.89
##  3     6         25951   999                   3.85
##  4     7         22821   792                   3.47
##  5     8         27242  1351                   4.96
##  6     9         20312  1049                   5.16
##  7    10         16708  1093                   6.54
##  8    11         16033  1052                   6.56
##  9    12         18181  1390                   7.65
## 10    13         19956  1780                   8.92
## 11    14         21706  2299                  10.6 
## 12    15         23888  2877                  12.0 
## 13    16         23002  3356                  14.6 
## 14    17         24426  3589                  14.7 
## 15    18         21783  3313                  15.2 
## 16    19         21441  4000                  18.7 
## 17    20         16739  3139                  18.8 
## 18    21         10933  2196                  20.1 
## 19    22          2639   416                  15.8 
## 20    23          1061   107                  10.1

Some hours are not available here because they did not have any delays greater than 60 minutes or a cancellation. Some hours, like 6:00, 7:00, and 8:00, had a large amount of flights, but not as many delays or cancellations as hours like 13:00. Let’s see what the relationship is between the hour of the day and delays/cancellations with a line graph.

ggplot(flights_time_of_day) + geom_line(mapping = aes(x = hour, y = percentage_delaycancel)) + labs(title = "Relationship between Time of Day and Delays/Cancellations", x = "Hour of the Day", y = "Percentage of Flights Delayed or Cancelled") 

As we see in the line graph, there is an increase in percentage of flights delayed/cancelled as the day goes on. It seems that at 21:00 is when it peaks, likely due to darkness of the sky as flights taper off for the night, so we would likely want to avoid a flight at 21:00, as 20% of flights were delayed or canceled. After 21:00, the number of flights and therefore the delays/cancellations decrease dramatically, as we see in the line graph. During the middle of the day is when there are most flights, so a bad weather day could really skew the results in the line graph. However, the shape seems to be exactly what we’d expect. There seems to be an outlier somewhere between 0:00 and 5:00, which if we look at our data set, we find that there was only one flight scheduled at 1:00 and it was canceled, creating that 100% percentage value. It is obscure to have a flight scheduled at 1am anyways.

Enough of the negativity, let’s look at what we hope for, flights leaving on time!

Can the flights avoid bad weather?

Earlier, we found that February 8, 9, and March 8 had the highest percentage of flights canceled or delayed. Let’s think positively and look at the number of flights that were on time on those dates and some other bad weather dates with higher cancellations (greater than 35%).

flights_on_time <- flights %>%
  filter(month == 2 & day == 8 | month == 2 & day == 9 | month == 3 & day == 8 |month == 5 & day == 23 | month == 7 & day == 1 | month == 7 & day == 10 | month == 7 & day == 23 | month == 9 & day == 2 | month == 9 & day == 12 | month == 12 & day == 5) %>%
  filter(dep_delay <= 0)
print(flights_on_time)
## # A tibble: 2,763 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 2,753 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

With this dataset we see that despite the bad weather, 2,763 flights on these days still managed to leave on time or early.

Knowing there was bad weather these days, we can predict that the flights that left on time would be in the morning before the bad weather hit that day. Let’s see what the average hour of departure was for those days to check our prediction.

flights_on_time %>%
  group_by(month, day) %>%
  summarize("Average Hr of Departure" = mean(hour))
## # A tibble: 10 × 3
## # Groups:   month [6]
##    month   day `Average Hr of Departure`
##    <int> <int>                     <dbl>
##  1     2     8                      8.86
##  2     2     9                     17.0 
##  3     3     8                     10.2 
##  4     5    23                      9.45
##  5     7     1                      9.43
##  6     7    10                      9.60
##  7     7    23                     10.0 
##  8     9     2                      9.85
##  9     9    12                      9.03
## 10    12     5                     10.1

As we see by the table, the average hour of departure for all of the days seem to be in the morning. The only day that does not have a morning average hour of departure is February 9, which seems to be the flights that were scheduled for that night that were still able to depart on time after the storm had cleared from the February 8-9 blizzard in New York.

Conclusion

Overall, we found that many of the most delayed or canceled flights came at a time of winter or severe weather in New York City, grounding flights for some periods of time. We also find that many of the flights that were able to leave early or on time on the days affected most were able to because they departed before the bad weather hit.