This report seeks to answer the following question:
Did the weather conditions significantly impact the departures of flights in New York City (NYC) in 2013?
I will be using a data set called flights
obtained from
the built-in r package “nycflights13”. This data contains information on
all of the flights that departed New York City (NYC) in 2013. This data
set contains 336,776 entries as well as 19 variables. Of these
variables, the relevant ones in this report are month
(the
month of departure), day
(the day of departure),
hour
(the scheduled departure time), dep_delay
(the delay in departure, in minutes), and dep_time
(the
actual departure time).
The source of this data is RITA, Bureau of transportation statistics, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Throughout this report, I will need the functionality of the tidyverse package, Lahman package, nycflights13 package, and DT package.
library(tidyverse)
library(Lahman)
library(nycflights13)
library(DT)
The first piece of data needed when investigating the relationship between weather and flight delays and cancellations is the amount of flights that were actually cancelled or delayed. Once I find the dates with a significant amount of cancellations and delays of at least an hour, I can then research the weather that was occurring at the time and see if there is any relationship. For the purpose of my research, I am going to use 35% as the benchmark of significance. I will use the variables of month, day, departure time, and departure delay to find the data that fits this classification. Below is a data set from “flights” that contains the flight percentage that exceeds 35% with cancellations and delays of at least an hour.
flights %>%
group_by(month, day) %>%
summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
arrange(desc(percentage_canceled_delayed)) %>%
filter(percentage_canceled_delayed >= .35)
## # A tibble: 12 × 5
## # Groups: month [7]
## month day canceled_delayed count percentage_canceled_delayed
## <int> <int> <int> <int> <dbl>
## 1 2 9 421 684 0.615
## 2 3 8 578 979 0.590
## 3 2 8 508 930 0.546
## 4 5 23 436 988 0.441
## 5 7 1 412 966 0.427
## 6 9 12 404 992 0.407
## 7 12 5 386 969 0.398
## 8 6 28 372 994 0.374
## 9 7 23 365 997 0.366
## 10 7 10 364 1004 0.363
## 11 6 24 352 994 0.354
## 12 9 2 327 929 0.352
After researching the different weather conditions on each of these dates, there is a small pattern that arises. On February 8th, February 9th, March 8th, May 23rd, July 1st, July 23rd, and September 2nd, there were large amounts of heavy fog in NYC. The visibility was also significantly low, causing unsafe conditions for flying. On July 10th, September 12th, and December 5th, there was heavy fog again, but there was also a lot of strong winds. These conditions also contributed to many cancellations and delays in flights. There are no significant weather changes on June 24th and June 28th, meaning that the weather was not necessarily a factor in the delays and cancellations on those days.
Now that I have found the dates with 35% of flights cancelled or delayed by at least an hour, I can get more specific with my research. I will now look at each scheduled departure hour throughout the day. I am going to find the percentage of flights cancelled or delayed by at least an hour for each scheduled departure hour. This information will help me determine if any hour of the day is less or more likely to have cancelled or delayed flights. To do this, I will be using the hour, departure time, and departure delay variables. The data set containing the percentage of flights that were canceled or delayed by at least an hour for each scheduled departure hour is below.
flights %>%
group_by(hour) %>%
summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
arrange(desc(percentage_canceled_delayed)) %>%
filter(!hour == 1)
## # A tibble: 19 × 4
## hour canceled_delayed count percentage_canceled_delayed
## <dbl> <int> <int> <dbl>
## 1 21 2226 10933 0.204
## 2 20 3181 16739 0.190
## 3 19 4051 21441 0.189
## 4 22 426 2639 0.161
## 5 18 3352 21783 0.154
## 6 17 3645 24426 0.149
## 7 16 3402 23002 0.148
## 8 15 2926 23888 0.122
## 9 14 2325 21706 0.107
## 10 23 111 1061 0.105
## 11 13 1817 19956 0.0911
## 12 12 1402 18181 0.0771
## 13 11 1066 16033 0.0665
## 14 10 1106 16708 0.0662
## 15 9 1062 20312 0.0523
## 16 8 1367 27242 0.0502
## 17 6 1007 25951 0.0388
## 18 7 803 22821 0.0352
## 19 5 38 1953 0.0195
From this data, I can create a visualization to aid me in explaining my result. In order to create a visualization that shows the relationship between the scheduled departure hour and the percentage of flights canceled or delayed by more than an hour, I must make this filtered data its own data set.
Canceled_Delayed_Days <- flights %>%
group_by(hour) %>%
summarize(canceled_delayed = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
mutate(percentage_canceled_delayed = canceled_delayed / count) %>%
arrange(desc(percentage_canceled_delayed)) %>%
filter(!hour == 1)
Now that I have given the data set a name, I will create a line graph displaying my data.
ggplot(data = Canceled_Delayed_Days) +
geom_line(mapping = aes(x = hour, y = percentage_canceled_delayed)) +
labs(x = "Scheduled Departure Hour",
y = "Percentage Canceled & Delayed")
From the visualization, I can see that there is a positive trend in cancellations and delays throughout the hours of the day. Based on the shape of the line graph, there is evidence that as the day goes on, the amount of cancellations and delays by at least an hour increases. One of the possible reasons for this trend could be that as the day goes on, the weather gets worse, contributing to more cancellations and delays. This is not the direct cause of this pattern, but it could be one of the factors contributing to it.
There was an outlier in the data that I filtered out in order to make the visualization easier to follow. It was in the scheduled departure hour of 1 am and had 100% of the flights cancelled or delayed by at least an hour. Flights typically do not leave after midnight. This flight could have been an earlier flight that had to reassign a departure time due to a previous delay and got attached to this data.
Based on the information from the graph, the riskiest time for someone to fly if they want to avoid cancellations and delays of at least an hour is any time from 7 pm to 9 pm, with 9 pm being the peak time of cancellations and delays.
Despite the great number of flights that departed late or did not depart at all, there were still several flights that were able to leave on time or even earlier than their scheduled departure time. In order to find the flights that left on time or early, I will be using the departure delay variable again. I am going to filter the data to only include the days in my previous data set and the flights that had a departure delay less than or equal to zero on those specific days. The data set containing these flights is below.
flights %>%
filter((dep_delay <= 0) & ((month == 2 & day == 9) | (month == 3 & day == 8) | (month == 2 & day == 8) | (month == 5 & day == 23) | (month == 7 & day == 1) | (month == 9 & day == 12) | (month == 12 & day == 5) | (month== 6 & day == 28) | (month == 7 & day == 23) | (month == 7 & day == 10) | (month == 6 & day == 24) | (month == 9 & day == 2)))
## # A tibble: 3,441 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 5 457 500 -3 637 651
## 2 2013 12 5 512 515 -3 753 814
## 3 2013 12 5 527 530 -3 657 706
## 4 2013 12 5 539 540 -1 832 850
## 5 2013 12 5 540 545 -5 822 832
## 6 2013 12 5 544 550 -6 959 1027
## 7 2013 12 5 548 600 -12 738 755
## 8 2013 12 5 551 600 -9 804 810
## 9 2013 12 5 553 600 -7 919 915
## 10 2013 12 5 553 600 -7 645 701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
I can argue that these flights that left on time or early were the flights that left earlier in the morning before the bad weather hit NYC. To test this idea, I am going to find the average scheduled hour of departure for each day found in this data set. In order to find the average hour of departure for each of the flights, I must make this filtered data its own data set and name it.
Flights_OnTime_Early <- flights %>%
filter((dep_delay <= 0) & ((month == 2 & day == 9) | (month == 3 & day == 8) | (month == 2 & day == 8) | (month == 5 & day == 23) | (month == 7 & day == 1) | (month == 9 & day == 12) | (month == 12 & day == 5) | (month== 6 & day == 28) | (month == 7 & day == 23) | (month == 7 & day == 10) | (month == 6 & day == 24) | (month == 9 & day == 2)))
Now that the data set is named, I can find the average hour of departure by using the month, day, and hour variables. The new data set containing the average hours of departure by day is below.
Flights_OnTime_Early %>%
group_by(month, day) %>%
summarize(avg_hour_dep = mean(hour), count = n())
## # A tibble: 12 × 4
## # Groups: month [7]
## month day avg_hour_dep count
## <int> <int> <dbl> <int>
## 1 2 8 8.86 219
## 2 2 9 17.0 125
## 3 3 8 10.2 146
## 4 5 23 9.45 329
## 5 6 24 9.82 369
## 6 6 28 10.1 309
## 7 7 1 9.43 229
## 8 7 10 9.60 360
## 9 7 23 10.0 257
## 10 9 2 9.85 386
## 11 9 12 9.03 386
## 12 12 5 10.1 326
The results of the average hours of departure do confirm my hypothesis that the flights that left on time or early departed in the earlier morning before the bad weather hit NYC. All of the flights that left early or on time departed between the morning hours of 8 am and 11 am.
There is one exception in this data though. The flights that departed on February 9th had an average of around 5 pm, which does not line up with my hypothesis. This could simply be because the weather was not as bad as the other dates or that the weather was worse in the morning than it was in the later of the day.
Overall, the data that I have collected and the research I have done allows me to conclude that the weather conditions are some of the main factors that contributed to the cancellations or delays of flights in NYC in 2013. It is safe to say that the poor weather conditions significantly impacted flight departures on various days throughout the year.