This report seeks to answer the following question:
Is there a relationship between the number of flights delayed or canceled and the weather conditions in New York City?
We will be using a data set called flights obtained from
the R package nycflights13.Sourced from RITA, Bureau of
transportation statistics, {https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236}.
This data includes flight information on all the flights the departed
New York City (NYC) in 2013. This a large data set with 336,776
observations with 19 variables for each observation. Of these variables,
the relevant ones include: month (the month of departure),
day (the day of departure), hour (the
scheduled departure hour), dep_delay (the delay of
departure, in minutes), and dep_time (the actual departure
time). A small portion of the data can be viewed below:
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Throughout, we will need the functionality of the tiqdyverse package, and the nycflight13 package.
library(tidyverse)
library(nycflights13)
The problem we are investigating deals with the relationship between
the weather in NYC and the frequency with which flights were canceled or
delayed. By using the group_by function to identify which
days in 2013 had a high percentage of canceled or delayed flights. After
identifying these dates research can be gathered to identify the weather
conditions that may have impacted flight departures. For the purpose of
this report, we will use 35% as the benchmark. Grouping the data by
month and day while using departure time, and
departure delay as the transformation variables we can identify which
dates had more than 35% of flights delayed or canceled. Below is a date
set that highlight this:
flights_percent_35<-flights %>%
group_by(month,day) %>%
summarize(exceeds_35= sum(is.na(dep_time)|dep_delay>=60, na.rm=TRUE), count=n()) %>%
mutate(percentage_exceed_35 = exceeds_35/count) %>%
arrange(desc(percentage_exceed_35)) %>%
filter(percentage_exceed_35>=.35)
flights_percent_35
## # A tibble: 12 × 5
## # Groups: month [7]
## month day exceeds_35 count percentage_exceed_35
## <int> <int> <int> <int> <dbl>
## 1 2 9 421 684 0.615
## 2 3 8 578 979 0.590
## 3 2 8 508 930 0.546
## 4 5 23 436 988 0.441
## 5 7 1 412 966 0.427
## 6 9 12 404 992 0.407
## 7 12 5 386 969 0.398
## 8 6 28 372 994 0.374
## 9 7 23 365 997 0.366
## 10 7 10 364 1004 0.363
## 11 6 24 352 994 0.354
## 12 9 2 327 929 0.352
Upon researching the weather conditions on this dates from WeatherSpark {https://weatherspark.com/h/y/23912/2013/Historical-Weather-during-2013-in-New-York-City-New-York-United-States}. There is a pattern that emerges in conjuncture with poor weather conditions, February 9th, March 8th, February 8th all had large snow storms with low visibility. May 23rd, July 1st, July 23rd, and September 2nd also had heavy fog with mixed thunderstorms also decreasing the visibility. On July 10th, September 12th, and December 5th the visibility was okay, but there were strong winds making it unsafe to fly. June 24th and June 28th are outlines in the pattern, as the had good visibility and winds were not unsafe to fly in, meaning weather was less of a factor in these delays and cancellations.
Since the dates with cancellation or delay rates higher than 35% have
been identified, the research can be specified to look at the hour of
departure of the flights that took off from NYC on these days. Again
using group_by to group the observations by the hour in
which they departed. This data helps identify if any specific hour of
the day experiences delays or cancellations more frequently than others.
By using hour, departure time, and departure delay, the following data
set was created:
flights_delay_per_hour <- flights %>%
group_by(hour) %>%
summarize(canceled_delayed_per_hour = sum(is.na(dep_time) | dep_delay >= 60, na.rm = TRUE), count = n()) %>%
mutate(percentage_canceled_delayed = canceled_delayed_per_hour / count) %>%
arrange(desc(percentage_canceled_delayed)) %>%
filter(!hour == 1)
flights_delay_per_hour
## # A tibble: 19 × 4
## hour canceled_delayed_per_hour count percentage_canceled_delayed
## <dbl> <int> <int> <dbl>
## 1 21 2226 10933 0.204
## 2 20 3181 16739 0.190
## 3 19 4051 21441 0.189
## 4 22 426 2639 0.161
## 5 18 3352 21783 0.154
## 6 17 3645 24426 0.149
## 7 16 3402 23002 0.148
## 8 15 2926 23888 0.122
## 9 14 2325 21706 0.107
## 10 23 111 1061 0.105
## 11 13 1817 19956 0.0911
## 12 12 1402 18181 0.0771
## 13 11 1066 16033 0.0665
## 14 10 1106 16708 0.0662
## 15 9 1062 20312 0.0523
## 16 8 1367 27242 0.0502
## 17 6 1007 25951 0.0388
## 18 7 803 22821 0.0352
## 19 5 38 1953 0.0195
By creating this data set entitles
flights_delay_per_hour we can than create a visualization
to assist in explaining the results from above. By using a line graph
the distribution of the data can be organized to highlight the change in
percentage throughout any given day in 2013.
ggplot(flights_delay_per_hour, mapping = aes(x=hour,y=percentage_canceled_delayed))+
geom_line()+
labs(title="Percentage of Delayed and Canceled Grouped by Departure Hour ",
x="Scheduled Departure Hour",
y="Percentage of Canceled and Delayed")
This visualization highlights a positive trend in the percentage of cancellations and delays throughout the day. Bashed on the left skew of the graph the argument can be made that as the day goes on, the amount of cancellation and delays increases. One possible explanation could be weather conditions may worsen as the day goes on making it unsafe to fly in the worsening conditions. While this could be a factor contributing to it, this is not a direct cause of this pattern.
There is an outlier in the data that was filtered out to make the interpretation of the graph easier. This was the only flight during the 1 AM hour and was cancelled, meaning that 100% during that hour were cancelled. Typically flights do not have a scheduled departure time during 1 AM, this perhaps was a delayed flight that was then cancelled included in the data.
Based on the visualization, the riskiest time to fly in terms of experiencing a cancellation or delay is any time between 7 PM and 9 PM, with a peak in the 9 o’clock hour with a 20.3% chance of a cancellation or delay.
Although there were large numbers of flights that had a delayed departure or were canceled, there were still a considerable number of flights that were able to leave on time, if not earlier than the departure time. By using the filter function, we can filter for a departure delay that’s less then or equal to zero for the dates above. The data set containing these flights is below.
excption_flights<- flights %>%
filter((dep_delay<=0) &((month==2 & day==9)|(month==3 & day==8)|(month==2 & day==8)|
(month==5 & day==23)|(month==7 & day==1)|(month==9 & day==12)|(month==12 & day==5)|
(month==6 & day==28)|(month==7 & day==23)|(month==7 & day==10)|(month==6 & day==24)|
(month==9 & day==2)))
excption_flights
## # A tibble: 3,441 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 5 457 500 -3 637 651
## 2 2013 12 5 512 515 -3 753 814
## 3 2013 12 5 527 530 -3 657 706
## 4 2013 12 5 539 540 -1 832 850
## 5 2013 12 5 540 545 -5 822 832
## 6 2013 12 5 544 550 -6 959 1027
## 7 2013 12 5 548 600 -12 738 755
## 8 2013 12 5 551 600 -9 804 810
## 9 2013 12 5 553 600 -7 919 915
## 10 2013 12 5 553 600 -7 645 701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
One can argue that these flights were able to take off in the early
morning before the weather hit NYC. In order to test this hypothesis we
can group the data set titled exception_flights and find
the average hour of departures on days with a considerable number of
delays and cancelations.
excption_flights %>%
group_by(month,day) %>%
summarize(avg_hour_dep=mean(hour),count=n())
## # A tibble: 12 × 4
## # Groups: month [7]
## month day avg_hour_dep count
## <int> <int> <dbl> <int>
## 1 2 8 8.86 219
## 2 2 9 17.0 125
## 3 3 8 10.2 146
## 4 5 23 9.45 329
## 5 6 24 9.82 369
## 6 6 28 10.1 309
## 7 7 1 9.43 229
## 8 7 10 9.60 360
## 9 7 23 10.0 257
## 10 9 2 9.85 386
## 11 9 12 9.03 386
## 12 12 5 10.1 326
As the above data transformation highlights, this hypothesis is accurate as eleven of the twelve problematic dates had an average departure hour before 11 AM. Which highlights that these flights were able to leave before the weather reached unsafe conditions.
There is one exception on February 9th, the average departure time was close to 5 PM, which does not match the hypothesis that flights left in the morning before the weather got bad. This is due to a blizzard happening on the 8th and 9th coming to a halt in the afternoon on the 9th allowing for safe flying conditions to resume around 5 PM.
Overall, the data from nycflight13 highlights the
importance weather plays in the number of flights that can safely fly
out of NYC on any given day. The conclusion can be made that there is a
significant correlation between the weather and flight delays and
cancellations. Poor weather conditions is one of the main factors
airports in NYC looked at when determining delays and cancellations in
2013.