This report will analyze data surrounding the cancellation or significant delay of flights departing from New York airports (JFK, EWR, LGA) in 2013.
library(nycflights13)
library(tidyverse)
The tidyverse library loads functions for data transformation and visualization.
The nycflights13 library loads in the data set “flights” used throughout this report.
From: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
#Flights data set too large to import into data table format
#datatable(flights, options = list(scrollX = TRUE))
19, variables: 336, 776 observations
year: year of departure
month: month of departure
day: day of departure
dep_time: actual departure time (HH:MM, local time zone)
arr_time: actual arrival time (HH:MM, local time zone)
sched_dep_time: scheduled departure time (HH:MM, local time zone)
sched_arr_time: scheduled arrival time (HH:MM, local time zone)
dep_delay: departure delay in minutes (negative represents early departure)
arr_delay: arrival delay in minutes (negative represents early arrival)
tailnum: plane tail number
origin: origin of flight
dest: destination of flight
distance: distance between origin and destination airports in miles
hour: hour of scheduled departure
minute: minute of scheduled departure
time_hour: scheduled hour and date of flight in POSIXct format
In the year of 2013 New York airports had several days with over 35% of flights being cancelled or delayed by over one hour. The data set below shows only those days, filtering out any flights that left on time, early, or had delays of less than 1 hour. The variable “percent_delay_cancel” indicates the percentage of flights that were delayed or cancelled for more than 60 min. The variable “number_delay” shows the number of flights that were delayed by 60 min and similarly the variable “number_cancel” shows the number of flights that were cancelled.
flights_cancel_delay_percent <- flights %>%
group_by(month, day) %>%
summarize(count = n(),
percent_delay_cancel = (mean(is.na(dep_time) | (dep_delay >= 60))),
number_delay = sum(dep_delay >= 60, na.rm = TRUE),
number_cancel = sum(is.na(dep_time), na.rm = TRUE)) %>%
filter(percent_delay_cancel >= 0.35)
datatable(flights_cancel_delay_percent, options = list(scrollX = TRUE))
These flights were cancelled for a variety of reasons. Both the cancellations on 2/8 and 2/9 were caused by a blizzard that rolled through the northeastern united states. Similarly flights on 3/8 were cancelled due to snow and fog within the New York area. Cancellations on 5/23 and 9/2 were caused by poor visibility due to fog. Cancellations on 6/28, 7/1, and 7/10 were all caused by storms. Flights on 6/24, and 7/23 were delayed or cancelled due to jet accidents on the runway.
More insight into flight cancellations and delays can be gained from investigating the time flights are scheduled to take off. The data set below groups all flights by the hour of their scheduled departure time and shows the percentage of those flights that were cancelled or delayed by more than 1 hour. The variable “percent_hour_delay_cancel” shows the percentage of flights that were delayed by more than 1 hour or cancelled for every hour.
flights_can_del_perhour <- flights %>%
group_by(hour) %>%
summarize(percent_hour_delay_cancel =
(mean(is.na(dep_time) | (dep_delay >= 60))),
count = n()) %>%
filter(hour != 1)
datatable(flights_can_del_perhour, options = list(scrollX = TRUE))
The figure below shows this data in the form of a scatter plot. The hour the plane was scheduled for departure is seen on the X-axis and the percentage of flights cancelled or delayed for more than 1 hour is shown on the Y-axis.
ggplot(flights_can_del_perhour,
mapping = aes(x = hour, y = percent_hour_delay_cancel)) +
geom_point() +
geom_smooth(se = FALSE) +
labs(title = "Percentage Flights Cancelled or Delayed per Hour",
x = "Hour",
y = "Percent Cancelled or Delayed")
There was 1 flight taking off in the hour of 1:00 that was cancelled resulting in the percent cancelled/delayed of 1.00. This data point has been removed as an outlier. There were no flights taking off between the hours of 2:00 - 4:00 as such the graph shows data from the hours of 5:00 to 23:00.
The trend line of this graph rises until it reaches its peak between the hours of 19:00 and 21:00 followed by a decrease from 22:00-23:00. Meaning that, as a trend overall, the later in the day the more likely it is that the flight will be cancelled or significantly delayed. The riskiest time of day for flight cancellation or significant delay departures occurs between 19:00-21:00. This trend is likely caused by the compilation of smaller delays. If a flight earlier in the day is slightly delayed, then the flight following that one is slightly delayed, as is the next and so on. As such, the later flights have been set back further. The number of flights departing after 22:00 is dramatically decreased from those taking off at other hours of the day allowing the airports to “catch up” from previous delays and reduce the number of cancellations
Despite a significant number of cancellations and delays on those days many flights were still able to take off on time or early. The data table below shows the flights on those high cancellation/delay days that were able to leave on time or early.
flights_highcan_ontime <- flights %>%
filter((month == 2 & day == 9) |
(month == 2 & day == 8) |
(month == 3 & day == 8) |
(month == 5 & day == 23) |
(month == 6 & day == 24) |
(month == 6 & day == 28) |
(month == 7 & day == 1) |
(month == 7 & day == 10) |
(month == 7 & day == 23) |
(month == 9 & day == 2)) %>%
filter(arr_time <= sched_arr_time)
datatable(flights_highcan_ontime, options = list(scrollX = TRUE))
2,964 flights were still able to take off early or late on these dates. It is likely that these flights departed in the mornings, before the weather had a chance to effect departure time. The table below shows the mean departure time for the flights that left on time or early for dates with a high percentage of cancellations or significant delays. The variable “mean_sched_dep_time” shows the average schedulued departure time of flights that left early or on time on high cancellation/delay days.
flights_highcan_ontime_avg <- flights_highcan_ontime %>%
group_by(month, day) %>%
summarize(mean_sched_dep_time = mean(hour))
datatable(flights_highcan_ontime_avg, options = list(scrollX = TRUE))
The range of flight times is between 5:00-23:00. The majority of these days have average departure within hour 13:00 or before, which indicates that most flights took off in the mornings. However, there are still a few notable exceptions. There are two notable exception to this trend. Flights on 2/9 had an average departure time of within hour 16:00. This is likely due to the fact that the blizzard conditions that began on 2/8 did not clear until then. Flights on 3/8 had an average departure time in hour 17:00, once again likely due to the fact that it took until then for the weather to clear.
There were 10 days with more than 35% of flights being cancelled ro delayed more than 1 hour
3 off those days were caused by snow
2 of those days were caused by fog/poor visibility
3 of those days were caused by storms
2 of those days were caused by jet accidents
Flights later in the day are more likely to be cancelled or significantly delayed likely due to the compounding effect of delays on earlier flights
Flights between the hours of 19:00 and 21:00 are the most likely to be cancelled or significantly delayed
Because there are fewer flights that depart after 22:00 flights between 22:00-23:00 the number of delays and cancellations decreases
2,964 flights were still able to depart on time or early on day with a >35% flight cancellations and delays
Flights that left on time or early on these dates left in the mornings