In this analysis, I will explore various aspects of flight data from NYC in 2013. I will focus on three main areas: days with high cancellation/delay rates, the impact of flight disruptions, and early/on time flights on days with high cancellation percentages. By examining these factors we can learn about patterns and reasons behind flight disruptions.
I have called in the functions tidyverse as well as nycflights13. Tidyverse allows me to provide effective and clean graphics while nycflights13 is where the data set is coming from. The function dplyr allows me to sort and filter the data to only pull out information that is needed.
library(tidyverse)
library(nycflights13)
library(dplyr)
The data set that I will be using in this report is flights from the nycflights13 package. This data set is too large to display in the following report. The data set flights is from the RITA, Bureau of transportation statistics and covers on-time data for all flights that departed NYC in 2013. There are 336,776 observations with 19 variables. The variables include: year, month, day, departure time, arrival time, scheduled departure time, scheduled arrival time, departure delay, arrival delay, carrier, flight number, tail number, origin, destination, air time, distance traveled, hour, minutes, and the scheduled date and hour of the flight.
The following data set lists only the days during 2013 where the percentage of flights that were canceled or delayed by more than one hour exceed 35%. These days had at least 35% of flights canceled or delayed for over an hour.
flights_summary <- flights %>%
filter(year == 2013) %>%
mutate(delayed_or_canceled = (dep_delay > 60 | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarize(total_flights = n(),
delayed_or_canceled_flights = sum(delayed_or_canceled, na.rm = TRUE),
percentage_delayed_or_canceled = (delayed_or_canceled_flights / total_flights) * 100) %>%
filter(percentage_delayed_or_canceled > 35)
(flights_summary)
Although there were only 12 days in 2013 that had at least a 35% cancellation rate, we can investigate the data to see the possible reasons as to why these dates had an overwhelming amount of late or canceled flights. The first two rows (February 8th and 9th) cover the most amount of delays seen in over day during 2013. I found that the reason that over half of flights where canceled on these two days was a record breaking snow storm that hit New York. Th weather on 3/8/2013, 5/23/2013 were clear with periodic light rain/snow, meaning that weather was not a large impact on why so many flights were canceled. Rainfall exceeded 4 inches on 6/24/2013 and 6/28/2013 which could be why approximately 36% of flights left late or not at all. For all three July dates the weather was extremely warm, being the record highs for the year. This could have caused planes to not operate well with the extreme high heat. For the two dates in September I was not able to find significant weather that should have impacted flights as much as it did. On 12/05/2013 there was some snow that fell which could have caused some of the delays/cancellations. While weather did have a large impact on some of the days, it did not have a large impact on others.
The following data set shows the percentage of flights that were canceled or delayed by more than one hour based off of the scheduled hour at which they should have departed. This is important for travelers to know what hours of the day have the most canceled/ delayed over an hour flights. Having this knowledge would allow for an individual to only schedule flights at times with less delays.
hourly_summary <- flights %>%
filter(year == 2013) %>%
mutate(delayed_or_canceled = (dep_delay > 60 | is.na(dep_delay))) %>%
group_by(sched_dep_time_hour = floor(sched_dep_time / 100)) %>%
summarize(total_flights = n(),
delayed_or_canceled_flights = sum(delayed_or_canceled, na.rm = TRUE),
percentage_delayed_or_canceled = (delayed_or_canceled_flights / total_flights) * 100)
(hourly_summary)
The following graphic displays the correlation between the hour of the day and the percentage of flights canceled or delayed by over an hour. This visualization allows us to easily read and analyse the previous data.
ggplot(data = hourly_summary) +
geom_line(mapping = aes(x = sched_dep_time_hour, y = percentage_delayed_or_canceled)) +
labs(x = "Scheduled Departure Hour", y = "Percentage of Delayed or Canceled Flights") +
ggtitle("Percentage of Flights Delayed or Canceled by Scheduled Departure Hour") +
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
Looking at the data in a plot rather than values on a list allow us to easily see that there was one major outlier. The beginning of the data starts with a 100% cancellation rate at hour 1. Looking back on the previous table we can see that only one flight took off at hour 1 and then the next flight did not leave until hour 5. The combination of having an outlier at the begging of the data set with the next data point coming 4 hours later is the reason behind the exponential fall of the plot line. After looking at this graph we can tell that the highest percentage of flights is at hour 21 making it the riskiest time to fly. This could be because flying in the dark of night is challenging outside any conditions other than perfect.
The following data set only includes flights that departed on time or early on the same dates that had over 35% cancellations and/or delays.
on_time_flights <- flights %>%
filter(year == 2013, dep_delay <= 0) %>%
filter((month == 12 & day == 5) |
(month == 9 & day == 12) |
(month == 9 & day == 2) |
(month == 7 & day == 23) |
(month == 7 & day == 10) |
(month == 7 & day == 1) |
(month == 6 & day == 28) |
(month == 6 & day == 24) |
(month == 5 & day == 23) |
(month == 3 & day == 8) |
(month == 2 & day == 9) |
(month == 2 & day == 8))
on_time_flights
Although the previous data set gives the information we are looking for, it is extremely hard to read due to the number of rows. The following data set allows us to see that average scheduled hour of departure on the flights that left early or on time, when at least 35% of flights left late or were canceled on the same day. We can infer that the flights that were able to leave early or on time happened earlier in the morning before the weather happened. We can also infer that the flights that left also left when the sun was up so some time after 8am but before 1pm.
average_departure_hour <- on_time_flights %>%
mutate(sched_dep_hour = floor(sched_dep_time / 100)) %>%
group_by(year, month, day) %>%
summarize(average_sched_dep_hour = mean(sched_dep_hour, na.rm = TRUE))
average_departure_hour
By running this table we can see that these flights did in fact leave between 9am and 10am, when the conditions were likely at their best. There is one outlier that happens on 2/9 where flights take off in the afternoon. After looking at the weather conditions for that date we can assume that flights likely took off in the afternoon because snow removal had to be done during the blizzard that took place the previous day.Since snow removal takes time, the runways would not have been cleaned off until the afternoon.
Throughout the course of this analysis of 2013 flights to/from New York City I identified several patterns and reasons why those patterns might have occurred. Weather along with the time of day a flight is can impact a traveler in many ways and knowing the right time to fly could save them from delays and other travel inconviences.