library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(DT)
Is poor weather a direct indication of airline cancellations? The intention of this report is to answer this question.
There are three libraries present above the introduction, these packages allow for use of the data table and the functions needed to conduct this report.
The data set we will be focusing on is named ‘Flights’ from the package ‘nycflights13’ which contains data collected on every flight departed from New York in the year 2013. This data set is too big to display, however, it Contains 19 variables, this data set has a wide variety of collected data and may be confusing at first glance. The first three variables in the table are ‘year’, ‘month’, and ‘day’. These variables are measuring the date of departure from New York. Next are the variables ‘dep_time’, ‘arr_time’, ‘sched_dep_time’, ‘sched_arr_time’, ‘dep_delay’, and ‘arr_delay’. These variables can be hard to understand simply because we may not know what “dep” or “arr” means. “Dep” is referring to departure, and “arr” is referring to arrival. These variables are measuring the actual, scheduled, and delay of departure and arrival in the format of (HHMM), H being hour and M being minute. Along with those variables, we are given ‘carrier’, ‘flight’, ‘tailnum’, ‘origin’, and ‘destination’. These are categorical variables and are simply indicators of each flight. We are given the carrier, flight number, tail number, the current origin, and recorded destination of each individual flight. ‘air_time’ and ‘distance’ are a record of the time spent in the air, recorded and minutes, and the distance traveled between the airport of origin and the airport of the recorded destination, given in miles. Finally we are given ‘hour’, ‘minute’, and ‘time_hour’. Hour and minute are a representation of the scheduled departure time broken into an hour and minutes. Time_hour is an indication of the scheduled date and time of a flight including the origin.
Source of data set: RITA, Bureau of transportation statistics, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
When looking at delays and cancellations, we must focus on the flights that had a significant delay, that being an hour or longer. Flights delayed for less than an hour are not substantial because a majority of the flights in the data set would be included in the filtered data
flights %>%
group_by(month, day) %>%
summarize(percent_cancelled_delayed = mean(is.na(dep_delay) | dep_delay >= 60)) %>%
filter(percent_cancelled_delayed >= 0.35)
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
This data table shows the exact days where the percentage of flights canceled or delayed exceeded 35%. This is a significant value that pares down the data table to the days with the highest cancellations. From the data table, we are given 12 days that had substantial cancellations and delays.
35% of all flights being canceled or delayed for more than an hour in one day is a big deal. So why did so many flights get delayed or canceled? In most cases, delayed or canceled flights are usually caused by weather.
On February 8, February 9, March 8, May 23, July 23, and September 2 there was heavy fog and low visibility leading to a mass cancelling and delays of flights on these days.
On July 10, September 12, and December 5 there was a mix of unpredictable wind and heavy fog. These two factors led to delayed or canceled flights.
On June 28 there was significant flooding throughout New York and tension between the cops and protestors that day which both may have attributed to late or canceled flights.
On June 24 there was no specific weather data that should point to the canceling of flights. However, there were a few news headlines that may point to tension within the nation. This does not mean flights were cancelled or delayed because of this.
After looking at the specific days that had the most canceled or late flights in 2013, what were the hours of the day that had the most cancellations and delays?
flights_w_percent <- flights %>%
group_by(hour) %>%
summarize(percent_cancelled_delayed = mean(is.na(dep_delay) | dep_delay >= 60))
arrange(flights_w_percent, hour)
This data table shows every departure hour within 2013 that had canceled flights or a delay greater than one hour and the percent of these combined. Since hours 2, 3, and 4 are not included in the table that means that there were no significant cancellations or delays pertaining to the table at this time.
To visualize this data table, we can create a scatter plot to show the relationship between departure hours and the percentage of late and delayed flights.
ggplot(data = flights_w_percent) +
geom_point(mapping = aes(x = hour, y = percent_cancelled_delayed)) +
labs(x = "Hour of the Day", y = "Percent of Flights Cancelled/Delayed", title = "Percent of Flights Cancelled or Delayed Per Hour")
When looking at the scatter plot above, the most obvious point is in the top left corner. This data point claims that 100% of the flights at 1 am in 2013 were canceled or delayed. This is not entirely true.
filter(flights, hour == 1)
This filter lets us see exactly how many flights left at 1 am in 2013, and there was only one flight which was canceled, meaning 100% of the flights at 1 am were canceled.
Looking past the outlier in the graph, there is a general trend up, peaking at around 20 and then it slowly falls back down. Why is this? When looking at typical weather patterns, harsh weather tends to progress throughout the day. If there is going to be bad flying conditions, the earlier flights may be able to depart before the weather intensifies, however, the flights later in the afternoon aren’t as lucky. Using exact numbers from the table, hours 19 and 20 seem to have the highest rate of cancellation and delays in 2013. This means that within the data from 2013, it is riskier to fly mid-day and evening than it is to fly during the morning and early afternoon.
Finding the overall best time to fly is not a simple answer. This may vary from year to year or even from airport to airport.
From the data we have, we can look at the days that encountered mass cancellations and delays to see flights that were able to depart despite the severe weather and other factors.
flights %>%
select(month, day, dep_delay) %>%
filter((dep_delay <= 0) & ((month == 2 & day == 8) | (month == 2 & day == 9) | (month == 3 & day == 8) | (month == 5 & day == 23) | (month == 6 & day == 24) | (month == 6 & day == 28) | (month == 7 & day == 1) | (month== 7 & day == 10) | (month == 7 & day == 23) | (month == 9 & day == 2) | (month == 9 & day == 12) | (month == 12 & day == 5)))
This table shows that on those 12 days above, despite the cancellations and delays, there were still 3,441 flights to take off on time or early. So what time did these flights leave?
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & ((month == 2 & day == 8) | (month == 2 & day == 9) | (month == 3 & day == 8) | (month == 5 & day == 23) | (month == 6 & day == 24) | (month == 6 & day == 28) | (month == 7 & day == 1) | (month == 7 & day == 23) | (month == 9 & day == 2) | (month == 9 & day == 12) | (month == 12 & day == 5)))
Since the data table has so many rows, we could even take this one step further and find the average hour for each of these dates.
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 2 & day == 8)) %>%
summarize(mean(hour))
On February 8, the average take off time was 8.86
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 2 & day == 9)) %>%
summarize(mean(hour))
On February 9, the average take off hour was 16.96
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 3 & day == 8)) %>%
summarize(mean(hour))
On March 8, the average take off time was 10.24
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 5 & day == 23)) %>%
summarize(mean(hour))
On May 23, the average take off time was 9.45
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 6 & day == 24)) %>%
summarize(mean(hour))
On June 24 the average take off time was 9.82
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 6 & day == 28)) %>%
summarize(mean(hour))
On June 28, the average take off time was 10.09
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 7 & day == 1)) %>%
summarize(mean(hour))
On July 1, the average take off time was 9.43
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 7 & day == 10)) %>%
summarize(mean(hour))
On July 10, the average take off time was 9.6
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 7 & day == 23)) %>%
summarize(mean(hour))
On July 23, the average take off time was 10.02
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 9 & day == 2)) %>%
summarize(mean(hour))
On September 2, the average take off time was 9.85
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 9 & day == 12)) %>%
summarize(mean(hour))
On September 12, the average take off time was 9.03
flights %>%
select(month, day, hour, dep_delay) %>%
filter((dep_delay <= 0) & (month == 12 & day ==5)) %>%
summarize(mean(hour))
On December 5, the average take off time was 10.14
All 12 of the dates, besides February 9 had an average take off time of 8-10 am. This means on days with forecasted bad weather, the best time to fly is before or around 10 am or your chances of delays and cancellations will drastically increase. So what happened on February 9?
This most likely occurred because the weather started to clear up around 12pm. Once the weather clears up, airports may have to sort some things out leading to an even later take off time around 16:00.
The driving question of this report was to explore the idea of airport delays and cancellations and the correlation of weather. Through the information gathered, I think that yes, weather patterns do have a direct correlation with delays and cancellations. When there is harsh weather approaching, the best time to fly out of an airport is between 8-10 am with regard to the special case that left later in the afternoon. As the day progresses, the chance for airport delays increase along with the chance for a flight cancellation.