This analysis explores patterns in cancellations and delays during the year 2013, as well as investigating any related weather and the relationship to the flights being delayed or cancelled. Using data transformation and visualizations, we will also analyze any days that had high amounts of cancellations or delays (35%) and what may have caused this amount of disruption, whether it be weather, time of day, or another factor.
For this analysis, we will use the “tidyverse”, “Lahman”, and “nycflights13” libraries.
library(tidyverse)
library(Lahman)
library(nycflights13)
First, we will analyze the amount of flights that were cancelled or delayed in 2013 and see which days had high amounts of delays and cancellations. In order to create our dataset, we first have to create our variable to be able to measure our bar for high amounts of delays and cancellations, which we set at 35% of all flights.
flights_delays_cancelled <- flights %>%
mutate(delay_or_cancelled = dep_delay > 60 | is.na(dep_time)) %>%
group_by(month, day) %>%
summarize(total = n(), delaycancelcount = sum(delay_or_cancelled), percentage_of_delayscancel = delaycancelcount / total * 100)
Now that we have our variable, “percentage_of_delayscancel,” we can filter to 35% to find which days have the most amount of cancellations or delays in New York City.
filter(flights_delays_cancelled, percentage_of_delayscancel > 35) %>%
arrange(desc(percentage_of_delayscancel))
## # A tibble: 12 × 5
## # Groups: month [7]
## month day total delaycancelcount percentage_of_delayscancel
## <int> <int> <int> <int> <dbl>
## 1 2 9 684 419 61.3
## 2 3 8 979 574 58.6
## 3 2 8 930 506 54.4
## 4 5 23 988 435 44.0
## 5 7 1 966 407 42.1
## 6 9 12 992 404 40.7
## 7 12 5 969 385 39.7
## 8 6 28 994 369 37.1
## 9 7 23 997 362 36.3
## 10 7 10 1004 364 36.3
## 11 9 2 929 326 35.1
## 12 6 24 994 348 35.0
Upon filtering, we find that February 9, March 8, and February 8 had the highest percentage of delays or cancellations, which if we Google the weather in New York City on those days in 2013, we find that February 8 and 9 was the “Blizzard of 2013” where there was heavy snow and strong winds in the entire Northeast United States, leading to 61.26% of all flights being delayed by more than an hour or cancelled on February 8th and 9th. March 8th, on a similar note, brought heavy snow and a nor’easter to New York City, blanketing the New York metro area. Given these events, we can definitely understand why there was such a high percentage of flights delayed/cancelled on those days! Brrrr…
We know that February 8-9 and March 8 had the highest percentage of flights delayed/cancelled, but maybe flights are delayed / canceled because of the time of day. Let’s see how much the time of day really affects delays & cancellations.
flights_time_of_day <- flights %>%
mutate(delay_cancel = dep_delay > 60 | is.na(dep_delay)) %>%
group_by(hour) %>%
summarize(total_flights = n(), count = sum(delay_cancel), percentage_delaycancel = (count / total_flights * 100))
print(flights_time_of_day)
## # A tibble: 20 × 4
## hour total_flights count percentage_delaycancel
## <dbl> <int> <int> <dbl>
## 1 1 1 1 100
## 2 5 1953 37 1.89
## 3 6 25951 999 3.85
## 4 7 22821 792 3.47
## 5 8 27242 1351 4.96
## 6 9 20312 1049 5.16
## 7 10 16708 1093 6.54
## 8 11 16033 1052 6.56
## 9 12 18181 1390 7.65
## 10 13 19956 1780 8.92
## 11 14 21706 2299 10.6
## 12 15 23888 2877 12.0
## 13 16 23002 3356 14.6
## 14 17 24426 3589 14.7
## 15 18 21783 3313 15.2
## 16 19 21441 4000 18.7
## 17 20 16739 3139 18.8
## 18 21 10933 2196 20.1
## 19 22 2639 416 15.8
## 20 23 1061 107 10.1
Some hours are not available here because they did not have any delays greater than 60 minutes or a cancellation. Some hours, like 6:00, 7:00, and 8:00, had a large amount of flights, but not as many delays or cancellations as hours like 13:00. Let’s see what the relationship is between the hour of the day and delays/cancellations with a line graph.
ggplot(flights_time_of_day) + geom_line(mapping = aes(x = hour, y = percentage_delaycancel)) + labs(title = "Relationship between Time of Day and Delays/Cancellations", x = "Hour of the Day", y = "Percentage of Flights Delayed or Cancelled")
As we see in the line graph, there is an increase in percentage of flights delayed/cancelled as the day goes on. It seems that at 21:00 is when it peaks, likely due to darkness of the sky as flights taper off for the night, so we would likely want to avoid a flight at 21:00, as 20% of flights were delayed or canceled. After 21:00, the number of flights and therefore the delays/cancellations decrease dramatically, as we see in the line graph. During the middle of the day is when there are most flights, so a bad weather day could really skew the results in the line graph. However, the shape seems to be exactly what we’d expect. There seems to be an outlier somewhere between 0:00 and 5:00, which if we look at our data set, we find that there was only one flight scheduled at 1:00 and it was canceled, creating that 100% percentage value. It is obscure to have a flight scheduled at 1am anyways.
Enough of the negativity, let’s look at what we hope for, flights leaving on time!
Earlier, we found that February 8, 9, and March 8 had the highest percentage of flights canceled or delayed. Let’s think positively and look at the number of flights that were on time on those dates and some other bad weather dates with higher cancellations (greater than 35%).
flights_on_time <- flights %>%
filter(month == 2 & day == 8 | month == 2 & day == 9 | month == 3 & day == 8 |month == 5 & day == 23 | month == 7 & day == 1 | month == 7 & day == 10 | month == 7 & day == 23 | month == 9 & day == 2 | month == 9 & day == 12 | month == 12 & day == 5) %>%
filter(dep_delay <= 0)
print(flights_on_time)
## # A tibble: 2,763 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 5 457 500 -3 637 651
## 2 2013 12 5 512 515 -3 753 814
## 3 2013 12 5 527 530 -3 657 706
## 4 2013 12 5 539 540 -1 832 850
## 5 2013 12 5 540 545 -5 822 832
## 6 2013 12 5 544 550 -6 959 1027
## 7 2013 12 5 548 600 -12 738 755
## 8 2013 12 5 551 600 -9 804 810
## 9 2013 12 5 553 600 -7 919 915
## 10 2013 12 5 553 600 -7 645 701
## # ℹ 2,753 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
With this dataset we see that despite the bad weather, 2,763 flights on these days still managed to leave on time or early.
Knowing there was bad weather these days, we can predict that the flights that left on time would be in the morning before the bad weather hit that day. Let’s see what the average hour of departure was for those days to check our prediction.
flights_on_time %>%
group_by(month, day) %>%
summarize("Average Hr of Departure" = mean(hour))
## # A tibble: 10 × 3
## # Groups: month [6]
## month day `Average Hr of Departure`
## <int> <int> <dbl>
## 1 2 8 8.86
## 2 2 9 17.0
## 3 3 8 10.2
## 4 5 23 9.45
## 5 7 1 9.43
## 6 7 10 9.60
## 7 7 23 10.0
## 8 9 2 9.85
## 9 9 12 9.03
## 10 12 5 10.1
As we see by the table, the average hour of departure for all of the days seem to be in the morning. The only day that does not have a morning average hour of departure is February 9, which seems to be the flights that were scheduled for that night that were still able to depart on time after the storm had cleared from the February 8-9 blizzard in New York.
Overall, we found that many of the most delayed or canceled flights came at a time of winter or severe weather in New York City, grounding flights for some periods of time. We also find that many of the flights that were able to leave early or on time on the days affected most were able to because they departed before the bad weather hit.