This report seeks to explore and analyze flight delays and cancellations during 2013, focusing on identifying patterns and drawing conclusions from the provided ‘nycflights13’ dataset. The ‘nycflights13’ dataset is one of the datasets built into R to be used as long as it is installed and loaded. The dataset used in this analysis contains information on flight departures, cancellations, and delays, along with their corresponding dates and times. Specifically, we will examine days where the percentage of flights canceled or delayed by more than an hour was greater than 35%, and investigate if weather conditions may explain the high number of delays and cancellations. We will also look at the relationship between scheduled departure times and the percentage of delayed or canceled flights.
The dataset includes variables that use ‘arr’ and ‘dep’ which stand for ‘arrival’ and ‘departure’. The ‘tailnum’ is the Plane tail number that identifies that plane. The ‘time_hour’ variable refers to the scheduled date and hour of the flight as a POSIXct date. The other variables in the dataset are self explanatory. In total there are 336,776 observations and 19 variables in the dataset.
Below is the code that loads the dataset. Since the dataset is so large the code will only load the first 100 entries:
slice(flights, 1:100)
## # A tibble: 100 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 90 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The tidyverse library will be used for visualizations throughout the report.
library(tidyverse)
We are identifying days in 2013 where at least 35% of flights were either canceled or delayed by more than one hour. This helps us find specific days with significant flight disruptions, likely caused by weather.
flights_dc <- flights %>%
mutate(dc = (dep_delay > 60 | is.na(dep_time))) %>%
group_by(year, month, day) %>%
summarize(
total = n(),
dc_count = sum(dc, na.rm = TRUE),
dc_percentage = (dc_count / total) * 100) %>%
filter(dc_percentage > 35)
flights_dc
## # A tibble: 12 × 6
## # Groups: year, month [7]
## year month day total dc_count dc_percentage
## <int> <int> <int> <int> <int> <dbl>
## 1 2013 2 8 930 506 54.4
## 2 2013 2 9 684 419 61.3
## 3 2013 3 8 979 574 58.6
## 4 2013 5 23 988 435 44.0
## 5 2013 6 24 994 348 35.0
## 6 2013 6 28 994 369 37.1
## 7 2013 7 1 966 407 42.1
## 8 2013 7 10 1004 364 36.3
## 9 2013 7 23 997 362 36.3
## 10 2013 9 2 929 326 35.1
## 11 2013 9 12 992 404 40.7
## 12 2013 12 5 969 385 39.7
Each of these dates saw severe weather that affected air travel. The most noteworthy being February 8-9 where a major blizzard struck the northeast and canceled over 2,700 flights. The cancellations on March 8th were also due to a winter storm in the northeast. The storms past those dates were primarily thunderstorms, and the December 5th cancellation were due to another winter storm. The large number of delays and cancellations can be linked to the weather for each of these dates.
Now we will be exploring the relationship between the hour of the day a flight is scheduled to depart and the amount of flights canceled or delayed by more than an hour. The hypothesis for this relationship is that the flights in the evening are most likely to be delayed or canceled as they are more likely to be subject to extreme weather.
flights_hourly_dc <- flights %>%
mutate(is_dc = is.na(dep_time) | dep_delay > 60,
hour = sched_dep_time %/% 100) %>%
group_by(hour) %>%
summarize(
total_flights = n(),
dc_count = sum(is_dc),
dc_percentage = (dc_count / total_flights) * 100)
flights_hourly_dc
## # A tibble: 20 × 4
## hour total_flights dc_count dc_percentage
## <dbl> <int> <int> <dbl>
## 1 1 1 1 100
## 2 5 1953 37 1.89
## 3 6 25951 999 3.85
## 4 7 22821 792 3.47
## 5 8 27242 1351 4.96
## 6 9 20312 1049 5.16
## 7 10 16708 1093 6.54
## 8 11 16033 1052 6.56
## 9 12 18181 1390 7.65
## 10 13 19956 1780 8.92
## 11 14 21706 2299 10.6
## 12 15 23888 2877 12.0
## 13 16 23002 3356 14.6
## 14 17 24426 3589 14.7
## 15 18 21783 3313 15.2
## 16 19 21441 4000 18.7
## 17 20 16739 3139 18.8
## 18 21 10933 2196 20.1
## 19 22 2639 416 15.8
## 20 23 1061 107 10.1
Let’s put this data into a graph so we can easily visualize the relationship between dc_percentage and hour.
ggplot(data = flights_hourly_dc, aes(x = hour, y = dc_percentage)) +
geom_line() +
labs(title = "Percentage of Flights Canceled or Delayed by an Hour",
x = "Scheduled Departure Hour",
y = "Percentage of Canceled or Delayed Flights")
We can see that our hypothesis is mostly correct, with the percentage peaking in the evening at 9 PM. The major outlier is at 1 AM, but the sample size of this is one flight that might have been accidentally scheduled past midnight and canceled for this reason. No other flights are scheduled between midnight and 5 AM so that is why there is no data for those on this graph. I believe it peaks at 9 PM because the total number of flights drastically decreases past 9 PM therefore meaning that the airport and carriers have more resources available to those flights. The percentage slowly builds up to 21 likely due to the weather worsening as it gets later and later. This means the riskiest time to fly would be at 9 PM due to the medium amount of flights and the weather worsening.
Now we will take a look at the flights that were on time on the previously observed days with a high percentage of cancelled or delayed flights. The hypothesis for this data is that the earlier in the day the flight left the more likely it was able to depart on time due to avoiding the bad weather that came later.
on_time_flights <- flights %>%
filter(dep_delay <= 0,
(year == 2013 & month == 2 & day %in% c(8, 9)) |
(year == 2013 & month == 3 & day == 8) |
(year == 2013 & month == 5 & day == 23) |
(year == 2013 & month == 6 & day %in% c(24, 28)) |
(year == 2013 & month == 7 & day %in% c(1, 10, 23)) |
(year == 2013 & month == 9 & day %in% c(2, 12)) |
(year == 2013 & month == 12 & day == 5))
on_time_flights
## # A tibble: 3,441 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 5 457 500 -3 637 651
## 2 2013 12 5 512 515 -3 753 814
## 3 2013 12 5 527 530 -3 657 706
## 4 2013 12 5 539 540 -1 832 850
## 5 2013 12 5 540 545 -5 822 832
## 6 2013 12 5 544 550 -6 959 1027
## 7 2013 12 5 548 600 -12 738 755
## 8 2013 12 5 551 600 -9 804 810
## 9 2013 12 5 553 600 -7 919 915
## 10 2013 12 5 553 600 -7 645 701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Scrolling through the dataset we can see that many of the departure times are earlier in the day but let’s make it easier to see by manipulating the dataset more.
average_departure_hour <- flights %>%
filter(dep_delay <= 0,
(year == 2013 & month == 2 & day %in% c(8, 9)) |
(year == 2013 & month == 3 & day == 8) |
(year == 2013 & month == 5 & day == 23) |
(year == 2013 & month == 6 & day %in% c(24, 28)) |
(year == 2013 & month == 7 & day %in% c(1, 10, 23)) |
(year == 2013 & month == 9 & day %in% c(2, 12)) |
(year == 2013 & month == 12 & day == 5)) %>%
mutate(hour = sched_dep_time %/% 100) %>%
group_by(year, month, day) %>%
summarise(avg_hour = mean(hour))
average_departure_hour
## # A tibble: 12 × 4
## # Groups: year, month [7]
## year month day avg_hour
## <int> <int> <int> <dbl>
## 1 2013 2 8 8.86
## 2 2013 2 9 17.0
## 3 2013 3 8 10.2
## 4 2013 5 23 9.45
## 5 2013 6 24 9.82
## 6 2013 6 28 10.1
## 7 2013 7 1 9.43
## 8 2013 7 10 9.60
## 9 2013 7 23 10.0
## 10 2013 9 2 9.85
## 11 2013 9 12 9.03
## 12 2013 12 5 10.1
This dataset shows the average departure hour of each on-time departure flight. This confirms our hypothesis that most left in the morning or afternoon at the latest if they were on time. However, February 9th has a much later average departure hour of around 5 PM than the others. This is likely due to the fact that the storm from the 8th continued into the morning of the 9th, only clearing up in the evening of the 9th.
In summary, we can conclude that the days in 2013 with at least 35% of flights canceled or delayed by an hour or more were indeed due to poor weather conditions. We also discovered the worst time to travel was at 9 PM due to the highest percentage of delays or cancellations. Lastly, we found that many of the days with poor weather conditions still had a number of flights that departed on time, but nearly all of these flights were in the morning.