What are the largest causes of flight delays and cancellations?

This Project will use the nycflights13 data package for all data and visualizations. I also use the tidyverse library and the dplyr for data manipulation and visualizations.

suppressWarnings(library(tidyverse))
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
suppressWarnings(library(nycflights13))
suppressWarnings(library(dplyr))

this project will use the nycflights13 dataset for data and visualizations but due to its size and stress on loading the entire dataset, I will limit the amount shown to only the first 10 entries.

slice(flights, 1:10)

This dataset has 3367766 observations with 19 variables.

Below is a data set that starts to outline us exploring some of the data’s basic trends. This data shows days that flights were canceled or delayed by more than one hour where more then 35% were canceled or delayed.

cancel_delay_counts <- flights %>%
  mutate(is_cancelled_or_delayed = is.na(dep_time) | dep_delay > 60) %>%
  group_by(year, month, day) %>%
  summarise(
    total_flights = n(),
    canceled_or_delayed = sum(is_cancelled_or_delayed), 
    percent_cancelled_or_delayed = (canceled_or_delayed / total_flights) * 100,
    .groups = 'drop'
    )

high_cancel_delay_days <- cancel_delay_counts %>%
  filter(percent_cancelled_or_delayed > 35)

print(high_cancel_delay_days)
## # A tibble: 12 × 6
##     year month   day total_flights canceled_or_delayed percent_cancelled_or_de…¹
##    <int> <int> <int>         <int>               <int>                     <dbl>
##  1  2013     2     8           930                 506                      54.4
##  2  2013     2     9           684                 419                      61.3
##  3  2013     3     8           979                 574                      58.6
##  4  2013     5    23           988                 435                      44.0
##  5  2013     6    24           994                 348                      35.0
##  6  2013     6    28           994                 369                      37.1
##  7  2013     7     1           966                 407                      42.1
##  8  2013     7    10          1004                 364                      36.3
##  9  2013     7    23           997                 362                      36.3
## 10  2013     9     2           929                 326                      35.1
## 11  2013     9    12           992                 404                      40.7
## 12  2013    12     5           969                 385                      39.7
## # ℹ abbreviated name: ¹​percent_cancelled_or_delayed

From this we can see that a majority of the delays and cancellations take place during winter months there blizzards and snow storms are extremely common and looking at weather reports of these days from 2013 reveal that these days had bad weather and large storms that took place on multiple of these days.

Below is a data set that shows each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour.

cancel_delay_counts <- flights %>%
  mutate(is_cancelled_or_delayed = is.na(dep_time) | dep_delay > 60) %>%
  group_by(hour) %>%
  summarise(
    total_flights = n(),
    canceled_or_delayed = sum(is_cancelled_or_delayed), 
    percent_cancelled_or_delayed = (canceled_or_delayed / total_flights) * 100) %>%
  filter(hour != 1)


high_cancel_delay_days <- flights %>%
  mutate(is_cancelled_or_delayed = is.na(dep_time) | dep_delay > 60) %>%
  group_by(month, day) %>%
  summarise(
    total_flights = n(),
    canceled_or_delayed = sum(is_cancelled_or_delayed),
    percent_cancelled_or_delayed = (canceled_or_delayed / total_flights) * 100,
    .groups = 'drop') %>%
  
    filter(percent_cancelled_or_delayed >= 35)
    
    

print(high_cancel_delay_days)
## # A tibble: 12 × 5
##    month   day total_flights canceled_or_delayed percent_cancelled_or_delayed
##    <int> <int>         <int>               <int>                        <dbl>
##  1     2     8           930                 506                         54.4
##  2     2     9           684                 419                         61.3
##  3     3     8           979                 574                         58.6
##  4     5    23           988                 435                         44.0
##  5     6    24           994                 348                         35.0
##  6     6    28           994                 369                         37.1
##  7     7     1           966                 407                         42.1
##  8     7    10          1004                 364                         36.3
##  9     7    23           997                 362                         36.3
## 10     9     2           929                 326                         35.1
## 11     9    12           992                 404                         40.7
## 12    12     5           969                 385                         39.7

This data set just continues to show how much this severe weather effected our flights on these days take 2/9 for example 61% of flights canceled is a huge portion of the airports flights per day especially when on average a airport is getting between 1,000-3,00 flights a day.

The worst times to fly and why.

Below is a graph showing the relationship between the time and the percentage of flights canceled or delayed by more then an hour during that time. As time goes on we can clearly see a trend in the graph. The key thing to pay atrention to is how more delays occur as time goes on.

ggplot(cancel_delay_counts, aes(x = hour, y = percent_cancelled_or_delayed)) +
  geom_line() + 
  labs(
    title = "Percentage of flights cancelled or delayed",
    x = "Hour",
    y = "Percent Cancelled or Delayed",
    caption = "Data from: nycflights13"
  )

To address the outlier right off the bat I have filtered out the major outlier which was a single flight that was canceled at 1am. This flight skewed the data because this outlier gave us a 100% cancellation rate at 1am.

In general delays and cancellations increase as the day does on. This is because of a lot of effects such as airports getting busier and congested. If one plane takes too long it can create a backup leading to more flights being delayed as each just compounds the issue.

Overall the riskiest time to fly is between 8pm to 10pm or 20-22 in military time if you want to avoid cancellations and long delays. The delays at this time could also be when pilots or flight staff change shifts.

The worst days for cancelations and delays

Above you saw see data listing off the significance days where a lot of cancellations and delays took place. However, some planes did manage to leave on these days and below you’ll see the planes that made it out.

filtered_flights <- flights %>%
  filter(
    (month == 2 & day == 8) |
    (month == 2 & day == 9) |
    (month == 3 & day == 8) |
    (month == 5 & day == 23) |
    (month == 6 & day == 24) |
    (month == 6 & day == 28) |
    (month == 7 & day == 1) |
    (month == 7 & day == 10) |
    (month == 7 & day == 23) |
    (month == 9 & day == 12) |
    (month == 12 & day == 5) |
    (month == 9 & day == 2)
  ) %>%
  filter(dep_delay <= 0)

print(filtered_flights)
## # A tibble: 3,441 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

With this data you’ll see that a majority of the planes that made it out on time or early left early in the morning before the worst of the weather or storms could hit.

Above we could see that the planes that were able to get out on time seem to be the one that leave early but below breaking down the data to see average time planes left will help show average times.

average_actual_hour <- flights %>%
  filter(
    (month == 2 & day == 8) |
    (month == 2 & day == 9) |
    (month == 3 & day == 8) |
    (month == 5 & day == 23) |
    (month == 6 & day == 24) |
    (month == 6 & day == 28) |
    (month == 7 & day == 1) |
    (month == 7 & day == 10) |
    (month == 7 & day == 23) |
    (month == 9 & day == 12) |
    (month == 9 & day == 2) |
    (month == 12 & day == 5)
  ) %>%
  filter(!is.na(dep_time)) %>% 
  mutate(
    dep_hour = dep_time %/% 100, 
    dep_min = dep_time %% 100, 
    dep_time_decimal = dep_hour + (dep_min / 60) 
  ) %>%
  group_by(month, day) %>%
  summarise(average_actual_hour = mean(dep_time_decimal, na.rm = TRUE),
    .groups = 'drop')

print(average_actual_hour)
## # A tibble: 12 × 3
##    month   day average_actual_hour
##    <int> <int>               <dbl>
##  1     2     8                10.1
##  2     2     9                17.1
##  3     3     8                14.4
##  4     5    23                13.1
##  5     6    24                13.3
##  6     6    28                13.2
##  7     7     1                13.8
##  8     7    10                13.2
##  9     7    23                14.1
## 10     9     2                14.1
## 11     9    12                12.7
## 12    12     5                14.0

This table shows the average hour planes on the days with a lot of delays that left. Most left fairly early and well before dark and before the storms, which it seems is the most likely reason for why most plans were able to leave. But what about the day 2/9? Why is the average so much higher then the others? It seems like that day around that time there was a break in the snowy weather with the wind speeds dropping significantly. This seems like the reason planes were able to leave. Other days in the afternoon seem to correlate with the temperature rising around 12-2pm.

In summary

Overall the weather especially winter weather has an extreme effect on air traffic and whether its delayed or not and these issues are only increased when we add other effects such as flights getting backed up as the day goes on. This data helps prove that some of the worst times attempt to have a flight leave is late in the day after dark and with inclement weather such as big thunder storms or blizzards. We also see that flights between 8pm to 10pm are some of the worst to have if your plan is to leave on time. As the day goes on flights back up and these back ups continue to pile up leading to large amounts of delays around 8pm to 10pm but after these times the airlines get caught back up. So leaving on days with good weather and clear skies between 5am-10am will give you the best chance of leaving on time.