In Depth Analysis of Flights Database

Introduction

This report seeks to explore and analyze flight delays and cancellations during 2013, focusing on identifying patterns and drawing conclusions from the provided ‘nycflights13’ dataset. The ‘nycflights13’ dataset is one of the datasets built into R to be used as long as it is installed and loaded. The dataset used in this analysis contains information on flight departures, cancellations, and delays, along with their corresponding dates and times. Specifically, we will examine days where the percentage of flights canceled or delayed by more than an hour was greater than 35%, and investigate if weather conditions may explain the high number of delays and cancellations. We will also look at the relationship between scheduled departure times and the percentage of delayed or canceled flights.

The dataset includes variables that use ‘arr’ and ‘dep’ which stand for ‘arrival’ and ‘departure’. The ‘tailnum’ is the Plane tail number that identifies that plane. The ‘time_hour’ variable refers to the scheduled date and hour of the flight as a POSIXct date. The other variables in the dataset are self explanatory. In total there are 336,776 observations and 19 variables in the dataset.

Below is the code that loads the dataset. Since the dataset is so large the code will only load the first 100 entries:

slice(flights, 1:100)

## # A tibble: 100 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 90 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The tidyverse library will be used for visualizations throughout the report.

library(tidyverse)

Relationship between Flights Canceled or Delayed and Day

We are identifying days in 2013 where at least 35% of flights were either canceled or delayed by more than one hour. This helps us find specific days with significant flight disruptions, likely caused by weather.

flights_dc <- flights %>%
  mutate(dc = (dep_delay > 60 | is.na(dep_time))) %>%
  group_by(year, month, day) %>%
  summarize(
    total = n(),
    dc_count = sum(dc, na.rm = TRUE),
    dc_percentage = (dc_count / total) * 100) %>%
  filter(dc_percentage > 35)

flights_dc

## # A tibble: 12 × 6
## # Groups:   year, month [7]
##     year month   day total dc_count dc_percentage
##    <int> <int> <int> <int>    <int>         <dbl>
##  1  2013     2     8   930      506          54.4
##  2  2013     2     9   684      419          61.3
##  3  2013     3     8   979      574          58.6
##  4  2013     5    23   988      435          44.0
##  5  2013     6    24   994      348          35.0
##  6  2013     6    28   994      369          37.1
##  7  2013     7     1   966      407          42.1
##  8  2013     7    10  1004      364          36.3
##  9  2013     7    23   997      362          36.3
## 10  2013     9     2   929      326          35.1
## 11  2013     9    12   992      404          40.7
## 12  2013    12     5   969      385          39.7

Each of these dates saw severe weather that affected air travel. The most noteworthy being February 8-9 where a major blizzard struck the northeast and canceled over 2,700 flights. The cancellations on March 8th were also due to a winter storm in the northeast. The storms past those dates were primarily thunderstorms, and the December 5th cancellation were due to another winter storm. The large number of delays and cancellations can be linked to the weather for each of these dates.

Relationship between Departure Hour and Flights Canceled or Delayed

Now we will be exploring the relationship between the hour of the day a flight is scheduled to depart and the amount of flights canceled or delayed by more than an hour. The hypothesis for this relationship is that the flights in the evening are most likely to be delayed or canceled as they are more likely to be subject to extreme weather.

flights_hourly_dc <- flights %>%
  mutate(is_dc = is.na(dep_time) | dep_delay > 60,
         hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarize(
    total_flights = n(),
    dc_count = sum(is_dc),
    dc_percentage = (dc_count / total_flights) * 100)

flights_hourly_dc

## # A tibble: 20 × 4
##     hour total_flights dc_count dc_percentage
##    <dbl>         <int>    <int>         <dbl>
##  1     1             1        1        100   
##  2     5          1953       37          1.89
##  3     6         25951      999          3.85
##  4     7         22821      792          3.47
##  5     8         27242     1351          4.96
##  6     9         20312     1049          5.16
##  7    10         16708     1093          6.54
##  8    11         16033     1052          6.56
##  9    12         18181     1390          7.65
## 10    13         19956     1780          8.92
## 11    14         21706     2299         10.6 
## 12    15         23888     2877         12.0 
## 13    16         23002     3356         14.6 
## 14    17         24426     3589         14.7 
## 15    18         21783     3313         15.2 
## 16    19         21441     4000         18.7 
## 17    20         16739     3139         18.8 
## 18    21         10933     2196         20.1 
## 19    22          2639      416         15.8 
## 20    23          1061      107         10.1

Let’s put this data into a graph so we can easily visualize the relationship between dc_percentage and hour.

ggplot(data = flights_hourly_dc, aes(x = hour, y = dc_percentage)) +
  geom_line() +
  labs(title = "Percentage of Flights Canceled or Delayed by an Hour",
       x = "Scheduled Departure Hour",
       y = "Percentage of Canceled or Delayed Flights")

We can see that our hypothesis is mostly correct, with the percentage peaking in the evening at 9 PM. The major outlier is at 1 AM, but the sample size of this is one flight that might have been accidentally scheduled past midnight and canceled for this reason. No other flights are scheduled between midnight and 5 AM so that is why there is no data for those on this graph. I believe it peaks at 9 PM because the total number of flights drastically decreases past 9 PM therefore meaning that the airport and carriers have more resources available to those flights. The percentage slowly builds up to 21 likely due to the weather worsening as it gets later and later. This means the riskiest time to fly would be at 9 PM due to the medium amount of flights and the weather worsening.

Relationship between Departure Hour on Bad Weather Days and On Time Departure Flights

Now we will take a look at the flights that were on time on the previously observed days with a high percentage of cancelled or delayed flights. The hypothesis for this data is that the earlier in the day the flight left the more likely it was able to depart on time due to avoiding the bad weather that came later.

on_time_flights <- flights %>%
  filter(dep_delay <= 0,
         (year == 2013 & month == 2 & day %in% c(8, 9)) |
         (year == 2013 & month == 3 & day == 8) |
         (year == 2013 & month == 5 & day == 23) |
         (year == 2013 & month == 6 & day %in% c(24, 28)) |
         (year == 2013 & month == 7 & day %in% c(1, 10, 23)) |
         (year == 2013 & month == 9 & day %in% c(2, 12)) |
         (year == 2013 & month == 12 & day == 5))
on_time_flights

## # A tibble: 3,441 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Scrolling through the dataset we can see that many of the departure times are earlier in the day but let’s make it easier to see by manipulating the dataset more.

average_departure_hour <- flights %>%
  filter(dep_delay <= 0,
         (year == 2013 & month == 2 & day %in% c(8, 9)) |
         (year == 2013 & month == 3 & day == 8) |
         (year == 2013 & month == 5 & day == 23) |
         (year == 2013 & month == 6 & day %in% c(24, 28)) |
         (year == 2013 & month == 7 & day %in% c(1, 10, 23)) |
         (year == 2013 & month == 9 & day %in% c(2, 12)) |
         (year == 2013 & month == 12 & day == 5)) %>%
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(year, month, day) %>%
  summarise(avg_hour = mean(hour))

average_departure_hour

## # A tibble: 12 × 4
## # Groups:   year, month [7]
##     year month   day avg_hour
##    <int> <int> <int>    <dbl>
##  1  2013     2     8     8.86
##  2  2013     2     9    17.0 
##  3  2013     3     8    10.2 
##  4  2013     5    23     9.45
##  5  2013     6    24     9.82
##  6  2013     6    28    10.1 
##  7  2013     7     1     9.43
##  8  2013     7    10     9.60
##  9  2013     7    23    10.0 
## 10  2013     9     2     9.85
## 11  2013     9    12     9.03
## 12  2013    12     5    10.1

This dataset shows the average departure hour of each on-time departure flight. This confirms our hypothesis that most left in the morning or afternoon at the latest if they were on time. However, February 9th has a much later average departure hour of around 5 PM than the others. This is likely due to the fact that the storm from the 8th continued into the morning of the 9th, only clearing up in the evening of the 9th.

Conclusion

In summary, we can conclude that the days in 2013 with at least 35% of flights canceled or delayed by an hour or more were indeed due to poor weather conditions. We also discovered the worst time to travel was at 9 PM due to the highest percentage of delays or cancellations. Lastly, we found that many of the days with poor weather conditions still had a number of flights that departed on time, but nearly all of these flights were in the morning.