Introduction:

This report looks at flight delays and cancellations for all flights leaving New York City in 2013 using the nycflights13 data set. The main goal is to find the days and times when flights were most delayed or canceled and see how weather might have affected them. I also check which flights were able to leave on time during those bad days to see if early departures helped avoid problems. By transforming the data, making graphs, and looking at averages, this report shows patterns in flight delays and gives a better understanding of when it’s riskier to fly.

Data Dictionary:

The data set flights from the nycflights13 package contains information on all flights that departed New York City in 2013. For this analysis, the key variables used include dep_delay, which records the departure delay in minutes (with NA if a flight was canceled), dep_time, which gives the actual departure time in HHMM format, sched_dep_time, which provides the scheduled departure time in HHMM format, and month and day, which indicate the date of the flight. These variables allow us to calculate the percentage of flights delayed by more than an hour, identify days with unusually high disruptions, and analyze trends by the time of day.

Days with High Delays and Cancellations:

To identify days with unusually high disruption, I calculated the percentage of flights each day that were either canceled or delayed on departure by more than one hour. Days where this percentage exceeded 35% were flagged as problematic.

flights_2013 <- flights %>%
  mutate(significantly_delayed = (dep_delay >= 60 | is.na(dep_time))) %>%
  group_by(month, day) %>%
  summarise(total = n (),
    significantly_delayed_count = sum(significantly_delayed, na.rm = TRUE),
    significantly_delayed_percentage = (significantly_delayed_count / total) * 100) %>%
  filter(significantly_delayed_percentage > 35)
flights_2013
## # A tibble: 12 × 5
## # Groups:   month [7]
##    month   day total significantly_delayed_count significantly_delayed_percent…¹
##    <int> <int> <int>                       <int>                           <dbl>
##  1     2     8   930                         508                            54.6
##  2     2     9   684                         421                            61.5
##  3     3     8   979                         578                            59.0
##  4     5    23   988                         436                            44.1
##  5     6    24   994                         352                            35.4
##  6     6    28   994                         372                            37.4
##  7     7     1   966                         412                            42.7
##  8     7    10  1004                         364                            36.3
##  9     7    23   997                         365                            36.6
## 10     9     2   929                         327                            35.2
## 11     9    12   992                         404                            40.7
## 12    12     5   969                         386                            39.8
## # ℹ abbreviated name: ¹​significantly_delayed_percentage

A quick review of historical weather records shows that many of these high-delay days corresponded to severe weather in New York. For example, February 8–9, 2013 experienced a major winter storm with heavy snow and freezing fog, and December 5, 2013 also had cold winter conditions. Summer delays, such as July 23, 2013, often coincided with thunderstorms or heat that slowed operations. These observations confirm that harsh winter weather in particular had a major impact on flight schedules.

Source: Time and Date. (2013). Weather history for New York, USA (2013). Retrieved from https://www.timeanddate.com/weather/usa/new-york/historic

Hourly Trends in Delays:

Next, I explored how the percentage of delayed or canceled flights varied by scheduled departure hour.

scheduled_departure_hour <- flights %>%
  mutate(significantly_delayed_1 = is.na(dep_time) | dep_delay > 60,
         hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarise(
    total_flights = n (),
    significantly_delayed_count = sum(significantly_delayed_1),
    significantly_delayed_percentage = (significantly_delayed_count / total_flights) * 100)
scheduled_departure_hour
## # A tibble: 20 × 4
##     hour total_flights significantly_delayed_count significantly_delayed_perce…¹
##    <dbl>         <int>                       <int>                         <dbl>
##  1     1             1                           1                        100   
##  2     5          1953                          37                          1.89
##  3     6         25951                         999                          3.85
##  4     7         22821                         792                          3.47
##  5     8         27242                        1351                          4.96
##  6     9         20312                        1049                          5.16
##  7    10         16708                        1093                          6.54
##  8    11         16033                        1052                          6.56
##  9    12         18181                        1390                          7.65
## 10    13         19956                        1780                          8.92
## 11    14         21706                        2299                         10.6 
## 12    15         23888                        2877                         12.0 
## 13    16         23002                        3356                         14.6 
## 14    17         24426                        3589                         14.7 
## 15    18         21783                        3313                         15.2 
## 16    19         21441                        4000                         18.7 
## 17    20         16739                        3139                         18.8 
## 18    21         10933                        2196                         20.1 
## 19    22          2639                         416                         15.8 
## 20    23          1061                         107                         10.1 
## # ℹ abbreviated name: ¹​significantly_delayed_percentage

The visualization below shows the relationship between departure hour and the percentage of flights delayed or canceled.

ggplot(data = scheduled_departure_hour, mapping = aes(x = hour, y = significantly_delayed_percentage)) +
  geom_line( color = "red") +
  geom_point(size = 1.5) +
  labs(
    x = "Scheduled Departure Hour",
    y = "Percentage of Canceled or Delayed Flight"
  )

The plot has a high initial point because very few flights depart in the early hours, so even one delay results in a large percentage. During the rest of the morning, most flights depart on time, so the percentage is low. Delays gradually increase later in the day as earlier disruptions propagate through the schedule. The riskiest times to fly are late afternoon and evening, when delays have accumulated.

On-Time Flights During Problematic Days

Even on days with severe delays, some flights managed to leave on time or early. The following data set isolates these flights for all previously identified problematic days.

on_time_flights <- flights %>%
  filter(
    (month == 2 & day == 8)  |
    (month == 2 & day == 9)  |
    (month == 3 & day == 8)  |
    (month == 5 & day == 23) |
    (month == 6 & day == 24) |
    (month == 6 & day == 28) |
    (month == 7 & day == 1)  |
    (month == 7 & day == 10) |
    (month == 7 & day == 23) |
    (month == 9 & day == 2)  |
    (month == 9 & day == 12) |
    (month == 12 & day == 5),
    dep_delay <= 0
  )
on_time_flights
## # A tibble: 3,441 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     5      457            500        -3      637            651
##  2  2013    12     5      512            515        -3      753            814
##  3  2013    12     5      527            530        -3      657            706
##  4  2013    12     5      539            540        -1      832            850
##  5  2013    12     5      540            545        -5      822            832
##  6  2013    12     5      544            550        -6      959           1027
##  7  2013    12     5      548            600       -12      738            755
##  8  2013    12     5      551            600        -9      804            810
##  9  2013    12     5      553            600        -7      919            915
## 10  2013    12     5      553            600        -7      645            701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Average Departure Hour Analysis:

To test the hypothesis that flights leaving on time during bad weather tended to depart early in the morning, I calculated the average scheduled departure hour for these flights.

avg_sched_hour <- on_time_flights %>%
  group_by(month, day) %>%
  summarise(avg_hour = mean(sched_dep_time %/% 100))
avg_sched_hour
## # A tibble: 12 × 3
## # Groups:   month [7]
##    month   day avg_hour
##    <int> <int>    <dbl>
##  1     2     8     8.86
##  2     2     9    17.0 
##  3     3     8    10.2 
##  4     5    23     9.45
##  5     6    24     9.82
##  6     6    28    10.1 
##  7     7     1     9.43
##  8     7    10     9.60
##  9     7    23    10.0 
## 10     9     2     9.85
## 11     9    12     9.03
## 12    12     5    10.1

Most flights that left on time during problematic days departed in the early morning hours, which supports the hypothesis. A few exceptions had average departure hours in the afternoon, possibly because some flights were delayed previously or rescheduled to later slots after the weather improved.

afternoon_flights <- avg_sched_hour %>%
  filter(avg_hour >= 12)
afternoon_flights
## # A tibble: 1 × 3
## # Groups:   month [1]
##   month   day avg_hour
##   <int> <int>    <dbl>
## 1     2     9     17.0

Most flights that left on time during bad weather departed in the morning, supporting the idea that early flights avoided delays. On February 9, however, the average departure was about 17:00, which might be due to fewer flights or improved airport operations later in the day.

Conclusion:

Analysis of NYC flights in 2013 shows that weather had a strong influence on delays and cancellations. Early morning flights were generally the most reliable, even on bad weather days, while late afternoon and evening departures faced the highest risk. These findings highlight the importance of scheduling and weather monitoring for maintaining on-time performance in airports.