Introduction:
This report looks at flight delays and cancellations for all flights leaving New York City in 2013 using the nycflights13 data set. The main goal is to find the days and times when flights were most delayed or canceled and see how weather might have affected them. I also check which flights were able to leave on time during those bad days to see if early departures helped avoid problems. By transforming the data, making graphs, and looking at averages, this report shows patterns in flight delays and gives a better understanding of when it’s riskier to fly.
Data Dictionary:
The data set flights from the nycflights13 package contains information on all flights that departed New York City in 2013. For this analysis, the key variables used include dep_delay, which records the departure delay in minutes (with NA if a flight was canceled), dep_time, which gives the actual departure time in HHMM format, sched_dep_time, which provides the scheduled departure time in HHMM format, and month and day, which indicate the date of the flight. These variables allow us to calculate the percentage of flights delayed by more than an hour, identify days with unusually high disruptions, and analyze trends by the time of day.
Days with High Delays and Cancellations:
To identify days with unusually high disruption, I calculated the percentage of flights each day that were either canceled or delayed on departure by more than one hour. Days where this percentage exceeded 35% were flagged as problematic.
flights_2013 <- flights %>%
mutate(significantly_delayed = (dep_delay >= 60 | is.na(dep_time))) %>%
group_by(month, day) %>%
summarise(total = n (),
significantly_delayed_count = sum(significantly_delayed, na.rm = TRUE),
significantly_delayed_percentage = (significantly_delayed_count / total) * 100) %>%
filter(significantly_delayed_percentage > 35)
flights_2013
## # A tibble: 12 × 5
## # Groups: month [7]
## month day total significantly_delayed_count significantly_delayed_percent…¹
## <int> <int> <int> <int> <dbl>
## 1 2 8 930 508 54.6
## 2 2 9 684 421 61.5
## 3 3 8 979 578 59.0
## 4 5 23 988 436 44.1
## 5 6 24 994 352 35.4
## 6 6 28 994 372 37.4
## 7 7 1 966 412 42.7
## 8 7 10 1004 364 36.3
## 9 7 23 997 365 36.6
## 10 9 2 929 327 35.2
## 11 9 12 992 404 40.7
## 12 12 5 969 386 39.8
## # ℹ abbreviated name: ¹significantly_delayed_percentage
A quick review of historical weather records shows that many of these high-delay days corresponded to severe weather in New York. For example, February 8–9, 2013 experienced a major winter storm with heavy snow and freezing fog, and December 5, 2013 also had cold winter conditions. Summer delays, such as July 23, 2013, often coincided with thunderstorms or heat that slowed operations. These observations confirm that harsh winter weather in particular had a major impact on flight schedules.
Source: Time and Date. (2013). Weather history for New York, USA (2013). Retrieved from https://www.timeanddate.com/weather/usa/new-york/historic
Hourly Trends in Delays:
Next, I explored how the percentage of delayed or canceled flights varied by scheduled departure hour.
scheduled_departure_hour <- flights %>%
mutate(significantly_delayed_1 = is.na(dep_time) | dep_delay > 60,
hour = sched_dep_time %/% 100) %>%
group_by(hour) %>%
summarise(
total_flights = n (),
significantly_delayed_count = sum(significantly_delayed_1),
significantly_delayed_percentage = (significantly_delayed_count / total_flights) * 100)
scheduled_departure_hour
## # A tibble: 20 × 4
## hour total_flights significantly_delayed_count significantly_delayed_perce…¹
## <dbl> <int> <int> <dbl>
## 1 1 1 1 100
## 2 5 1953 37 1.89
## 3 6 25951 999 3.85
## 4 7 22821 792 3.47
## 5 8 27242 1351 4.96
## 6 9 20312 1049 5.16
## 7 10 16708 1093 6.54
## 8 11 16033 1052 6.56
## 9 12 18181 1390 7.65
## 10 13 19956 1780 8.92
## 11 14 21706 2299 10.6
## 12 15 23888 2877 12.0
## 13 16 23002 3356 14.6
## 14 17 24426 3589 14.7
## 15 18 21783 3313 15.2
## 16 19 21441 4000 18.7
## 17 20 16739 3139 18.8
## 18 21 10933 2196 20.1
## 19 22 2639 416 15.8
## 20 23 1061 107 10.1
## # ℹ abbreviated name: ¹significantly_delayed_percentage
The visualization below shows the relationship between departure hour and the percentage of flights delayed or canceled.
ggplot(data = scheduled_departure_hour, mapping = aes(x = hour, y = significantly_delayed_percentage)) +
geom_line( color = "red") +
geom_point(size = 1.5) +
labs(
x = "Scheduled Departure Hour",
y = "Percentage of Canceled or Delayed Flight"
)
The plot has a high initial point because very few flights depart in the early hours, so even one delay results in a large percentage. During the rest of the morning, most flights depart on time, so the percentage is low. Delays gradually increase later in the day as earlier disruptions propagate through the schedule. The riskiest times to fly are late afternoon and evening, when delays have accumulated.
On-Time Flights During Problematic Days
Even on days with severe delays, some flights managed to leave on time or early. The following data set isolates these flights for all previously identified problematic days.
on_time_flights <- flights %>%
filter(
(month == 2 & day == 8) |
(month == 2 & day == 9) |
(month == 3 & day == 8) |
(month == 5 & day == 23) |
(month == 6 & day == 24) |
(month == 6 & day == 28) |
(month == 7 & day == 1) |
(month == 7 & day == 10) |
(month == 7 & day == 23) |
(month == 9 & day == 2) |
(month == 9 & day == 12) |
(month == 12 & day == 5),
dep_delay <= 0
)
on_time_flights
## # A tibble: 3,441 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 5 457 500 -3 637 651
## 2 2013 12 5 512 515 -3 753 814
## 3 2013 12 5 527 530 -3 657 706
## 4 2013 12 5 539 540 -1 832 850
## 5 2013 12 5 540 545 -5 822 832
## 6 2013 12 5 544 550 -6 959 1027
## 7 2013 12 5 548 600 -12 738 755
## 8 2013 12 5 551 600 -9 804 810
## 9 2013 12 5 553 600 -7 919 915
## 10 2013 12 5 553 600 -7 645 701
## # ℹ 3,431 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Average Departure Hour Analysis:
To test the hypothesis that flights leaving on time during bad weather tended to depart early in the morning, I calculated the average scheduled departure hour for these flights.
avg_sched_hour <- on_time_flights %>%
group_by(month, day) %>%
summarise(avg_hour = mean(sched_dep_time %/% 100))
avg_sched_hour
## # A tibble: 12 × 3
## # Groups: month [7]
## month day avg_hour
## <int> <int> <dbl>
## 1 2 8 8.86
## 2 2 9 17.0
## 3 3 8 10.2
## 4 5 23 9.45
## 5 6 24 9.82
## 6 6 28 10.1
## 7 7 1 9.43
## 8 7 10 9.60
## 9 7 23 10.0
## 10 9 2 9.85
## 11 9 12 9.03
## 12 12 5 10.1
Most flights that left on time during problematic days departed in the early morning hours, which supports the hypothesis. A few exceptions had average departure hours in the afternoon, possibly because some flights were delayed previously or rescheduled to later slots after the weather improved.
afternoon_flights <- avg_sched_hour %>%
filter(avg_hour >= 12)
afternoon_flights
## # A tibble: 1 × 3
## # Groups: month [1]
## month day avg_hour
## <int> <int> <dbl>
## 1 2 9 17.0
Most flights that left on time during bad weather departed in the morning, supporting the idea that early flights avoided delays. On February 9, however, the average departure was about 17:00, which might be due to fewer flights or improved airport operations later in the day.
Conclusion:
Analysis of NYC flights in 2013 shows that weather had a strong influence on delays and cancellations. Early morning flights were generally the most reliable, even on bad weather days, while late afternoon and evening departures faced the highest risk. These findings highlight the importance of scheduling and weather monitoring for maintaining on-time performance in airports.