This data dive explores the nycflights13 dataset, which
contains detailed information on all flights departing from New York
City airports in 2013. The goal of this analysis is to develop an
initial understanding of flight delays and their variability, and to
identify meaningful patterns that can inform a more focused statistical
analysis in the final project.
library(nycflights13)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("flights")
data("weather")
data("planes")
flights |>
filter(!is.na(dep_delay)) |>
ggplot(aes(x = dep_delay)) +
geom_histogram(bins = 40, color = "white") +
coord_cartesian(xlim = c(-50, 300)) +
labs(
title = "Distribution of Departure Delays from NYC Airports",
x = "Departure Delay (minutes)",
y = "Number of Flights"
) +
theme_classic()
The distribution of departure delays is highly right-skewed, with a large concentration of flights clustered near zero minutes. This indicates that most flights depart on time or with only minor delays, suggesting that severe disruptions are not the norm in daily operations at NYC airports. However, the long right tail shows that a smaller subset of flights experience very large delays, sometimes exceeding several hours. Although these events are relatively rare, they represent significant disruptions for affected passengers and likely contribute disproportionately to overall delay statistics.
flights |>
filter(!is.na(arr_delay), !is.na(dep_delay)) |>
mutate(delay_diff = arr_delay - dep_delay) |>
ggplot(aes(x = delay_diff)) +
geom_histogram(bins = 40, color = "white") +
coord_cartesian(xlim = c(-100, 200)) +
labs(
title = "Difference Between Arrival and Departure Delays",
x = "Arrival Delay − Departure Delay (minutes)",
y = "Number of Flights"
) +
theme_classic()
distribution of the difference between arrival and departure delays is centered slightly below zero, indicating that many flights arrive earlier than their departure delay would predict. Since negative delay values represent early arrivals or departures, this suggests that flights often recover time while airborne, either by making up delays or arriving ahead of schedule.
At the same time, the spread of the distribution shows that recovery is not guaranteed. Some flights arrive much later than their departure delay would suggest, indicating that initial delays can compound rather than be mitigated. For travelers, this highlights that while early arrivals and delay recovery are common, there remains a meaningful risk of substantial downstream delays, particularly for flights with tight connections.
flights |>
filter(!is.na(dep_delay)) |>
group_by(carrier) |>
summarise(
avg_dep_delay = mean(dep_delay),
flights = n()
) |>
filter(flights > 1000) |>
mutate(
extreme = case_when(
avg_dep_delay == min(avg_dep_delay) ~ "Extreme1",
avg_dep_delay == max(avg_dep_delay) ~ "Extreme2",
TRUE ~ "Other"
)
) |>
ggplot(aes(
x = reorder(carrier, avg_dep_delay),
y = avg_dep_delay,
fill = extreme
)) +
geom_col() +
scale_fill_manual(
values = c("Extreme1" = "red", "Extreme2"= "darkgreen", "Other" = "grey70")
) +
labs(
title = "Average Departure Delay by Airline",
x = "Carrier",
y = "Average Departure Delay (minutes)"
) +
theme_classic() +
theme(legend.position = "none")
This figure shows average departure delays for major airlines operating out of NYC airports, with the airlines experiencing the lowest and highest average delays highlighted for emphasis. The airline with the lowest average delay stands out as substantially more reliable at departure, while the airline with the highest average delay exhibits notably worse on-time performance relative to its peers.
Most airlines fall between these two extremes, indicating that while airline choice does matter for departure reliability, differences among the majority of carriers are moderate. From a practical standpoint, this suggests that passengers concerned about delays may benefit from avoiding carriers with consistently high average delays, while recognizing that no airline is entirely immune to operational disruptions.
flights |>
filter(!is.na(dep_delay), !is.na(dep_time)) |>
mutate(hour = dep_time %/% 100) |>
group_by(hour) |>
summarise(avg_dep_delay = mean(dep_delay)) |>
ggplot(aes(x = hour, y = avg_dep_delay)) +
geom_line() +
geom_point() +
labs(
title = "Average Departure Delay by Hour of Day",
x = "Hour of Departure",
y = "Average Delay (minutes)"
) +
theme_classic()
Average departure delays vary substantially by time of day. Flights departing very early in the morning tend to have the lowest average delays, reflecting the fact that these flights are less affected by earlier disruptions in the daily flight schedule.
As the day progresses, average delays steadily increase, indicating that delays accumulate over time as aircraft, crews, and airports absorb earlier disruptions. The sharp rise in delays during the late evening hours suggests that recovery becomes increasingly difficult later in the day. From a practical standpoint, this pattern implies that passengers seeking more reliable departures should prioritize early-morning flights when possible.