In this data dive, we explore individual rows and groups of rows
within the nycflights13 dataset. By grouping flights
according to categorical variables, we interpret each group’s size as a
probability of occurrence under random selection. Smaller groups
correspond to rarer events and may be interpreted as anomalies.
The goal is to: - Identify low-probability (rare) groups - Translate rarity into real-world meaning - Form testable hypotheses explaining why some groups occur less frequently
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
flights_clean <- flights |>
filter(!is.na(dep_delay), !is.na(arr_delay))
carrier_delay <- flights_clean |>
group_by(carrier) |>
summarise(
n_flights = n(),
mean_dep_delay = mean(dep_delay),
.groups = "drop"
) |>
mutate(
probability = n_flights / sum(n_flights),
rare_group = probability == min(probability)
)
ggplot(carrier_delay, aes(x = reorder(carrier, n_flights), y = n_flights)) +
geom_col() +
coord_flip() +
labs(
title = "Number of Flights by Carrier",
x = "Carrier",
y = "Flight Count"
)
This visualization shows a highly uneven distribution of flights across carriers. A small number of airlines, most notably UA, B6, EV, DL, and AA account for the majority of departures from NYC airports, while several carriers operate only a very small number of flights.
From a probability perspective, if a single flight were selected at random from the dataset, it would be far more likely to belong to one of the dominant carriers and extremely unlikely to belong to the smallest carriers (e.g., OO, HA, YV). These low-frequency carriers therefore represent low-probability groups under random sampling.
This rarity is not anomalous behavior but reflects airline size, route coverage, and hub strategies rather than data errors.
time_delay <- flights_clean |>
mutate(
dep_time_bin = cut(
hour,
breaks = c(0, 6, 12, 18, 24),
labels = c("Night", "Morning", "Afternoon", "Evening"),
include.lowest = TRUE
)
) |>
group_by(dep_time_bin) |>
summarise(
n_flights = n(),
mean_dep_delay = mean(dep_delay),
.groups = "drop"
) |>
mutate(
probability = n_flights / sum(n_flights),
rare_group = probability == min(probability)
)
ggplot(time_delay, aes(x = dep_time_bin, y = mean_dep_delay)) +
geom_col() +
labs(
title = "Average Departure Delay by Time of Day",
x = "Time of Day",
y = "Mean Departure Delay (minutes)"
)
The plot shows a clear upward trend in average departure delays as the day progresses. Flights departing at night and in the morning have relatively low average delays, while afternoon and evening flights experience substantially higher delays.
Nighttime departures form the smallest group and therefore have the lowest probability of occurring when randomly selecting a flight. Despite their rarity, night flights also tend to have the lowest average delays.
This pattern suggests that delays accumulate throughout the day, likely due to earlier disruptions propagating forward in airline schedules.
distance_delay <- flights_clean |>
mutate(
distance_bin = cut(
distance,
breaks = c(0, 500, 1000, 2000, 3000),
labels = c("Short", "Medium", "Long", "Very Long"),
include.lowest = TRUE
)
) |>
group_by(distance_bin) |>
summarise(
n_flights = n(),
mean_arr_delay = mean(arr_delay),
.groups = "drop"
) |>
mutate(
probability = n_flights / sum(n_flights),
rare_group = probability == min(probability)
)
ggplot(distance_delay, aes(x = distance_bin, y = n_flights)) +
geom_col() +
labs(
title = "Flight Counts by Distance Category",
x = "Distance Category",
y = "Number of Flights"
)
Most NYC flights fall into the short, medium, and long-distance categories, with medium-distance flights being the most common. Very long-distance flights form the smallest category and therefore represent the lowest-probability group under random selection.
The presence of an NA category indicates a small number of flights with missing or uncategorizable distance values, which occur far less frequently than the defined distance bins.
Overall, the rarity of very long-distance flights reflects travel demand and route availability rather than unusual or anomalous behavior.
origin_carrier <- flights |>
distinct(origin, carrier)
all_combinations <- expand_grid(
origin = unique(flights$origin),
carrier = unique(flights$carrier)
)
missing_combinations <- all_combinations |>
anti_join(origin_carrier, by = c("origin", "carrier"))
origin_carrier_counts <- flights |>
count(origin, carrier, sort = TRUE)
ggplot(origin_carrier_counts, aes(x = origin, y = n, fill = origin)) +
geom_col() +
scale_fill_manual(
values = c("EWR" = "#4E79A7", "JFK" = "#F28E2B", "LGA" = "#59A14F")
) +
labs(
title = "Carrier Frequency by Origin Airport",
x = "Origin Airport",
y = "Number of Flights"
)
This visualization compares the total number of flights departing from each NYC origin airport. By limiting color to origin airport, the plot emphasizes overall differences in flight volume rather than individual carrier composition.
EWR shows the highest total number of departures, followed by JFK and then LGA. From a probability perspective, a randomly selected flight from the dataset is most likely to originate from EWR and least likely to originate from LGA.
While individual carriers are no longer distinguished by color, the underlying counts still reflect airport-specific operational differences driven by airline hub strategies and airport capacity.