The purpose of this data dive is to examine how conclusions drawn from data can change depending on the specific sample observed. By repeatedly sampling from the same dataset, we simulate the variability that arises when collecting data from a population and investigate how anomalies and patterns may appear or disappear across samples.
We generate multiple random sub-samples (with replacement) from the original dataset. Each sub-sample represents a possible realization of data collected from the same underlying population.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(tidyverse)
library(nycflights13)
df <- flights |>
filter(!is.na(dep_delay))
sample_frac <- 0.25
n_samples <- 3
df_samples <- tibble()
for (sample_i in 1:n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
df_samples
## # A tibble: 246,390 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 28 1639 1612 27 1923 1908
## 2 2013 3 6 1600 1600 0 1916 1934
## 3 2013 7 29 803 800 3 1028 1036
## 4 2013 10 14 1154 1200 -6 1453 1515
## 5 2013 2 25 1646 1649 -3 1754 1816
## 6 2013 2 22 909 914 -5 1152 1210
## 7 2013 8 24 1810 1813 -3 2053 2128
## 8 2013 11 19 1924 1930 -6 2229 2249
## 9 2013 4 8 1648 1640 8 1903 1845
## 10 2013 11 20 1004 1005 -1 1210 1212
## # ℹ 246,380 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, sample_num <int>
df_samples |>
ggplot(aes(x = dep_delay)) +
geom_histogram(bins = 40, fill = "grey70", color = "white") +
coord_cartesian(xlim = c(-50, 300)) +
facet_wrap(~ sample_num) +
labs(
title = "Distribution of Departure Delays Across Sub-samples",
x = "Departure Delay (minutes)",
y = "Number of Flights"
) +
theme_classic()
The distributions of departure delays across the three sub-samples are remarkably similar in shape. All sub-samples exhibit a strong right-skew, with a large concentration of flights clustered near zero minutes of delay and a long tail of extreme positive delays. This consistency suggests that the overall structure of departure delays is stable across repeated samples drawn from the same dataset. While individual extreme delay events may differ slightly between sub-samples, the presence of a heavy right tail is a persistent feature rather than an anomaly unique to any single sample.
As a result, what might appear to be an unusual pattern in one sub-sample (such as several large delays) is better understood as a recurring characteristic of the data-generating process rather than a true anomaly.
sample_carrier_summary <- df_samples |>
group_by(sample_num, carrier) |>
summarise(
avg_dep_delay = mean(dep_delay),
flights = n(),
.groups = "drop"
)
sample_carrier_summary
## # A tibble: 48 × 4
## sample_num carrier avg_dep_delay flights
## <int> <chr> <dbl> <int>
## 1 1 9E 15.2 4405
## 2 1 AA 8.45 8144
## 3 1 AS 7.45 164
## 4 1 B6 13.0 13506
## 5 1 DL 9.33 11981
## 6 1 EV 19.8 12784
## 7 1 F9 19.0 197
## 8 1 FL 18.9 779
## 9 1 HA -0.585 82
## 10 1 MQ 11.1 6206
## # ℹ 38 more rows
This table compares average departure delays by airline across three random sub-samples of the same dataset. Large carriers with many flights (such as AA, DL, UA, and B6) exhibit highly consistent average delays across all sub-samples, suggesting that their estimated performance is stable and not sensitive to sampling variability.
In contrast, carriers with relatively few flights (such as OO, HA, AS, and F9) show substantial variation in average departure delays across sub-samples. In some cases, these carriers appear to have unusually high or even negative average delays in one sub-sample but not in others.
This demonstrates that what might be labeled an anomaly in one sub-sample such as an airline appearing unusually delayed may simply be a result of small sample size rather than a true underlying pattern. Sampling variability can therefore strongly influence conclusions, particularly for groups with limited data.
sample_sizes <- c(0.10, 0.25, 0.75)
size_comparison <- map_df(sample_sizes, function(frac) {
df |>
sample_n(size = frac * nrow(df), replace = TRUE) |>
summarise(
sample_frac = frac,
mean_dep_delay = mean(dep_delay),
sd_dep_delay = sd(dep_delay)
)
})
size_comparison
## # A tibble: 3 × 3
## sample_frac mean_dep_delay sd_dep_delay
## <dbl> <dbl> <dbl>
## 1 0.1 12.7 40.3
## 2 0.25 12.5 39.9
## 3 0.75 12.6 40.1
As the relative size of the sub-sample increases, summary statistics for departure delays become more stable. While the mean departure delay is already fairly consistent across sub-samples, the standard deviation decreases slightly as sample size increases, indicating reduced sensitivity to extreme values.
Smaller sub-samples exhibit greater fluctuation in variability estimates, making rare or extreme delay events more influential in shaping perceived patterns. Larger sub-samples reduce this influence, leading to more reliable summaries of the underlying data.
The results of this analysis demonstrate that conclusions drawn from a single sample can be misleading, particularly when based on small groups or rare events. Apparent anomalies observed in one sub-sample may not persist across other sub-samples and are often driven by sampling variability rather than true underlying differences.
This highlights the importance of considering sample size and group size when interpreting exploratory findings. Conclusions based on larger samples and well-represented groups are more reliable, while findings based on small samples should be treated with caution and validated through repeated sampling or formal statistical methods.
By examining multiple random sub-samples of the same dataset, this data dive demonstrates how sampling variability influences perceived patterns and anomalies. While the overall structure of departure delays remains stable across sub-samples, estimates for small groups vary substantially, leading to apparent anomalies that do not consistently replicate.
Increasing sub-sample size reduces this variability and produces more stable and reliable summaries. These findings emphasize the need for cautious interpretation during exploratory analysis and reinforce the importance of validating observed patterns before drawing substantive conclusions.