Week 4 Data Dive: Sampling Variability and Anomalies

The purpose of this data dive is to examine how conclusions drawn from data can change depending on the specific sample observed. By repeatedly sampling from the same dataset, we simulate the variability that arises when collecting data from a population and investigate how anomalies and patterns may appear or disappear across samples.


Creating Random Sub-samples

We generate multiple random sub-samples (with replacement) from the original dataset. Each sub-sample represents a possible realization of data collected from the same underlying population.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(tidyverse)
library(nycflights13)


df <- flights |>
  filter(!is.na(dep_delay))

sample_frac <- 0.25
n_samples <- 3

df_samples <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate(sample_num = sample_i)

  df_samples <- bind_rows(df_samples, df_i)
}

df_samples
## # A tibble: 246,390 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    28     1639           1612        27     1923           1908
##  2  2013     3     6     1600           1600         0     1916           1934
##  3  2013     7    29      803            800         3     1028           1036
##  4  2013    10    14     1154           1200        -6     1453           1515
##  5  2013     2    25     1646           1649        -3     1754           1816
##  6  2013     2    22      909            914        -5     1152           1210
##  7  2013     8    24     1810           1813        -3     2053           2128
##  8  2013    11    19     1924           1930        -6     2229           2249
##  9  2013     4     8     1648           1640         8     1903           1845
## 10  2013    11    20     1004           1005        -1     1210           1212
## # ℹ 246,380 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, sample_num <int>

Comparing Delay Distributions Across Sub-samples

df_samples |>
  ggplot(aes(x = dep_delay)) +
  geom_histogram(bins = 40, fill = "grey70", color = "white") +
  coord_cartesian(xlim = c(-50, 300)) +
  facet_wrap(~ sample_num) +
  labs(
    title = "Distribution of Departure Delays Across Sub-samples",
    x = "Departure Delay (minutes)",
    y = "Number of Flights"
  ) +
  theme_classic()

Findings and Interpretation

The distributions of departure delays across the three sub-samples are remarkably similar in shape. All sub-samples exhibit a strong right-skew, with a large concentration of flights clustered near zero minutes of delay and a long tail of extreme positive delays. This consistency suggests that the overall structure of departure delays is stable across repeated samples drawn from the same dataset. While individual extreme delay events may differ slightly between sub-samples, the presence of a heavy right tail is a persistent feature rather than an anomaly unique to any single sample.

As a result, what might appear to be an unusual pattern in one sub-sample (such as several large delays) is better understood as a recurring characteristic of the data-generating process rather than a true anomaly.


Airline-Level Summaries Across Sub-samples

sample_carrier_summary <- df_samples |>
  group_by(sample_num, carrier) |>
  summarise(
    avg_dep_delay = mean(dep_delay),
    flights = n(),
    .groups = "drop"
  )

sample_carrier_summary
## # A tibble: 48 × 4
##    sample_num carrier avg_dep_delay flights
##         <int> <chr>           <dbl>   <int>
##  1          1 9E             15.2      4405
##  2          1 AA              8.45     8144
##  3          1 AS              7.45      164
##  4          1 B6             13.0     13506
##  5          1 DL              9.33    11981
##  6          1 EV             19.8     12784
##  7          1 F9             19.0       197
##  8          1 FL             18.9       779
##  9          1 HA             -0.585      82
## 10          1 MQ             11.1      6206
## # ℹ 38 more rows

Findings and Interpretation

This table compares average departure delays by airline across three random sub-samples of the same dataset. Large carriers with many flights (such as AA, DL, UA, and B6) exhibit highly consistent average delays across all sub-samples, suggesting that their estimated performance is stable and not sensitive to sampling variability.

In contrast, carriers with relatively few flights (such as OO, HA, AS, and F9) show substantial variation in average departure delays across sub-samples. In some cases, these carriers appear to have unusually high or even negative average delays in one sub-sample but not in others.

This demonstrates that what might be labeled an anomaly in one sub-sample such as an airline appearing unusually delayed may simply be a result of small sample size rather than a true underlying pattern. Sampling variability can therefore strongly influence conclusions, particularly for groups with limited data.


Effect of Increasing Sub-sample Size

sample_sizes <- c(0.10, 0.25, 0.75)

size_comparison <- map_df(sample_sizes, function(frac) {
  df |>
    sample_n(size = frac * nrow(df), replace = TRUE) |>
    summarise(
      sample_frac = frac,
      mean_dep_delay = mean(dep_delay),
      sd_dep_delay = sd(dep_delay)
    )
})

size_comparison
## # A tibble: 3 × 3
##   sample_frac mean_dep_delay sd_dep_delay
##         <dbl>          <dbl>        <dbl>
## 1        0.1            12.7         40.3
## 2        0.25           12.5         39.9
## 3        0.75           12.6         40.1

Findings and Interpretation

As the relative size of the sub-sample increases, summary statistics for departure delays become more stable. While the mean departure delay is already fairly consistent across sub-samples, the standard deviation decreases slightly as sample size increases, indicating reduced sensitivity to extreme values.

Smaller sub-samples exhibit greater fluctuation in variability estimates, making rare or extreme delay events more influential in shaping perceived patterns. Larger sub-samples reduce this influence, leading to more reliable summaries of the underlying data.


Implication for Drawing Conclusions

The results of this analysis demonstrate that conclusions drawn from a single sample can be misleading, particularly when based on small groups or rare events. Apparent anomalies observed in one sub-sample may not persist across other sub-samples and are often driven by sampling variability rather than true underlying differences.

This highlights the importance of considering sample size and group size when interpreting exploratory findings. Conclusions based on larger samples and well-represented groups are more reliable, while findings based on small samples should be treated with caution and validated through repeated sampling or formal statistical methods.


Conclusion

By examining multiple random sub-samples of the same dataset, this data dive demonstrates how sampling variability influences perceived patterns and anomalies. While the overall structure of departure delays remains stable across sub-samples, estimates for small groups vary substantially, leading to apparent anomalies that do not consistently replicate.

Increasing sub-sample size reduces this variability and produces more stable and reliable summaries. These findings emphasize the need for cautious interpretation during exploratory analysis and reinforce the importance of validating observed patterns before drawing substantive conclusions.