library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
deaths <- read.csv('NCHS_cleaned.csv')
The goal of this week’s data dive is to critically evaluate how sampling variability can influence the conclusions drawn from data analysis. Rather than assuming that the full dataset is always observed, this analysis simulates the process of collecting data from a population by repeatedly drawing random sub‑samples from the dataset.
By comparing these sub‑samples, this analysis examines how perceived patterns, anomalies, and consistencies can change depending on the portion of the data that is observed. This exercise highlights potential pitfalls when making conclusions based on limited or partial data.
_______________________________________________________________________________________________________________
To simulate repeated data collection, multiple random samples were drawn with replacement from the full dataset. The dataset itself is treated as the population, and each sub‑sample represents a hypothetical observed dataset.
Each sub‑sample:
Includes 25% of the total dataset
Is drawn with replacement
Is repeated three times
set.seed(123)
sample_frac <- 0.25
n_samples <- 3
df_samples <- tibble()
for (sample_i in 1:n_samples) {
df_i <- deaths |>
sample_n(
size = sample_frac * nrow(deaths),
replace = TRUE
) |>
mutate(sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
df_samples |>
count(sample_num)
## # A tibble: 3 × 2
## sample_num n
## <int> <int>
## 1 1 2717
## 2 2 2717
## 3 3 2717
Each sub‑sample contains the same number of observations. Any differences observed across sub‑samples are therefore attributable to random sampling, not differences in sample size.
To compare how conclusions might differ across samples, total deaths were summarized by cause of death within each sub‑sample.
sample_cause_summary <- df_samples |>
filter(Cause.Name != "All causes") |>
group_by(sample_num, Cause.Name) |>
summarise(
total_deaths = sum(Deaths, na.rm = TRUE),
.groups = "drop"
)
sample_cause_summary
## # A tibble: 30 × 3
## sample_num Cause.Name total_deaths
## <int> <chr> <int>
## 1 1 Alzheimer's disease 817636
## 2 1 CLRD 1052171
## 3 1 Cancer 6626979
## 4 1 Diabetes 713546
## 5 1 Heart disease 6189532
## 6 1 Influenza and pneumonia 494975
## 7 1 Kidney disease 579512
## 8 1 Stroke 1283621
## 9 1 Suicide 217018
## 10 1 Unintentional injuries 919956
## # ℹ 20 more rows
Although each sub‑sample is drawn from the same underlying dataset, noticeable differences appear in the total deaths attributed to certain causes. Some causes appear more prominent in one sub‑sample than another, even though no actual change occurred in the population.
This demonstrates how conclusions about which causes are “most significant” can shift depending on which portion of the data is observed.
In some sub‑samples, particular causes of death appear unusually high or low relative to others. However, these apparent anomalies are not consistent across all sub‑samples.
A cause that might be flagged as unusual in one sub‑sample often appears typical in another. This suggests that what may initially appear to be an anomaly can instead be the result of random sampling variation, rather than a meaningful underlying difference.
Despite variability, several patterns remain consistent across all sub‑samples:
Causes with the highest overall prevalence in the full dataset remain among the most common in each sub‑sample.
The general ordering of the most prevalent causes is relatively stable.
Rare causes do not become dominant in any sub‑sample.
These consistencies suggest that strong signals in the data are robust to sampling variability, while weaker signals are more sensitive to which observations are included
ggplot(sample_cause_summary, aes(
x = total_deaths,
y = reorder(Cause.Name, total_deaths)
)) +
geom_col(fill = "steelblue") +
facet_wrap(~ sample_num, scales = "free_x") +
scale_x_continuous(
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths by Cause Across Random Sub‑Samples",
x = "Total Deaths (Millions)",
y = "Cause of Death"
) +
theme_minimal()
As the relative size of the sub‑samples increases (for example, moving from 10% to 25% or larger portions of the dataset), differences between sub‑samples become less pronounced. Larger samples more closely resemble the full dataset, making conclusions more stable and reducing the likelihood of observing misleading anomalies.
Smaller sub‑samples, by contrast, show greater variability and are more susceptible to random fluctuations that can distort interpretation.
sample_sizes <- c(0.10, 0.25, 0.75)
sample_size_comparison <- map_df(sample_sizes, function(frac) {
deaths |>
sample_n(size = frac * nrow(deaths), replace = TRUE) |>
filter(Cause.Name != "All causes") |>
group_by(Cause.Name) |>
summarise(
total_deaths = sum(Deaths),
.groups = "drop"
) |>
mutate(sample_frac = frac)
})
sample_size_comparison
## # A tibble: 30 × 3
## Cause.Name total_deaths sample_frac
## <chr> <int> <dbl>
## 1 Alzheimer's disease 270980 0.1
## 2 CLRD 682840 0.1
## 3 Cancer 2755335 0.1
## 4 Diabetes 137501 0.1
## 5 Heart disease 1920661 0.1
## 6 Influenza and pneumonia 149665 0.1
## 7 Kidney disease 64373 0.1
## 8 Stroke 1171963 0.1
## 9 Suicide 110540 0.1
## 10 Unintentional injuries 570291 0.1
## # ℹ 20 more rows
As sample size increases, variability in total deaths by cause decreases. Smaller samples exaggerate random fluctuations, while larger samples more closely resemble the full dataset. This reinforces the importance of sufficient sample size when drawing conclusions.
This investigation demonstrates that conclusions drawn from a single dataset may be sensitive to sampling uncertainty, especially when focusing on smaller groups or subtle differences. Apparent anomalies or strong patterns observed in one sample may not persist when the data is resampled.
As a result, conclusions should be framed cautiously and supported by repeated analyses, larger samples, or population‑adjusted measures such as age‑adjusted death rates.
This week’s data dive explored how random sampling variability can affect observed patterns, anomalies, and conclusions in data analysis. By simulating repeated data collection through random sub‑sampling, it became clear that not all observed differences reflect meaningful underlying structure.
The findings emphasize the importance of acknowledging uncertainty when interpreting results, particularly when working with observational data. Future analyses should consider robustness checks and repeated sampling to avoid over‑interpreting random variation as meaningful insight.