library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
deaths <- read.csv('NCHS_cleaned.csv')

Week 4 Data Dive

Data Dive Objective

The goal of this week’s data dive is to critically evaluate how sampling variability can influence the conclusions drawn from data analysis. Rather than assuming that the full dataset is always observed, this analysis simulates the process of collecting data from a population by repeatedly drawing random sub‑samples from the dataset.

By comparing these sub‑samples, this analysis examines how perceived patterns, anomalies, and consistencies can change depending on the portion of the data that is observed. This exercise highlights potential pitfalls when making conclusions based on limited or partial data.

_______________________________________________________________________________________________________________

Simulating Random Sub‑Samples

To simulate repeated data collection, multiple random samples were drawn with replacement from the full dataset. The dataset itself is treated as the population, and each sub‑sample represents a hypothetical observed dataset.

Each sub‑sample:

  • Includes 25% of the total dataset

  • Is drawn with replacement

  • Is repeated three times

set.seed(123)

sample_frac <- 0.25
n_samples <- 3

df_samples <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- deaths |>
    sample_n(
      size = sample_frac * nrow(deaths),
      replace = TRUE
    ) |>
    mutate(sample_num = sample_i)

  df_samples <- bind_rows(df_samples, df_i)
}

Sub-sample Structure

df_samples |>
  count(sample_num)
## # A tibble: 3 × 2
##   sample_num     n
##        <int> <int>
## 1          1  2717
## 2          2  2717
## 3          3  2717

Each sub‑sample contains the same number of observations. Any differences observed across sub‑samples are therefore attributable to random sampling, not differences in sample size.

Scrutinizing Sub‑Samples by Cause of Death

To compare how conclusions might differ across samples, total deaths were summarized by cause of death within each sub‑sample.

sample_cause_summary <- df_samples |>
  filter(Cause.Name != "All causes") |>
  group_by(sample_num, Cause.Name) |>
  summarise(
    total_deaths = sum(Deaths, na.rm = TRUE),
    .groups = "drop"
  )

sample_cause_summary
## # A tibble: 30 × 3
##    sample_num Cause.Name              total_deaths
##         <int> <chr>                          <int>
##  1          1 Alzheimer's disease           817636
##  2          1 CLRD                         1052171
##  3          1 Cancer                       6626979
##  4          1 Diabetes                      713546
##  5          1 Heart disease                6189532
##  6          1 Influenza and pneumonia       494975
##  7          1 Kidney disease                579512
##  8          1 Stroke                       1283621
##  9          1 Suicide                       217018
## 10          1 Unintentional injuries        919956
## # ℹ 20 more rows

Summary and Insight:

Although each sub‑sample is drawn from the same underlying dataset, noticeable differences appear in the total deaths attributed to certain causes. Some causes appear more prominent in one sub‑sample than another, even though no actual change occurred in the population.

This demonstrates how conclusions about which causes are “most significant” can shift depending on which portion of the data is observed.

Identifying Anomalies Across Sub‑Samples

In some sub‑samples, particular causes of death appear unusually high or low relative to others. However, these apparent anomalies are not consistent across all sub‑samples.

A cause that might be flagged as unusual in one sub‑sample often appears typical in another. This suggests that what may initially appear to be an anomaly can instead be the result of random sampling variation, rather than a meaningful underlying difference.

Consistencies Across Sub‑Samples

Despite variability, several patterns remain consistent across all sub‑samples:

  • Causes with the highest overall prevalence in the full dataset remain among the most common in each sub‑sample.

  • The general ordering of the most prevalent causes is relatively stable.

  • Rare causes do not become dominant in any sub‑sample.

These consistencies suggest that strong signals in the data are robust to sampling variability, while weaker signals are more sensitive to which observations are included

Visual Comparison of Sub‑Samples

ggplot(sample_cause_summary, aes(
  x = total_deaths,
  y = reorder(Cause.Name, total_deaths)
)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ sample_num, scales = "free_x") +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths by Cause Across Random Sub‑Samples",
    x = "Total Deaths (Millions)",
    y = "Cause of Death"
  ) +
  theme_minimal()

Effect of Sub‑Sample Size on Comparisons

As the relative size of the sub‑samples increases (for example, moving from 10% to 25% or larger portions of the dataset), differences between sub‑samples become less pronounced. Larger samples more closely resemble the full dataset, making conclusions more stable and reducing the likelihood of observing misleading anomalies.

Smaller sub‑samples, by contrast, show greater variability and are more susceptible to random fluctuations that can distort interpretation.

sample_sizes <- c(0.10, 0.25, 0.75)

sample_size_comparison <- map_df(sample_sizes, function(frac) {
  deaths |>
    sample_n(size = frac * nrow(deaths), replace = TRUE) |>
    filter(Cause.Name != "All causes") |>
    group_by(Cause.Name) |>
    summarise(
      total_deaths = sum(Deaths),
      .groups = "drop"
    ) |>
    mutate(sample_frac = frac)
})

sample_size_comparison
## # A tibble: 30 × 3
##    Cause.Name              total_deaths sample_frac
##    <chr>                          <int>       <dbl>
##  1 Alzheimer's disease           270980         0.1
##  2 CLRD                          682840         0.1
##  3 Cancer                       2755335         0.1
##  4 Diabetes                      137501         0.1
##  5 Heart disease                1920661         0.1
##  6 Influenza and pneumonia       149665         0.1
##  7 Kidney disease                 64373         0.1
##  8 Stroke                       1171963         0.1
##  9 Suicide                       110540         0.1
## 10 Unintentional injuries        570291         0.1
## # ℹ 20 more rows

As sample size increases, variability in total deaths by cause decreases. Smaller samples exaggerate random fluctuations, while larger samples more closely resemble the full dataset. This reinforces the importance of sufficient sample size when drawing conclusions.

Implications for Drawing Conclusions

This investigation demonstrates that conclusions drawn from a single dataset may be sensitive to sampling uncertainty, especially when focusing on smaller groups or subtle differences. Apparent anomalies or strong patterns observed in one sample may not persist when the data is resampled.

As a result, conclusions should be framed cautiously and supported by repeated analyses, larger samples, or population‑adjusted measures such as age‑adjusted death rates.

Weekly Data Dive Summary

This week’s data dive explored how random sampling variability can affect observed patterns, anomalies, and conclusions in data analysis. By simulating repeated data collection through random sub‑sampling, it became clear that not all observed differences reflect meaningful underlying structure.

The findings emphasize the importance of acknowledging uncertainty when interpreting results, particularly when working with observational data. Future analyses should consider robustness checks and repeated sampling to avoid over‑interpreting random variation as meaningful insight.