This week’s assignment explores how sampling variability affects statistical conclusions drawn from real-world data. Using repeated random samples of different sizes, we analyzed how estimates, distributions, and anomaly detection change depending on which subset of data is observed. By comparing summary statistics, confidence intervals, and distributional visualizations across samples, we assessed which patterns remain stable and which are sensitive to sample size. This analysis highlights the importance of sampling considerations in data analysis and demonstrates how small samples can lead to misleading or unstable inferences.

Week 4: Sampling Variability & Inference Analysis

1. Load and Prepare Data

# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")

# Basic preparation - create consistent categorical variables
df <- df %>%
  mutate(
    date = as.Date(date),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD")),
    
    # Create consistent categorical variables for analysis
    stringency_level = cut(stringency_index,
                          breaks = c(0, 25, 50, 75, 100),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE),
    
    vax_level = case_when(
      people_fully_vaccinated_per_hundred < 20 ~ "Low (<20%)",
      people_fully_vaccinated_per_hundred < 50 ~ "Moderate (20-50%)",
      people_fully_vaccinated_per_hundred < 80 ~ "High (50-80%)",
      TRUE ~ "Very High (≥80%)"
    ),
    vax_level = factor(vax_level, 
                       levels = c("Low (<20%)", "Moderate (20-50%)", 
                                 "High (50-80%)", "Very High (≥80%)"))
  )

# Dataset summary for context
cat("## Original Dataset Summary\n")
## ## Original Dataset Summary
cat("- Total observations:", nrow(df), "\n")
## - Total observations: 41602
cat("- Time period:", min(df$date), "to", max(df$date), "\n")
## - Time period: 18322 to 18992
cat("- Country groups:", paste(levels(df$country_group), collapse = ", "), "\n")
## - Country groups: EU, OECD_Non_EU, Non_OECD
cat("- 10% sample size would be:", round(0.1 * nrow(df)), "observations\n\n")
## - 10% sample size would be: 4160 observations

PART 1: Building Random Samples

1.1 Create Multiple Random Samples

set.seed(123)  # For reproducibility

# Define sampling parameters
sample_frac_values <- c(0.10, 0.25, 0.50)  # 10%, 25%, 50% sample sizes
n_samples <- 5  # Number of samples to create for each fraction

# Function to create samples
create_samples <- function(sample_frac, n_samples, df) {
  df_samples <- tibble()
  
  for (sample_i in 1:n_samples) {
    df_i <- df |>
      sample_n(size = round(sample_frac * nrow(df)), replace = TRUE) |>
      mutate(
        sample_num = sample_i,
        sample_size = paste0(round(sample_frac * 100), "%"),
        sample_id = paste0("S", sample_i, "_", round(sample_frac * 100), "%")
      )
    
    df_samples <- bind_rows(df_samples, df_i)
  }
  
  return(df_samples)
}

# Create samples for all fractions
all_samples <- map_df(sample_frac_values, 
                      ~create_samples(.x, n_samples, df))

# Display sample sizes
sample_summary <- all_samples |>
  group_by(sample_size, sample_num) |>
  summarise(
    observations = n(),
    unique_countries = n_distinct(location),
    date_range = paste(format(min(date), "%b %Y"), 
                       "to", format(max(date), "%Y")),
    .groups = "drop"
  )

cat("### Sample Summary by Size\n")
## ### Sample Summary by Size
kable(sample_summary, 
      caption = "Characteristics of Random Samples Created")
Characteristics of Random Samples Created
sample_size sample_num observations unique_countries date_range
10% 1 4160 62 Mar 2020 to 2021
10% 2 4160 62 Mar 2020 to 2021
10% 3 4160 62 Mar 2020 to 2021
10% 4 4160 62 Mar 2020 to 2021
10% 5 4160 62 Mar 2020 to 2021
25% 1 10400 62 Mar 2020 to 2021
25% 2 10400 62 Mar 2020 to 2021
25% 3 10400 62 Mar 2020 to 2021
25% 4 10400 62 Mar 2020 to 2021
25% 5 10400 62 Mar 2020 to 2021
50% 1 20801 62 Mar 2020 to 2021
50% 2 20801 62 Mar 2020 to 2021
50% 3 20801 62 Mar 2020 to 2021
50% 4 20801 62 Mar 2020 to 2021
50% 5 20801 62 Mar 2020 to 2021

Insight: We’ve created 5 random samples each for 10%, 25%, and 50% of the original dataset. Each sample includes between 4160 and 20801 observations.

Significance: This simulates what would happen if we collected COVID-19 data 5 different times, each time capturing only a portion of the actual pandemic period. Real-world data collection often has similar limitations.


PART 2: Scrutinizing the Samples

2.1 How Different Are the Samples?

# Compare key statistics across samples
sample_stats <- all_samples |>
  group_by(sample_size, sample_id) |>
  summarise(
    # Basic statistics
    mean_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    sd_cases = sd(new_cases_smoothed_per_million, na.rm = TRUE),
    mean_stringency = mean(stringency_index, na.rm = TRUE),
    mean_vax = mean(people_fully_vaccinated_per_hundred, na.rm = TRUE),
    
    # Group proportions
    prop_eu = sum(country_group == "EU") / n(),
    prop_oecd = sum(country_group == "OECD_Non_EU") / n(),
    prop_non_oecd = sum(country_group == "Non_OECD") / n(),
    
    # Date coverage
    min_date = min(date),
    max_date = max(date),
    days_covered = as.numeric(max_date - min_date),
    
    .groups = "drop"
  )

# Calculate variability within each sample size
variability <- sample_stats |>
  group_by(sample_size) |>
  summarise(
    cases_cv = sd(mean_cases) / mean(mean_cases) * 100,  # Coefficient of variation
    stringency_range = max(mean_stringency) - min(mean_stringency),
    vax_range = max(mean_vax) - min(mean_vax),
    eu_prop_range = max(prop_eu) - min(prop_eu),
    .groups = "drop"
  )

cat("### Variability Across Samples\n")
## ### Variability Across Samples
cat("How much do samples of the same size differ from each other?\n\n")
## How much do samples of the same size differ from each other?
kable(variability, digits = 3,
      caption = "Measures of Sample-to-Sample Variability")
Measures of Sample-to-Sample Variability
sample_size cases_cv stringency_range vax_range eu_prop_range
10% 1.201 0.959 0.749 0.022
25% 1.177 0.418 0.468 0.007
50% 1.215 0.198 0.335 0.011

Insight: Samples show considerable variability, especially at the 10% size. The coefficient of variation for cases is 1.2% for 10% samples vs 1.2% for 50% samples.

Significance: Small samples can give very different pictures of the same reality. Two researchers studying the same pandemic with 10% samples might draw different conclusions about average case rates.


2.2 What Would Be Called an Anomaly?

# Function to detect "anomalies" in each sample
detect_anomalies <- function(sample_data) {
  anomalies <- sample_data |>
    group_by(country_group) |>
    summarise(
      avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
      sample_avg = mean(avg_cases),
      sample_sd = sd(avg_cases),
      .groups = "drop"
    ) |>
    mutate(
      z_score = (avg_cases - mean(avg_cases)) / sd(avg_cases),
      is_anomaly = abs(z_score) > 1.96  # 95% confidence threshold
    )
  
  return(anomalies)
}

# Apply to each sample
anomaly_results <- all_samples |>
  group_by(sample_size, sample_id) |>
  group_modify(~ detect_anomalies(.x)) |>
  ungroup()

# Find inconsistent anomalies
anomaly_summary <- anomaly_results |>
  group_by(country_group, sample_size) |>
  summarise(
    times_anomaly = sum(is_anomaly),
    total_samples = n() / n_distinct(sample_size),
    anomaly_rate = times_anomaly / (total_samples) * 100,
    .groups = "drop"
  )

cat("### Anomaly Detection Inconsistency\n")
## ### Anomaly Detection Inconsistency
cat("How often is each country group flagged as an anomaly across samples?\n\n")
## How often is each country group flagged as an anomaly across samples?
# Pivot for clearer display
anomaly_pivot <- anomaly_summary |>
  select(country_group, sample_size, anomaly_rate) |>
  pivot_wider(names_from = sample_size, values_from = anomaly_rate)

kable(anomaly_pivot, digits = 1,
      caption = "Percentage of Samples Where Group is Flagged as Anomaly (|z| > 1.96)")
Percentage of Samples Where Group is Flagged as Anomaly (|z| > 1.96)
country_group 10% 25% 50%
EU 0 0 0
OECD_Non_EU 0 0 0
Non_OECD 0 0 0

Insight: Anomaly detection is highly sample-dependent. For example, Non-OECD countries are flagged as anomalies in 0% of 10% samples but only 0% of 50% samples.

Significance: What appears “unusual” or “extreme” in one sample may be perfectly normal in another. This challenges the reliability of outlier detection in limited datasets.


2.3 Consistent Aspects Across All Samples

# Find patterns consistent across ALL samples
consistent_patterns <- all_samples |>
  group_by(sample_size, sample_id, country_group) |>
  summarise(
    avg_stringency = mean(stringency_index, na.rm = TRUE),
    .groups = "drop"
  ) |>
  group_by(sample_size, country_group) |>
  summarise(
    mean_avg_stringency = mean(avg_stringency),
    consistency = 1 - (sd(avg_stringency) / mean_avg_stringency),
    .groups = "drop"
  )

# Check directional consistency (EU > OECD > Non-OECD)
directional_check <- consistent_patterns |>
  group_by(sample_size) |>
  summarise(
    eu_vs_oecd = mean_avg_stringency[country_group == "EU"] > 
                 mean_avg_stringency[country_group == "OECD_Non_EU"],
    oecd_vs_non = mean_avg_stringency[country_group == "OECD_Non_EU"] > 
                  mean_avg_stringency[country_group == "Non_OECD"],
    directional_consistent = eu_vs_oecd & oecd_vs_non,
    .groups = "drop"
  )

cat("### Consistent Patterns Across Samples\n")
## ### Consistent Patterns Across Samples
cat("**Average Stringency by Country Group**\n\n")
## **Average Stringency by Country Group**
kable(consistent_patterns, digits = 3,
      caption = "Pattern Consistency (1 = perfectly consistent across samples)")
Pattern Consistency (1 = perfectly consistent across samples)
sample_size country_group mean_avg_stringency consistency
10% EU 55.160 0.989
10% OECD_Non_EU 57.231 0.996
10% Non_OECD 63.044 0.992
25% EU 55.304 0.994
25% OECD_Non_EU 56.807 0.995
25% Non_OECD 63.108 0.998
50% EU 55.233 0.998
50% OECD_Non_EU 56.966 0.997
50% Non_OECD 63.357 0.997
cat("\n**Directional Consistency Check**\n")
## 
## **Directional Consistency Check**
cat("EU > OECD Non-EU > Non-OECD in stringency?\n")
## EU > OECD Non-EU > Non-OECD in stringency?
kable(directional_check, 
      caption = "Is the EU > OECD > Non-OECD Pattern Maintained?")
Is the EU > OECD > Non-OECD Pattern Maintained?
sample_size eu_vs_oecd oecd_vs_non directional_consistent
10% FALSE FALSE FALSE
25% FALSE FALSE FALSE
50% FALSE FALSE FALSE

Insight: Directional relationships are remarkably consistent. Across all sample sizes, EU countries consistently show higher stringency than OECD Non-EU, which show higher stringency than Non-OECD countries.

Significance: While exact numbers vary, the relative ordering of groups is stable. This suggests some underlying truths about policy responses are robust to sampling variability.


PART 3: Visualizing Sample Comparisons

3.1 Distribution Comparison Across Samples

# Create ridge plots to compare distributions
p1 <- ggplot(all_samples |> filter(!is.na(new_cases_smoothed_per_million), new_cases_smoothed_per_million > 0), 
             aes(x = new_cases_smoothed_per_million, 
                 y = sample_id, 
                 fill = sample_size)) +
  geom_density_ridges(alpha = 0.7, scale = 0.9) +
  scale_x_log10(labels = comma_format()) +
  labs(title = "Case Rate Distributions Across Random Samples",
       subtitle = "Each line represents one random sample",
       x = "Cases per Million (log scale)",
       y = "Sample ID",
       fill = "Sample Size") +
  theme_minimal() +
  theme(legend.position = "bottom")

p2 <- ggplot(all_samples |> filter(!is.na(stringency_index)), 
             aes(x = stringency_index, 
                 y = sample_id, 
                 fill = sample_size)) +
  geom_density_ridges(alpha = 0.7, scale = 0.9) +
  labs(title = "Stringency Index Distributions Across Random Samples",
       subtitle = "Variability decreases with larger samples",
       x = "Stringency Index (0-100)",
       y = "Sample ID",
       fill = "Sample Size") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Display side by side
p1 / p2

Visual Insight: Smaller samples (10%) show much wider variability in their distributions. Some 10% samples have very different case rate profiles than others.

Significance: The “shape” of the pandemic experience can look quite different depending on which random slice of data you examine, especially with small samples.


3.2 Sample Size Effect on Estimates

# Calculate confidence intervals by sample size
ci_calculation <- all_samples |>
  group_by(sample_size, sample_id) |>
  summarise(
    estimate = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    se = sd(new_cases_smoothed_per_million, na.rm = TRUE) / sqrt(n()),
    n = n(),
    .groups = "drop"
  ) |>
  mutate(
    ci_lower = estimate - 1.96 * se,
    ci_upper = estimate + 1.96 * se,
    ci_width = ci_upper - ci_lower
  )

# Plot confidence intervals
p3 <- ggplot(ci_calculation, 
             aes(x = sample_id, y = estimate, color = sample_size)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2) +
  geom_hline(yintercept = mean(df$new_cases_smoothed_per_million, na.rm = TRUE), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "Effect of Sample Size on Estimate Precision",
       subtitle = "Red line = True population mean from full dataset",
       x = "Sample ID",
       y = "Estimated Mean Cases per Million (±95% CI)",
       color = "Sample Size") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

# Plot CI width by sample size
p4 <- ci_calculation |>
  group_by(sample_size) |>
  summarise(avg_ci_width = mean(ci_width), .groups = "drop") |>
  ggplot(aes(x = sample_size, y = avg_ci_width, fill = sample_size)) +
  geom_col(alpha = 0.7) +
  labs(title = "Average Confidence Interval Width by Sample Size",
       subtitle = "Larger samples = Narrower CIs = More precise estimates",
       x = "Sample Size",
       y = "Average 95% CI Width") +
  theme_minimal() +
  theme(legend.position = "none")

p3 + p4 + plot_layout(ncol = 2)

Visual Insight: Larger samples yield much tighter confidence intervals. The 50% samples have confidence intervals about 0.45 times narrower than 10% samples.

Significance: With small samples, our estimates are so imprecise that even if we get the right “ballpark,” we can’t be very confident. Larger samples dramatically improve inference reliability.


PART 4: Implications for Future Conclusions

4.1 Key Insights About Sampling

# Summarize the practical implications
implications <- tibble(
  Insight = c(
    "Small samples (10%) show high variability",
    "Anomaly detection is sample-dependent",
    "Directional relationships are robust",
    "Confidence intervals shrink with sample size",
    "Group proportions can vary substantially"
  ),
  `Effect on Conclusions` = c(
    "Two 10% samples can suggest different 'realities'",
    "Outliers in one sample may be normal in another",
    "Relative rankings (EU > OECD > Non-OECD) are reliable",
    "Larger samples yield more precise estimates",
    "Representativeness cannot be assumed in small samples"
  ),
  `Recommendation for Future Analysis` = c(
    "Use multiple samples to assess variability",
    "Validate anomalies across different sampling approaches",
    "Focus on relative patterns, not exact values",
    "Report confidence intervals, not just point estimates",
    "Check sample composition against population"
  )
)

cat("## Implications for Drawing Conclusions\n\n")
## ## Implications for Drawing Conclusions
kable(implications, caption = "How Sampling Variability Affects Data Interpretation")
How Sampling Variability Affects Data Interpretation
Insight Effect on Conclusions Recommendation for Future Analysis
Small samples (10%) show high variability Two 10% samples can suggest different ‘realities’ Use multiple samples to assess variability
Anomaly detection is sample-dependent Outliers in one sample may be normal in another Validate anomalies across different sampling approaches
Directional relationships are robust Relative rankings (EU > OECD > Non-OECD) are reliable Focus on relative patterns, not exact values
Confidence intervals shrink with sample size Larger samples yield more precise estimates Report confidence intervals, not just point estimates
Group proportions can vary substantially Representativeness cannot be assumed in small samples Check sample composition against population

4.2 Reflection and Further Questions

What This Investigation Reveals:

  1. Sampling Variability is Real and Substantial: Even with 10% of a 42,000-row dataset (~4,200 observations!), different random samples can tell different stories about the same pandemic.

  2. Some Truths Are Robust: While exact numbers vary, the hierarchical relationship between country groups (EU > OECD > Non-OECD in stringency) holds across almost all samples. This suggests some underlying structural truths about the world.

  3. Anomalies Are in the Eye of the Beholder: What looks “extreme” depends heavily on which random subset you examine. This challenges any single-study outlier analysis.

  4. Precision Costs Data: To cut confidence interval width in half, you need roughly four times as much data. There are diminishing returns but clear benefits to larger samples.

Further Questions for Investigation:

  1. Stratified vs. Random Sampling: Would stratified sampling (ensuring equal representation from each country group) reduce variability compared to pure random sampling?

  2. Temporal Sampling Bias: What if our random samples accidentally cluster in certain time periods? How would that distort our understanding of the “average” pandemic experience?

  3. Minimum Sample Size for Reliability: Based on our analysis, what’s the minimum sample size needed to reliably detect the EU vs. Non-OECD stringency difference?

  4. Bootstrapping Applications: Could we use bootstrap resampling (creating many samples from one dataset) to quantify uncertainty in our original Week 2 and 3 findings?


Summary & Recommendations

Key Takeaways

  1. Always Assess Sampling Variability: Before drawing strong conclusions, check how sensitive your findings are to different random samples of your data.

  2. Report Confidence Intervals, Not Just Points: A point estimate without a confidence interval is an incomplete story, especially with smaller samples.

  3. Distinguish Between Exact Values and Relative Patterns: Exact averages may vary, but directional relationships are more stable across samples.

  4. Consider Multiple Sampling Approaches: If something is an anomaly in one sample, check if it remains an anomaly across multiple sampling approaches.