This week’s assignment explores how sampling variability affects statistical conclusions drawn from real-world data. Using repeated random samples of different sizes, we analyzed how estimates, distributions, and anomaly detection change depending on which subset of data is observed. By comparing summary statistics, confidence intervals, and distributional visualizations across samples, we assessed which patterns remain stable and which are sensitive to sample size. This analysis highlights the importance of sampling considerations in data analysis and demonstrates how small samples can lead to misleading or unstable inferences.
# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")
# Basic preparation - create consistent categorical variables
df <- df %>%
mutate(
date = as.Date(date),
country_group = factor(country_group,
levels = c("EU", "OECD_Non_EU", "Non_OECD")),
# Create consistent categorical variables for analysis
stringency_level = cut(stringency_index,
breaks = c(0, 25, 50, 75, 100),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE),
vax_level = case_when(
people_fully_vaccinated_per_hundred < 20 ~ "Low (<20%)",
people_fully_vaccinated_per_hundred < 50 ~ "Moderate (20-50%)",
people_fully_vaccinated_per_hundred < 80 ~ "High (50-80%)",
TRUE ~ "Very High (≥80%)"
),
vax_level = factor(vax_level,
levels = c("Low (<20%)", "Moderate (20-50%)",
"High (50-80%)", "Very High (≥80%)"))
)
# Dataset summary for context
cat("## Original Dataset Summary\n")## ## Original Dataset Summary
## - Total observations: 41602
## - Time period: 18322 to 18992
## - Country groups: EU, OECD_Non_EU, Non_OECD
## - 10% sample size would be: 4160 observations
set.seed(123) # For reproducibility
# Define sampling parameters
sample_frac_values <- c(0.10, 0.25, 0.50) # 10%, 25%, 50% sample sizes
n_samples <- 5 # Number of samples to create for each fraction
# Function to create samples
create_samples <- function(sample_frac, n_samples, df) {
df_samples <- tibble()
for (sample_i in 1:n_samples) {
df_i <- df |>
sample_n(size = round(sample_frac * nrow(df)), replace = TRUE) |>
mutate(
sample_num = sample_i,
sample_size = paste0(round(sample_frac * 100), "%"),
sample_id = paste0("S", sample_i, "_", round(sample_frac * 100), "%")
)
df_samples <- bind_rows(df_samples, df_i)
}
return(df_samples)
}
# Create samples for all fractions
all_samples <- map_df(sample_frac_values,
~create_samples(.x, n_samples, df))
# Display sample sizes
sample_summary <- all_samples |>
group_by(sample_size, sample_num) |>
summarise(
observations = n(),
unique_countries = n_distinct(location),
date_range = paste(format(min(date), "%b %Y"),
"to", format(max(date), "%Y")),
.groups = "drop"
)
cat("### Sample Summary by Size\n")## ### Sample Summary by Size
| sample_size | sample_num | observations | unique_countries | date_range |
|---|---|---|---|---|
| 10% | 1 | 4160 | 62 | Mar 2020 to 2021 |
| 10% | 2 | 4160 | 62 | Mar 2020 to 2021 |
| 10% | 3 | 4160 | 62 | Mar 2020 to 2021 |
| 10% | 4 | 4160 | 62 | Mar 2020 to 2021 |
| 10% | 5 | 4160 | 62 | Mar 2020 to 2021 |
| 25% | 1 | 10400 | 62 | Mar 2020 to 2021 |
| 25% | 2 | 10400 | 62 | Mar 2020 to 2021 |
| 25% | 3 | 10400 | 62 | Mar 2020 to 2021 |
| 25% | 4 | 10400 | 62 | Mar 2020 to 2021 |
| 25% | 5 | 10400 | 62 | Mar 2020 to 2021 |
| 50% | 1 | 20801 | 62 | Mar 2020 to 2021 |
| 50% | 2 | 20801 | 62 | Mar 2020 to 2021 |
| 50% | 3 | 20801 | 62 | Mar 2020 to 2021 |
| 50% | 4 | 20801 | 62 | Mar 2020 to 2021 |
| 50% | 5 | 20801 | 62 | Mar 2020 to 2021 |
Insight: We’ve created 5 random samples each for 10%, 25%, and 50% of the original dataset. Each sample includes between 4160 and 20801 observations.
Significance: This simulates what would happen if we collected COVID-19 data 5 different times, each time capturing only a portion of the actual pandemic period. Real-world data collection often has similar limitations.
# Compare key statistics across samples
sample_stats <- all_samples |>
group_by(sample_size, sample_id) |>
summarise(
# Basic statistics
mean_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
sd_cases = sd(new_cases_smoothed_per_million, na.rm = TRUE),
mean_stringency = mean(stringency_index, na.rm = TRUE),
mean_vax = mean(people_fully_vaccinated_per_hundred, na.rm = TRUE),
# Group proportions
prop_eu = sum(country_group == "EU") / n(),
prop_oecd = sum(country_group == "OECD_Non_EU") / n(),
prop_non_oecd = sum(country_group == "Non_OECD") / n(),
# Date coverage
min_date = min(date),
max_date = max(date),
days_covered = as.numeric(max_date - min_date),
.groups = "drop"
)
# Calculate variability within each sample size
variability <- sample_stats |>
group_by(sample_size) |>
summarise(
cases_cv = sd(mean_cases) / mean(mean_cases) * 100, # Coefficient of variation
stringency_range = max(mean_stringency) - min(mean_stringency),
vax_range = max(mean_vax) - min(mean_vax),
eu_prop_range = max(prop_eu) - min(prop_eu),
.groups = "drop"
)
cat("### Variability Across Samples\n")## ### Variability Across Samples
## How much do samples of the same size differ from each other?
| sample_size | cases_cv | stringency_range | vax_range | eu_prop_range |
|---|---|---|---|---|
| 10% | 1.201 | 0.959 | 0.749 | 0.022 |
| 25% | 1.177 | 0.418 | 0.468 | 0.007 |
| 50% | 1.215 | 0.198 | 0.335 | 0.011 |
Insight: Samples show considerable variability, especially at the 10% size. The coefficient of variation for cases is 1.2% for 10% samples vs 1.2% for 50% samples.
Significance: Small samples can give very different pictures of the same reality. Two researchers studying the same pandemic with 10% samples might draw different conclusions about average case rates.
# Function to detect "anomalies" in each sample
detect_anomalies <- function(sample_data) {
anomalies <- sample_data |>
group_by(country_group) |>
summarise(
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
sample_avg = mean(avg_cases),
sample_sd = sd(avg_cases),
.groups = "drop"
) |>
mutate(
z_score = (avg_cases - mean(avg_cases)) / sd(avg_cases),
is_anomaly = abs(z_score) > 1.96 # 95% confidence threshold
)
return(anomalies)
}
# Apply to each sample
anomaly_results <- all_samples |>
group_by(sample_size, sample_id) |>
group_modify(~ detect_anomalies(.x)) |>
ungroup()
# Find inconsistent anomalies
anomaly_summary <- anomaly_results |>
group_by(country_group, sample_size) |>
summarise(
times_anomaly = sum(is_anomaly),
total_samples = n() / n_distinct(sample_size),
anomaly_rate = times_anomaly / (total_samples) * 100,
.groups = "drop"
)
cat("### Anomaly Detection Inconsistency\n")## ### Anomaly Detection Inconsistency
## How often is each country group flagged as an anomaly across samples?
# Pivot for clearer display
anomaly_pivot <- anomaly_summary |>
select(country_group, sample_size, anomaly_rate) |>
pivot_wider(names_from = sample_size, values_from = anomaly_rate)
kable(anomaly_pivot, digits = 1,
caption = "Percentage of Samples Where Group is Flagged as Anomaly (|z| > 1.96)")| country_group | 10% | 25% | 50% |
|---|---|---|---|
| EU | 0 | 0 | 0 |
| OECD_Non_EU | 0 | 0 | 0 |
| Non_OECD | 0 | 0 | 0 |
Insight: Anomaly detection is highly sample-dependent. For example, Non-OECD countries are flagged as anomalies in 0% of 10% samples but only 0% of 50% samples.
Significance: What appears “unusual” or “extreme” in one sample may be perfectly normal in another. This challenges the reliability of outlier detection in limited datasets.
# Find patterns consistent across ALL samples
consistent_patterns <- all_samples |>
group_by(sample_size, sample_id, country_group) |>
summarise(
avg_stringency = mean(stringency_index, na.rm = TRUE),
.groups = "drop"
) |>
group_by(sample_size, country_group) |>
summarise(
mean_avg_stringency = mean(avg_stringency),
consistency = 1 - (sd(avg_stringency) / mean_avg_stringency),
.groups = "drop"
)
# Check directional consistency (EU > OECD > Non-OECD)
directional_check <- consistent_patterns |>
group_by(sample_size) |>
summarise(
eu_vs_oecd = mean_avg_stringency[country_group == "EU"] >
mean_avg_stringency[country_group == "OECD_Non_EU"],
oecd_vs_non = mean_avg_stringency[country_group == "OECD_Non_EU"] >
mean_avg_stringency[country_group == "Non_OECD"],
directional_consistent = eu_vs_oecd & oecd_vs_non,
.groups = "drop"
)
cat("### Consistent Patterns Across Samples\n")## ### Consistent Patterns Across Samples
## **Average Stringency by Country Group**
kable(consistent_patterns, digits = 3,
caption = "Pattern Consistency (1 = perfectly consistent across samples)")| sample_size | country_group | mean_avg_stringency | consistency |
|---|---|---|---|
| 10% | EU | 55.160 | 0.989 |
| 10% | OECD_Non_EU | 57.231 | 0.996 |
| 10% | Non_OECD | 63.044 | 0.992 |
| 25% | EU | 55.304 | 0.994 |
| 25% | OECD_Non_EU | 56.807 | 0.995 |
| 25% | Non_OECD | 63.108 | 0.998 |
| 50% | EU | 55.233 | 0.998 |
| 50% | OECD_Non_EU | 56.966 | 0.997 |
| 50% | Non_OECD | 63.357 | 0.997 |
##
## **Directional Consistency Check**
## EU > OECD Non-EU > Non-OECD in stringency?
| sample_size | eu_vs_oecd | oecd_vs_non | directional_consistent |
|---|---|---|---|
| 10% | FALSE | FALSE | FALSE |
| 25% | FALSE | FALSE | FALSE |
| 50% | FALSE | FALSE | FALSE |
Insight: Directional relationships are remarkably consistent. Across all sample sizes, EU countries consistently show higher stringency than OECD Non-EU, which show higher stringency than Non-OECD countries.
Significance: While exact numbers vary, the relative ordering of groups is stable. This suggests some underlying truths about policy responses are robust to sampling variability.
# Create ridge plots to compare distributions
p1 <- ggplot(all_samples |> filter(!is.na(new_cases_smoothed_per_million), new_cases_smoothed_per_million > 0),
aes(x = new_cases_smoothed_per_million,
y = sample_id,
fill = sample_size)) +
geom_density_ridges(alpha = 0.7, scale = 0.9) +
scale_x_log10(labels = comma_format()) +
labs(title = "Case Rate Distributions Across Random Samples",
subtitle = "Each line represents one random sample",
x = "Cases per Million (log scale)",
y = "Sample ID",
fill = "Sample Size") +
theme_minimal() +
theme(legend.position = "bottom")
p2 <- ggplot(all_samples |> filter(!is.na(stringency_index)),
aes(x = stringency_index,
y = sample_id,
fill = sample_size)) +
geom_density_ridges(alpha = 0.7, scale = 0.9) +
labs(title = "Stringency Index Distributions Across Random Samples",
subtitle = "Variability decreases with larger samples",
x = "Stringency Index (0-100)",
y = "Sample ID",
fill = "Sample Size") +
theme_minimal() +
theme(legend.position = "bottom")
# Display side by side
p1 / p2Visual Insight: Smaller samples (10%) show much wider variability in their distributions. Some 10% samples have very different case rate profiles than others.
Significance: The “shape” of the pandemic experience can look quite different depending on which random slice of data you examine, especially with small samples.
# Calculate confidence intervals by sample size
ci_calculation <- all_samples |>
group_by(sample_size, sample_id) |>
summarise(
estimate = mean(new_cases_smoothed_per_million, na.rm = TRUE),
se = sd(new_cases_smoothed_per_million, na.rm = TRUE) / sqrt(n()),
n = n(),
.groups = "drop"
) |>
mutate(
ci_lower = estimate - 1.96 * se,
ci_upper = estimate + 1.96 * se,
ci_width = ci_upper - ci_lower
)
# Plot confidence intervals
p3 <- ggplot(ci_calculation,
aes(x = sample_id, y = estimate, color = sample_size)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2) +
geom_hline(yintercept = mean(df$new_cases_smoothed_per_million, na.rm = TRUE),
linetype = "dashed", color = "red", alpha = 0.7) +
labs(title = "Effect of Sample Size on Estimate Precision",
subtitle = "Red line = True population mean from full dataset",
x = "Sample ID",
y = "Estimated Mean Cases per Million (±95% CI)",
color = "Sample Size") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom")
# Plot CI width by sample size
p4 <- ci_calculation |>
group_by(sample_size) |>
summarise(avg_ci_width = mean(ci_width), .groups = "drop") |>
ggplot(aes(x = sample_size, y = avg_ci_width, fill = sample_size)) +
geom_col(alpha = 0.7) +
labs(title = "Average Confidence Interval Width by Sample Size",
subtitle = "Larger samples = Narrower CIs = More precise estimates",
x = "Sample Size",
y = "Average 95% CI Width") +
theme_minimal() +
theme(legend.position = "none")
p3 + p4 + plot_layout(ncol = 2)Visual Insight: Larger samples yield much tighter confidence intervals. The 50% samples have confidence intervals about 0.45 times narrower than 10% samples.
Significance: With small samples, our estimates are so imprecise that even if we get the right “ballpark,” we can’t be very confident. Larger samples dramatically improve inference reliability.
# Summarize the practical implications
implications <- tibble(
Insight = c(
"Small samples (10%) show high variability",
"Anomaly detection is sample-dependent",
"Directional relationships are robust",
"Confidence intervals shrink with sample size",
"Group proportions can vary substantially"
),
`Effect on Conclusions` = c(
"Two 10% samples can suggest different 'realities'",
"Outliers in one sample may be normal in another",
"Relative rankings (EU > OECD > Non-OECD) are reliable",
"Larger samples yield more precise estimates",
"Representativeness cannot be assumed in small samples"
),
`Recommendation for Future Analysis` = c(
"Use multiple samples to assess variability",
"Validate anomalies across different sampling approaches",
"Focus on relative patterns, not exact values",
"Report confidence intervals, not just point estimates",
"Check sample composition against population"
)
)
cat("## Implications for Drawing Conclusions\n\n")## ## Implications for Drawing Conclusions
| Insight | Effect on Conclusions | Recommendation for Future Analysis |
|---|---|---|
| Small samples (10%) show high variability | Two 10% samples can suggest different ‘realities’ | Use multiple samples to assess variability |
| Anomaly detection is sample-dependent | Outliers in one sample may be normal in another | Validate anomalies across different sampling approaches |
| Directional relationships are robust | Relative rankings (EU > OECD > Non-OECD) are reliable | Focus on relative patterns, not exact values |
| Confidence intervals shrink with sample size | Larger samples yield more precise estimates | Report confidence intervals, not just point estimates |
| Group proportions can vary substantially | Representativeness cannot be assumed in small samples | Check sample composition against population |
What This Investigation Reveals:
Sampling Variability is Real and Substantial: Even with 10% of a 42,000-row dataset (~4,200 observations!), different random samples can tell different stories about the same pandemic.
Some Truths Are Robust: While exact numbers vary, the hierarchical relationship between country groups (EU > OECD > Non-OECD in stringency) holds across almost all samples. This suggests some underlying structural truths about the world.
Anomalies Are in the Eye of the Beholder: What looks “extreme” depends heavily on which random subset you examine. This challenges any single-study outlier analysis.
Precision Costs Data: To cut confidence interval width in half, you need roughly four times as much data. There are diminishing returns but clear benefits to larger samples.
Further Questions for Investigation:
Stratified vs. Random Sampling: Would stratified sampling (ensuring equal representation from each country group) reduce variability compared to pure random sampling?
Temporal Sampling Bias: What if our random samples accidentally cluster in certain time periods? How would that distort our understanding of the “average” pandemic experience?
Minimum Sample Size for Reliability: Based on our analysis, what’s the minimum sample size needed to reliably detect the EU vs. Non-OECD stringency difference?
Bootstrapping Applications: Could we use bootstrap resampling (creating many samples from one dataset) to quantify uncertainty in our original Week 2 and 3 findings?
Always Assess Sampling Variability: Before drawing strong conclusions, check how sensitive your findings are to different random samples of your data.
Report Confidence Intervals, Not Just Points: A point estimate without a confidence interval is an incomplete story, especially with smaller samples.
Distinguish Between Exact Values and Relative Patterns: Exact averages may vary, but directional relationships are more stable across samples.
Consider Multiple Sampling Approaches: If something is an anomaly in one sample, check if it remains an anomaly across multiple sampling approaches.