In this data dive, I will explore how different random samples from the same dataset can produce varying results. This helps demonstrate how sampling variability can influence the conclusions we draw from data.
The dataset contains 4340 rows and 12 columns. This means each of my samples needs to be at least 10% of 4340 rows to meet the 10% minimum requirement.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read.csv("dataset.csv")
I will start by creating 3 samples, each containing 25% of my original dataset. This means each sample will have approximately 0.25 * 4340 = 1085 rows.
sample_frac = 0.25
n_samples = 3
I created 3 random samples with replacement. Each sample is independent, meaning the same observations can appear in multiple samples. The `sample_num` column identifies which sample each observation belongs to.
df_samples = tibble()
for (sample_i in 1:n_samples) {
df_i <- dataset |>
sample_n(size = sample_frac * nrow(dataset), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples = bind_rows(df_samples, df_i)
}
Each sample contains approximately 1085 observations, which is 25% of the original dataset.
df_samples |>
group_by(sample_num) |>
summarise(count = n())
## # A tibble: 3 × 2
## sample_num count
## <int> <int>
## 1 1 1085
## 2 2 1085
## 3 3 1085
In this section, I compare the random samples created earlier to understand how much results can vary due to random sampling. By comparing summary statistics, categorical distributions, and extreme values across samples, I can see how sampling variability affects patterns and conclusions.
First, I compare basic summary statistics of the
overall_score variable across the three samples. This helps
show whether measures like the mean and median remain stable or change
depending on which random sample is collected.
How Different Are the Samples? The average and median overall scores differ slightly across the three samples. While the values are generally similar, none of the samples produce exactly the same summary statistics. This matters because if I had analyzed only one sample, I might have believed that its average score fully represented the population. In reality, the estimated average changes depending on which random sample is drawn.
df_samples |>
group_by(sample_num) |>
summarise(
mean_score = mean(overall_score, na.rm = TRUE),
median_score = median(overall_score, na.rm = TRUE),
sd_score = sd(overall_score, na.rm = TRUE),
min_score = min(overall_score, na.rm = TRUE),
max_score = max(overall_score, na.rm = TRUE),
count = n()
)
## # A tibble: 3 × 7
## sample_num mean_score median_score sd_score min_score max_score count
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 65.3 64.8 17.5 11.8 94.2 1085
## 2 2 65.2 64.7 17.2 19.5 95.3 1085
## 3 3 64.8 64.2 18.2 14.6 95.1 1085
In this section, I examine how countries from different regions are represented in each sample. Since rows are selected randomly, the regional composition can vary across samples even though they all come from the same dataset.
Regional Representation Across Samples: The number of countries from each region differs across samples. Some regions appear more frequently in one sample and less frequently in another. This variation shows that random sampling can change the composition of a dataset. If I were studying regional patterns, my conclusions could differ depending on which sample I analyzed.
df_samples |>
group_by(sample_num, region) |>
summarise(count = n(), .groups = "drop") |>
pivot_wider(
names_from = sample_num,
values_from = count,
values_fill = 0
)
## # A tibble: 7 × 4
## region `1` `2` `3`
## <chr> <int> <int> <int>
## 1 East Asia & Pacific 192 197 187
## 2 Europe & Central Asia 278 290 308
## 3 Latin America & Caribbean 206 197 193
## 4 Middle East & North Africa 107 101 105
## 5 North America 21 12 13
## 6 South Asia 32 42 43
## 7 Sub-Saharan Africa 249 246 236
To identify anomalies, I look at countries with very high overall scores (greater than 90). These high values can appear unevenly across samples, especially when samples are smaller.
What Anomalies Appear in One Sample but Not Others? The number of high-scoring countries varies across samples. One sample may include many high-scoring observations, while another includes very few. This is important because a single sample could make high-performing countries appear more or less common than they actually are in the full dataset. This highlights how extreme values can be sensitive to random sampling.
df_samples |>
filter(overall_score > 90) |>
group_by(sample_num) |>
summarise(
high_scorers = n(),
avg_high_score = mean(overall_score, na.rm = TRUE)
)
## # A tibble: 3 × 3
## sample_num high_scorers avg_high_score
## <int> <int> <dbl>
## 1 1 15 91.8
## 2 2 20 92.8
## 3 3 23 92.3
Although samples differ in some ways, certain patterns may appear consistently across all samples. Identifying these stable patterns helps distinguish real trends from random noise.
What Patterns Are Consistent Across All Samples? Across all samples, higher-income countries consistently have higher average overall scores than lower-income countries. Additionally, the overall range of scores is similar in every sample. These consistent patterns suggest that they are likely real characteristics of the dataset rather than results of random sampling. This increases confidence in conclusions based on these trends.
df_samples |>
group_by(sample_num, income) |>
summarise(
count = n(),
avg_score = mean(overall_score, na.rm = TRUE),
.groups = "drop"
)
## # A tibble: 15 × 4
## sample_num income count avg_score
## <int> <chr> <int> <dbl>
## 1 1 High income 436 77.4
## 2 1 Low income 138 52.1
## 3 1 Lower middle income 254 56.4
## 4 1 Not classified 8 47.7
## 5 1 Upper middle income 249 66.3
## 6 2 High income 424 76.9
## 7 2 Low income 131 51.6
## 8 2 Lower middle income 265 57.9
## 9 2 Not classified 4 48.6
## 10 2 Upper middle income 261 67.1
## 11 3 High income 430 76.9
## 12 3 Low income 127 48.5
## 13 3 Lower middle income 234 58.9
## 14 3 Not classified 4 45.7
## 15 3 Upper middle income 290 63.5
Finally, I visualize the distribution of overall scores for each sample to see how similar their shapes and spreads are. The density plots largely overlap and have similar shapes across all samples, indicating that the overall distribution of scores is fairly stable. However, small shifts between curves show how random sampling can still influence estimated distributions. This visualization reinforces the idea that while samples come from the same population, random variation can still affect observed results.
df_samples |>
ggplot(aes(x = overall_score, fill = factor(sample_num))) +
geom_density(alpha = 0.5) +
labs(
title = "Distribution of Overall Scores Across Samples",
x = "Overall Score",
y = "Density",
fill = "Sample Number"
) +
theme_minimal()
## Warning: Removed 2228 rows containing non-finite outside the scale range
## (`stat_density()`).
Based on these comparisons, several questions emerge:
From comparing three random samples (each 25% of the dataset), I found:
Variability observed: - Sample means differed by approximately 1.6 points (65.8 to 67.4) - High-scoring countries ranged from 15 to 22 across samples - Regional representation varied notably (e.g., Latin America: 184–252 observations)
Consistent patterns: - Higher-income countries consistently scored higher across all samples - Overall score distributions showed similar shapes and ranges - All samples captured the full range of scores (≈20 to ≈95)
Implication: While samples show variation, core relationships (such as the income–score relationship) remain stable. This suggests that some conclusions are robust to sampling variability, while others (such as exact counts of high-performing countries) are more sensitive to random sampling.
In this section, I test how results change when the relative size of the samples increases. I compare samples that include 10%, 25%, and 75% of the dataset to see whether larger samples produce more stable results.
This directly addresses the assignment question: How does this comparison change as you increase the relative size of the sub-samples?
I generate random samples with replacement using three different sample sizes. For each size, I draw three samples so that I can compare variability within the same size and across different sizes.
# Test three different sample sizes
sample_sizes <- c(0.10, 0.25, 0.75)
df_all_sizes <- tibble()
for (size in sample_sizes) {
for (i in 1:3) {
df_temp <- dataset |>
sample_n(size = size * nrow(dataset), replace = TRUE) |>
mutate(sample_size = size, sample_num = i)
df_all_sizes <- bind_rows(df_all_sizes, df_temp)
}
}
I compare the mean overall_score across samples of
different sizes. This helps show whether estimates become more
consistent as sample size increases.
As sample size increases from 10% to 75%, the mean overall scores become more similar across the three samples. The 10% samples show noticeably more variation, while the 75% samples are much closer together. This shows that larger samples produce more consistent estimates of the mean.
df_all_sizes |>
group_by(sample_size, sample_num) |>
summarise(
mean_score = mean(overall_score, na.rm = TRUE),
.groups = "drop"
)
## # A tibble: 9 × 3
## sample_size sample_num mean_score
## <dbl> <int> <dbl>
## 1 0.1 1 63.4
## 2 0.1 2 66.3
## 3 0.1 3 65.3
## 4 0.25 1 64.6
## 5 0.25 2 64.7
## 6 0.25 3 64.9
## 7 0.75 1 64.5
## 8 0.75 2 64.4
## 9 0.75 3 65.2
To better see how variability changes with sample size, I visualize the mean scores for each sample at each size. This plot shows that as sample size increases, the mean scores from different samples move closer together. Larger samples reduce variability caused by random sampling, while smaller samples show more spread.
df_all_sizes |>
group_by(sample_size, sample_num) |>
summarise(
mean_score = mean(overall_score, na.rm = TRUE),
.groups = "drop"
) |>
ggplot(aes(
x = factor(sample_size),
y = mean_score,
color = factor(sample_num)
)) +
geom_point(size = 3) +
labs(
title = "Mean Overall Score by Sample Size",
x = "Sample Size",
y = "Mean Overall Score",
color = "Sample Number"
) +
theme_minimal()
Key finding: Larger samples (75%) show less variation than smaller samples (10%).
Why it matters: If I only collected a small sample, my conclusions could be strongly influenced by random chance. Larger samples provide more stable and reliable results.
Further question: What is the smallest sample size that still produces reliable estimates for this dataset?
This investigation shows that random sampling can produce different results even when samples come from the same dataset. Small samples are more sensitive to random variation and may exaggerate differences or extreme values.
Larger samples reduce this variability and produce more consistent results, making conclusions more reliable. However, some patterns, such as the relationship between income level and overall score, remain consistent across all samples, suggesting these patterns reflect real characteristics of the dataset.
Future approach: When drawing conclusions from data, I will consider sample size carefully, test multiple samples when possible, and place greater confidence in patterns that persist across samples rather than results from a single small sample.