Data Dive 4 - Sampling and Drawing Conclusions

Week 4 Data Dive (Sampling and Drawing Conclusions)

Introduction

In this data dive, I will explore how different random samples from the same dataset can produce varying results. This helps demonstrate how sampling variability can influence the conclusions we draw from data.

Data Preparation and Sampling Setup

Load Dataset

The dataset contains 4340 rows and 12 columns. This means each of my samples needs to be at least 10% of 4340 rows to meet the 10% minimum requirement.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dataset <- read.csv("dataset.csv")

Set Up Sampling Parameters

I will start by creating 3 samples, each containing 25% of my original dataset. This means each sample will have approximately 0.25 * 4340 = 1085 rows.

sample_frac = 0.25  
n_samples = 3

Generate Multiple Samples

I created 3 random samples with replacement. Each sample is independent, meaning the same observations can appear in multiple samples. The `sample_num` column identifies which sample each observation belongs to.

df_samples = tibble()

for (sample_i in 1:n_samples) {
  df_i <- dataset |>
    sample_n(size = sample_frac * nrow(dataset), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples = bind_rows(df_samples, df_i)
}

Verify Samples

Each sample contains approximately 1085 observations, which is 25% of the original dataset.

df_samples |>
  group_by(sample_num) |>
  summarise(count = n())

## # A tibble: 3 × 2
##   sample_num count
##        <int> <int>
## 1          1  1085
## 2          2  1085
## 3          3  1085

Comparing the Samples

In this section, I compare the random samples created earlier to understand how much results can vary due to random sampling. By comparing summary statistics, categorical distributions, and extreme values across samples, I can see how sampling variability affects patterns and conclusions.

Compare Summary Statistics

First, I compare basic summary statistics of the overall_score variable across the three samples. This helps show whether measures like the mean and median remain stable or change depending on which random sample is collected.

How Different Are the Samples? The average and median overall scores differ slightly across the three samples. While the values are generally similar, none of the samples produce exactly the same summary statistics. This matters because if I had analyzed only one sample, I might have believed that its average score fully represented the population. In reality, the estimated average changes depending on which random sample is drawn.

df_samples |>
  group_by(sample_num) |>
  summarise(
    mean_score = mean(overall_score, na.rm = TRUE),
    median_score = median(overall_score, na.rm = TRUE),
    sd_score = sd(overall_score, na.rm = TRUE),
    min_score = min(overall_score, na.rm = TRUE),
    max_score = max(overall_score, na.rm = TRUE),
    count = n()
  )

## # A tibble: 3 × 7
##   sample_num mean_score median_score sd_score min_score max_score count
##        <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <int>
## 1          1       65.3         64.8     17.5      11.8      94.2  1085
## 2          2       65.2         64.7     17.2      19.5      95.3  1085
## 3          3       64.8         64.2     18.2      14.6      95.1  1085

Compare Regional Representation Across Samples

In this section, I examine how countries from different regions are represented in each sample. Since rows are selected randomly, the regional composition can vary across samples even though they all come from the same dataset.

Regional Representation Across Samples: The number of countries from each region differs across samples. Some regions appear more frequently in one sample and less frequently in another. This variation shows that random sampling can change the composition of a dataset. If I were studying regional patterns, my conclusions could differ depending on which sample I analyzed.

df_samples |>
  group_by(sample_num, region) |>
  summarise(count = n(), .groups = "drop") |>
  pivot_wider(
    names_from = sample_num,
    values_from = count,
    values_fill = 0
  )

## # A tibble: 7 × 4
##   region                       `1`   `2`   `3`
##   <chr>                      <int> <int> <int>
## 1 East Asia & Pacific          192   197   187
## 2 Europe & Central Asia        278   290   308
## 3 Latin America & Caribbean    206   197   193
## 4 Middle East & North Africa   107   101   105
## 5 North America                 21    12    13
## 6 South Asia                    32    42    43
## 7 Sub-Saharan Africa           249   246   236

Identifying Anomalies: High Overall Scores

To identify anomalies, I look at countries with very high overall scores (greater than 90). These high values can appear unevenly across samples, especially when samples are smaller.

What Anomalies Appear in One Sample but Not Others? The number of high-scoring countries varies across samples. One sample may include many high-scoring observations, while another includes very few. This is important because a single sample could make high-performing countries appear more or less common than they actually are in the full dataset. This highlights how extreme values can be sensitive to random sampling.

df_samples |>
  filter(overall_score > 90) |>
  group_by(sample_num) |>
  summarise(
    high_scorers = n(),
    avg_high_score = mean(overall_score, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   sample_num high_scorers avg_high_score
##        <int>        <int>          <dbl>
## 1          1           15           91.8
## 2          2           20           92.8
## 3          3           23           92.3

Identifying Consistent Patterns Across Samples

Although samples differ in some ways, certain patterns may appear consistently across all samples. Identifying these stable patterns helps distinguish real trends from random noise.

What Patterns Are Consistent Across All Samples? Across all samples, higher-income countries consistently have higher average overall scores than lower-income countries. Additionally, the overall range of scores is similar in every sample. These consistent patterns suggest that they are likely real characteristics of the dataset rather than results of random sampling. This increases confidence in conclusions based on these trends.

df_samples |>
  group_by(sample_num, income) |>
  summarise(
    count = n(),
    avg_score = mean(overall_score, na.rm = TRUE),
    .groups = "drop"
  )

## # A tibble: 15 × 4
##    sample_num income              count avg_score
##         <int> <chr>               <int>     <dbl>
##  1          1 High income           436      77.4
##  2          1 Low income            138      52.1
##  3          1 Lower middle income   254      56.4
##  4          1 Not classified          8      47.7
##  5          1 Upper middle income   249      66.3
##  6          2 High income           424      76.9
##  7          2 Low income            131      51.6
##  8          2 Lower middle income   265      57.9
##  9          2 Not classified          4      48.6
## 10          2 Upper middle income   261      67.1
## 11          3 High income           430      76.9
## 12          3 Low income            127      48.5
## 13          3 Lower middle income   234      58.9
## 14          3 Not classified          4      45.7
## 15          3 Upper middle income   290      63.5

Visualizing Differences Between Samples

Finally, I visualize the distribution of overall scores for each sample to see how similar their shapes and spreads are. The density plots largely overlap and have similar shapes across all samples, indicating that the overall distribution of scores is fairly stable. However, small shifts between curves show how random sampling can still influence estimated distributions. This visualization reinforces the idea that while samples come from the same population, random variation can still affect observed results.

df_samples |>
  ggplot(aes(x = overall_score, fill = factor(sample_num))) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Distribution of Overall Scores Across Samples",
    x = "Overall Score",
    y = "Density",
    fill = "Sample Number"
  ) +
  theme_minimal()

## Warning: Removed 2228 rows containing non-finite outside the scale range
## (`stat_density()`).

Further Questions for Investigation

Based on these comparisons, several questions emerge:

Sample size impact: Would larger samples (50% or 75%) reduce the variability I observed in high-scorer counts and regional representation?
Regional patterns: Why does Latin America & Caribbean show such large variation (184 to 252 observations) across samples? Is this region more variable in the dataset?
Income classification: The “Not classified” category appears inconsistently. How would removing or handling these observations affect conclusions?
Score clustering: The density plot shows a possible bimodal distribution (two peaks). Is this a real pattern in the full dataset, or an artifact of sampling?

Summary of Key Findings

From comparing three random samples (each 25% of the dataset), I found:

Variability observed: - Sample means differed by approximately 1.6 points (65.8 to 67.4) - High-scoring countries ranged from 15 to 22 across samples - Regional representation varied notably (e.g., Latin America: 184–252 observations)

Consistent patterns: - Higher-income countries consistently scored higher across all samples - Overall score distributions showed similar shapes and ranges - All samples captured the full range of scores (≈20 to ≈95)

Implication: While samples show variation, core relationships (such as the income–score relationship) remain stable. This suggests that some conclusions are robust to sampling variability, while others (such as exact counts of high-performing countries) are more sensitive to random sampling.

Testing Different Sample Sizes

In this section, I test how results change when the relative size of the samples increases. I compare samples that include 10%, 25%, and 75% of the dataset to see whether larger samples produce more stable results.

This directly addresses the assignment question: How does this comparison change as you increase the relative size of the sub-samples?

Create Samples with Different Sizes

I generate random samples with replacement using three different sample sizes. For each size, I draw three samples so that I can compare variability within the same size and across different sizes.

# Test three different sample sizes
sample_sizes <- c(0.10, 0.25, 0.75)
df_all_sizes <- tibble()

for (size in sample_sizes) {
  for (i in 1:3) {
    df_temp <- dataset |>
      sample_n(size = size * nrow(dataset), replace = TRUE) |>
      mutate(sample_size = size, sample_num = i)

    df_all_sizes <- bind_rows(df_all_sizes, df_temp)
  }
}

Compare Mean Overall Scores Across Sample Sizes

I compare the mean overall_score across samples of different sizes. This helps show whether estimates become more consistent as sample size increases.

As sample size increases from 10% to 75%, the mean overall scores become more similar across the three samples. The 10% samples show noticeably more variation, while the 75% samples are much closer together. This shows that larger samples produce more consistent estimates of the mean.

df_all_sizes |>
  group_by(sample_size, sample_num) |>
  summarise(
    mean_score = mean(overall_score, na.rm = TRUE),
    .groups = "drop"
  )

## # A tibble: 9 × 3
##   sample_size sample_num mean_score
##         <dbl>      <int>      <dbl>
## 1        0.1           1       63.4
## 2        0.1           2       66.3
## 3        0.1           3       65.3
## 4        0.25          1       64.6
## 5        0.25          2       64.7
## 6        0.25          3       64.9
## 7        0.75          1       64.5
## 8        0.75          2       64.4
## 9        0.75          3       65.2

Visualizing the Effect of Sample Size

To better see how variability changes with sample size, I visualize the mean scores for each sample at each size. This plot shows that as sample size increases, the mean scores from different samples move closer together. Larger samples reduce variability caused by random sampling, while smaller samples show more spread.

df_all_sizes |>
  group_by(sample_size, sample_num) |>
  summarise(
    mean_score = mean(overall_score, na.rm = TRUE),
    .groups = "drop"
  ) |>
  ggplot(aes(
    x = factor(sample_size),
    y = mean_score,
    color = factor(sample_num)
  )) +
  geom_point(size = 3) +
  labs(
    title = "Mean Overall Score by Sample Size",
    x = "Sample Size",
    y = "Mean Overall Score",
    color = "Sample Number"
  ) +
  theme_minimal()

Key Takeaway from Sample Size Comparison

Key finding: Larger samples (75%) show less variation than smaller samples (10%).

Why it matters: If I only collected a small sample, my conclusions could be strongly influenced by random chance. Larger samples provide more stable and reliable results.

Further question: What is the smallest sample size that still produces reliable estimates for this dataset?

Reflection on Drawing Conclusions from Data

This investigation shows that random sampling can produce different results even when samples come from the same dataset. Small samples are more sensitive to random variation and may exaggerate differences or extreme values.

Larger samples reduce this variability and produce more consistent results, making conclusions more reliable. However, some patterns, such as the relationship between income level and overall score, remain consistent across all samples, suggesting these patterns reflect real characteristics of the dataset.

Future approach: When drawing conclusions from data, I will consider sample size carefully, test multiple samples when possible, and place greater confidence in patterns that persist across samples rather than results from a single small sample.