2026-03-01

Introduction

  • While simple random sampling is easy to understand, it has its fair share of drawbacks.

  • Simple Random Samples are often:

    • Costly
    • Inefficient when populations are geographically dispersed
    • Dependent on having a full population list
    • Not representative of small subgroups

Multistage Sampling

  • Method of obtaining a sample from a population by splitting a population into smaller and smaller groups and taking samples of individuals from the smallest resulting groups

Example: Instead of sampling people directly from the whole country, we first sample clusters such as states, then sample individuals within those states

Stages of Sampling

  • Stage 1: Primary Sampling Units (PSUs)
    • Population is divided into larger clusters (countries, states, universities, etc)
  • Stage 2: Secondary Sampling Units (SSUs)
    • Divide PSUs into SSUs (countries -> states, states -> counties, universities -> grade level)
  • Subsequent Stages
    • Repeat process
  • Final Stage
    • Select individual elements or participants from the final-level clusters

Benefits

  • Cheaper and faster than SRS

    • Reduces travel costs
  • Requires list of groups rather than list of individuals

  • Flexible Design

    • Adapt to different population structures

Equations

In multistage sampling, the probability that an individual is selected is the product of the probabilities at each stage.

\[ \pi_i = P(\text{PSU}) \times P(\text{SSU} \mid \text{PSU}) \times P(\text{Individual} \mid \text{SSU}) \]

Example with three stages:

\[ \pi_i = \left(\frac{m}{M}\right) \left(\frac{n_i}{N_i}\right) \left(\frac{s_{ij}}{S_{ij}}\right) \]

where:

  • \(M\) = total number of primary sampling units
  • \(m\) = number of PSUs selected
  • \(N_i\) = number of secondary units in PSU \(i\)
  • \(n_i\) = number of SSUs selected within PSU \(i\)

Weighted Sampling

Different individuals may have different probabilities of being selected at each stage, so we use sampling weights, which are the inverse of the inclusion probability, to produce unbiased population estimates. \[ w_i = \frac{1}{\pi_i} \]

If weights are used, the population mean can be estimated as:

\[ \bar{y}_w = \frac{\sum_{i=1}^{n} w_i y_i}{\sum_{i=1}^{n} w_i} \]

Multistage Sampling Simulation

This simulation creates a population of students organized into counties, districts, schools, and individuals. We sample in this order, respectively, and then compare students’ average test scores to the true population average

First, we define how the population is organized.

set.seed(1)

n_counties  <- 20
n_districts <- 6
n_schools   <- 4
n_students  <- 80

These variables determine how many units will be randomly selected during the sampling process.

# Sample sizes at each stage
c_counties  <- 5
c_districts <- 2
c_schools   <- 2
c_students  <- 10

Stage 1: randomly sample 5 counties Stage 2: from each selected county, sample 2 districts Stage 3: from each district, sample 2 schools Stage 4: from each school, sample 10 students

Multistage Sampling Simulation

Here, we build every possible combination of counties, districts, schools, and students. We set the average baseline score at 75.

pop <- expand.grid(
  county   = paste0("County_", 1:n_counties),
  district = paste0("Dist_",   1:n_districts),
  school   = paste0("Sch_",    1:n_schools),
  student  = 1:n_students
) %>%
  as_tibble() %>%
  mutate(
    district_id = paste(county, district, sep = "_"),
    school_id   = paste(district_id, school, sep = "_")
  )

mu <- 75

Multistage Sampling Simulation

We create realistic test scores for the simulated population by adding variation at the different levels.

county_eff   <- rnorm(n_counties, 0, 4)
names(county_eff) <- unique(pop$county)

district_eff <- rnorm(n_counties * n_districts, 0, 3)
names(district_eff) <- unique(pop$district_id)

school_eff   <- rnorm(n_counties * n_districts * n_schools, 0, 2)
names(school_eff) <- unique(pop$school_id)

Below, we create a test score for every student and calculate the true average score.

pop <- pop %>%
  mutate(
    score = mu +
      county_eff[county] +
      district_eff[district_id] +
      school_eff[school_id] +
      rnorm(n(), 0, 10)
  )

true_mean <- mean(pop$score)

Multistage Sampling Simulation

Now, we finally perform the multistage sampling process.

# Stage 1: sample counties
sampled_counties <- sample(unique(pop$county), c_counties)

# Stage 2: sample districts within counties
sampled_districts <- pop %>%
  filter(county %in% sampled_counties) %>%
  distinct(county, district_id) %>%
  group_by(county) %>%
  slice_sample(n = c_districts) %>%
  ungroup()

# Stage 3: sample schools within districts
sampled_schools <- pop %>%
  semi_join(sampled_districts, by = c("county", "district_id")) %>%
  distinct(district_id, school_id) %>%
  group_by(district_id) %>%
  slice_sample(n = c_schools) %>%
  ungroup()

# Stage 4: sample students within schools
sample_multistage <- pop %>%
  semi_join(sampled_schools, by = c("district_id", "school_id")) %>%
  group_by(school_id) %>%
  slice_sample(n = c_students) %>%
  ungroup()

Lastly, we calculate the average score from the sample and compare to the true population average.

sample_mean <- mean(sample_multistage$score)

results <- tibble(
  True_Population_Mean = round(true_mean, 2),
  Multistage_Sample_Mean = round(sample_mean, 2),
  Sample_Size = nrow(sample_multistage)
)

knitr::kable(results)
True_Population_Mean Multistage_Sample_Mean Sample_Size
75.86 78.35 200

Our sample produces a mean close to the true mean, demonstrating how multistage sampling can approximate population characteristics without measuring every individual.

Real-World Applications

National Health Interview Survey - NHIS uses multistage probability design to sample U.S. households for health statistics - PSU: Counties - SSU: Neighborhoods - Final Stage: Households

Source: CDC’s National Center for Health Statistics (NCHS)

Real World Applications

American Community Survey (ACS) - Geographic Areas - Housing Units - Individuals

Monitoring the Future (University of Michigan) - Geographic Regions - Schools - Students

National Assessment of Educational Progress (NAEP) - States - School Districts - Schools - Students

United States Department of Agriculture - States - Counties - Farms - Production unit on farm

Limitations

  • Selected groups may not be the most representative

  • Portions of population may be excluded

  • Higher sampling error than SRS

  • More complex to design and analyze

Work Cited