Sampling Methods

class: center, middle, inverse, title-slide

.title[
# Sampling Methods
]
.subtitle[
## Unit Project · NOVA IMS · 2025–2026
]
.author[
### Pedro Fernandes · Ricardo Branco
]
.date[
### Professor Pedro Simões Coelho
]

---

---
class: cover-slide, middle

# Sampling Methods
## Three Designs Applied to a Population of N = 50,000

<br>

**Pedro Fernandes · Ricardo Branco**

.muted[NOVA IMS · Master in Statistics · 2025–2026  
Professor Pedro Simões Coelho]

---
class: section-slide, middle

.label[01]
# **Motivation**

---

# Motivation

.pull-left[

### The problem

Surveying an entire population is rarely feasible.  
Sampling allows us to estimate population parameters from a fraction of the data — but **the design matters**.

### Three fundamental questions

- How many observations do we need?
- How do we select them?
- How do we compare competing designs?

### Why it matters

Sample size is directly associated with **cost**.  
A more efficient design achieves the same precision with fewer observations — reducing both financial and operational burden.

]

.pull-right[

### This project

We apply and compare **three sampling designs** to a finite population of N = 50,000 individuals with known socioeconomic characteristics.

Because the full population is available, true parameter values are known — allowing a rigorous, controlled comparison.

> The goal is not just to estimate. It is to understand **which design does it best**, and at what cost.

]

---
class: section-slide, middle

.label[02]
# **Study Design**

---

# Data, Parameters and Precision Targets

.col-left[

### Population

- **N = 50,000** individuals
- 13 variables per person
- Treated as the complete target population

### Three parameters of interest

| Parameter | Estimator |
|-----------|-----------|
| Mean income | `$\hat{\mu}$` |
| Poverty rate | `$\hat{p}$` |
| Total expenditure | `$\hat{\tau}$` |

### Common framework

- Confidence: **95%** (`$Z = 1.96$`)
- All designs evaluated on equal footing

]

.col-right[

### Precision targets

``` r
tgt <- data.frame(
  Parameter = c("Mean income", "Poverty rate", "Total expenditure"),
  Target    = c("$d = 3\\%$ of mean",
                "$d = 3$ pp",
                "$d = 3\\%$ of total"),
  Value     = c(paste0("±€", fmt_num(d_income)),
                "±0.03",
                paste0("±€", fmt_num(d_exp)))
)
kable(tgt, escape = FALSE, booktabs = TRUE,
      col.names = c("Parameter", "Target", "Value"),
      align = c("l","l","r")) |>
  kable_styling(font_size = 13, full_width = TRUE) |>
  row_spec(0, background = "#000000", color = "white")
```

### Three designs compared

1. **Design 1** — Simple Random Sampling (SRS)
2. **Design 2** — Stratified Sampling (STRS)
3. **Design 3** — Cluster PPS

]

---
class: section-slide, middle

.label[03]
# **Design 1**
# Simple Random Sampling

---

# Design 1 — Simple Random Sampling

.pull-left[

### What it is

Every individual has the **same probability** of selection. No structure, no grouping. The natural baseline.

### Sample size formulas

`$$n_\mu = \frac{Z^2 S^2 N}{Z^2 S^2 + d_\mu^2 N}$$`

`$$n_p = \frac{Z^2 p(1-p) N}{Z^2 p(1-p) + d_p^2 N}$$`

`$$n_\tau = \frac{Z^2 S^2 N^2}{d_\tau^2 + Z^2 N S^2}$$`

### Three candidate sizes

`$$n = \max(1026,\ 821,\ 1128) = 1128$$`

Binding constraint: **total expenditure**

]

.pull-right[

### Results

``` r
srs_s <- data.frame(
  Parameter = c("Mean income (EUR)",
                "Poverty rate",
                "Total exp. (EUR)"),
  Estimate  = c(fmt_num(est_income, 1),
                fmt_pct(est_pov),
                fmt_num(est_exp)),
  D         = c(fmt_num(abs_income, 1),
                fmt_pct(abs_pov),
                fmt_num(abs_exp)),
  r         = c(fmt_rel(rel_income),
                fmt_rel(rel_pov),
                fmt_rel(rel_exp))
)
kable(srs_s, escape = TRUE, booktabs = TRUE,
      col.names = c("Parameter","Estimate","D","r (%)"),
      align = c("l","r","r","r")) |>
  kable_styling(font_size = 12, full_width = TRUE) |>
  row_spec(0, background = "#000000", color = "white")
```

<br>

.box-black[
**n = 1128** · Sampling fraction f = 2.26%  
All three precision targets met ✓
]

]

---
class: section-slide, middle

.label[04]
# **Design 2**
# Stratified Sampling

---

# Design 2 — Stratification Variable

.pull-left[

### Why education level?

Education is strongly correlated with **income** and **poverty** — it creates strata that are homogeneous within and heterogeneous between.

The within-stratum pooled variance for income:

`$$S^2_{\text{intra}} = 48,041,159 \ll S^2 = 187,739,928$$`

A reduction of **74.4%** — the source of the efficiency gain.

### Four strata

]

.pull-right[

``` r
st <- data.frame(
  Stratum = c("Primary", "Secondary", "Bachelor", "Master+"),
  N_h     = fmt_num(strata_info$N_h[match(
              c("Primary","Secondary","Bachelor","Master+"),
              strata_info$Education)]),
  W_h     = paste0(round(strata_info$W_h[match(
              c("Primary","Secondary","Bachelor","Master+"),
              strata_info$Education)] * 100, 1), "%"),
  pov     = paste0(round(strata_info$p_h_pov[match(
              c("Primary","Secondary","Bachelor","Master+"),
              strata_info$Education)] * 100, 1), "%"),
  S_exp   = fmt_num(strata_info$S_h_exp[match(
              c("Primary","Secondary","Bachelor","Master+"),
              strata_info$Education)], 1)
)

kable(st,
      escape    = FALSE,
      booktabs  = TRUE,
      col.names = c("Stratum", "$N_h$", "$W_h$", "Poverty", "$S_h$ exp."),
      align     = c("l", "r", "r", "r", "r")) |>
  kable_styling(font_size = 12, full_width = TRUE) |>
  row_spec(0, background = "#000000", color = "white") |>
  row_spec(4, bold = TRUE)
```

.box-gray[
**Master+:** `$p_h = 0$` → `$S_{h,\text{pov}} = 0$`  
Zero poverty variance — structural, not a sampling artefact.
]

]

---

# Design 2 — Allocation Methods

.pull-left[

### Proportional allocation &nbsp; n = 538

`$$n_h = \frac{N_h}{N} \cdot n$$`

Sample mirrors population structure.  
Driven by **poverty** (`$n_{\text{prop,pov}} = 538$`).

---

### Optimal (Neyman) &nbsp; n = 377

`$$n_h = \frac{N_h S_h}{\sum_j N_j S_j} \cdot n$$`

**Reference: total expenditure** — second largest requirement, ensures all strata are represented.  
Master+ would receive zero units under poverty-driven allocation.

---

### Compromise &nbsp; n = 377

Average of income and expenditure allocation shapes.  
Both anchored to `$n = 377$`.

]

.pull-right[

``` r
strata_ord <- c("Primary","Secondary","Bachelor","Master+")
al <- data.frame(
  Stratum = strata_ord,
  N_h     = fmt_num(strata_info$N_h[match(strata_ord,
                    strata_info$Education)]),
  n_prop  = as.integer(alloc_prop[strata_ord]),
  n_opt   = as.integer(alloc_opt[strata_ord])
)
al <- rbind(al, data.frame(
  Stratum = "Total",
  N_h     = fmt_num(N),
  n_prop  = sum(alloc_prop),
  n_opt   = sum(alloc_opt)
))
kable(al,
      escape    = FALSE,
      booktabs  = TRUE,
      col.names = c("Stratum", "$N_h$", "$n_h$ Prop.", "$n_h$ Opt."),
      align     = c("l", "r", "r", "r")) |>
  kable_styling(font_size = 12, full_width = TRUE) |>
  row_spec(0, background = "#000000", color = "white") |>
  row_spec(5, bold = TRUE)
```

<br>

``` r
data.frame(
  Design = factor(c("SRS","Proportional","Optimal"),
                  levels=c("SRS","Proportional","Optimal")),
  n = c(n_srs, n_prop, n_opt)
) |>
  ggplot(aes(x=Design, y=n, fill=Design)) +
  geom_col(width=0.55, show.legend=FALSE) +
  geom_text(aes(label=n), vjust=-0.4, fontface="bold", size=4) +
  scale_fill_manual(values=c("#aaaaaa","#555555","#000000")) +
  scale_y_continuous(limits=c(0,1350), expand=c(0,0)) +
  labs(x=NULL, y="n") +
  theme_minimal(base_size=12) +
  theme(panel.grid.major.x=element_blank(),
        panel.grid.minor=element_blank())
```

]

---

# Design 2 — Results

``` r
sr <- data.frame(
  Allocation = rep(c("Proportional","Optimal"), each=3),
  Parameter  = rep(c("Mean income (EUR)","Poverty rate","Total exp. (EUR)"),2),
  n = c(rep(fmt_num(nrow(samp_prop)),3),
        rep(fmt_num(nrow(samp_opt)),3)),
  Estimate = c(fmt_num(p_inc$est,1), fmt_pct(p_pov$est),
               fmt_num(p_exp$est),
               fmt_num(o_inc$est,1), fmt_pct(o_pov$est),
               fmt_num(o_exp$est)),
  D = c(fmt_num(p_inc$abs_prec,1), fmt_pct(p_pov$abs_prec),
        fmt_num(p_exp$abs_prec),
        fmt_num(o_inc$abs_prec,1), fmt_pct(o_pov$abs_prec),
        fmt_num(o_exp$abs_prec)),
  r = c(fmt_rel(p_inc$rel_prec), fmt_rel(p_pov$rel_prec),
        fmt_rel(p_exp$rel_prec),
        fmt_rel(o_inc$rel_prec), fmt_rel(o_pov$rel_prec),
        fmt_rel(o_exp$rel_prec))
)
kable(sr, escape=TRUE, booktabs=TRUE,
      col.names=c("Allocation","Parameter","n","Estimate","D","r (%)"),
      align=c("l","l","r","r","r","r")) |>
  kable_styling(font_size=12, full_width=TRUE,
                latex_options="scale_down") |>
  row_spec(0, background="#000000", color="white") |>
  row_spec(3, extra_css="border-bottom: 2px solid #000;")
```

<table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Allocation </th>
   <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Parameter </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> n </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Estimate </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r (%) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Proportional </td>
   <td style="text-align:left;"> Mean income (EUR) </td>
   <td style="text-align:right;"> 377 </td>
   <td style="text-align:right;"> 28,145 </td>
   <td style="text-align:right;"> 668.7 </td>
   <td style="text-align:right;"> 2.38% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Proportional </td>
   <td style="text-align:left;"> Poverty rate </td>
   <td style="text-align:right;"> 377 </td>
   <td style="text-align:right;"> 24.35% </td>
   <td style="text-align:right;"> 3.45% </td>
   <td style="text-align:right;"> 14.15% </td>
  </tr>
  <tr>
   <td style="text-align:left;border-bottom: 2px solid #000;"> Proportional </td>
   <td style="text-align:left;border-bottom: 2px solid #000;"> Total exp. (EUR) </td>
   <td style="text-align:right;border-bottom: 2px solid #000;"> 377 </td>
   <td style="text-align:right;border-bottom: 2px solid #000;"> 1,031,690,866 </td>
   <td style="text-align:right;border-bottom: 2px solid #000;"> 28,083,369 </td>
   <td style="text-align:right;border-bottom: 2px solid #000;"> 2.72% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Optimal </td>
   <td style="text-align:left;"> Mean income (EUR) </td>
   <td style="text-align:right;"> 377 </td>
   <td style="text-align:right;"> 27,794.9 </td>
   <td style="text-align:right;"> 1,469.8 </td>
   <td style="text-align:right;"> 5.29% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Optimal </td>
   <td style="text-align:left;"> Poverty rate </td>
   <td style="text-align:right;"> 377 </td>
   <td style="text-align:right;"> 28.79% </td>
   <td style="text-align:right;"> 2.95% </td>
   <td style="text-align:right;"> 10.23% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Optimal </td>
   <td style="text-align:left;"> Total exp. (EUR) </td>
   <td style="text-align:right;"> 377 </td>
   <td style="text-align:right;"> 1,021,598,357 </td>
   <td style="text-align:right;"> 35,810,491 </td>
   <td style="text-align:right;"> 3.51% </td>
  </tr>
</tbody>
</table>

.small[D: absolute precision · r: relative precision · All targets met under both allocations ✓]

---
class: section-slide, middle

.label[05]
# **Design 3**
# Cluster PPS

---

# Design 3 — Cluster PPS

.pull-left[

### Structure

- **M = 599** PSUs (neighbourhoods)
- PSU sizes: 57–110 individuals
- Mean size `$\bar{N}_g = 83.5$` · CV = 10.9%

### Why PPS?

Cluster `$g$` selected with probability:

`$$\pi_g = m \times \frac{N_g}{N}$$`

Larger clusters selected more often → size correction absorbed into design → simple mean of cluster means is unbiased.

### Sample size formula

`$$m = \frac{Z^2 \sum_g \frac{N_g}{N}(\mu_g - \mu)^2}{d^2}$$`

`$$m = \max(12,\ 10,\ 13) = 13$$`

Binding constraint: **total expenditure**

]

.pull-right[

### Results

``` r
pps_s <- data.frame(
  Parameter = c("Mean income (EUR)",
                "Poverty rate",
                "Total exp. (EUR)"),
  n    = rep(fmt_num(nrow(samp_pps)), 3),
  Est  = c(fmt_num(est_pps_inc,1),
           fmt_pct(est_pps_pov),
           fmt_num(est_pps_exp)),
  D    = c(fmt_num(abs_pps_inc,1),
           fmt_pct(abs_pps_pov),
           fmt_num(abs_pps_exp)),
  r    = c(fmt_rel(rel_pps_inc),
           fmt_rel(rel_pps_pov),
           fmt_rel(rel_pps_exp))
)
kable(pps_s, escape=TRUE, booktabs=TRUE,
      col.names=c("Parameter","n","Estimate","D","r (%)"),
      align=c("l","r","r","r","r")) |>
  kable_styling(font_size=12, full_width=TRUE) |>
  row_spec(0, background="#000000", color="white")
```

<br>

.box-gray[
**m = 13** clusters · **n = 1,111** individuals  
Variance estimator: **12 degrees of freedom**  
A single atypical cluster can destabilise the estimate.
]

]

---
class: section-slide, middle

.label[06]
# **Comparison**

---

# All Designs — Summary

``` r
concl <- data.frame(
  Design = c("SRS",
             "Stratified — Proportional",
             "Stratified — Optimal",
             "Cluster PPS"),
  n = c(fmt_num(n_srs), fmt_num(n_prop),
        fmt_num(n_opt), fmt_num(nrow(samp_pps))),
  r_inc = c(fmt_rel(rel_income),    fmt_rel(p_inc$rel_prec),
            fmt_rel(o_inc$rel_prec), fmt_rel(rel_pps_inc)),
  d_pov = c(fmt_pct(abs_pov),       fmt_pct(p_pov$abs_prec),
            fmt_pct(o_pov$abs_prec), fmt_pct(abs_pps_pov)),
  r_exp = c(fmt_rel(rel_exp),       fmt_rel(p_exp$rel_prec),
            fmt_rel(o_exp$rel_prec), fmt_rel(rel_pps_exp))
)
kable(concl, escape=TRUE, booktabs=TRUE,
      col.names=c("Design","n","r income","D poverty","r exp."),
      align=c("l","r","r","r","r")) |>
  kable_styling(font_size=13, full_width=TRUE) |>
  row_spec(0, background="#000000", color="white") |>
  row_spec(3, bold=TRUE, background="#f0f0f0")
```

<table class="table" style="font-size: 13px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Design </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> n </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r income </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D poverty </th>
   <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r exp. </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> SRS </td>
   <td style="text-align:right;"> 1,128 </td>
   <td style="text-align:right;"> 2.87% </td>
   <td style="text-align:right;"> 2.52% </td>
   <td style="text-align:right;"> 3.01% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Stratified — Proportional </td>
   <td style="text-align:right;"> 538 </td>
   <td style="text-align:right;"> 2.38% </td>
   <td style="text-align:right;"> 3.45% </td>
   <td style="text-align:right;"> 2.72% </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> Stratified — Optimal </td>
   <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 377 </td>
   <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 5.29% </td>
   <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 2.95% </td>
   <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 3.51% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cluster PPS </td>
   <td style="text-align:right;"> 1,111 </td>
   <td style="text-align:right;"> 2.24% </td>
   <td style="text-align:right;"> 2.09% </td>
   <td style="text-align:right;"> 2.33% </td>
  </tr>
</tbody>
</table>

<br>

### Design effect — Stratified Proportional vs SRS

`$$DEFF = \frac{S^2_{\text{intra}}}{S^2}$$`

| Parameter | `$S^2_{\text{intra}}$` | `$S^2$` | DEFF |
|-----------|---------------------|-------|------|
| Income | 48,041,159 | 187,739,928 | **0.256** |
| Poverty | 0.127336 | 0.195472 | **0.651** |
| Expenditure | 34,845,369 | 108,681,847 | **0.321** |

.small[DEFF < 1: stratification more efficient than SRS at the same n · Exact population-level ratio, no sampling randomness involved]

---
class: section-slide, middle

.label[07]
# **Conclusions**

---

# Main Findings

.pull-left[

### 1 · Sample size is the efficiency metric

All designs target the same precision.  
Variance comparisons across different `$n$` are not meaningful.  
**Smaller sample = lower cost.**

### 2 · Stratification dominates

| Design | n | Saving vs SRS |
|--------|---|--------------|
| SRS | 1128 | — |
| Proportional | 538 | 52.3% |
| **Optimal** | **377** | **66.6%** |

Education is a highly effective stratification variable.

### 3 · Poverty is a two-stratum problem

Bachelor and Master+ together account for
38%
of the population but contribute negligible poverty variance.  
Estimation uncertainty concentrated in **Primary and Secondary**.

]

.pull-right[

### 4 · Master+ structural finding

Zero poverty incidence in Master+ (`$S_{h,\text{pov}} = 0$`) makes poverty-driven Neyman allocation degenerate. Expenditure used as reference variable instead.

Master+ also has the **highest expenditure variability** (`$S_h = 7349.1$`) with the **smallest sample** — producing the weakest stratum-level precision, as expected under a top-down approach.

### 5 · Cluster PPS — operational vs statistical

.box-black[
**Statistical:** `$m = 13$` clusters, 12 df — instability risk from a single atypical PSU.

**Operational:** only 13 of 599 locations visited — substantial fieldwork cost reduction.
]

The choice depends on survey budget and objectives.

]

---
class: section-slide, middle, center

# Thank you

.muted[
NOVA IMS · Master in Statistics · 2025–2026  
Professor Pedro Simões Coelho

*Press **O** for tile view · **F** for fullscreen*
]