class: center, middle, inverse, title-slide .title[ # Sampling Methods ] .subtitle[ ## Unit Project ยท NOVA IMS ยท 2025โ2026 ] .author[ ### Pedro Fernandes ยท Ricardo Branco ] .date[ ### Professor Pedro Simรตes Coelho ] --- --- class: cover-slide, middle # Sampling Methods ## Three Designs Applied to a Population of N = 50,000 <br> **Pedro Fernandes ยท Ricardo Branco** .muted[NOVA IMS ยท Master in Statistics ยท 2025โ2026 Professor Pedro Simรตes Coelho] --- class: section-slide, middle .label[01] # **Motivation** --- # Motivation .pull-left[ ### The problem Surveying an entire population is rarely feasible. Sampling allows us to estimate population parameters from a fraction of the data โ but **the design matters**. ### Three fundamental questions - How many observations do we need? - How do we select them? - How do we compare competing designs? ### Why it matters Sample size is directly associated with **cost**. A more efficient design achieves the same precision with fewer observations โ reducing both financial and operational burden. ] .pull-right[ ### This project We apply and compare **three sampling designs** to a finite population of N = 50,000 individuals with known socioeconomic characteristics. Because the full population is available, true parameter values are known โ allowing a rigorous, controlled comparison. > The goal is not just to estimate. It is to understand **which design does it best**, and at what cost. ] --- class: section-slide, middle .label[02] # **Study Design** --- # Data, Parameters and Precision Targets .col-left[ ### Population - **N = 50,000** individuals - 13 variables per person - Treated as the complete target population ### Three parameters of interest | Parameter | Estimator | |-----------|-----------| | Mean income | `\(\hat{\mu}\)` | | Poverty rate | `\(\hat{p}\)` | | Total expenditure | `\(\hat{\tau}\)` | ### Common framework - Confidence: **95%** (`\(Z = 1.96\)`) - All designs evaluated on equal footing ] .col-right[ ### Precision targets ``` r tgt <- data.frame( Parameter = c("Mean income", "Poverty rate", "Total expenditure"), Target = c("$d = 3\\%$ of mean", "$d = 3$ pp", "$d = 3\\%$ of total"), Value = c(paste0("ยฑโฌ", fmt_num(d_income)), "ยฑ0.03", paste0("ยฑโฌ", fmt_num(d_exp))) ) kable(tgt, escape = FALSE, booktabs = TRUE, col.names = c("Parameter", "Target", "Value"), align = c("l","l","r")) |> kable_styling(font_size = 13, full_width = TRUE) |> row_spec(0, background = "#000000", color = "white") ``` <table class="table" style="font-size: 13px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Parameter </th> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Target </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mean income </td> <td style="text-align:left;"> `\(d = 3\%\)` of mean </td> <td style="text-align:right;"> ยฑโฌ830 </td> </tr> <tr> <td style="text-align:left;"> Poverty rate </td> <td style="text-align:left;"> `\(d = 3\)` pp </td> <td style="text-align:right;"> ยฑ0.03 </td> </tr> <tr> <td style="text-align:left;"> Total expenditure </td> <td style="text-align:left;"> `\(d = 3\%\)` of total </td> <td style="text-align:right;"> ยฑโฌ30,078,148 </td> </tr> </tbody> </table> ### Three designs compared 1. **Design 1** โ Simple Random Sampling (SRS) 2. **Design 2** โ Stratified Sampling (STRS) 3. **Design 3** โ Cluster PPS ] --- class: section-slide, middle .label[03] # **Design 1** # Simple Random Sampling --- # Design 1 โ Simple Random Sampling .pull-left[ ### What it is Every individual has the **same probability** of selection. No structure, no grouping. The natural baseline. ### Sample size formulas `$$n_\mu = \frac{Z^2 S^2 N}{Z^2 S^2 + d_\mu^2 N}$$` `$$n_p = \frac{Z^2 p(1-p) N}{Z^2 p(1-p) + d_p^2 N}$$` `$$n_\tau = \frac{Z^2 S^2 N^2}{d_\tau^2 + Z^2 N S^2}$$` ### Three candidate sizes `$$n = \max(1026,\ 821,\ 1128) = 1128$$` Binding constraint: **total expenditure** ] .pull-right[ ### Results ``` r srs_s <- data.frame( Parameter = c("Mean income (EUR)", "Poverty rate", "Total exp. (EUR)"), Estimate = c(fmt_num(est_income, 1), fmt_pct(est_pov), fmt_num(est_exp)), D = c(fmt_num(abs_income, 1), fmt_pct(abs_pov), fmt_num(abs_exp)), r = c(fmt_rel(rel_income), fmt_rel(rel_pov), fmt_rel(rel_exp)) ) kable(srs_s, escape = TRUE, booktabs = TRUE, col.names = c("Parameter","Estimate","D","r (%)"), align = c("l","r","r","r")) |> kable_styling(font_size = 12, full_width = TRUE) |> row_spec(0, background = "#000000", color = "white") ``` <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Parameter </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Estimate </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r (%) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mean income (EUR) </td> <td style="text-align:right;"> 27,944.6 </td> <td style="text-align:right;"> 801.5 </td> <td style="text-align:right;"> 2.87% </td> </tr> <tr> <td style="text-align:left;"> Poverty rate </td> <td style="text-align:right;"> 25.71% </td> <td style="text-align:right;"> 2.52% </td> <td style="text-align:right;"> 9.81% </td> </tr> <tr> <td style="text-align:left;"> Total exp. (EUR) </td> <td style="text-align:right;"> 1,010,694,105 </td> <td style="text-align:right;"> 30,421,719 </td> <td style="text-align:right;"> 3.01% </td> </tr> </tbody> </table> <br> .box-black[ **n = 1128** ยท Sampling fraction f = 2.26% All three precision targets met โ ] ] --- class: section-slide, middle .label[04] # **Design 2** # Stratified Sampling --- # Design 2 โ Stratification Variable .pull-left[ ### Why education level? Education is strongly correlated with **income** and **poverty** โ it creates strata that are homogeneous within and heterogeneous between. The within-stratum pooled variance for income: `$$S^2_{\text{intra}} = 48,041,159 \ll S^2 = 187,739,928$$` A reduction of **74.4%** โ the source of the efficiency gain. ### Four strata ] .pull-right[ ``` r st <- data.frame( Stratum = c("Primary", "Secondary", "Bachelor", "Master+"), N_h = fmt_num(strata_info$N_h[match( c("Primary","Secondary","Bachelor","Master+"), strata_info$Education)]), W_h = paste0(round(strata_info$W_h[match( c("Primary","Secondary","Bachelor","Master+"), strata_info$Education)] * 100, 1), "%"), pov = paste0(round(strata_info$p_h_pov[match( c("Primary","Secondary","Bachelor","Master+"), strata_info$Education)] * 100, 1), "%"), S_exp = fmt_num(strata_info$S_h_exp[match( c("Primary","Secondary","Bachelor","Master+"), strata_info$Education)], 1) ) kable(st, escape = FALSE, booktabs = TRUE, col.names = c("Stratum", "$N_h$", "$W_h$", "Poverty", "$S_h$ exp."), align = c("l", "r", "r", "r", "r")) |> kable_styling(font_size = 12, full_width = TRUE) |> row_spec(0, background = "#000000", color = "white") |> row_spec(4, bold = TRUE) ``` <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Stratum </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(N_h\)` </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(W_h\)` </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Poverty </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(S_h\)` exp. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Primary </td> <td style="text-align:right;"> 10,177 </td> <td style="text-align:right;"> 20.4% </td> <td style="text-align:right;"> 71.9% </td> <td style="text-align:right;"> 5,050.6 </td> </tr> <tr> <td style="text-align:left;"> Secondary </td> <td style="text-align:right;"> 20,814 </td> <td style="text-align:right;"> 41.6% </td> <td style="text-align:right;"> 28.5% </td> <td style="text-align:right;"> 5,570.1 </td> </tr> <tr> <td style="text-align:left;"> Bachelor </td> <td style="text-align:right;"> 13,021 </td> <td style="text-align:right;"> 26% </td> <td style="text-align:right;"> 0.5% </td> <td style="text-align:right;"> 6,279.7 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Master+ </td> <td style="text-align:right;font-weight: bold;"> 5,988 </td> <td style="text-align:right;font-weight: bold;"> 12% </td> <td style="text-align:right;font-weight: bold;"> 0% </td> <td style="text-align:right;font-weight: bold;"> 7,349.1 </td> </tr> </tbody> </table> .box-gray[ **Master+:** `\(p_h = 0\)` โ `\(S_{h,\text{pov}} = 0\)` Zero poverty variance โ structural, not a sampling artefact. ] ] --- # Design 2 โ Allocation Methods .pull-left[ ### Proportional allocation n = 538 `$$n_h = \frac{N_h}{N} \cdot n$$` Sample mirrors population structure. Driven by **poverty** (`\(n_{\text{prop,pov}} = 538\)`). --- ### Optimal (Neyman) n = 377 `$$n_h = \frac{N_h S_h}{\sum_j N_j S_j} \cdot n$$` **Reference: total expenditure** โ second largest requirement, ensures all strata are represented. Master+ would receive zero units under poverty-driven allocation. --- ### Compromise n = 377 Average of income and expenditure allocation shapes. Both anchored to `\(n = 377\)`. ] .pull-right[ ``` r strata_ord <- c("Primary","Secondary","Bachelor","Master+") al <- data.frame( Stratum = strata_ord, N_h = fmt_num(strata_info$N_h[match(strata_ord, strata_info$Education)]), n_prop = as.integer(alloc_prop[strata_ord]), n_opt = as.integer(alloc_opt[strata_ord]) ) al <- rbind(al, data.frame( Stratum = "Total", N_h = fmt_num(N), n_prop = sum(alloc_prop), n_opt = sum(alloc_opt) )) kable(al, escape = FALSE, booktabs = TRUE, col.names = c("Stratum", "$N_h$", "$n_h$ Prop.", "$n_h$ Opt."), align = c("l", "r", "r", "r")) |> kable_styling(font_size = 12, full_width = TRUE) |> row_spec(0, background = "#000000", color = "white") |> row_spec(5, bold = TRUE) ``` <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Stratum </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(N_h\)` </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(n_h\)` Prop. </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> `\(n_h\)` Opt. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Primary </td> <td style="text-align:right;"> 10,177 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 74 </td> </tr> <tr> <td style="text-align:left;"> Secondary </td> <td style="text-align:right;"> 20,814 </td> <td style="text-align:right;"> 157 </td> <td style="text-align:right;"> 159 </td> </tr> <tr> <td style="text-align:left;"> Bachelor </td> <td style="text-align:right;"> 13,021 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:left;"> Master+ </td> <td style="text-align:right;"> 5,988 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> Total </td> <td style="text-align:right;font-weight: bold;"> 50,000 </td> <td style="text-align:right;font-weight: bold;"> 377 </td> <td style="text-align:right;font-weight: bold;"> 377 </td> </tr> </tbody> </table> <br> ``` r data.frame( Design = factor(c("SRS","Proportional","Optimal"), levels=c("SRS","Proportional","Optimal")), n = c(n_srs, n_prop, n_opt) ) |> ggplot(aes(x=Design, y=n, fill=Design)) + geom_col(width=0.55, show.legend=FALSE) + geom_text(aes(label=n), vjust=-0.4, fontface="bold", size=4) + scale_fill_manual(values=c("#aaaaaa","#555555","#000000")) + scale_y_continuous(limits=c(0,1350), expand=c(0,0)) + labs(x=NULL, y="n") + theme_minimal(base_size=12) + theme(panel.grid.major.x=element_blank(), panel.grid.minor=element_blank()) ``` <img src="presentation_files/figure-html/alloc-bar-1.png" width="360" style="display: block; margin: auto;" /> ] --- # Design 2 โ Results ``` r sr <- data.frame( Allocation = rep(c("Proportional","Optimal"), each=3), Parameter = rep(c("Mean income (EUR)","Poverty rate","Total exp. (EUR)"),2), n = c(rep(fmt_num(nrow(samp_prop)),3), rep(fmt_num(nrow(samp_opt)),3)), Estimate = c(fmt_num(p_inc$est,1), fmt_pct(p_pov$est), fmt_num(p_exp$est), fmt_num(o_inc$est,1), fmt_pct(o_pov$est), fmt_num(o_exp$est)), D = c(fmt_num(p_inc$abs_prec,1), fmt_pct(p_pov$abs_prec), fmt_num(p_exp$abs_prec), fmt_num(o_inc$abs_prec,1), fmt_pct(o_pov$abs_prec), fmt_num(o_exp$abs_prec)), r = c(fmt_rel(p_inc$rel_prec), fmt_rel(p_pov$rel_prec), fmt_rel(p_exp$rel_prec), fmt_rel(o_inc$rel_prec), fmt_rel(o_pov$rel_prec), fmt_rel(o_exp$rel_prec)) ) kable(sr, escape=TRUE, booktabs=TRUE, col.names=c("Allocation","Parameter","n","Estimate","D","r (%)"), align=c("l","l","r","r","r","r")) |> kable_styling(font_size=12, full_width=TRUE, latex_options="scale_down") |> row_spec(0, background="#000000", color="white") |> row_spec(3, extra_css="border-bottom: 2px solid #000;") ``` <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Allocation </th> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Parameter </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> n </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Estimate </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r (%) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Proportional </td> <td style="text-align:left;"> Mean income (EUR) </td> <td style="text-align:right;"> 377 </td> <td style="text-align:right;"> 28,145 </td> <td style="text-align:right;"> 668.7 </td> <td style="text-align:right;"> 2.38% </td> </tr> <tr> <td style="text-align:left;"> Proportional </td> <td style="text-align:left;"> Poverty rate </td> <td style="text-align:right;"> 377 </td> <td style="text-align:right;"> 24.35% </td> <td style="text-align:right;"> 3.45% </td> <td style="text-align:right;"> 14.15% </td> </tr> <tr> <td style="text-align:left;border-bottom: 2px solid #000;"> Proportional </td> <td style="text-align:left;border-bottom: 2px solid #000;"> Total exp. (EUR) </td> <td style="text-align:right;border-bottom: 2px solid #000;"> 377 </td> <td style="text-align:right;border-bottom: 2px solid #000;"> 1,031,690,866 </td> <td style="text-align:right;border-bottom: 2px solid #000;"> 28,083,369 </td> <td style="text-align:right;border-bottom: 2px solid #000;"> 2.72% </td> </tr> <tr> <td style="text-align:left;"> Optimal </td> <td style="text-align:left;"> Mean income (EUR) </td> <td style="text-align:right;"> 377 </td> <td style="text-align:right;"> 27,794.9 </td> <td style="text-align:right;"> 1,469.8 </td> <td style="text-align:right;"> 5.29% </td> </tr> <tr> <td style="text-align:left;"> Optimal </td> <td style="text-align:left;"> Poverty rate </td> <td style="text-align:right;"> 377 </td> <td style="text-align:right;"> 28.79% </td> <td style="text-align:right;"> 2.95% </td> <td style="text-align:right;"> 10.23% </td> </tr> <tr> <td style="text-align:left;"> Optimal </td> <td style="text-align:left;"> Total exp. (EUR) </td> <td style="text-align:right;"> 377 </td> <td style="text-align:right;"> 1,021,598,357 </td> <td style="text-align:right;"> 35,810,491 </td> <td style="text-align:right;"> 3.51% </td> </tr> </tbody> </table> .small[D: absolute precision ยท r: relative precision ยท All targets met under both allocations โ] --- class: section-slide, middle .label[05] # **Design 3** # Cluster PPS --- # Design 3 โ Cluster PPS .pull-left[ ### Structure - **M = 599** PSUs (neighbourhoods) - PSU sizes: 57โ110 individuals - Mean size `\(\bar{N}_g = 83.5\)` ยท CV = 10.9% ### Why PPS? Cluster `\(g\)` selected with probability: `$$\pi_g = m \times \frac{N_g}{N}$$` Larger clusters selected more often โ size correction absorbed into design โ simple mean of cluster means is unbiased. ### Sample size formula `$$m = \frac{Z^2 \sum_g \frac{N_g}{N}(\mu_g - \mu)^2}{d^2}$$` `$$m = \max(12,\ 10,\ 13) = 13$$` Binding constraint: **total expenditure** ] .pull-right[ ### Results ``` r pps_s <- data.frame( Parameter = c("Mean income (EUR)", "Poverty rate", "Total exp. (EUR)"), n = rep(fmt_num(nrow(samp_pps)), 3), Est = c(fmt_num(est_pps_inc,1), fmt_pct(est_pps_pov), fmt_num(est_pps_exp)), D = c(fmt_num(abs_pps_inc,1), fmt_pct(abs_pps_pov), fmt_num(abs_pps_exp)), r = c(fmt_rel(rel_pps_inc), fmt_rel(rel_pps_pov), fmt_rel(rel_pps_exp)) ) kable(pps_s, escape=TRUE, booktabs=TRUE, col.names=c("Parameter","n","Estimate","D","r (%)"), align=c("l","r","r","r","r")) |> kable_styling(font_size=12, full_width=TRUE) |> row_spec(0, background="#000000", color="white") ``` <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Parameter </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> n </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Estimate </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r (%) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mean income (EUR) </td> <td style="text-align:right;"> 1,111 </td> <td style="text-align:right;"> 27,404.8 </td> <td style="text-align:right;"> 613.3 </td> <td style="text-align:right;"> 2.24% </td> </tr> <tr> <td style="text-align:left;"> Poverty rate </td> <td style="text-align:right;"> 1,111 </td> <td style="text-align:right;"> 28.41% </td> <td style="text-align:right;"> 2.09% </td> <td style="text-align:right;"> 7.35% </td> </tr> <tr> <td style="text-align:left;"> Total exp. (EUR) </td> <td style="text-align:right;"> 1,111 </td> <td style="text-align:right;"> 998,465,835 </td> <td style="text-align:right;"> 23,244,700 </td> <td style="text-align:right;"> 2.33% </td> </tr> </tbody> </table> <br> .box-gray[ **m = 13** clusters ยท **n = 1,111** individuals Variance estimator: **12 degrees of freedom** A single atypical cluster can destabilise the estimate. ] ] --- class: section-slide, middle .label[06] # **Comparison** --- # All Designs โ Summary ``` r concl <- data.frame( Design = c("SRS", "Stratified โ Proportional", "Stratified โ Optimal", "Cluster PPS"), n = c(fmt_num(n_srs), fmt_num(n_prop), fmt_num(n_opt), fmt_num(nrow(samp_pps))), r_inc = c(fmt_rel(rel_income), fmt_rel(p_inc$rel_prec), fmt_rel(o_inc$rel_prec), fmt_rel(rel_pps_inc)), d_pov = c(fmt_pct(abs_pov), fmt_pct(p_pov$abs_prec), fmt_pct(o_pov$abs_prec), fmt_pct(abs_pps_pov)), r_exp = c(fmt_rel(rel_exp), fmt_rel(p_exp$rel_prec), fmt_rel(o_exp$rel_prec), fmt_rel(rel_pps_exp)) ) kable(concl, escape=TRUE, booktabs=TRUE, col.names=c("Design","n","r income","D poverty","r exp."), align=c("l","r","r","r","r")) |> kable_styling(font_size=13, full_width=TRUE) |> row_spec(0, background="#000000", color="white") |> row_spec(3, bold=TRUE, background="#f0f0f0") ``` <table class="table" style="font-size: 13px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> Design </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> n </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r income </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> D poverty </th> <th style="text-align:right;color: white !important;background-color: rgba(0, 0, 0, 255) !important;"> r exp. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> SRS </td> <td style="text-align:right;"> 1,128 </td> <td style="text-align:right;"> 2.87% </td> <td style="text-align:right;"> 2.52% </td> <td style="text-align:right;"> 3.01% </td> </tr> <tr> <td style="text-align:left;"> Stratified โ Proportional </td> <td style="text-align:right;"> 538 </td> <td style="text-align:right;"> 2.38% </td> <td style="text-align:right;"> 3.45% </td> <td style="text-align:right;"> 2.72% </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> Stratified โ Optimal </td> <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 377 </td> <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 5.29% </td> <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 2.95% </td> <td style="text-align:right;font-weight: bold;background-color: rgba(240, 240, 240, 255) !important;"> 3.51% </td> </tr> <tr> <td style="text-align:left;"> Cluster PPS </td> <td style="text-align:right;"> 1,111 </td> <td style="text-align:right;"> 2.24% </td> <td style="text-align:right;"> 2.09% </td> <td style="text-align:right;"> 2.33% </td> </tr> </tbody> </table> <br> ### Design effect โ Stratified Proportional vs SRS `$$DEFF = \frac{S^2_{\text{intra}}}{S^2}$$` | Parameter | `\(S^2_{\text{intra}}\)` | `\(S^2\)` | DEFF | |-----------|---------------------|-------|------| | Income | 48,041,159 | 187,739,928 | **0.256** | | Poverty | 0.127336 | 0.195472 | **0.651** | | Expenditure | 34,845,369 | 108,681,847 | **0.321** | .small[DEFF < 1: stratification more efficient than SRS at the same n ยท Exact population-level ratio, no sampling randomness involved] --- class: section-slide, middle .label[07] # **Conclusions** --- # Main Findings .pull-left[ ### 1 ยท Sample size is the efficiency metric All designs target the same precision. Variance comparisons across different `\(n\)` are not meaningful. **Smaller sample = lower cost.** ### 2 ยท Stratification dominates | Design | n | Saving vs SRS | |--------|---|--------------| | SRS | 1128 | โ | | Proportional | 538 | 52.3% | | **Optimal** | **377** | **66.6%** | Education is a highly effective stratification variable. ### 3 ยท Poverty is a two-stratum problem Bachelor and Master+ together account for 38% of the population but contribute negligible poverty variance. Estimation uncertainty concentrated in **Primary and Secondary**. ] .pull-right[ ### 4 ยท Master+ structural finding Zero poverty incidence in Master+ (`\(S_{h,\text{pov}} = 0\)`) makes poverty-driven Neyman allocation degenerate. Expenditure used as reference variable instead. Master+ also has the **highest expenditure variability** (`\(S_h = 7349.1\)`) with the **smallest sample** โ producing the weakest stratum-level precision, as expected under a top-down approach. ### 5 ยท Cluster PPS โ operational vs statistical .box-black[ **Statistical:** `\(m = 13\)` clusters, 12 df โ instability risk from a single atypical PSU. **Operational:** only 13 of 599 locations visited โ substantial fieldwork cost reduction. ] The choice depends on survey budget and objectives. ] --- class: section-slide, middle, center # Thank you .muted[ NOVA IMS ยท Master in Statistics ยท 2025โ2026 Professor Pedro Simรตes Coelho *Press **O** for tile view ยท **F** for fullscreen* ]