Definition: A stratified random sample is obtained by separating the population units into non-overlapping groups, called strata, and then selecting a random sample from each stratum
Three components:
Note that the total population size can be written as
\[ N = N_1 + N_2 + \cdots + N_H \]
In the sampling frame, must have a variable that indicates the stratum assignment (region) for each county
agpop <- read.table("agpop.dat", quote="\"", comment.char="")
head(agpop[c("V1", "V2", "V15")], n=10)## V1 V2 V15
## 1 ALEUTIAN_ISLANDS_AREA AK W
## 2 ANCHORAGE_AREA AK W
## 3 FAIRBANKS_AREA AK W
## 4 JUNEAU_AREA AK W
## 5 KENAI_PENINSULA_AREA AK W
## 6 AUTAUGA_COUNTY AL S
## 7 BALDWIN_COUNTY AL S
## 8 BARBOUR_COUNTY AL S
## 9 BIBB_COUNTY AL S
## 10 BLOUNT_COUNTY AL S
agpop1 <- subset(agpop, V15 == "NE")
agpop1$StratID <- 1
agpop1$WithinID <- seq_len(nrow(agpop1))
head(agpop1[c("V1", "V2", "StratID", "WithinID")], n=10)## V1 V2 StratID WithinID
## 284 FAIRFIELD_COUNTY CT 1 1
## 285 HARTFORD_COUNTY CT 1 2
## 286 LITCHFIELD_COUNTY CT 1 3
## 287 MIDDLESEX_COUNTY CT 1 4
## 288 NEW_HAVEN_COUNTY CT 1 5
## 289 NEW_LONDON_COUNTY CT 1 6
## 290 TOLLAND_COUNTY CT 1 7
## 291 WINDHAM_COUNTY CT 1 8
## 292 KENT_COUNTY DE 1 9
## 293 NEW_CASTLE_COUNTY DE 1 10
agpop2 <- subset(agpop, V15 == "NC")
agpop2$StratID <- 2
agpop2$WithinID <- seq_len(nrow(agpop2))
head(agpop2[c("V1", "V2", "StratID", "WithinID")], n=10)## V1 V2 StratID WithinID
## 525 ADAIR_COUNTY IA 2 1
## 526 ADAMS_COUNTY IA 2 2
## 527 ALLAMAKEE_COUNTY IA 2 3
## 528 APPANOOSE_COUNTY IA 2 4
## 529 AUDUBON_COUNTY IA 2 5
## 530 BENTON_COUNTY IA 2 6
## 531 BLACK_HAWK_COUNTY IA 2 7
## 532 BOONE_COUNTY IA 2 8
## 533 BREMER_COUNTY IA 2 9
## 534 BUCHANAN_COUNTY IA 2 10
agpop3 <- subset(agpop, V15 == "S")
agpop3$StratID <- 3
agpop3$WithinID <- seq_len(nrow(agpop3))
head(agpop3[c("V1", "V2", "StratID", "WithinID")], n=10)## V1 V2 StratID WithinID
## 6 AUTAUGA_COUNTY AL 3 1
## 7 BALDWIN_COUNTY AL 3 2
## 8 BARBOUR_COUNTY AL 3 3
## 9 BIBB_COUNTY AL 3 4
## 10 BLOUNT_COUNTY AL 3 5
## 11 BULLOCK_COUNTY AL 3 6
## 12 BUTLER_COUNTY AL 3 7
## 13 CALHOUN_COUNTY AL 3 8
## 14 CHAMBERS_COUNTY AL 3 9
## 15 CHEROKEE_COUNTY AL 3 10
agpop4 <- subset(agpop, V15 == "W")
agpop4$StratID <- 4
agpop4$WithinID <- seq_len(nrow(agpop4))
head(agpop4[c("V1", "V2", "StratID", "WithinID")], n=10)## V1 V2 StratID WithinID
## 1 ALEUTIAN_ISLANDS_AREA AK 4 1
## 2 ANCHORAGE_AREA AK 4 2
## 3 FAIRBANKS_AREA AK 4 3
## 4 JUNEAU_AREA AK 4 4
## 5 KENAI_PENINSULA_AREA AK 4 5
## 148 APACHE_COUNTY AZ 4 6
## 149 COCHISE_COUNTY AZ 4 7
## 150 COCONINO_COUNTY AZ 4 8
## 151 GILA_COUNTY AZ 4 9
## 152 GRAHAM_COUNTY AZ 4 10
agpopc <- rbind(agpop1, agpop2, agpop3, agpop4)
head(agpopc[c("V1", "V2", "StratID", "WithinID")], n=5)## V1 V2 StratID WithinID
## 284 FAIRFIELD_COUNTY CT 1 1
## 285 HARTFORD_COUNTY CT 1 2
## 286 LITCHFIELD_COUNTY CT 1 3
## 287 MIDDLESEX_COUNTY CT 1 4
## 288 NEW_HAVEN_COUNTY CT 1 5
tail(agpopc[c("V1", "V2", "StratID", "WithinID")], n=5)## V1 V2 StratID WithinID
## 3074 SWEETWATER_COUNTY WY 4 418
## 3075 TETON_COUNTY WY 4 419
## 3076 UINTA_COUNTY WY 4 420
## 3077 WASHAKIE_COUNTY WY 4 421
## 3078 WESTON_COUNTY WY 4 422
Sample size of \(n=300\) from \(N=3078\) counties in the US
Sample allocation using proportional allocation
| Stratum | Stratum Size \((N_h)\) | Sample size \((n_h)\) |
|---|---|---|
| 1 (NE) | 220 | 21 |
| 2 (NC) | 1054 | 103 |
| 3 (S) | 1382 | 135 |
| 4 (W) | 422 | 41 |
| Total | 3078 | 300 |
Once the sample sizes \(n_h\) are known, we select an independent sample from each stratum
Let \(\mathcal{S}_h \subset \mathcal{U}_h\) be the set of samples selected from \(\mathcal{U}_h\)
The final sample is \[ \mathcal{S} = \mathcal{S}_1 \cup \cdots \cup \mathcal{S}_H \]
We will rely on the fact that each stratum sample is independent of the other stratum samples
For each stratum, estimate stratum total and its variance
Combine estimates across strata to estimate population total
Stratum population total for stratum \(h\): \[ T_h = \sum_{i=1}^{N_h} y_{hi} \]
Overall population total \[ T = \sum_{h=1}^H T_h = \sum_{h=1}^H \sum_{i=1}^{N_h} y_{hi} \]
Stratum population mean for stratum \(h\): \[ \bar{Y}_{hU} = \frac{1}{N_h} \sum_{i=1}^{N_h} y_{hi} \]
Overall population mean \[ \bar{Y}_U = \frac{1}{N} T = \sum_{h=1}^H \left( \frac{N_h}{N} \right)\bar{Y}_{hU} \]
pN1 <- nrow(agpop1); n1 <- 21
sam1 <- sample(agpop1$V3, n1, replace = F)
m1 <- mean(sam1); t1 <- pN1*m1
v1 <- pN1*pN1*(1-n1/pN1)*var(sam1)/n1
pN2 <- nrow(agpop2); n2 <- 103
sam2 <- sample(agpop2$V3, n2, replace = F)
m2 <- mean(sam2); t2 <- pN2*m2
v2 <- pN2*pN2*(1-n2/pN2)*var(sam2)/n2
pN3 <- nrow(agpop3); n3 <- 135
sam3 <- sample(agpop3$V3, n3, replace = F)
m3 <- mean(sam3); t3 <- pN3*m3
v3 <- pN3*pN3*(1-n3/pN3)*var(sam3)/n3
pN4 <- nrow(agpop4); n4 <- 41
sam4 <- sample(agpop4$V3, n4, replace = F)
m4 <- mean(sam4); t4 <- pN4*m4
v4 <- pN4*pN4*(1-n4/pN4)*var(sam4)/n4 pN <- pN1+pN2+pN3 + pN4
nn <- n1+n2+ n3 + n4
te <- t1+t2+t3+t4
ve <- v1 + v2 + v3 + v4 | Stratum | Stratum Size | Sample Size | Sample Mean | Total Estimate |
|---|---|---|---|---|
| 1 | 220 | 21 | 82886.14 | 18234951.43 |
| 2 | 1054 | 103 | 319901.83 | 337176534.04 |
| 3 | 1382 | 135 | 207082.41 | 286187887.04 |
| 4 | 422 | 41 | 626791.80 | 264506141.66 |
| Total | 3078 | 300 | 906105514.16 |
Recall that \[ \hat{V} ( \hat{T}_h ) = \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{N_h} \right) s_h^2 \]
From the above sample, we can obtain
| Stratum | Stratum Size | Sample Size | Total Estimate | Variance Estimates |
|---|---|---|---|---|
| 1 | 220 | 21 | 1.823495e+07 | 9.552438E+12 |
| 2 | 1054 | 103 | 3.371765e+08 | 6.705221E+14 |
| 3 | 1382 | 135 | 2.861879e+08 | 9.328065E+14 |
| 4 | 422 | 41 | 2.645061e+08 | 1.133078E+15 |
| Total | 3078 | 300 | 9.061055e+08 | 2.745959e+15 |
We can use a normal-based approximation to obtain 95% confidence interval for \(T\): \[ \left( {\hat{T}}_{st} -1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})}, {\hat{T}}_{st} +1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})} \right) \]
From the sample, the 95% CI that we obtain is (8.033978e+08, 1.008813e+09) and the true parameter value is 9.439517e+08
| Stratum | \(N_h\) | \(n_h\) | \(w_{hi}\) | \(y_{hi}\) |
|---|---|---|---|---|
| 1 | 6 | 3 | 2 | 53 |
| 1 | 6 | 3 | 2 | 107 |
| 1 | 6 | 3 | 2 | 83 |
| 2 | 2 | 2 | 1 | 34 |
| 2 | 2 | 2 | 1 | 22 |
| 3 | 4 | 1 | 4 | 90 |
| 4 | 5 | 3 | 1.67 | 12 |
| 4 | 5 | 3 | 1.67 | 34 |
| 4 | 5 | 3 | 1.67 | 15 |
Sum of the weights for sampled units in a stratum is equal to the stratum population size \[ N_h = \sum_{i \in \mathcal{S}_h } w_{hi} \]
Sum of all weights for all sampled units is equal to the total population size \[ N = \sum_{h=1}^H \sum_{i \in \mathcal{S}_h } w_{hi} \]
Improve the precision of estimators of totals, means, and proportions
The objective may be to estimate parameters for particular sub-groups of the population (strata) or to make comparisons between different statra.