Stat 421: Survey Sampling

Week 8

Jae-kwang Kim

3/3/2020

Stratified Sampling

This week
- Stratified sampling overview
- Estimation under stratified sampling
Next week
- Sample size allocation
- Optimal allocation

Stratified Sampling Overview

Stratified Random Sampling (STS)

Definition: A stratified random sample is obtained by separating the population units into non-overlapping groups, called strata, and then selecting a random sample from each stratum
Three components:
1. Stratification: Partition the population into H subpopulations (called strata)
2. Sample size allocation: Determine the sample size (\(n_h\)) for each stratum \(h=1,\cdots, H\)
3. Sampling: Select SRS of size \(n_h\) from each stratum \(h\)

STS overview: Stratification

Stratification:
- Divide the finite population into \(H\) mutually exclusive and exhustive subpopulations, called strata.
- Each sampling unit in the finite population belongs to one and only one stratum.
Using mathematical notation, we use \[ \mathcal{U} = \mathcal{U}_1 \cup \mathcal{U}_2 \cup \cdots \cup \mathcal{U}_H \] where \(\mathcal{U}\) is the index set of the finite population and \(\mathcal{U}_h\) is the index set of the population in stratum \(h\). It is a exhaustive partition. The mutually exclusive partition means that, for \(h \neq g\),
\[ \mathcal{U}_h \cap \mathcal{U}_g = \phi .\]

STS overview: Stratification

Population sizes
- \(N_h= | \mathcal{U}_h |\): number of sampling units in stratum \(h\) in the population
Note that the total population size can be written as
\[ N = N_1 + N_2 + \cdots + N_H \]
Thus, the sampling frame contains the stratum variable for every sampling unit
- Stratum variable is a type of auxiliary information
- We must know the stratum variable for every sampling unit in the population

Ag Example

Population is 3078 counties from Ag Census (lab)
Divide 3078 counties into 4 regional strata (H = 4)
Strata: regions of the US (indexed by h )
- Northeast (h = 1)
- North central (h = 2)
- South (h = 3)
- West (h = 4)

Ag Example

In the sampling frame, must have a variable that indicates the stratum assignment (region) for each county
Population size
- NE stratum: \(N_1=220\)
- NC stratum: \(N_2=1054\)
- South stratum: \(N_3=1382\)
- West stratum: \(N_4=422\)

Ag Examle

agpop <- read.table("agpop.dat", quote="\"", comment.char="")
head(agpop[c("V1", "V2", "V15")], n=10)

##                       V1 V2 V15
## 1  ALEUTIAN_ISLANDS_AREA AK   W
## 2         ANCHORAGE_AREA AK   W
## 3         FAIRBANKS_AREA AK   W
## 4            JUNEAU_AREA AK   W
## 5   KENAI_PENINSULA_AREA AK   W
## 6         AUTAUGA_COUNTY AL   S
## 7         BALDWIN_COUNTY AL   S
## 8         BARBOUR_COUNTY AL   S
## 9            BIBB_COUNTY AL   S
## 10         BLOUNT_COUNTY AL   S

Ag Examle (Creating Stratum=1)

agpop1 <- subset(agpop, V15 == "NE")
agpop1$StratID <- 1  
agpop1$WithinID <- seq_len(nrow(agpop1))

head(agpop1[c("V1", "V2",  "StratID", "WithinID")], n=10)

##                    V1 V2 StratID WithinID
## 284  FAIRFIELD_COUNTY CT       1        1
## 285   HARTFORD_COUNTY CT       1        2
## 286 LITCHFIELD_COUNTY CT       1        3
## 287  MIDDLESEX_COUNTY CT       1        4
## 288  NEW_HAVEN_COUNTY CT       1        5
## 289 NEW_LONDON_COUNTY CT       1        6
## 290    TOLLAND_COUNTY CT       1        7
## 291    WINDHAM_COUNTY CT       1        8
## 292       KENT_COUNTY DE       1        9
## 293 NEW_CASTLE_COUNTY DE       1       10

Ag Examle (Creating Stratum=2)

agpop2 <- subset(agpop, V15 == "NC")
agpop2$StratID <- 2  
agpop2$WithinID <- seq_len(nrow(agpop2))

head(agpop2[c("V1", "V2",  "StratID", "WithinID")], n=10)

##                    V1 V2 StratID WithinID
## 525      ADAIR_COUNTY IA       2        1
## 526      ADAMS_COUNTY IA       2        2
## 527  ALLAMAKEE_COUNTY IA       2        3
## 528  APPANOOSE_COUNTY IA       2        4
## 529    AUDUBON_COUNTY IA       2        5
## 530     BENTON_COUNTY IA       2        6
## 531 BLACK_HAWK_COUNTY IA       2        7
## 532      BOONE_COUNTY IA       2        8
## 533     BREMER_COUNTY IA       2        9
## 534   BUCHANAN_COUNTY IA       2       10

Ag Examle (Creating Stratum=3)

agpop3 <- subset(agpop, V15 == "S")
agpop3$StratID <- 3  
agpop3$WithinID <- seq_len(nrow(agpop3))

head(agpop3[c("V1", "V2",  "StratID", "WithinID")], n=10)

##                 V1 V2 StratID WithinID
## 6   AUTAUGA_COUNTY AL       3        1
## 7   BALDWIN_COUNTY AL       3        2
## 8   BARBOUR_COUNTY AL       3        3
## 9      BIBB_COUNTY AL       3        4
## 10   BLOUNT_COUNTY AL       3        5
## 11  BULLOCK_COUNTY AL       3        6
## 12   BUTLER_COUNTY AL       3        7
## 13  CALHOUN_COUNTY AL       3        8
## 14 CHAMBERS_COUNTY AL       3        9
## 15 CHEROKEE_COUNTY AL       3       10

Ag Examle (Creating Stratum=4)

agpop4 <- subset(agpop, V15 == "W")
agpop4$StratID <- 4 
agpop4$WithinID <- seq_len(nrow(agpop4))

head(agpop4[c("V1", "V2",  "StratID", "WithinID")], n=10)

##                        V1 V2 StratID WithinID
## 1   ALEUTIAN_ISLANDS_AREA AK       4        1
## 2          ANCHORAGE_AREA AK       4        2
## 3          FAIRBANKS_AREA AK       4        3
## 4             JUNEAU_AREA AK       4        4
## 5    KENAI_PENINSULA_AREA AK       4        5
## 148         APACHE_COUNTY AZ       4        6
## 149        COCHISE_COUNTY AZ       4        7
## 150       COCONINO_COUNTY AZ       4        8
## 151           GILA_COUNTY AZ       4        9
## 152         GRAHAM_COUNTY AZ       4       10

Ag Example (Combine 4 strata)

 agpopc <- rbind(agpop1, agpop2, agpop3, agpop4)

head(agpopc[c("V1", "V2",  "StratID", "WithinID")], n=5)

##                    V1 V2 StratID WithinID
## 284  FAIRFIELD_COUNTY CT       1        1
## 285   HARTFORD_COUNTY CT       1        2
## 286 LITCHFIELD_COUNTY CT       1        3
## 287  MIDDLESEX_COUNTY CT       1        4
## 288  NEW_HAVEN_COUNTY CT       1        5

tail(agpopc[c("V1", "V2",  "StratID", "WithinID")], n=5)

##                     V1 V2 StratID WithinID
## 3074 SWEETWATER_COUNTY WY       4      418
## 3075      TETON_COUNTY WY       4      419
## 3076      UINTA_COUNTY WY       4      420
## 3077   WASHAKIE_COUNTY WY       4      421
## 3078     WESTON_COUNTY WY       4      422

STS overview: sampling framework

Partition sample of size \(n\) aross strata
- \(n_h\): number of sample units selected from stratum \(h\)
- \(n=n_1+n_2+ \cdots + n_H\)
- Determining the stratum sample size \(n_h\) for a given sample size of n is called sample allocation
Methods for sample size allocation
1. Proportional allocation
2. Neyman allocation

Ag example

Sample size of \(n=300\) from \(N=3078\) counties in the US
Proportional allocation
- \(n=300\) is 9.75% of \(N=3078\) counties
- Set each stratum size to 9.75% of stratum population. That is, \[ n_h = 0.0975 \times N_h\]

Ag example

Sample allocation using proportional allocation

Stratum	Stratum Size \((N_h)\)	Sample size \((n_h)\)
1 (NE)	220	21
2 (NC)	1054	103
3 (S)	1382	135
4 (W)	422	41
Total	3078	300

Note that \(n_h/N_h = 0.0975\) for all \(h=1,2,3,4\).

STS overview: sampling framework

Once the sample sizes \(n_h\) are known, we select an independent sample from each stratum
Let \(\mathcal{S}_h \subset \mathcal{U}_h\) be the set of samples selected from \(\mathcal{U}_h\)
Each stratum sample is selected independently of others
- Select a new set of random numbers for each stratum
The final sample is \[ \mathcal{S} = \mathcal{S}_1 \cup \cdots \cup \mathcal{S}_H \]

STS overview: sampling framework

Design within a stratum
- We can use any probability sampling design within a stratum
- Sample designs do not need to be the same across strata
- In this lecture, we will only consider the case when a SRSWOR is selected within each stratum

STS overview: estimate parameters

We will rely on the fact that each stratum sample is independent of the other stratum samples
For each stratum, estimate stratum total and its variance
Combine estimates across strata to estimate population total

STS overview: summary

Population framework
- Partition the population into H mutually exclusive and exhaustive categories called strata.
Sampling framework:
- Determine a sample size and sample design for each stratum.
- Select an independent probability sample in each stratum.

STS overview: summary

Estimation framework
- Construct estimates and variance estimates for the population parameter for each stratum.
- Combine the estimates and variance estimates appropriately across strata to obtain an estimate of the population total.

Estimation under STS

STS: population framework

There are \(H\) mutually exclusive and exhaustive strata
- Assign each SU to one and only one stratum
- Index set for strata \(h\): \(h=1,2, \cdots, H\)
For STS, we use two indices:
- Stratum \(h\): \(h=1, \cdots, H\)
- SU within stratum \(i\): \(i=1, \cdots, N_h\)
Charateristic of interest for SU \(i\) in stratum \(h\) is \(y_{hi}\)

Population parameter: Population Total

Stratum population total for stratum \(h\): \[ T_h = \sum_{i=1}^{N_h} y_{hi} \]
Overall population total \[ T = \sum_{h=1}^H T_h = \sum_{h=1}^H \sum_{i=1}^{N_h} y_{hi} \]

Population parameter: Population Mean

Stratum population mean for stratum \(h\): \[ \bar{Y}_{hU} = \frac{1}{N_h} \sum_{i=1}^{N_h} y_{hi} \]
Overall population mean \[ \bar{Y}_U = \frac{1}{N} T = \sum_{h=1}^H \left( \frac{N_h}{N} \right)\bar{Y}_{hU} \]

Estimation of the population total under STS

Basic Idea
1. Estimate \(T_h\) from the sample observations in \(\mathcal{S}_h\), the sample in stratum \(h\)
2. Combine them
Estimation
- Under SRS of size \(n_h\) from \(N_h\), we use \[ \hat{T}_h = N_h \bar{y}_h = N_h n_h^{-1} \sum_{i \in S_h} y_{hi} \]
- Note that \(\hat{T}_h\) is unbiased for \(T_h\).
- The final estimator of \(T\) is \[ {\hat{T}}_{st} = \sum_{h=1}^H \hat{T}_h \]

Recall that the samples in each stratum are independently sampled. Thus, \(\hat{T}_h\) are mutually independent.
Variance
- The total variance is \[V \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H V \left(\hat{T}_h \right)\]
- Using the formula for SRS, we can obtain \[ V \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{ N_h} \right) S_h^2 ,\] where \[S_h^2 = \frac{1}{N_h-1} \sum_{i=1}^{N_h} ( y_{hi} - \bar{Y}_{hU})^2.\]

Variance Estimation \[ \hat{V} \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H \hat{V} ( \hat{T}_h ) ,\] where \[ \hat{V} ( \hat{T}_h ) = \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{ N_h} \right) s_h^2 \] and \[ s_h^2= \frac{1}{n_h-1} \sum_{i \in \mathcal{S}_h} \left( y_{hi} - \bar{y}_h\right)^2 . \]

Ag example

Y: Acres devoted to farms in 1992

pN1 <- nrow(agpop1); n1 <- 21  
sam1 <- sample(agpop1$V3, n1, replace = F)
m1 <- mean(sam1); t1 <- pN1*m1 
v1 <- pN1*pN1*(1-n1/pN1)*var(sam1)/n1 

pN2 <- nrow(agpop2); n2  <- 103 
sam2 <- sample(agpop2$V3, n2, replace = F)
m2 <- mean(sam2); t2 <- pN2*m2 
v2 <- pN2*pN2*(1-n2/pN2)*var(sam2)/n2  

pN3 <- nrow(agpop3); n3 <- 135 
sam3 <- sample(agpop3$V3, n3, replace = F)
m3 <- mean(sam3); t3 <- pN3*m3 
v3 <- pN3*pN3*(1-n3/pN3)*var(sam3)/n3 

pN4 <- nrow(agpop4); n4 <- 41 
sam4 <- sample(agpop4$V3, n4, replace = F) 
m4 <- mean(sam4); t4 <- pN4*m4 
v4 <- pN4*pN4*(1-n4/pN4)*var(sam4)/n4

pN <- pN1+pN2+pN3 + pN4 
nn <- n1+n2+ n3 + n4 
te <- t1+t2+t3+t4 
ve <- v1 + v2 + v3 + v4

One possible realization of the stratified sampling result

Stratum	Stratum Size	Sample Size	Sample Mean	Total Estimate
1	220	21	82886.14	18234951.43
2	1054	103	319901.83	337176534.04
3	1382	135	207082.41	286187887.04
4	422	41	626791.80	264506141.66
Total	3078	300		906105514.16

The (true) population total is 9.439517e+08, the total acres devoted to farms in 1992.

From the sample, we can also estimate the variances

Recall that \[ \hat{V} ( \hat{T}_h ) = \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{N_h} \right) s_h^2 \]
From the above sample, we can obtain

Stratum	Stratum Size	Sample Size	Total Estimate	Variance Estimates
1	220	21	1.823495e+07	9.552438E+12
2	1054	103	3.371765e+08	6.705221E+14
3	1382	135	2.861879e+08	9.328065E+14
4	422	41	2.645061e+08	1.133078E+15
Total	3078	300	9.061055e+08	2.745959e+15

Confidence Interval

We can use a normal-based approximation to obtain 95% confidence interval for \(T\): \[ \left( {\hat{T}}_{st} -1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})}, {\hat{T}}_{st} +1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})} \right) \]
From the sample, the 95% CI that we obtain is (8.033978e+08, 1.008813e+09) and the true parameter value is 9.439517e+08

Estimation of other parameters

Estimation of \(\bar{Y}_U = N^{-1} T\): \[ {\bar{y}}_{st} = \frac{1}{N} {\hat{T}}_{st} \]
Variance estimation of \(\hat{y}_{st}\): \[ \hat{V} ( \bar{y}_{st}) = \sum_{h=1}^H \left( \frac{N_h}{N} \right)^2 \frac{1}{n_h} \left( 1- \frac{n_h}{N_h} \right) s_h^2 \]
Proportion estimation is a special case of mean estimation (with binary \(y\))

Sampling weight for STS estimator

STS estimator of the population total \(T\) \[ {\hat{T}}_{st} = \sum_{h=1}^H {\hat{T}}_h = \sum_{h=1}^H \sum_{ i \in \mathcal{S}_h } \frac{N_h}{n_h} y_{hi} = \sum_{h=1}^H \sum_{ i \in \mathcal{S}_h } w_{hi} y_{hi} \]
Sampling weight for SU \(i\) in stratum \(h\) using SRS is \[ w_{hi} = \frac{N_h}{n_h} \]
A sampling weight is a measure of the number of units in the population represetned by the sampled unit

Example

Each SU has a weight in the analysis dataset

Stratum	\(N_h\)	\(n_h\)	\(w_{hi}\)	\(y_{hi}\)
1	6	3	2	53
1	6	3	2	107
1	6	3	2	83
2	2	2	1	34
2	2	2	1	22
3	4	1	4	90
4	5	3	1.67	12
4	5	3	1.67	34
4	5	3	1.67	15

Weights

Sum of the weights for sampled units in a stratum is equal to the stratum population size \[ N_h = \sum_{i \in \mathcal{S}_h } w_{hi} \]
Sum of all weights for all sampled units is equal to the total population size \[ N = \sum_{h=1}^H \sum_{i \in \mathcal{S}_h } w_{hi} \]

Reasons for stratification

Improve the precision of estimators of totals, means, and proportions
The objective may be to estimate parameters for particular sub-groups of the population (strata) or to make comparisons between different statra.