Stat 421: Survey Sampling

Week 8

Jae-kwang Kim

3/3/2020

Stratified Sampling

  • This week
    • Stratified sampling overview
    • Estimation under stratified sampling
  • Next week
    • Sample size allocation
    • Optimal allocation

Stratified Sampling Overview

Stratified Random Sampling (STS)

  • Definition: A stratified random sample is obtained by separating the population units into non-overlapping groups, called strata, and then selecting a random sample from each stratum

  • Three components:

    1. Stratification: Partition the population into H subpopulations (called strata)
    2. Sample size allocation: Determine the sample size (\(n_h\)) for each stratum \(h=1,\cdots, H\)
    3. Sampling: Select SRS of size \(n_h\) from each stratum \(h\)

STS overview: Stratification

  • Stratification:
    • Divide the finite population into \(H\) mutually exclusive and exhustive subpopulations, called strata.
    • Each sampling unit in the finite population belongs to one and only one stratum.
  • Using mathematical notation, we use \[ \mathcal{U} = \mathcal{U}_1 \cup \mathcal{U}_2 \cup \cdots \cup \mathcal{U}_H \] where \(\mathcal{U}\) is the index set of the finite population and \(\mathcal{U}_h\) is the index set of the population in stratum \(h\). It is a exhaustive partition. The mutually exclusive partition means that, for \(h \neq g\),
    \[ \mathcal{U}_h \cap \mathcal{U}_g = \phi .\]

STS overview: Stratification

  • Population sizes
    • \(N_h= | \mathcal{U}_h |\): number of sampling units in stratum \(h\) in the population
  • Note that the total population size can be written as
    \[ N = N_1 + N_2 + \cdots + N_H \]

  • Thus, the sampling frame contains the stratum variable for every sampling unit
    • Stratum variable is a type of auxiliary information
    • We must know the stratum variable for every sampling unit in the population

Ag Example

  • Population is 3078 counties from Ag Census (lab)
  • Divide 3078 counties into 4 regional strata (H = 4)
  • Strata: regions of the US (indexed by h )
    • Northeast (h = 1)
    • North central (h = 2)
    • South (h = 3)
    • West (h = 4)

Ag Example

  • In the sampling frame, must have a variable that indicates the stratum assignment (region) for each county

  • Population size
    • NE stratum: \(N_1=220\)
    • NC stratum: \(N_2=1054\)
    • South stratum: \(N_3=1382\)
    • West stratum: \(N_4=422\)

Ag Examle

agpop <- read.table("agpop.dat", quote="\"", comment.char="")
head(agpop[c("V1", "V2", "V15")], n=10)
##                       V1 V2 V15
## 1  ALEUTIAN_ISLANDS_AREA AK   W
## 2         ANCHORAGE_AREA AK   W
## 3         FAIRBANKS_AREA AK   W
## 4            JUNEAU_AREA AK   W
## 5   KENAI_PENINSULA_AREA AK   W
## 6         AUTAUGA_COUNTY AL   S
## 7         BALDWIN_COUNTY AL   S
## 8         BARBOUR_COUNTY AL   S
## 9            BIBB_COUNTY AL   S
## 10         BLOUNT_COUNTY AL   S

Ag Examle (Creating Stratum=1)

agpop1 <- subset(agpop, V15 == "NE")
agpop1$StratID <- 1  
agpop1$WithinID <- seq_len(nrow(agpop1))

head(agpop1[c("V1", "V2",  "StratID", "WithinID")], n=10)
##                    V1 V2 StratID WithinID
## 284  FAIRFIELD_COUNTY CT       1        1
## 285   HARTFORD_COUNTY CT       1        2
## 286 LITCHFIELD_COUNTY CT       1        3
## 287  MIDDLESEX_COUNTY CT       1        4
## 288  NEW_HAVEN_COUNTY CT       1        5
## 289 NEW_LONDON_COUNTY CT       1        6
## 290    TOLLAND_COUNTY CT       1        7
## 291    WINDHAM_COUNTY CT       1        8
## 292       KENT_COUNTY DE       1        9
## 293 NEW_CASTLE_COUNTY DE       1       10

Ag Examle (Creating Stratum=2)

agpop2 <- subset(agpop, V15 == "NC")
agpop2$StratID <- 2  
agpop2$WithinID <- seq_len(nrow(agpop2))

head(agpop2[c("V1", "V2",  "StratID", "WithinID")], n=10)
##                    V1 V2 StratID WithinID
## 525      ADAIR_COUNTY IA       2        1
## 526      ADAMS_COUNTY IA       2        2
## 527  ALLAMAKEE_COUNTY IA       2        3
## 528  APPANOOSE_COUNTY IA       2        4
## 529    AUDUBON_COUNTY IA       2        5
## 530     BENTON_COUNTY IA       2        6
## 531 BLACK_HAWK_COUNTY IA       2        7
## 532      BOONE_COUNTY IA       2        8
## 533     BREMER_COUNTY IA       2        9
## 534   BUCHANAN_COUNTY IA       2       10

Ag Examle (Creating Stratum=3)

agpop3 <- subset(agpop, V15 == "S")
agpop3$StratID <- 3  
agpop3$WithinID <- seq_len(nrow(agpop3))

head(agpop3[c("V1", "V2",  "StratID", "WithinID")], n=10)
##                 V1 V2 StratID WithinID
## 6   AUTAUGA_COUNTY AL       3        1
## 7   BALDWIN_COUNTY AL       3        2
## 8   BARBOUR_COUNTY AL       3        3
## 9      BIBB_COUNTY AL       3        4
## 10   BLOUNT_COUNTY AL       3        5
## 11  BULLOCK_COUNTY AL       3        6
## 12   BUTLER_COUNTY AL       3        7
## 13  CALHOUN_COUNTY AL       3        8
## 14 CHAMBERS_COUNTY AL       3        9
## 15 CHEROKEE_COUNTY AL       3       10

Ag Examle (Creating Stratum=4)

agpop4 <- subset(agpop, V15 == "W")
agpop4$StratID <- 4 
agpop4$WithinID <- seq_len(nrow(agpop4))

head(agpop4[c("V1", "V2",  "StratID", "WithinID")], n=10)
##                        V1 V2 StratID WithinID
## 1   ALEUTIAN_ISLANDS_AREA AK       4        1
## 2          ANCHORAGE_AREA AK       4        2
## 3          FAIRBANKS_AREA AK       4        3
## 4             JUNEAU_AREA AK       4        4
## 5    KENAI_PENINSULA_AREA AK       4        5
## 148         APACHE_COUNTY AZ       4        6
## 149        COCHISE_COUNTY AZ       4        7
## 150       COCONINO_COUNTY AZ       4        8
## 151           GILA_COUNTY AZ       4        9
## 152         GRAHAM_COUNTY AZ       4       10

Ag Example (Combine 4 strata)

 agpopc <- rbind(agpop1, agpop2, agpop3, agpop4)

head(agpopc[c("V1", "V2",  "StratID", "WithinID")], n=5)
##                    V1 V2 StratID WithinID
## 284  FAIRFIELD_COUNTY CT       1        1
## 285   HARTFORD_COUNTY CT       1        2
## 286 LITCHFIELD_COUNTY CT       1        3
## 287  MIDDLESEX_COUNTY CT       1        4
## 288  NEW_HAVEN_COUNTY CT       1        5
tail(agpopc[c("V1", "V2",  "StratID", "WithinID")], n=5)
##                     V1 V2 StratID WithinID
## 3074 SWEETWATER_COUNTY WY       4      418
## 3075      TETON_COUNTY WY       4      419
## 3076      UINTA_COUNTY WY       4      420
## 3077   WASHAKIE_COUNTY WY       4      421
## 3078     WESTON_COUNTY WY       4      422

STS overview: sampling framework

  • Partition sample of size \(n\) aross strata
    • \(n_h\): number of sample units selected from stratum \(h\)
    • \(n=n_1+n_2+ \cdots + n_H\)
    • Determining the stratum sample size \(n_h\) for a given sample size of n is called sample allocation
  • Methods for sample size allocation
    1. Proportional allocation
    2. Neyman allocation

Ag example

  • Sample size of \(n=300\) from \(N=3078\) counties in the US

  • Proportional allocation
    • \(n=300\) is 9.75% of \(N=3078\) counties
    • Set each stratum size to 9.75% of stratum population. That is, \[ n_h = 0.0975 \times N_h\]

Ag example

Sample allocation using proportional allocation

Stratum Stratum Size \((N_h)\) Sample size \((n_h)\)
1 (NE) 220 21
2 (NC) 1054 103
3 (S) 1382 135
4 (W) 422 41
Total 3078 300
  • Note that \(n_h/N_h = 0.0975\) for all \(h=1,2,3,4\).

STS overview: sampling framework

  • Once the sample sizes \(n_h\) are known, we select an independent sample from each stratum

  • Let \(\mathcal{S}_h \subset \mathcal{U}_h\) be the set of samples selected from \(\mathcal{U}_h\)

  • Each stratum sample is selected independently of others
    • Select a new set of random numbers for each stratum
  • The final sample is \[ \mathcal{S} = \mathcal{S}_1 \cup \cdots \cup \mathcal{S}_H \]

STS overview: sampling framework

  • Design within a stratum
    • We can use any probability sampling design within a stratum
    • Sample designs do not need to be the same across strata
    • In this lecture, we will only consider the case when a SRSWOR is selected within each stratum

STS overview: estimate parameters

  • We will rely on the fact that each stratum sample is independent of the other stratum samples

  • For each stratum, estimate stratum total and its variance

  • Combine estimates across strata to estimate population total

STS overview: summary

  • Population framework
    • Partition the population into H mutually exclusive and exhaustive categories called strata.
  • Sampling framework:
    • Determine a sample size and sample design for each stratum.
    • Select an independent probability sample in each stratum.

STS overview: summary

  • Estimation framework
    • Construct estimates and variance estimates for the population parameter for each stratum.
    • Combine the estimates and variance estimates appropriately across strata to obtain an estimate of the population total.

Estimation under STS

STS: population framework

  • There are \(H\) mutually exclusive and exhaustive strata
    • Assign each SU to one and only one stratum
    • Index set for strata \(h\): \(h=1,2, \cdots, H\)
  • For STS, we use two indices:
    • Stratum \(h\): \(h=1, \cdots, H\)
    • SU within stratum \(i\): \(i=1, \cdots, N_h\)
  • Charateristic of interest for SU \(i\) in stratum \(h\) is \(y_{hi}\)

Population parameter: Population Total

  • Stratum population total for stratum \(h\): \[ T_h = \sum_{i=1}^{N_h} y_{hi} \]

  • Overall population total \[ T = \sum_{h=1}^H T_h = \sum_{h=1}^H \sum_{i=1}^{N_h} y_{hi} \]

Population parameter: Population Mean

  • Stratum population mean for stratum \(h\): \[ \bar{Y}_{hU} = \frac{1}{N_h} \sum_{i=1}^{N_h} y_{hi} \]

  • Overall population mean \[ \bar{Y}_U = \frac{1}{N} T = \sum_{h=1}^H \left( \frac{N_h}{N} \right)\bar{Y}_{hU} \]

Estimation of the population total under STS

  • Basic Idea
    1. Estimate \(T_h\) from the sample observations in \(\mathcal{S}_h\), the sample in stratum \(h\)
    2. Combine them
  • Estimation
    • Under SRS of size \(n_h\) from \(N_h\), we use \[ \hat{T}_h = N_h \bar{y}_h = N_h n_h^{-1} \sum_{i \in S_h} y_{hi} \]
    • Note that \(\hat{T}_h\) is unbiased for \(T_h\).
    • The final estimator of \(T\) is \[ {\hat{T}}_{st} = \sum_{h=1}^H \hat{T}_h \]

  • Recall that the samples in each stratum are independently sampled. Thus, \(\hat{T}_h\) are mutually independent.
  • Variance
    • The total variance is \[V \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H V \left(\hat{T}_h \right)\]
    • Using the formula for SRS, we can obtain \[ V \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{ N_h} \right) S_h^2 ,\] where \[S_h^2 = \frac{1}{N_h-1} \sum_{i=1}^{N_h} ( y_{hi} - \bar{Y}_{hU})^2.\]

  • Variance Estimation \[ \hat{V} \left( {\hat{T}}_{st} \right) = \sum_{h=1}^H \hat{V} ( \hat{T}_h ) ,\] where \[ \hat{V} ( \hat{T}_h ) = \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{ N_h} \right) s_h^2 \] and \[ s_h^2= \frac{1}{n_h-1} \sum_{i \in \mathcal{S}_h} \left( y_{hi} - \bar{y}_h\right)^2 . \]

Ag example

Y: Acres devoted to farms in 1992

pN1 <- nrow(agpop1); n1 <- 21  
sam1 <- sample(agpop1$V3, n1, replace = F)
m1 <- mean(sam1); t1 <- pN1*m1 
v1 <- pN1*pN1*(1-n1/pN1)*var(sam1)/n1 

pN2 <- nrow(agpop2); n2  <- 103 
sam2 <- sample(agpop2$V3, n2, replace = F)
m2 <- mean(sam2); t2 <- pN2*m2 
v2 <- pN2*pN2*(1-n2/pN2)*var(sam2)/n2  

pN3 <- nrow(agpop3); n3 <- 135 
sam3 <- sample(agpop3$V3, n3, replace = F)
m3 <- mean(sam3); t3 <- pN3*m3 
v3 <- pN3*pN3*(1-n3/pN3)*var(sam3)/n3 

pN4 <- nrow(agpop4); n4 <- 41 
sam4 <- sample(agpop4$V3, n4, replace = F) 
m4 <- mean(sam4); t4 <- pN4*m4 
v4 <- pN4*pN4*(1-n4/pN4)*var(sam4)/n4 

pN <- pN1+pN2+pN3 + pN4 
nn <- n1+n2+ n3 + n4 
te <- t1+t2+t3+t4 
ve <- v1 + v2 + v3 + v4 
  • One possible realization of the stratified sampling result
Stratum Stratum Size Sample Size Sample Mean Total Estimate
1 220 21 82886.14 18234951.43
2 1054 103 319901.83 337176534.04
3 1382 135 207082.41 286187887.04
4 422 41 626791.80 264506141.66
Total 3078 300 906105514.16
  • The (true) population total is 9.439517e+08, the total acres devoted to farms in 1992.

From the sample, we can also estimate the variances

  • Recall that \[ \hat{V} ( \hat{T}_h ) = \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{N_h} \right) s_h^2 \]

  • From the above sample, we can obtain

Stratum Stratum Size Sample Size Total Estimate Variance Estimates
1 220 21 1.823495e+07 9.552438E+12
2 1054 103 3.371765e+08 6.705221E+14
3 1382 135 2.861879e+08 9.328065E+14
4 422 41 2.645061e+08 1.133078E+15
Total 3078 300 9.061055e+08 2.745959e+15

Confidence Interval

  • We can use a normal-based approximation to obtain 95% confidence interval for \(T\): \[ \left( {\hat{T}}_{st} -1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})}, {\hat{T}}_{st} +1.96 \sqrt{\hat{V} ( {\hat{T}}_{st})} \right) \]

  • From the sample, the 95% CI that we obtain is (8.033978e+08, 1.008813e+09) and the true parameter value is 9.439517e+08

Estimation of other parameters

  • Estimation of \(\bar{Y}_U = N^{-1} T\): \[ {\bar{y}}_{st} = \frac{1}{N} {\hat{T}}_{st} \]
  • Variance estimation of \(\hat{y}_{st}\): \[ \hat{V} ( \bar{y}_{st}) = \sum_{h=1}^H \left( \frac{N_h}{N} \right)^2 \frac{1}{n_h} \left( 1- \frac{n_h}{N_h} \right) s_h^2 \]
  • Proportion estimation is a special case of mean estimation (with binary \(y\))

Sampling weight for STS estimator

  • STS estimator of the population total \(T\) \[ {\hat{T}}_{st} = \sum_{h=1}^H {\hat{T}}_h = \sum_{h=1}^H \sum_{ i \in \mathcal{S}_h } \frac{N_h}{n_h} y_{hi} = \sum_{h=1}^H \sum_{ i \in \mathcal{S}_h } w_{hi} y_{hi} \]
  • Sampling weight for SU \(i\) in stratum \(h\) using SRS is \[ w_{hi} = \frac{N_h}{n_h} \]
  • A sampling weight is a measure of the number of units in the population represetned by the sampled unit

Example

  • Each SU has a weight in the analysis dataset
Stratum \(N_h\) \(n_h\) \(w_{hi}\) \(y_{hi}\)
1 6 3 2 53
1 6 3 2 107
1 6 3 2 83
2 2 2 1 34
2 2 2 1 22
3 4 1 4 90
4 5 3 1.67 12
4 5 3 1.67 34
4 5 3 1.67 15

Weights

  • Sum of the weights for sampled units in a stratum is equal to the stratum population size \[ N_h = \sum_{i \in \mathcal{S}_h } w_{hi} \]

  • Sum of all weights for all sampled units is equal to the total population size \[ N = \sum_{h=1}^H \sum_{i \in \mathcal{S}_h } w_{hi} \]

Reasons for stratification

  1. Improve the precision of estimators of totals, means, and proportions

  2. The objective may be to estimate parameters for particular sub-groups of the population (strata) or to make comparisons between different statra.