Cluster Sampling Analysis with R

This document introduces the use of the survey package for R for making inferences using survey data collected using a cluster sampling design. It demonstrates several common “textbook” problems such as the estimation of the population means and totals based on data collected using one-stage and two-stage cluster sampling designs, one-stage or multi-stage sampling where there first stage units are sampled with replacement with probabilities proportional to size (PPS), and cluster sampling where the first stage units are sampled using stratified random sampling.¹

One-Stage Cluster Sampling Specification

In a standard one-stage cluster sampling design, clusters of elements are selected using a probability sampling design and all elements in each selected cluster are observed. That data that will be used for illustration are in the data frame apiclus1 from the survey package. The population is all schools in the state of California with at least 100 students. The clusters are school districts (dname or dnum) and the elements are schools (sname or snum). The $n$ = 15 selected districts in apiclus1 were selected using simple random sampling, and apiclus1 includes observations from all schools from the population that are in the selected districts. The data can be loaded into a R session using the following commands.

library(survey)
data(api)

The command data(api) actually loads several data frames that are samples from the data frame apipop.

A one-stage cluster sampling design is specified similarly to a simple random sampling design except that the id argument must be specified using a variable that uniquely identifies each cluster, and the fpc argument is the number of clusters instead of the number of elements in the population.

api.onestage <- svydesign(id = ~dnum, data = apiclus1, fpc = ~fpc)
summary(api.onestage)

1 - level Cluster Sampling design
With (15) clusters.
svydesign(id = ~dnum, data = apiclus1, fpc = ~fpc)
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01982 0.01982 0.01982 0.01982 0.01982 0.01982 
Population size (PSUs): 757 
Data variables:
 [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
 [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
[13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
[19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
[25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
[31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
[37] "api.stu"  "fpc"      "pw"

The variable dnum is a number that identifies the school district, and the variable fpc is the number of sampling units (i.e., clusters) in the population which is $N$ = 757. Without any additional information this specification implies that (a) the clusters were selected using simple random sampling and (b) all elements within each cluster were observed (i.e., a one-stage cluster sampling design). The summary verifies the number of selected clusters as well as the total number of clusters in the population.

Two-Stage Cluster Sampling Specification

A typical specification of a two-stage cluster sampling design requires variables to identify both the primary and secondary sampling units, as well as the total number of clusters in the population ($N$) and the total number of elements in each selected cluster ($M_i$). The data frame apiclus2 is a sample obtained using a two-stage cluster sampling design using a simple random sample of $n$ = 40 districts, where within selected district $i$ one or more of the $M_i$ schools in the district were selected using simple random sampling. This two-stage cluster sampling design can be specified as follows.

api.twostage <- svydesign(id = ~dnum + snum, data = apiclus2, fpc = ~fpc1 + 
    fpc2)
summary(api.twostage)

2 - level Cluster Sampling design
With (40, 126) clusters.
svydesign(id = ~dnum + snum, data = apiclus2, fpc = ~fpc1 + fpc2)
Probabilities:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003669 0.037740 0.052840 0.042390 0.052840 0.052840 
Population size (PSUs): 757 
Data variables:
 [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
 [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
[13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
[19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
[25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
[31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
[37] "api.stu"  "pw"       "fpc1"     "fpc2"

Here the argument id is given a “formula” of variables that uniquely identify each primary sampling unit (dnum for district) and secondary sampling unit (snum for school). The variables fpc1 and fpc2 indicate for each observation the number of clusters (i.e., fpc1 = $N$) in the population, which is $N$ = 757 as in the one-stage cluster sampling design, and the total number of schools in each sampled district (i.e., fpc2 = $M_i$). The summary verifies the number of sampled primary and sampling units and secondary sampling units, as well as the total number primary sampling units in the population.

Estimation of Means and Totals

For one-stage or two-stage cluster sampling, a population mean $\mu$ can be written as \[ \mu = \frac{1}{M} \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij}, \] where $N$ is the number of clusters in the population, $M_i$ is the number of elements in the $i$-th cluster, $M = \sum_{i=1}^N M_i$ is the number of elements in the population, and $y_{ij}$ is the value of the target variable for the $j$-th element in the $i$-th cluster. Suppose that the clusters and elements are ordered such that the selected clusters are the first $n$ clusters, and the selected elements within the $i$-th cluster are the first $m_i$ elements. A general estimator of $\mu$ using these data is \[ \hat\mu = \frac{\sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} y_{ij}} {\sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij}}, \] where $w_{ij}$ is the sampling weight of the $j$-th element in the $i$-th cluster. Note that in one-stage cluster sampling $m_i = M_i$ since all of the elements in each selected cluster are observed, whereas in two-stage cluster sampling $m_i \le M_i$, and $m_i < M_i$ for at least one cluster.

The sampling weights implied by a given design are the reciprocals of the inclusion probabilities so that $w_{ij} = 1/\pi_{ij}$, where $\pi_{ij}$ is the probability of including the $j$-th element within the $i$-th cluster in the sample. For one-stage cluster sampling, if the clusters are sampled using simple random sampling then $w_{ij} = N/n$ since $\pi_{ij} = n/N$ is the inclusion probability of the $j$-th element in the $i$-th cluster. For the example given earlier, $n = 15$ and $N = 757$, so $w_{ij} = 757/15 \approx 50.47$. This can be verified using the weights function.

unique(weights(api.onestage))

[1] 50.46667

When the weights are $w_{ij} = N/n$ then the estimator of $\mu$ simplifies to forms commonly found in textbooks such as \[ \hat\mu = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i} \ \ \text{or} \ \ \hat\mu = \frac{\sum_{i=1}^n m_i\bar{y}_i}{\sum_{i=1}^n m_i}, \] where $y_i = \sum_{j=1}^{m_i} y_{ij}$ is the total for the $i$-th cluster and $\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}$ is the mean for the $i$-th cluster. For two-stage cluster sampling where clusters and elements within clusters are sampled using simple random sampling, the inclusion probabilities are $\pi_{ij} = \frac{nm_i}{NM_i}$ (the product of the inclusion probabilities of the $i$-th cluster and the $j$-th element with the $i$-th cluster) so the weights are $w_{ij} = \frac{NM_i}{nm_i}$. The estimator of $\mu$ for a two-stage cluster sampling design can be simplified to \[ \hat\mu = \frac{\sum_{i=1}^n M_i\bar{y}_i}{\sum_{i=1}^n M_i}. \]

The estimators of $\mu$ for one-stage and two-stage cluster sampling discussed above can be implemented using the svymean function, which will automatically compute the appropriate weights for a given design. The estimate and the estimated standard error for the one-stage cluster sampling design are obtained as follows.

svymean(~enroll, design = api.onestage)

         mean     SE
enroll 549.72 45.191

Here the target variable ($y_{ij}$) is enroll which is the enrollment of each school. It can similarly be applied to the sample from the two-stage design, but since there are some schools with missing values for the enroll variable it is necessary to specify the na.rm = TRUE option to drop these observations from the sample prior to the calculations.

svymean(~enroll, design = api.twostage, na.rm = TRUE)

         mean     SE
enroll 526.26 80.341

The schools with missing values can easily be identified.

with(apiclus2, sname[is.na(enroll)])

[1] "Virginia Avenue Elementary"    "Fairfax Elementary"           
[3] "California City Middle"        "Mojave Elementary"            
[5] "Joshua Middle"                 "Ulrich (Robert P.) Elementary"

An implicit assumption of course is that these are missing at random so as not to bias inferences.

The population total $\tau$ can be written as \[ \tau = \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij}, \] or simply $\tau = M\mu$. A general estimator of $\tau$ is then \[ \hat\tau = \sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} y_{ij} \] For one-stage sampling where the clusters are sampled using simple random sampling, $w_{ij} = N/n$ and the estimator can be simplified to \[ \hat\tau = \frac{N}{n}\sum_{i=1}^n y_i \ \ \text{or} \ \ \hat\tau = \frac{N}{n}\sum_{i=1}^n M_i\bar{y}_i \] where $y_i = \sum_{j=1}^{M_i} y_{ij}$ and $\bar{y}_i = \frac{1}{M_i}\sum_{j=1}^{M_i} y_{ij}$. For two-stage sampling with simple random sampling at both stages, $w_{ij} = \frac{NM_i}{nm_i}$ and the estimator of $\tau$ can be simplified to \[ \hat\tau = \frac{N}{n}\sum_{i=1}^n M_i\bar{y}_i, \] where $\bar{y}_i = \frac{1}{m_i}\sum_{i=1}^{m_i} y_{ij}$. These estimators can be implemented using the svytotal function. For the one-stage and two-stage designs specified earlier, the estimated total enrollment of all schools can be computed as follows.

svytotal(~enroll, design = api.onestage)

         total      SE
enroll 5076846 1389984

svytotal(~enroll, design = api.twostage, na.rm = TRUE)

         total     SE
enroll 2639273 799638

Again the option na.rm = TRUE is used for the two-stage design to remove the schools with missing values for enroll.

An alternative estimator for $\tau$ is $M\hat\mu$ where $\hat\mu$ is an estimator of $\mu$ and $M$ is the number of elements in the population. Because this estimator is proportional to a ratio estimator for $\mu$ it may have lower variance, but can only be used when $M$ is known. Also unlike the estimator for $\tau$ given earlier it is not generally unbiased, although the bias will often be small unless the number of clusters is very small. Estimates based on this estimator, the estimated standard error, and the confidence interval based on this estimator can be computed by multiplying the estimate, standard error, and confidence interval endpoints for $\mu$ obtained using svymean by $M$ as follows.

tmp <- svymean(~enroll, design = api.onestage)
tmp[1] * 6194  # estimate

 enroll 
3404940

SE(tmp) * 6194  # standard error

         enroll
enroll 279915.4

confint(tmp) * 6194  # confidence interval

         2.5 %  97.5 %
enroll 2856316 3953564

Sampling With Probabilities Proportional to Size (PPS)

If clusters are sampled with replacement with probabilities proportional to their size (i.e., $M_i/M$), then the design can be specified using svydesign by specifying sampling weights of $w_{ij} = \frac{M}{nm_i}$. When the first stage of sampling uses PPS, the estimators of $\mu$ and $\tau$ and their estimated variances are the same for any number of sampling stages, so it is usually sufficient to only specify a PPS design as a one-stage design. The only extra step is the creation of the sampling weights. Assume (incorrectly) for illustration that the data frame apiclus1 is a sample obtained using PPS of the districts. We know that there are $M$ = 6194 schools in the population, and that the sample includes $n$ = 15 districts. We can compute and add the sampling weights to the apiclus1 data frame as follows.

apiclus1$w <- NA
for (i in unique(apiclus1$dnum)) {
    apiclus1$w[apiclus1$dnum == i] <- 6194/(15 * sum(apiclus1$dnum == i))
}

Note that sum(apiclus1$dnum == i) simply computes the number of schools in a given district.² Now the design can be specified using svydesign.

api.pps <- svydesign(id = ~dnum, data = apiclus1, weights = ~w)

Note that because sampling is with replacement no fpc argument is given.

Assuming PPS sampling with replacement with cluster selection probabilities of $M_i/M$, the Hansen-Hurwitz estimator of $\tau$ is \[ \hat\tau = \frac{M}{n}\sum_{i=1}^n \bar{y}_i, \] where $\bar{y}_i = \frac{1}{M_i}\sum_{j=1}^{M_i} y_{ij}$ for one-stage sampling, and $\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}$ for two-stage sampling. The corresponding estimator of $\mu$ is then simply $\hat\mu = \hat\tau/M$. These estimators are special cases of the general estimators using weights as described earlier, except here the weights are $w_{ij} = \frac{M}{nm_i}$. The estimators can be implemented as usual via the svytotal and svymean functions.

svytotal(~enroll, design = api.pps)

         total     SE
enroll 2900735 233096

svymean(~enroll, design = api.pps)

         mean     SE
enroll 468.31 37.633

Stratified Cluster Sampling

The data frame teachers from the SDaA package is a sample of observations of teachers using a stratified two-stage cluster sampling design where the primary sampling units (schools) were selected using stratified random sampling, and the secondary sampling units (teachers) were selected using simple random sampling. The number of teachers in each selected school ($M_i$) are in a separate data frame teachmi. These data frames can be merged together as follows.

library(SDaA)
teacher.sample <- merge(teachers, teachmi, by = c("dist", "school"))

The data frame also lacks a variable to identify individual teachers, or the stratum sizes, but these can easily be added to the new data frame.

teacher.sample$teacher <- 1:nrow(teacher.sample)
teacher.sample$N <- NA
teacher.sample$N[teacher.sample$dist == "sm/me"] <- 66
teacher.sample$N[teacher.sample$dist == "large"] <- 245

Now the design can be specified using svydesign.

teacher.design <- svydesign(id = ~school + teacher, data = teacher.sample, strata = ~dist + 
    NULL, fpc = ~N + popteach)
summary(teacher.design)

Stratified 2 - level Cluster Sampling design
With (31, 310) clusters.
svydesign(id = ~school + teacher, data = teacher.sample, strata = ~dist + 
    NULL, fpc = ~N + popteach)
Probabilities:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006259 0.031880 0.062590 0.056240 0.079430 0.093880 
First-level Stratum Sizes: 
           large sm/me
obs          250    60
design.PSU    23     8
actual.PSU    23     8
Population stratum sizes (PSUs): 
large sm/me 
  245    66 
Data variables:
 [1] "dist"     "school"   "hrwork"   "size"     "preprmin" "assist"  
 [7] "popteach" "ssteach"  "teacher"  "N"

Note that the arguments to id, strata, and fpc give the appropriate values for each sampling stage. Here school and teacher are the primary and secondary sampling units, respectively. Only the primary sampling units (i.e., schools) were stratified so the stratification variable for the secondary sampling units (i.e., teachers) is specified as NULL (or could be omitted entirely). The variables N and popteach give the number of schools in the stratum and the number of teachers in the school for a given teacher in a given school.

Estimates of $\tau$ and $\mu$, as well as estimates for each stratum, can be obtained using svytotal, svymean, and svyby. The estimators are like those for two-stage cluster sampling designs, but the weights are computed separately for each stratum.

options(survey.lonely.psu = "remove")
svytotal(~hrwork, design = teacher.design, na.rm = TRUE)

        total    SE
hrwork 260472 16613

svymean(~hrwork, design = teacher.design, na.rm = TRUE)

         mean    SE
hrwork 34.094 0.626

svyby(~hrwork, by = ~dist, design = teacher.design, FUN = svytotal, na.rm = TRUE)

       dist    hrwork        se
large large 223228.60 15106.897
sm/me sm/me  37243.51  6910.767

svyby(~hrwork, by = ~dist, design = teacher.design, FUN = svymean, na.rm = TRUE)

       dist   hrwork        se
large large 33.82060 0.7096674
sm/me sm/me 35.82829 0.5190823

Note that there are a couple of complications. One is that there are a few teachers in the sample with missing values for the target variable hrwork (reported number of hours worked per week). Assuming these are missing at random this can be dealt with by using the na.rm = TRUE option. The other problem is that one of the primary sampling units includes on a single secondary sampling unit.

with(teacher.sample, table(school))

school
 1  2  3  4  6  7  8  9 11 12 13 15 16 18 19 20 21 22 23 24 25 28 29 30 31 
 1  4  7  7 11  4  5 21 10 13  3 24 24  2  3  8  7 13 16  9  8 18  8 22 18 
32 33 34 36 38 41 
16  5  7  4 10  2

This makes the usual calculation of the estimated variance of estimators of $\tau$ or $\mu$ impossible. If there are not too many such primary sampling units, one simple approach is to drop them from calculations (i.e., treat them as missing data) using options(survey.lonely.psu = "remove").