This document introduces the use of the survey package for R for making inferences using survey data collected using a cluster sampling design. It demonstrates several common “textbook” problems such as the estimation of the population means and totals based on data collected using one-stage and two-stage cluster sampling designs, one-stage or multi-stage sampling where there first stage units are sampled with replacement with probabilities proportional to size (PPS), and cluster sampling where the first stage units are sampled using stratified random sampling.1
In a standard one-stage cluster sampling design, clusters of elements are selected using a probability sampling design and all elements in each selected cluster are observed. That data that will be used for illustration are in the data frame apiclus1 from the survey package. The population is all schools in the state of California with at least 100 students. The clusters are school districts (dname or dnum) and the elements are schools (sname or snum). The \(n\) = 15 selected districts in apiclus1 were selected using simple random sampling, and apiclus1 includes observations from all schools from the population that are in the selected districts. The data can be loaded into a R session using the following commands.
library(survey)
data(api)
The command data(api) actually loads several data frames that are samples from the data frame apipop.
A one-stage cluster sampling design is specified similarly to a simple random sampling design except that the id argument must be specified using a variable that uniquely identifies each cluster, and the fpc argument is the number of clusters instead of the number of elements in the population.
api.onestage <- svydesign(id = ~dnum, data = apiclus1, fpc = ~fpc)
summary(api.onestage)
1 - level Cluster Sampling design
With (15) clusters.
svydesign(id = ~dnum, data = apiclus1, fpc = ~fpc)
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01982 0.01982 0.01982 0.01982 0.01982 0.01982
Population size (PSUs): 757
Data variables:
[1] "cds" "stype" "name" "sname" "snum" "dname"
[7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
[13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
[19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
[25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
[31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
[37] "api.stu" "fpc" "pw"
The variable dnum is a number that identifies the school district, and the variable fpc is the number of sampling units (i.e., clusters) in the population which is \(N\) = 757. Without any additional information this specification implies that (a) the clusters were selected using simple random sampling and (b) all elements within each cluster were observed (i.e., a one-stage cluster sampling design). The summary verifies the number of selected clusters as well as the total number of clusters in the population.
A typical specification of a two-stage cluster sampling design requires variables to identify both the primary and secondary sampling units, as well as the total number of clusters in the population (\(N\)) and the total number of elements in each selected cluster (\(M_i\)). The data frame apiclus2 is a sample obtained using a two-stage cluster sampling design using a simple random sample of \(n\) = 40 districts, where within selected district \(i\) one or more of the \(M_i\) schools in the district were selected using simple random sampling. This two-stage cluster sampling design can be specified as follows.
api.twostage <- svydesign(id = ~dnum + snum, data = apiclus2, fpc = ~fpc1 +
fpc2)
summary(api.twostage)
2 - level Cluster Sampling design
With (40, 126) clusters.
svydesign(id = ~dnum + snum, data = apiclus2, fpc = ~fpc1 + fpc2)
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.003669 0.037740 0.052840 0.042390 0.052840 0.052840
Population size (PSUs): 757
Data variables:
[1] "cds" "stype" "name" "sname" "snum" "dname"
[7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
[13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
[19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
[25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
[31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
[37] "api.stu" "pw" "fpc1" "fpc2"
Here the argument id is given a “formula” of variables that uniquely identify each primary sampling unit (dnum for district) and secondary sampling unit (snum for school). The variables fpc1 and fpc2 indicate for each observation the number of clusters (i.e., fpc1 = \(N\)) in the population, which is \(N\) = 757 as in the one-stage cluster sampling design, and the total number of schools in each sampled district (i.e., fpc2 = \(M_i\)). The summary verifies the number of sampled primary and sampling units and secondary sampling units, as well as the total number primary sampling units in the population.
For one-stage or two-stage cluster sampling, a population mean \(\mu\) can be written as \[ \mu = \frac{1}{M} \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij}, \] where \(N\) is the number of clusters in the population, \(M_i\) is the number of elements in the \(i\)-th cluster, \(M = \sum_{i=1}^N M_i\) is the number of elements in the population, and \(y_{ij}\) is the value of the target variable for the \(j\)-th element in the \(i\)-th cluster. Suppose that the clusters and elements are ordered such that the selected clusters are the first \(n\) clusters, and the selected elements within the \(i\)-th cluster are the first \(m_i\) elements. A general estimator of \(\mu\) using these data is \[ \hat\mu = \frac{\sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} y_{ij}} {\sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij}}, \] where \(w_{ij}\) is the sampling weight of the \(j\)-th element in the \(i\)-th cluster. Note that in one-stage cluster sampling \(m_i = M_i\) since all of the elements in each selected cluster are observed, whereas in two-stage cluster sampling \(m_i \le M_i\), and \(m_i < M_i\) for at least one cluster.
The sampling weights implied by a given design are the reciprocals of the inclusion probabilities so that \(w_{ij} = 1/\pi_{ij}\), where \(\pi_{ij}\) is the probability of including the \(j\)-th element within the \(i\)-th cluster in the sample. For one-stage cluster sampling, if the clusters are sampled using simple random sampling then \(w_{ij} = N/n\) since \(\pi_{ij} = n/N\) is the inclusion probability of the \(j\)-th element in the \(i\)-th cluster. For the example given earlier, \(n = 15\) and \(N = 757\), so \(w_{ij} = 757/15 \approx 50.47\). This can be verified using the weights function.
unique(weights(api.onestage))
[1] 50.46667
When the weights are \(w_{ij} = N/n\) then the estimator of \(\mu\) simplifies to forms commonly found in textbooks such as \[ \hat\mu = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i} \ \ \text{or} \ \ \hat\mu = \frac{\sum_{i=1}^n m_i\bar{y}_i}{\sum_{i=1}^n m_i}, \] where \(y_i = \sum_{j=1}^{m_i} y_{ij}\) is the total for the \(i\)-th cluster and \(\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}\) is the mean for the \(i\)-th cluster. For two-stage cluster sampling where clusters and elements within clusters are sampled using simple random sampling, the inclusion probabilities are \(\pi_{ij} = \frac{nm_i}{NM_i}\) (the product of the inclusion probabilities of the \(i\)-th cluster and the \(j\)-th element with the \(i\)-th cluster) so the weights are \(w_{ij} = \frac{NM_i}{nm_i}\). The estimator of \(\mu\) for a two-stage cluster sampling design can be simplified to \[ \hat\mu = \frac{\sum_{i=1}^n M_i\bar{y}_i}{\sum_{i=1}^n M_i}. \]
The estimators of \(\mu\) for one-stage and two-stage cluster sampling discussed above can be implemented using the svymean function, which will automatically compute the appropriate weights for a given design. The estimate and the estimated standard error for the one-stage cluster sampling design are obtained as follows.
svymean(~enroll, design = api.onestage)
mean SE
enroll 549.72 45.191
Here the target variable (\(y_{ij}\)) is enroll which is the enrollment of each school. It can similarly be applied to the sample from the two-stage design, but since there are some schools with missing values for the enroll variable it is necessary to specify the na.rm = TRUE option to drop these observations from the sample prior to the calculations.
svymean(~enroll, design = api.twostage, na.rm = TRUE)
mean SE
enroll 526.26 80.341
The schools with missing values can easily be identified.
with(apiclus2, sname[is.na(enroll)])
[1] "Virginia Avenue Elementary" "Fairfax Elementary"
[3] "California City Middle" "Mojave Elementary"
[5] "Joshua Middle" "Ulrich (Robert P.) Elementary"
An implicit assumption of course is that these are missing at random so as not to bias inferences.
The population total \(\tau\) can be written as \[
\tau = \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij},
\] or simply \(\tau = M\mu\). A general estimator of \(\tau\) is then \[
\hat\tau = \sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} y_{ij}
\] For one-stage sampling where the clusters are sampled using simple random sampling, \(w_{ij} = N/n\) and the estimator can be simplified to \[
\hat\tau = \frac{N}{n}\sum_{i=1}^n y_i \ \ \text{or} \ \
\hat\tau = \frac{N}{n}\sum_{i=1}^n M_i\bar{y}_i
\] where \(y_i = \sum_{j=1}^{M_i} y_{ij}\) and \(\bar{y}_i = \frac{1}{M_i}\sum_{j=1}^{M_i} y_{ij}\). For two-stage sampling with simple random sampling at both stages, \(w_{ij} = \frac{NM_i}{nm_i}\) and the estimator of \(\tau\) can be simplified to \[
\hat\tau = \frac{N}{n}\sum_{i=1}^n M_i\bar{y}_i,
\] where \(\bar{y}_i = \frac{1}{m_i}\sum_{i=1}^{m_i} y_{ij}\). These estimators can be implemented using the svytotal function. For the one-stage and two-stage designs specified earlier, the estimated total enrollment of all schools can be computed as follows.
svytotal(~enroll, design = api.onestage)
total SE
enroll 5076846 1389984
svytotal(~enroll, design = api.twostage, na.rm = TRUE)
total SE
enroll 2639273 799638
Again the option na.rm = TRUE is used for the two-stage design to remove the schools with missing values for enroll.
An alternative estimator for \(\tau\) is \(M\hat\mu\) where \(\hat\mu\) is an estimator of \(\mu\) and \(M\) is the number of elements in the population. Because this estimator is proportional to a ratio estimator for \(\mu\) it may have lower variance, but can only be used when \(M\) is known. Also unlike the estimator for \(\tau\) given earlier it is not generally unbiased, although the bias will often be small unless the number of clusters is very small. Estimates based on this estimator, the estimated standard error, and the confidence interval based on this estimator can be computed by multiplying the estimate, standard error, and confidence interval endpoints for \(\mu\) obtained using svymean by \(M\) as follows.
tmp <- svymean(~enroll, design = api.onestage)
tmp[1] * 6194 # estimate
enroll
3404940
SE(tmp) * 6194 # standard error
enroll
enroll 279915.4
confint(tmp) * 6194 # confidence interval
2.5 % 97.5 %
enroll 2856316 3953564
If clusters are sampled with replacement with probabilities proportional to their size (i.e., \(M_i/M\)), then the design can be specified using svydesign by specifying sampling weights of \(w_{ij} = \frac{M}{nm_i}\). When the first stage of sampling uses PPS, the estimators of \(\mu\) and \(\tau\) and their estimated variances are the same for any number of sampling stages, so it is usually sufficient to only specify a PPS design as a one-stage design. The only extra step is the creation of the sampling weights. Assume (incorrectly) for illustration that the data frame apiclus1 is a sample obtained using PPS of the districts. We know that there are \(M\) = 6194 schools in the population, and that the sample includes \(n\) = 15 districts. We can compute and add the sampling weights to the apiclus1 data frame as follows.
apiclus1$w <- NA
for (i in unique(apiclus1$dnum)) {
apiclus1$w[apiclus1$dnum == i] <- 6194/(15 * sum(apiclus1$dnum == i))
}
Note that sum(apiclus1$dnum == i) simply computes the number of schools in a given district.2 Now the design can be specified using svydesign.
api.pps <- svydesign(id = ~dnum, data = apiclus1, weights = ~w)
Note that because sampling is with replacement no fpc argument is given.
Assuming PPS sampling with replacement with cluster selection probabilities of \(M_i/M\), the Hansen-Hurwitz estimator of \(\tau\) is \[
\hat\tau = \frac{M}{n}\sum_{i=1}^n \bar{y}_i,
\] where \(\bar{y}_i = \frac{1}{M_i}\sum_{j=1}^{M_i} y_{ij}\) for one-stage sampling, and \(\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}\) for two-stage sampling. The corresponding estimator of \(\mu\) is then simply \(\hat\mu = \hat\tau/M\). These estimators are special cases of the general estimators using weights as described earlier, except here the weights are \(w_{ij} = \frac{M}{nm_i}\). The estimators can be implemented as usual via the svytotal and svymean functions.
svytotal(~enroll, design = api.pps)
total SE
enroll 2900735 233096
svymean(~enroll, design = api.pps)
mean SE
enroll 468.31 37.633
The data frame teachers from the SDaA package is a sample of observations of teachers using a stratified two-stage cluster sampling design where the primary sampling units (schools) were selected using stratified random sampling, and the secondary sampling units (teachers) were selected using simple random sampling. The number of teachers in each selected school (\(M_i\)) are in a separate data frame teachmi. These data frames can be merged together as follows.
library(SDaA)
teacher.sample <- merge(teachers, teachmi, by = c("dist", "school"))
The data frame also lacks a variable to identify individual teachers, or the stratum sizes, but these can easily be added to the new data frame.
teacher.sample$teacher <- 1:nrow(teacher.sample)
teacher.sample$N <- NA
teacher.sample$N[teacher.sample$dist == "sm/me"] <- 66
teacher.sample$N[teacher.sample$dist == "large"] <- 245
Now the design can be specified using svydesign.
teacher.design <- svydesign(id = ~school + teacher, data = teacher.sample, strata = ~dist +
NULL, fpc = ~N + popteach)
summary(teacher.design)
Stratified 2 - level Cluster Sampling design
With (31, 310) clusters.
svydesign(id = ~school + teacher, data = teacher.sample, strata = ~dist +
NULL, fpc = ~N + popteach)
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.006259 0.031880 0.062590 0.056240 0.079430 0.093880
First-level Stratum Sizes:
large sm/me
obs 250 60
design.PSU 23 8
actual.PSU 23 8
Population stratum sizes (PSUs):
large sm/me
245 66
Data variables:
[1] "dist" "school" "hrwork" "size" "preprmin" "assist"
[7] "popteach" "ssteach" "teacher" "N"
Note that the arguments to id, strata, and fpc give the appropriate values for each sampling stage. Here school and teacher are the primary and secondary sampling units, respectively. Only the primary sampling units (i.e., schools) were stratified so the stratification variable for the secondary sampling units (i.e., teachers) is specified as NULL (or could be omitted entirely). The variables N and popteach give the number of schools in the stratum and the number of teachers in the school for a given teacher in a given school.
Estimates of \(\tau\) and \(\mu\), as well as estimates for each stratum, can be obtained using svytotal, svymean, and svyby. The estimators are like those for two-stage cluster sampling designs, but the weights are computed separately for each stratum.
options(survey.lonely.psu = "remove")
svytotal(~hrwork, design = teacher.design, na.rm = TRUE)
total SE
hrwork 260472 16613
svymean(~hrwork, design = teacher.design, na.rm = TRUE)
mean SE
hrwork 34.094 0.626
svyby(~hrwork, by = ~dist, design = teacher.design, FUN = svytotal, na.rm = TRUE)
dist hrwork se
large large 223228.60 15106.897
sm/me sm/me 37243.51 6910.767
svyby(~hrwork, by = ~dist, design = teacher.design, FUN = svymean, na.rm = TRUE)
dist hrwork se
large large 33.82060 0.7096674
sm/me sm/me 35.82829 0.5190823
Note that there are a couple of complications. One is that there are a few teachers in the sample with missing values for the target variable hrwork (reported number of hours worked per week). Assuming these are missing at random this can be dealt with by using the na.rm = TRUE option. The other problem is that one of the primary sampling units includes on a single secondary sampling unit.
with(teacher.sample, table(school))
school
1 2 3 4 6 7 8 9 11 12 13 15 16 18 19 20 21 22 23 24 25 28 29 30 31
1 4 7 7 11 4 5 21 10 13 3 24 24 2 3 8 7 13 16 9 8 18 8 22 18
32 33 34 36 38 41
16 5 7 4 10 2
This makes the usual calculation of the estimated variance of estimators of \(\tau\) or \(\mu\) impossible. If there are not too many such primary sampling units, one simple approach is to drop them from calculations (i.e., treat them as missing data) using options(survey.lonely.psu = "remove").
This document is a work in progress. It is based on version 3.30.3 of the survey package.↩
Also note that if there were any schools with missing observations that were going to be dropped from the sample that this calculation would need to be modified. If the target variable is enroll then this could be done by using something like sum(!is.na(apiclus1$enroll[apiclus1$dnum == i])) instead of sum(apiclus1$dnum == i) to compute the number of (non-missing) schools in each district.↩