This document introduces the use of the survey
package for R for making inferences using data collected using a stratified random sampling design. It demonstrates estimation of the population mean, total, strata means and totals, post-stratification, and double-sampling when the number of sampling units per strata are unknown.1
Kruuk et al. (1990) used a stratified random sampling design to estimate the number of otter (Lutra lutra) dens or holts along a 1400km coastline of the Shetland Islands.2 The coastline was divided into 237 5km habitable sections, with each section classified as one of four types: cliffs over 10m (89 sections), agricultural (61 sections), peat (40 sections), and non-peat (47 sections). The sample featured here was collected using stratified random sampling. The data are available in the SDaA
package. If you have not done so already you will need to install this package using install.packages("SDaA")
. To access the data load the package.
library(SDaA)
The data frame is called otters
. Here are the first six observations:
head(otters)
section habitat holts
1 1 4 6
2 3 2 0
3 4 1 8
4 8 1 0
5 11 1 0
6 19 2 0
Each observation includes the section number, the type of habitat, and the number of holts. Although not necessary for the analysis, it would be helpful to have the stratum names be more descriptive. This can be done by changing habitat
into a factor using the factor
function and assigning labels to the strata.
otters$habitat <- factor(otters$habitat, labels = c("cliffs", "agricultural",
"peat", "non-peat"))
head(otters)
section habitat holts
1 1 non-peat 6
2 3 agricultural 0
3 4 cliffs 8
4 8 cliffs 0
5 11 cliffs 0
6 19 agricultural 0
It might be interesting to visualize the survey data. We could do that using a boxplot.
boxplot(holts ~ habitat, data = otters, ylab = "Number of Holts", xlab = "Habitat")
Interestingly the distribution of the number of holts is very similar for the cliffs, agricultural, and non-peat sections, but markedly different for the peat sections. This suggests that stratification for sampling from this population may be worthwhile, particularly if optimal or Neyman allocation is used.
Before specifying the design it is necessary to include the stratum sizes (i.e., \(N_1\), \(N_2\), \(N_3\), and \(N_4\)) in the data frame. This can be done by creating a new variable in the data frame called N
. First we will initialize the variable.
otters$N <- NA
This just creates an “empty” variable with no values (in R, NA
is what is used for a missing value). Now we need to input the stratum sizes, but do so such that the correct stratum size is assigned to the correct sampling unit. This can be done several ways, but perhaps the easiest is to use what is sometimes called subscripting.
otters$N[otters$habitat == "cliffs"] <- 89
otters$N[otters$habitat == "agricultural"] <- 61
otters$N[otters$habitat == "peat"] <- 40
otters$N[otters$habitat == "non-peat"] <- 47
The expression otters$N[otters$habitat == "cliffs"]
refers to all values of otters$N
where the corresponding value of otters$habitat
is equal to "cliffs"
. Here we assign the value 89
to those values of otters$N
. Now we can see the result and check that the subscripting worked correctly.
head(otters)
section habitat holts N
1 1 non-peat 6 47
2 3 agricultural 0 61
3 4 cliffs 8 89
4 8 cliffs 0 89
5 11 cliffs 0 89
6 19 agricultural 0 61
Now the design can be specified. First load the survey
package, installing it first if necessary using the command install.packages("survey").
library(survey)
The svydesign
function in the survey
package is used to create a survey design object that includes information about the design and the data. A stratified random sampling design can be specified as follows.
mydesign <- svydesign(id = ~1, strata = ~habitat, data = otters, fpc = ~N)
The argument ids = ~1
has to do with if or how elements are grouped within sampling units, such as in cluster sampling. But for designs without clustering such as this one simply specify ids = ~1
to indicate one element per sampling unit. The argument strata = ~habitat
identifies the stratification variable. The argument data = otters
indicates the data frame. The argument fpc = ~N
indicates the variable that size of the population from which the unit was sampled. Notice that the variables are always proceeded by a tilde (~
). This is a convention of the survey
package but is not typical of other packages in R, so it is easy to forget to use it. The object mydesign
now includes the data as well as information about the design.
Inferences can be made by applying special functions to the object created by svydesign
. An estimate for the population mean \(\mu\) and its associated standard error can be computed using the svymean
function. Recall that the estimator of \(\mu\) is \[
\bar{y}_{\text{st}} = \frac{1}{N}\sum_{i=1}^L N_i\bar{y}_i,
\] and the estimated variance of this estimator is \[
\hat{V}(\bar{y}_{\text{st}}) =
\frac{1}{N^2}\sum_{i=1}^L N_i^2\left(1 - \frac{n_i}{N_i}\right)\frac{s_i^2}{n_i}.
\] The estimator of \(\tau\) is \(\hat{\tau}_{\text{st}} = N\bar{y}_{\text{st}}\) with estimated variance \(\hat{V}(\hat{\tau}_{\text{st}}) = N^2\hat{V}(\bar{y}_{\text{st}})\). To compute \(\bar{y}_{\text{st}}\) and its standard error \(\sqrt{\hat{V}(\bar{y}_{\text{st}})}\) use svymean
and indicate the variable of interest and the design object.
svymean(~holts, design = mydesign)
mean SE
holts 4.1549 0.3119
Thus we estimate that the mean number of holts per section is about 4.15 holts with a bound on the error of estimation of about \(2 \times 0.3119 \approx 0.62\) holts. The svymean
function does not report the bound on the error of estimation \(B = z\sqrt{\hat{V}(\bar{y})}\), but it can be computed easily “by hand” if we multiply the standard error by \(z\). The confint
function can be applied to the result of svymean
to compute the confidence interval \(\bar{y} \pm z\sqrt{\hat{V}(\bar{y})}\).
confint(svymean(~holts, design = mydesign))
2.5 % 97.5 %
holts 3.543594 4.766231
Thus we estimate that the mean number of holts per section is between 3.54 and 4.77 holts. Note that while many textbooks define \(z = 2\) for a 95% confidence level, the survey
package uses \(z \approx 1.96\). Inferences concerning the population total \(\tau\) can be done by using the svytotal
function instead of svymean
.
svytotal(~holts, design = mydesign)
total SE
holts 984.71 73.921
Thus the estimate for the total number of holts is about 985 with a bound on the error of estimation of about 74. As with svymean
, the confint
function can be used in conjunction with svytotal
to produce a confidence interval.
The svymean
and svytotal
functions can also be used to estimate the design effect for the survey by specifying the option deff = TRUE
.
svymean(~holts, design = mydesign, deff = TRUE)
mean SE DEff
holts 4.1549 0.3119 0.3572
The design effect of about 0.3572 implies an effective sample size of about \(82/0.3572 \approx 230\), meaning that a survey from the same population using a simple random sampling design would need a sample size of approximately 230 to achieve the same variance of the estimator for \(\mu\).
Domain means and totals for each habitat can be computed using the svyby
function.
svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)
habitat holts se
cliffs cliffs 1.736842 0.4739725
agricultural agricultural 1.750000 0.4790589
peat peat 13.272727 1.0964954
non-peat non-peat 4.095238 0.6408521
svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)
habitat holts se
cliffs cliffs 154.5789 42.18355
agricultural agricultural 106.7500 29.22259
peat peat 530.9091 43.85982
non-peat non-peat 192.4762 30.12005
Here the by
argument specifies the variable that defines the domains. The FUN
argument determines what type of parameters should be estimated. In stratified random sampling domains often correspond to strata, although this is not necessary.
The survey data in the otter
data frame were collected using a stratified random sampling design, so an analysis using post-stratification would not be consistent with the design. However we can use these data to illustrate how to do post-stratification with the survey
package if we pretend that the data were collected using simple random sampling. First we need to re-specify the design as a simple random sampling design.
otters$N <- 237
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
The otter
data can then be post-stratified as follows.
habitat.freq <- data.frame(habitat = c("cliffs", "agricultural", "peat", "non-peat"),
Freq = c(89, 61, 40, 47))
mydesign <- postStratify(design = mydesign, strata = ~habitat, population = habitat.freq)
The population
argument is a data frame, created earlier as habitat.freq
. This data frame must contain (a) the exact names of the strata and (b) the total number of sampling units (Freq
) in each stratum (i.e., \(N_1, N_2, \dots, N_4\)). Now functions such as svymean
and svytotal
can be applied to the post-stratified design.
svymean(~holts, design = mydesign)
mean SE
holts 4.1549 0.3256
svytotal(~holts, design = mydesign)
total SE
holts 984.71 77.162
Note that the estimates are the same as those obtained when the design was specified as a stratified random sampling design. The standard errors, however, are different because the (estimated) variance of the estimators is different for post-stratification than for stratified random sampling.
Recall that with simple random sampling the weights in the estimator for \(\tau\), \[
\hat{\tau} = \sum_{i=1}^n w_i y_i,
\] are \(w_i = N/n\), which is \(w_i = 237/82 \approx 2.89\) for all observations. You can verify this if you use the command weights(mydesign)
before applying the post-stratification. Also you can verify that the weights sum to \(N=237\) with the command sum(weights(mydesign))
, again before applying the post-stratification. However before post-stratification the sums of the weights for the observations from each startum do not match the known values of \(N_1\), \(N_2\), \(N_3\), and \(N_4\).
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
tapply(weights(mydesign), otters$habitat, sum)
cliffs agricultural peat non-peat
54.91463 57.80488 63.58537 60.69512
Now observe what happens after applying post-stratification.
mydesign <- postStratify(design = mydesign, strata = ~habitat, population = habitat.freq)
tapply(weights(mydesign), otters$habitat, sum)
cliffs agricultural peat non-peat
89 61 40 47
You can also see the individual weights as before using the command weights(mydesign)
to confirm that the weights for all the observations within stratum \(k\) are equal to \(N_k/n_k\).
Domain/stratum means and totals can also be estimated from the post-stratified design.3
svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)
habitat holts se
cliffs cliffs 1.736842 0.4232673
agricultural agricultural 1.750000 0.4634253
peat peat 13.272727 1.2994362
non-peat non-peat 4.095238 0.6841975
svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)
habitat holts se
cliffs cliffs 154.5789 37.67079
agricultural agricultural 106.7500 28.26895
peat peat 530.9091 51.97745
non-peat non-peat 192.4762 32.15728
Note that while the estimates are the same as those that were obtained when the design was (correctly) specified as a stratified random sampling design, the standard errors are different. This is because the domain means are ratios of the form \[
\bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i},
\] where \(\bar{y}_d\) denotes the estimated mean for a particular domain, \(d_i = 1\) if the \(i\)-th unit is in the domain, and \(d_i = 0\) if the \(i\)-th unit is not in the domain. To see this define the numerator variable as \(y_i^* = y_i d_i\) and the denominator variable as \(x_i = d_i\) so that \[
\bar{y}_d = \frac{\sum_{i=1}^n y_i^*}{\sum_{i=1}^n x_i}
\] is the ratio of the sample totals or means of \(\bar{y}_i^*\) and \(x_i\). The standard errors can actually be made somewhat more accurate when the total number of sampling units per domain is known. To understand why note that the estimator for the variance of \(\bar{y}_d\) is \[
\hat{V}(\bar{y}_d) =
\left(1 - \frac{n}{N}\right)\frac{1}{\mu_x^2}\frac{s_r^2}{n},
\] where \[
s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2
\] and where \(r = \bar{y}^*/\bar{x}\), recalling that \(y_i^* = d_i y_i\) and \(x_i = d_i\) where \(d_i\) is an indicator variable for if the \(i\)-th unit is in the domain. An estimator of the variance of \(\tau_d\) is \(N_d^2 \hat{V}(\bar{y}_d)\), and the standard error of \(\tau_d\) is \(N_d\) times the standard error of \(\bar{y}_d\). But svyby
uses \(\bar{x}\) in place of \(\mu_x\), even though \(\mu_x\) is known since \[
\mu_x = \frac{1}{N}\sum_{i=1}^N x_i = \frac{N_d}{N},
\] and \(N_d\) and \(N\) are often known when using post-stratification. To adjust the standard errors reported by svyby
to use the known value of \(\mu_x = N_d/N\) they can be simply multiplied by \(\bar{d}N/N_d\) where \(\bar{d} = \bar{x}\) is the proportion of units in the sample that are in the domain.
Recall that double sampling can be used in cases where the stratum sizes (i.e., \(N_1, N_2, \dots, N_L\)) are unknown before sampling. The most basic approach is take a larger simple random sample, divide the units in that sample into their respective strata, and then take a smaller stratified random sample from that sample. As an example consider a fictional data set from a double sampling design to estimate the mean foot hair density (fhd) of Hobbits from a population of \(N=10,000\).
library(downloader)
source_url("https://db.tt/SYVQBc90", prompt = FALSE)
hobbits$N <- 10000
Take a look at the first 12 observations in the data frame hobbit
.
head(hobbits, 12)
region fhd subsample N
1 South NA FALSE 10000
2 North NA FALSE 10000
3 East NA FALSE 10000
4 East NA FALSE 10000
5 North NA FALSE 10000
6 South NA FALSE 10000
7 East NA FALSE 10000
8 South 4.1 TRUE 10000
9 East NA FALSE 10000
10 East NA FALSE 10000
11 West 9.2 TRUE 10000
12 East 13.9 TRUE 10000
The stratification variable is region
and the variable of interest is fhd
. The variable subsample
indicates whether an observation appeared in the second phase sample obtained using stratified random sampling applied to the first simple random sample of observations. If subsample
is TRUE
then the observations appeared in the second phase sample, but if subsample
is FALSE
then it did not appear in the second phase sample. Note that the stratification variable region
is observed for all units, but the target variable fhd
is observed only in the second phase sample. The two phase design is specified using the twophase
function.
mydesign <- twophase(id = list(~1, ~1), data = hobbits, strata = list(NULL,
~region), fpc = list(~N, NULL), subset = ~subsample)
Note that several of the usual arguments (i.e., id
, strata
, and fpc
) have two components corresponding to the first and second phase samples. Note that because the first sample ignores the strata we specify NULL
for strata
, and since the stratum sizes are unknown (but estimated using the information in the first sample) we specify NULL
for fpc
. The subset
argument indicates the variable that identifies those observations that are included in the second sample — i.e., a subset of the first sample. Once the two-phase design object is created, then we can make inferences as usual. The functions will recognize the two-phase design and use the appropriate formulas for estimators and standard errors.
svymean(~fhd, design = mydesign)
mean SE
fhd 11.209 0.4386
This document is a work in progress. It is based on version 3.30.3 of the survey
package.↩
Kruuk, H., Moorhouse, A., Conroy, J. W. H., Durbin, L., & Frears, S. (1989). An estimate of numbers and habitat preferences of otters lutra lutra in Shetland, UK. Biological Conservation, 49, 241-254.↩
The relatively large difference between the estimates of the domain totals with and without post-stratification may be due to the fact that the original sample was not collected using simple random sampling. Simple random sampling would have resulted in approximately proportional allocation, but the allocation is nearly uniform across the strata.↩