This document introduces the use of the survey package for R for making inferences using data collected using a simple random sampling design. It demonstrates estimation of the population mean, total, domain means and totals, and estimation with post-stratification.1 Other estimators that use auxiliary variables (e.g., ratio and regression estimators, and calibration) are discussed in another (forthcoming) document.
Kruuk et al. (1990) used a stratified random sampling design to estimate the number of otter (Lutra lutra) dens or holts along a 1400km coastline of the Shetland Islands.2 The coastline was divided into 237 5km habitable sections, with each section classified as one of four types: cliffs over 10m (89 sections), agricultural (61 sections), peat (40 sections), and non-peat (47 sections). The sample featured here was collected using stratified random sampling, but for this example it will be treated as if it had been collected using simple random sampling. The data are available in the SDaA
package. If you have not done so already you will need to install this package using install.packages("SDaA")
. To access the data load the package.
library(SDaA)
The data frame is called otters
. Here are the first six observations:
head(otters)
section habitat holts
1 1 4 6
2 3 2 0
3 4 1 8
4 8 1 0
5 11 1 0
6 19 2 0
Each observation includes the section number, the type of habitat, and the number of holts. Although not necessary for the analysis, it would be helpful to have the stratum names be more descriptive. This can be done by changing habitat
into a factor using the factor
function and assigning labels to the strata.
otters$habitat <- factor(otters$habitat, labels = c("cliffs", "agricultural",
"peat", "non-peat"))
head(otters)
section habitat holts
1 1 non-peat 6
2 3 agricultural 0
3 4 cliffs 8
4 8 cliffs 0
5 11 cliffs 0
6 19 agricultural 0
Before specifying the design it is necessary to include within the data frame the population size for the finite population correction. This can be done by creating a variable N
.
otters$N <- 237
head(otters)
section habitat holts N
1 1 non-peat 6 237
2 3 agricultural 0 237
3 4 cliffs 8 237
4 8 cliffs 0 237
5 11 cliffs 0 237
6 19 agricultural 0 237
Now the design can be specified. First load the survey
package, installing it first if necessary using the command install.packages("survey").
library(survey)
The svydesign
function in the survey
package is used to create a survey design object that includes information about the design and the data. A simple random sampling design can be specified as follows.
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
The argument ids = ~1
has to do with if or how elements are grouped within sampling units, such as in cluster sampling. But for simple random sampling simply specify ids = ~1
to indicate one element per sampling unit. The argument data = otters
indicates the data frame. The argument fpc = ~N
indicates the variable that size of the population from which the unit was sampled. In cases where the finite population correction can be safely ignored such as when \(1-n/N \approx 1\), or when sampling is with replacement, this argument can be omitted. Notice that the variables are always proceeded by a tilde (~
). This is a convention of the survey
package but is not typical of other packages in R, so it is easy to forget to use it. The object mydesign
now includes the data as well as information about the design.
Inferences can be made by applying special functions to the object created by svydesign
. An estimate for the population mean \(\mu\) and its associated standard error can be computed using the svymean
function. Recall that the estimator of \(\mu\) is simply the sample mean \(\bar{y}\), and its estimated variance is \[
\hat{V}(\bar{y}) = \left(1 - \frac{n}{N}\right)\frac{s^2}{n}.
\] To compute \(\bar{y}\) and its standard error \(\sqrt{\hat{V}(\bar{y})}\), use svymean
and indicate the variable of interest and the design object.
svymean(~holts, design = mydesign)
mean SE
holts 5.439 0.6031
Thus we estimate that the mean number of holts per section is about 5.44 holts with a bound on the error of estimation of about \(2 \times 0.6031 \approx 1.21\) holts. The svymean
function does not report the bound on the error of estimation \(B = z\sqrt{\hat{V}(\bar{y})}\), but it can be computed easily “by hand” if we multiply the standard error by \(z\). The confint
function can be applied to the result of svymean
to compute the confidence interval \(\bar{y} \pm z\sqrt{\hat{V}(\bar{y})}\).
confint(svymean(~holts, design = mydesign))
2.5 % 97.5 %
holts 4.256935 6.621114
Thus we estimate that the mean number of holts per section is between 4.26 and 6.62 holts. Note that while many textbooks define \(z = 2\) for a 95% confidence level, the survey
package uses \(z \approx 1.96\). Inferences concerning the population total \(\tau\) can be done by using the svytotal
function instead of svymean
, which will produce the estimate \(\hat{\tau} = N\bar{y}\) with its estimated standard error \(\sqrt{\hat{V}(\hat{\tau})} = N\sqrt{\hat{V}(\bar{y})}\).
svytotal(~holts, design = mydesign)
total SE
holts 1289 142.94
Thus the estimate for the total number of holts is about 1289 with a bound on the error of estimation of about 143. As with svymean
, the confint
function can be used in conjunction with svytotal
to produce a confidence interval.
Domain means and totals can be computed using the svyby
function.
svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)
habitat holts se
cliffs cliffs 1.736842 0.4232673
agricultural agricultural 1.750000 0.4634253
peat peat 13.272727 1.2994362
non-peat non-peat 4.095238 0.6841975
svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)
habitat holts se
cliffs cliffs 95.37805 27.99661
agricultural agricultural 101.15854 31.20485
peat peat 843.95122 150.03694
non-peat non-peat 248.56098 56.33427
Here the by
argument specifies the variable that defines the domains. The FUN
argument determines what type of parameters should be estimated. The domain mean estimator \(\bar{y}_d\) is a ratio since it can be written as \[
\bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i},
\] where \(d_i = 1\) if the \(i\)-th unit is in the domain, and \(d_i = 0\) if the \(i\)-th unit is not in the domain. To see this define the numerator variable as \(y_i^* = y_i d_i\) and the denominator variable as \(x_i = d_i\) so that \[
\bar{y}_d = \frac{\sum_{i=1}^n y_i^*}{\sum_{i=1}^n x_i}
\] is the ratio of the sample totals or means of \(\bar{y}_i^*\) and \(x_i\). Because of this the results above from svyby
with FUN = svymean
can be replicated using the svyratio
function which computes the estimate and standard error for a ratio of means or totals. For example, we can compute the estimate and its standard error for the mean number of holts in the cliffs habitat.
otters$d <- ifelse(otters$habitat == "cliffs", 1, 0)
otters$dholts <- otters$holts * otters$d
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
svyratio(numerator = ~dholts, denominator = ~d, design = mydesign)
Ratio estimator: svyratio.survey.design2(numerator = ~dholts, denominator = ~d,
design = mydesign)
Ratios=
d
dholts 1.736842
SEs=
d
dholts 0.4232673
To compute an estimator for a domain total, we could use \[
\hat{\tau}_i = N_d \bar{y}_d,
\] where \(N_d\) is the total number of sampling units in the domain. If \(N_d\) is known then computing \(\hat{\tau}_d\) can be done by simply multiplying \(\bar{y}_d\) and its standard error by \(N_d\). But if \(N_d\) is unknown then it can be estimated as \[
\hat{N}_d = N\frac{\sum_{i=1}^n d_i}{n},
\] because \(\sum_{i=1}^n d_i/n\) estimates \(N_d/N\), the proportion of units in the population that are in the domain. Multiplying \(\hat{N}_d\) and \(\bar{y}_d\) and simplifying gives \[
\hat{\tau}_d = \frac{N}{n}\sum_{i=1}^n d_i y_i,
\] so that the estimator for \(\tau_d\) when \(N_d\) is unknown is equivalent to the estimate of the population total of \(y_i^* = d_i y_i\). This is the estimator that is reported by svyby
with the FUN = svytotal
argument, but it can also be computed as follows.
svytotal(~dholts, mydesign)
total SE
dholts 95.378 27.997
Post-stratification is usually discussed in textbooks in the context of stratified random sampling. But since it is applied, by definition, to data not collected using a stratified random sampling design, such as simple random sampling, it is natural to discuss it here. The otter
data can be post-stratified as follows.
pop <- data.frame(habitat = c("cliffs", "agricultural", "peat", "non-peat"),
Freq = c(89, 61, 40, 47))
mydesign <- postStratify(design = mydesign, strata = ~habitat, population = pop)
The population
argument is specified as a data frame pop
that must contain (a) the exact names of the strata and (b) the total number of sampling units (Freq
) in each stratum (i.e., \(N_1, N_2, \dots, N_4\)). The function postStatify
re-weights the observations to match the given frequencies as can be seen below.
aggregate(weights(mydesign) ~ habitat, data = otters, sum)
habitat weights(mydesign)
1 cliffs 89
2 agricultural 61
3 peat 40
4 non-peat 47
Now functions such as svymean
and svytotal
can be applied to the post-stratified design.
svymean(~holts, design = mydesign)
mean SE
holts 4.1549 0.3256
svytotal(~holts, design = mydesign)
total SE
holts 984.71 77.162
Domain/stratum means and totals can also be estimated from the post-stratified design. The estimates and standard errors of the domain means are the same as those obtained earlier, but the estimates and standard errors for the domain totals are different.
svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)
habitat holts se
cliffs cliffs 1.736842 0.4232673
agricultural agricultural 1.750000 0.4634253
peat peat 13.272727 1.2994362
non-peat non-peat 4.095238 0.6841975
svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)
habitat holts se
cliffs cliffs 154.5789 37.67079
agricultural agricultural 106.7500 28.26895
peat peat 530.9091 51.97745
non-peat non-peat 192.4762 32.15728
The reason for this is that svyby
has used the information about the total number of sampling units in each domain/stratum (\(N_d\)) that was passed to postStratify
with the population
argument. So the estimator for the domain/stratum total is \(\hat{\tau}_d = N_d\bar{y}_d\) where \(\bar{y}_d\) is the domain/stratum mean estimator for simple random sampling described earlier.3
When the proportion of sampling units in the domain/stratum is known for the population, the standard errors for \(\bar{y}_d\) and \(\tau_d\) reported by svyby
can be adjusted to use this information. Because \(\bar{y}_d\) is a ratio as shown earlier, an estimator for the variance of \(\bar{y}_d\) is \[
\hat{V}(\bar{y}_d) =
\left(1 - \frac{n}{N}\right)\frac{1}{\mu_x^2}\frac{s_r^2}{n},
\] where \[
s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2
\] and where \(r = \bar{y}^*/\bar{x}\), recalling that \(y_i^* = d_i y_i\) and \(x_i = d_i\) where \(d_i\) is an indicator variable for if the \(i\)-th unit is in the domain. An estimator of the variance of \(\tau_d\) is \(N_d^2 \hat{V}(\bar{y}_d)\), and the standard error of \(\tau_d\) is \(N_d\) times the standard error of \(\bar{y}_d\). But svyby
uses \(\bar{x}\) in place of \(\mu_x\), even though \(\mu_x\) is known since \[
\mu_x = \frac{1}{N}\sum_{i=1}^N x_i = \frac{N_d}{N},
\] and \(N_d\) and \(N\) are often known when using post-stratification. To adjust the standard errors reported by svyby
to use the known value of \(\mu_x = N_d/N\) they can be simply multiplied by \(\bar{d}N/N_d\) where \(\bar{d} = \bar{x}\) is the proportion of units in the sample that are in the domain.
This document is a work in progress. It is based on version 3.30.3 of the survey
package.↩
Kruuk, H., Moorhouse, A., Conroy, J. W. H., Durbin, L., & Frears, S. (1989). An estimate of numbers and habitat preferences of otters lutra lutra in Shetland, UK. Biological Conservation, 49, 241-254.↩
The relatively large difference between the estimates of the domain totals with and without post-stratification may be due to the fact that the original sample was not collected using simple random sampling. Simple random sampling would have resulted in approximately proportional allocation, but the allocation is nearly uniform across the strata.↩