Stratified Random Sampling Analysis with R

This document introduces the use of the survey package for R for making inferences using data collected using a stratified random sampling design. It demonstrates estimation of the population mean, total, strata means and totals, post-stratification, and double-sampling when the number of sampling units per strata are unknown.¹

Survey of Otter Dens

Kruuk et al. (1990) used a stratified random sampling design to estimate the number of otter (Lutra lutra) dens or holts along a 1400km coastline of the Shetland Islands.² The coastline was divided into 237 5km habitable sections, with each section classified as one of four types: cliffs over 10m (89 sections), agricultural (61 sections), peat (40 sections), and non-peat (47 sections). The sample featured here was collected using stratified random sampling. The data are available in the SDaA package. If you have not done so already you will need to install this package using install.packages("SDaA"). To access the data load the package.

library(SDaA)

The data frame is called otters. Here are the first six observations:

head(otters)

  section habitat holts
1       1       4     6
2       3       2     0
3       4       1     8
4       8       1     0
5      11       1     0
6      19       2     0

Each observation includes the section number, the type of habitat, and the number of holts. Although not necessary for the analysis, it would be helpful to have the stratum names be more descriptive. This can be done by changing habitat into a factor using the factor function and assigning labels to the strata.

otters$habitat <- factor(otters$habitat, labels = c("cliffs", "agricultural", 
    "peat", "non-peat"))
head(otters)

  section      habitat holts
1       1     non-peat     6
2       3 agricultural     0
3       4       cliffs     8
4       8       cliffs     0
5      11       cliffs     0
6      19 agricultural     0

It might be interesting to visualize the survey data. We could do that using a boxplot.

boxplot(holts ~ habitat, data = otters, ylab = "Number of Holts", xlab = "Habitat")

Interestingly the distribution of the number of holts is very similar for the cliffs, agricultural, and non-peat sections, but markedly different for the peat sections. This suggests that stratification for sampling from this population may be worthwhile, particularly if optimal or Neyman allocation is used.

Specification of the Design

Before specifying the design it is necessary to include the stratum sizes (i.e., $N_1$, $N_2$, $N_3$, and $N_4$) in the data frame. This can be done by creating a new variable in the data frame called N. First we will initialize the variable.

otters$N <- NA

This just creates an “empty” variable with no values (in R, NA is what is used for a missing value). Now we need to input the stratum sizes, but do so such that the correct stratum size is assigned to the correct sampling unit. This can be done several ways, but perhaps the easiest is to use what is sometimes called subscripting.

otters$N[otters$habitat == "cliffs"] <- 89
otters$N[otters$habitat == "agricultural"] <- 61
otters$N[otters$habitat == "peat"] <- 40
otters$N[otters$habitat == "non-peat"] <- 47

The expression otters$N[otters$habitat == "cliffs"] refers to all values of otters$N where the corresponding value of otters$habitat is equal to "cliffs". Here we assign the value 89 to those values of otters$N. Now we can see the result and check that the subscripting worked correctly.

head(otters)

  section      habitat holts  N
1       1     non-peat     6 47
2       3 agricultural     0 61
3       4       cliffs     8 89
4       8       cliffs     0 89
5      11       cliffs     0 89
6      19 agricultural     0 61

Now the design can be specified. First load the survey package, installing it first if necessary using the command install.packages("survey").

library(survey)

The svydesign function in the survey package is used to create a survey design object that includes information about the design and the data. A stratified random sampling design can be specified as follows.

mydesign <- svydesign(id = ~1, strata = ~habitat, data = otters, fpc = ~N)

The argument ids = ~1 has to do with if or how elements are grouped within sampling units, such as in cluster sampling. But for designs without clustering such as this one simply specify ids = ~1 to indicate one element per sampling unit. The argument strata = ~habitat identifies the stratification variable. The argument data = otters indicates the data frame. The argument fpc = ~N indicates the variable that size of the population from which the unit was sampled. Notice that the variables are always proceeded by a tilde (~). This is a convention of the survey package but is not typical of other packages in R, so it is easy to forget to use it. The object mydesign now includes the data as well as information about the design.

Inferences for $\mu$ and $\tau$

Inferences can be made by applying special functions to the object created by svydesign. An estimate for the population mean $\mu$ and its associated standard error can be computed using the svymean function. Recall that the estimator of $\mu$ is \[ \bar{y}_{\text{st}} = \frac{1}{N}\sum_{i=1}^L N_i\bar{y}_i, \] and the estimated variance of this estimator is \[ \hat{V}(\bar{y}_{\text{st}}) = \frac{1}{N^2}\sum_{i=1}^L N_i^2\left(1 - \frac{n_i}{N_i}\right)\frac{s_i^2}{n_i}. \] The estimator of $\tau$ is $\hat{\tau}_{\text{st}} = N\bar{y}_{\text{st}}$ with estimated variance $\hat{V}(\hat{\tau}_{\text{st}}) = N^2\hat{V}(\bar{y}_{\text{st}})$. To compute $\bar{y}_{\text{st}}$ and its standard error $\sqrt{\hat{V}(\bar{y}_{\text{st}})}$ use svymean and indicate the variable of interest and the design object.

svymean(~holts, design = mydesign)

        mean     SE
holts 4.1549 0.3119

Thus we estimate that the mean number of holts per section is about 4.15 holts with a bound on the error of estimation of about $2 \times 0.3119 \approx 0.62$ holts. The svymean function does not report the bound on the error of estimation $B = z\sqrt{\hat{V}(\bar{y})}$, but it can be computed easily “by hand” if we multiply the standard error by $z$. The confint function can be applied to the result of svymean to compute the confidence interval $\bar{y} \pm z\sqrt{\hat{V}(\bar{y})}$.

confint(svymean(~holts, design = mydesign))

         2.5 %   97.5 %
holts 3.543594 4.766231

Thus we estimate that the mean number of holts per section is between 3.54 and 4.77 holts. Note that while many textbooks define $z = 2$ for a 95% confidence level, the survey package uses $z \approx 1.96$. Inferences concerning the population total $\tau$ can be done by using the svytotal function instead of svymean.

svytotal(~holts, design = mydesign)

       total     SE
holts 984.71 73.921

Thus the estimate for the total number of holts is about 985 with a bound on the error of estimation of about 74. As with svymean, the confint function can be used in conjunction with svytotal to produce a confidence interval.

The svymean and svytotal functions can also be used to estimate the design effect for the survey by specifying the option deff = TRUE.

svymean(~holts, design = mydesign, deff = TRUE)

        mean     SE   DEff
holts 4.1549 0.3119 0.3572

The design effect of about 0.3572 implies an effective sample size of about $82/0.3572 \approx 230$, meaning that a survey from the same population using a simple random sampling design would need a sample size of approximately 230 to achieve the same variance of the estimator for $\mu$.

Domain Means and Totals

Domain means and totals for each habitat can be computed using the svyby function.

svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)

                  habitat     holts        se
cliffs             cliffs  1.736842 0.4739725
agricultural agricultural  1.750000 0.4790589
peat                 peat 13.272727 1.0964954
non-peat         non-peat  4.095238 0.6408521

svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)

                  habitat    holts       se
cliffs             cliffs 154.5789 42.18355
agricultural agricultural 106.7500 29.22259
peat                 peat 530.9091 43.85982
non-peat         non-peat 192.4762 30.12005

Here the by argument specifies the variable that defines the domains. The FUN argument determines what type of parameters should be estimated. In stratified random sampling domains often correspond to strata, although this is not necessary.

Post-stratification

The survey data in the otter data frame were collected using a stratified random sampling design, so an analysis using post-stratification would not be consistent with the design. However we can use these data to illustrate how to do post-stratification with the survey package if we pretend that the data were collected using simple random sampling. First we need to re-specify the design as a simple random sampling design.

otters$N <- 237
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)

The otter data can then be post-stratified as follows.

habitat.freq <- data.frame(habitat = c("cliffs", "agricultural", "peat", "non-peat"), 
    Freq = c(89, 61, 40, 47))
mydesign <- postStratify(design = mydesign, strata = ~habitat, population = habitat.freq)

The population argument is a data frame, created earlier as habitat.freq. This data frame must contain (a) the exact names of the strata and (b) the total number of sampling units (Freq) in each stratum (i.e., $N_1, N_2, \dots, N_4$). Now functions such as svymean and svytotal can be applied to the post-stratified design.

svymean(~holts, design = mydesign)

        mean     SE
holts 4.1549 0.3256

svytotal(~holts, design = mydesign)

       total     SE
holts 984.71 77.162

Note that the estimates are the same as those obtained when the design was specified as a stratified random sampling design. The standard errors, however, are different because the (estimated) variance of the estimators is different for post-stratification than for stratified random sampling.

Recall that with simple random sampling the weights in the estimator for $\tau$, \[ \hat{\tau} = \sum_{i=1}^n w_i y_i, \] are $w_i = N/n$, which is $w_i = 237/82 \approx 2.89$ for all observations. You can verify this if you use the command weights(mydesign) before applying the post-stratification. Also you can verify that the weights sum to $N=237$ with the command sum(weights(mydesign)), again before applying the post-stratification. However before post-stratification the sums of the weights for the observations from each startum do not match the known values of $N_1$, $N_2$, $N_3$, and $N_4$.

mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
tapply(weights(mydesign), otters$habitat, sum)

      cliffs agricultural         peat     non-peat 
    54.91463     57.80488     63.58537     60.69512

Now observe what happens after applying post-stratification.

mydesign <- postStratify(design = mydesign, strata = ~habitat, population = habitat.freq)
tapply(weights(mydesign), otters$habitat, sum)

      cliffs agricultural         peat     non-peat 
          89           61           40           47

You can also see the individual weights as before using the command weights(mydesign) to confirm that the weights for all the observations within stratum $k$ are equal to $N_k/n_k$.

Domain/stratum means and totals can also be estimated from the post-stratified design.³

svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)

                  habitat     holts        se
cliffs             cliffs  1.736842 0.4232673
agricultural agricultural  1.750000 0.4634253
peat                 peat 13.272727 1.2994362
non-peat         non-peat  4.095238 0.6841975

svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)

                  habitat    holts       se
cliffs             cliffs 154.5789 37.67079
agricultural agricultural 106.7500 28.26895
peat                 peat 530.9091 51.97745
non-peat         non-peat 192.4762 32.15728

Note that while the estimates are the same as those that were obtained when the design was (correctly) specified as a stratified random sampling design, the standard errors are different. This is because the domain means are ratios of the form \[ \bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i}, \] where $\bar{y}_d$ denotes the estimated mean for a particular domain, $d_i = 1$ if the $i$-th unit is in the domain, and $d_i = 0$ if the $i$-th unit is not in the domain. To see this define the numerator variable as $y_i^* = y_i d_i$ and the denominator variable as $x_i = d_i$ so that \[ \bar{y}_d = \frac{\sum_{i=1}^n y_i^*}{\sum_{i=1}^n x_i} \] is the ratio of the sample totals or means of $\bar{y}_i^*$ and $x_i$. The standard errors can actually be made somewhat more accurate when the total number of sampling units per domain is known. To understand why note that the estimator for the variance of $\bar{y}_d$ is \[ \hat{V}(\bar{y}_d) = \left(1 - \frac{n}{N}\right)\frac{1}{\mu_x^2}\frac{s_r^2}{n}, \] where \[ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2 \] and where $r = \bar{y}^*/\bar{x}$, recalling that $y_i^* = d_i y_i$ and $x_i = d_i$ where $d_i$ is an indicator variable for if the $i$-th unit is in the domain. An estimator of the variance of $\tau_d$ is $N_d^2 \hat{V}(\bar{y}_d)$, and the standard error of $\tau_d$ is $N_d$ times the standard error of $\bar{y}_d$. But svyby uses $\bar{x}$ in place of $\mu_x$, even though $\mu_x$ is known since \[ \mu_x = \frac{1}{N}\sum_{i=1}^N x_i = \frac{N_d}{N}, \] and $N_d$ and $N$ are often known when using post-stratification. To adjust the standard errors reported by svyby to use the known value of $\mu_x = N_d/N$ they can be simply multiplied by $\bar{d}N/N_d$ where $\bar{d} = \bar{x}$ is the proportion of units in the sample that are in the domain.

Double (i.e., Two-Phase) Sampling

Recall that double sampling can be used in cases where the stratum sizes (i.e., $N_1, N_2, \dots, N_L$) are unknown before sampling. The most basic approach is take a larger simple random sample, divide the units in that sample into their respective strata, and then take a smaller stratified random sample from that sample. As an example consider a fictional data set from a double sampling design to estimate the mean foot hair density (fhd) of Hobbits from a population of $N=10,000$.

library(downloader)
source_url("https://db.tt/SYVQBc90", prompt = FALSE)
hobbits$N <- 10000

Take a look at the first 12 observations in the data frame hobbit.

head(hobbits, 12)

   region  fhd subsample     N
1   South   NA     FALSE 10000
2   North   NA     FALSE 10000
3    East   NA     FALSE 10000
4    East   NA     FALSE 10000
5   North   NA     FALSE 10000
6   South   NA     FALSE 10000
7    East   NA     FALSE 10000
8   South  4.1      TRUE 10000
9    East   NA     FALSE 10000
10   East   NA     FALSE 10000
11   West  9.2      TRUE 10000
12   East 13.9      TRUE 10000

The stratification variable is region and the variable of interest is fhd. The variable subsample indicates whether an observation appeared in the second phase sample obtained using stratified random sampling applied to the first simple random sample of observations. If subsample is TRUE then the observations appeared in the second phase sample, but if subsample is FALSE then it did not appear in the second phase sample. Note that the stratification variable region is observed for all units, but the target variable fhd is observed only in the second phase sample. The two phase design is specified using the twophase function.

mydesign <- twophase(id = list(~1, ~1), data = hobbits, strata = list(NULL, 
    ~region), fpc = list(~N, NULL), subset = ~subsample)

Note that several of the usual arguments (i.e., id, strata, and fpc) have two components corresponding to the first and second phase samples. Note that because the first sample ignores the strata we specify NULL for strata, and since the stratum sizes are unknown (but estimated using the information in the first sample) we specify NULL for fpc. The subset argument indicates the variable that identifies those observations that are included in the second sample — i.e., a subset of the first sample. Once the two-phase design object is created, then we can make inferences as usual. The functions will recognize the two-phase design and use the appropriate formulas for estimators and standard errors.

svymean(~fhd, design = mydesign)

      mean     SE
fhd 11.209 0.4386

This document is a work in progress. It is based on version 3.30.3 of the survey package.↩
Kruuk, H., Moorhouse, A., Conroy, J. W. H., Durbin, L., & Frears, S. (1989). An estimate of numbers and habitat preferences of otters lutra lutra in Shetland, UK. Biological Conservation, 49, 241-254.↩
The relatively large difference between the estimates of the domain totals with and without post-stratification may be due to the fact that the original sample was not collected using simple random sampling. Simple random sampling would have resulted in approximately proportional allocation, but the allocation is nearly uniform across the strata.↩