Simple Random Sampling Analysis in R

This document introduces the use of the survey package for R for making inferences using data collected using a simple random sampling design. It demonstrates estimation of the population mean, total, domain means and totals, and estimation with post-stratification.¹ Other estimators that use auxiliary variables (e.g., ratio and regression estimators, and calibration) are discussed in another (forthcoming) document.

Survey of Otter Dens

Kruuk et al. (1990) used a stratified random sampling design to estimate the number of otter (Lutra lutra) dens or holts along a 1400km coastline of the Shetland Islands.² The coastline was divided into 237 5km habitable sections, with each section classified as one of four types: cliffs over 10m (89 sections), agricultural (61 sections), peat (40 sections), and non-peat (47 sections). The sample featured here was collected using stratified random sampling, but for this example it will be treated as if it had been collected using simple random sampling. The data are available in the SDaA package. If you have not done so already you will need to install this package using install.packages("SDaA"). To access the data load the package.

library(SDaA)

The data frame is called otters. Here are the first six observations:

head(otters)

  section habitat holts
1       1       4     6
2       3       2     0
3       4       1     8
4       8       1     0
5      11       1     0
6      19       2     0

Each observation includes the section number, the type of habitat, and the number of holts. Although not necessary for the analysis, it would be helpful to have the stratum names be more descriptive. This can be done by changing habitat into a factor using the factor function and assigning labels to the strata.

otters$habitat <- factor(otters$habitat, labels = c("cliffs", "agricultural", 
    "peat", "non-peat"))
head(otters)

  section      habitat holts
1       1     non-peat     6
2       3 agricultural     0
3       4       cliffs     8
4       8       cliffs     0
5      11       cliffs     0
6      19 agricultural     0

Specification of the Design

Before specifying the design it is necessary to include within the data frame the population size for the finite population correction. This can be done by creating a variable N.

otters$N <- 237
head(otters)

  section      habitat holts   N
1       1     non-peat     6 237
2       3 agricultural     0 237
3       4       cliffs     8 237
4       8       cliffs     0 237
5      11       cliffs     0 237
6      19 agricultural     0 237

Now the design can be specified. First load the survey package, installing it first if necessary using the command install.packages("survey").

library(survey)

The svydesign function in the survey package is used to create a survey design object that includes information about the design and the data. A simple random sampling design can be specified as follows.

mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)

The argument ids = ~1 has to do with if or how elements are grouped within sampling units, such as in cluster sampling. But for simple random sampling simply specify ids = ~1 to indicate one element per sampling unit. The argument data = otters indicates the data frame. The argument fpc = ~N indicates the variable that size of the population from which the unit was sampled. In cases where the finite population correction can be safely ignored such as when \(1-n/N \approx 1\), or when sampling is with replacement, this argument can be omitted. Notice that the variables are always proceeded by a tilde (~). This is a convention of the survey package but is not typical of other packages in R, so it is easy to forget to use it. The object mydesign now includes the data as well as information about the design.

Inferences for \(\mu\) and \(\tau\)

Inferences can be made by applying special functions to the object created by svydesign. An estimate for the population mean \(\mu\) and its associated standard error can be computed using the svymean function. Recall that the estimator of \(\mu\) is simply the sample mean \(\bar{y}\), and its estimated variance is \[ \hat{V}(\bar{y}) = \left(1 - \frac{n}{N}\right)\frac{s^2}{n}. \] To compute \(\bar{y}\) and its standard error \(\sqrt{\hat{V}(\bar{y})}\), use svymean and indicate the variable of interest and the design object.

svymean(~holts, design = mydesign)

       mean     SE
holts 5.439 0.6031

Thus we estimate that the mean number of holts per section is about 5.44 holts with a bound on the error of estimation of about \(2 \times 0.6031 \approx 1.21\) holts. The svymean function does not report the bound on the error of estimation \(B = z\sqrt{\hat{V}(\bar{y})}\), but it can be computed easily “by hand” if we multiply the standard error by \(z\). The confint function can be applied to the result of svymean to compute the confidence interval \(\bar{y} \pm z\sqrt{\hat{V}(\bar{y})}\).

confint(svymean(~holts, design = mydesign))

         2.5 %   97.5 %
holts 4.256935 6.621114

Thus we estimate that the mean number of holts per section is between 4.26 and 6.62 holts. Note that while many textbooks define \(z = 2\) for a 95% confidence level, the survey package uses \(z \approx 1.96\). Inferences concerning the population total \(\tau\) can be done by using the svytotal function instead of svymean, which will produce the estimate \(\hat{\tau} = N\bar{y}\) with its estimated standard error \(\sqrt{\hat{V}(\hat{\tau})} = N\sqrt{\hat{V}(\bar{y})}\).

svytotal(~holts, design = mydesign)

      total     SE
holts  1289 142.94

Thus the estimate for the total number of holts is about 1289 with a bound on the error of estimation of about 143. As with svymean, the confint function can be used in conjunction with svytotal to produce a confidence interval.

Inferences for Domain Means

Domain means and totals can be computed using the svyby function.

svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)

                  habitat     holts        se
cliffs             cliffs  1.736842 0.4232673
agricultural agricultural  1.750000 0.4634253
peat                 peat 13.272727 1.2994362
non-peat         non-peat  4.095238 0.6841975

svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)

                  habitat     holts        se
cliffs             cliffs  95.37805  27.99661
agricultural agricultural 101.15854  31.20485
peat                 peat 843.95122 150.03694
non-peat         non-peat 248.56098  56.33427

Here the by argument specifies the variable that defines the domains. The FUN argument determines what type of parameters should be estimated. The domain mean estimator \(\bar{y}_d\) is a ratio since it can be written as \[ \bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i}, \] where \(d_i = 1\) if the \(i\)-th unit is in the domain, and \(d_i = 0\) if the \(i\)-th unit is not in the domain. To see this define the numerator variable as \(y_i^* = y_i d_i\) and the denominator variable as \(x_i = d_i\) so that \[ \bar{y}_d = \frac{\sum_{i=1}^n y_i^*}{\sum_{i=1}^n x_i} \] is the ratio of the sample totals or means of \(\bar{y}_i^*\) and \(x_i\). Because of this the results above from svyby with FUN = svymean can be replicated using the svyratio function which computes the estimate and standard error for a ratio of means or totals. For example, we can compute the estimate and its standard error for the mean number of holts in the cliffs habitat.

otters$d <- ifelse(otters$habitat == "cliffs", 1, 0)
otters$dholts <- otters$holts * otters$d
mydesign <- svydesign(id = ~1, data = otters, fpc = ~N)
svyratio(numerator = ~dholts, denominator = ~d, design = mydesign)

Ratio estimator: svyratio.survey.design2(numerator = ~dholts, denominator = ~d, 
    design = mydesign)
Ratios=
              d
dholts 1.736842
SEs=
               d
dholts 0.4232673

To compute an estimator for a domain total, we could use \[ \hat{\tau}_i = N_d \bar{y}_d, \] where \(N_d\) is the total number of sampling units in the domain. If \(N_d\) is known then computing \(\hat{\tau}_d\) can be done by simply multiplying \(\bar{y}_d\) and its standard error by \(N_d\). But if \(N_d\) is unknown then it can be estimated as \[ \hat{N}_d = N\frac{\sum_{i=1}^n d_i}{n}, \] because \(\sum_{i=1}^n d_i/n\) estimates \(N_d/N\), the proportion of units in the population that are in the domain. Multiplying \(\hat{N}_d\) and \(\bar{y}_d\) and simplifying gives \[ \hat{\tau}_d = \frac{N}{n}\sum_{i=1}^n d_i y_i, \] so that the estimator for \(\tau_d\) when \(N_d\) is unknown is equivalent to the estimate of the population total of \(y_i^* = d_i y_i\). This is the estimator that is reported by svyby with the FUN = svytotal argument, but it can also be computed as follows.

svytotal(~dholts, mydesign)

        total     SE
dholts 95.378 27.997

Post-Stratification

Post-stratification is usually discussed in textbooks in the context of stratified random sampling. But since it is applied, by definition, to data not collected using a stratified random sampling design, such as simple random sampling, it is natural to discuss it here. The otter data can be post-stratified as follows.

pop <- data.frame(habitat = c("cliffs", "agricultural", "peat", "non-peat"), 
    Freq = c(89, 61, 40, 47))
mydesign <- postStratify(design = mydesign, strata = ~habitat, population = pop)

The population argument is specified as a data frame pop that must contain (a) the exact names of the strata and (b) the total number of sampling units (Freq) in each stratum (i.e., \(N_1, N_2, \dots, N_4\)). The function postStatify re-weights the observations to match the given frequencies as can be seen below.

aggregate(weights(mydesign) ~ habitat, data = otters, sum)

       habitat weights(mydesign)
1       cliffs                89
2 agricultural                61
3         peat                40
4     non-peat                47

Now functions such as svymean and svytotal can be applied to the post-stratified design.

svymean(~holts, design = mydesign)

        mean     SE
holts 4.1549 0.3256

svytotal(~holts, design = mydesign)

       total     SE
holts 984.71 77.162

Domain/stratum means and totals can also be estimated from the post-stratified design. The estimates and standard errors of the domain means are the same as those obtained earlier, but the estimates and standard errors for the domain totals are different.

svyby(~holts, by = ~habitat, design = mydesign, FUN = svymean)

                  habitat     holts        se
cliffs             cliffs  1.736842 0.4232673
agricultural agricultural  1.750000 0.4634253
peat                 peat 13.272727 1.2994362
non-peat         non-peat  4.095238 0.6841975

svyby(~holts, by = ~habitat, design = mydesign, FUN = svytotal)

                  habitat    holts       se
cliffs             cliffs 154.5789 37.67079
agricultural agricultural 106.7500 28.26895
peat                 peat 530.9091 51.97745
non-peat         non-peat 192.4762 32.15728

The reason for this is that svyby has used the information about the total number of sampling units in each domain/stratum (\(N_d\)) that was passed to postStratify with the population argument. So the estimator for the domain/stratum total is \(\hat{\tau}_d = N_d\bar{y}_d\) where \(\bar{y}_d\) is the domain/stratum mean estimator for simple random sampling described earlier.³

When the proportion of sampling units in the domain/stratum is known for the population, the standard errors for \(\bar{y}_d\) and \(\tau_d\) reported by svyby can be adjusted to use this information. Because \(\bar{y}_d\) is a ratio as shown earlier, an estimator for the variance of \(\bar{y}_d\) is \[ \hat{V}(\bar{y}_d) = \left(1 - \frac{n}{N}\right)\frac{1}{\mu_x^2}\frac{s_r^2}{n}, \] where \[ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2 \] and where \(r = \bar{y}^*/\bar{x}\), recalling that \(y_i^* = d_i y_i\) and \(x_i = d_i\) where \(d_i\) is an indicator variable for if the \(i\)-th unit is in the domain. An estimator of the variance of \(\tau_d\) is \(N_d^2 \hat{V}(\bar{y}_d)\), and the standard error of \(\tau_d\) is \(N_d\) times the standard error of \(\bar{y}_d\). But svyby uses \(\bar{x}\) in place of \(\mu_x\), even though \(\mu_x\) is known since \[ \mu_x = \frac{1}{N}\sum_{i=1}^N x_i = \frac{N_d}{N}, \] and \(N_d\) and \(N\) are often known when using post-stratification. To adjust the standard errors reported by svyby to use the known value of \(\mu_x = N_d/N\) they can be simply multiplied by \(\bar{d}N/N_d\) where \(\bar{d} = \bar{x}\) is the proportion of units in the sample that are in the domain.