Background

This document describes a R function called allocate for computing optimal allocation for stratified random sampling.1 The allocate function assumes a stratified random sampling design where simple random samples are taken from each of \(L\) strata, and the use of the population mean estimator \[ \bar{y}_{\text{st}} = \frac{1}{N}\sum_{i=1}^L N_i \bar{y}_i, \] where \(N_i\) is the number of available sampling units in the \(i\)-th stratum (i.e., the population size for the \(i\)-th stratum), \(N = \sum_{i=1}^L N_i\), and \(\bar{y}_i\) is the sample mean for the simple random sample from the \(i\)-th stratum. The variance of this estimator is \[ V(\bar{y}_{\text{st}}) = \frac{1}{N^2}\sum_{i=1}^L N_i^2\left(1-\frac{n_i}{N_i}\right)\frac{\sigma_i^2}{n_i}, \] where \(n_i\) is the number of sampling units sampled from the \(i\)-th stratum and \(\sigma_i^2\) is the population variance for the \(i\)-th stratum defined as \[ \sigma_i^2 = \frac{1}{N_i-1}\sum_{j=1}^{N_i} (y_{ij} - \mu_i)^2, \] where \(\mu_i\) is the population mean for the \(i\)-th stratum and \(y_{ij}\) is the observation of the \(j\)-th unit in the \(i\)-th stratum (note that the denominator in the expression for the variance is \(N_i-1\) rather than \(N_i\)). A linear cost function is assumed by allocate where the cost of the survey (\(C\)) is computed as \[ C = c_0 + \sum_{i=1}^L n_i c_i, \] where \(c_0\) is the “overhead” cost and \(c_i\) is the cost per sampling unit in the \(i\)-th stratum. The optimum allocation that minimizes the cost for a given variance or minimizes the variance for a given cost results in the sampling fractions, \[ n_i/n = \frac{N_i \sigma_i / \sqrt{c_i}} {\sum_{k=1}^L N_k \sigma_k / \sqrt{c_k}} \] To compute an optimum value of \(n\) one must decide if they want to minimize cost for a given variance, or minimize the variance for a given cost. To minimize the survey cost for a given variance \(V(\bar{y}_{\text{st}}) = V\) the total sample size is \[ n = \frac{\left(\sum_{i=1}^L N_i\sigma_i/\sqrt{c_i}\right)\left(\sum_{i=1}^L N_i\sigma_i\sqrt{c_i}\right)}{N^2V + \sum_{i=1}^L N_i\sigma_i^2}. \] or minimize variance for a given cost the total sample size is \[ n = \frac{(C-c_0)\sum_{i=1}^L N_i\sigma_i/\sqrt{c_i}}{\sum_{i=1}^L N_i\sigma_i\sqrt{c}_i}. \] Once \(n\) is determined the sample sizes for each stratum can be computed as \[ n_i = n\left(\frac{N_i \sigma_i / \sqrt{c_i}} {\sum_{k=1}^L N_k \sigma_k / \sqrt{c_k}}\right). \] The allocate function simply makes these calculations when given sufficient information. More discussion of these results including proofs can be found in Cochran’s Sampling Techniques.

Function Usage

The allocate function can be downloaded into a R session from my public Dropbox directory using the source function.

source("https://db.tt/dXuYetYh")

Alternatively the function can be linked directly. The code is also given at the end of this document.

The two required arguments of the allocate function are the stratum sizes (Ni) and the stratum standard deviations (si). The optional arguments are the sampling units costs for each stratum (ci, default: 1), the overhead cost (c0, default: 0), total cost for the survey (ct, default: NA), and the desired “total” variance for the estimator for the population mean (vt, default: NA). If the total cost is given then the function will compute the \(n\) and \(n_i\) to minimize variance for the given cost. If the variance is given then the function will compute the \(n\) and \(n_i\) to minimize the cost for the given variance. If neither are given then only the sampling fractions will be computed. In some cases it is impossible to achieve a specified variance or cost because the number of available strata is too small (i.e., \(n_i > N_i\) for some \(i\)). In these cases a warning will be given.

Examples

Consider the following example from Elementary Survey Sampling (7th edition) by Scheaffer, Mendenhall, Ott, and Gerow. The example features a stratified random sampling design to estimate the average number of hours per week of watched television for the households within a county. The households were assigned to \(L=3\) strata: a town built near a factor with many children (\(N_1 = 155\) households), a town that is a suburb of a city in a neighboring county that has more older residents and fewer children (\(N_2 = 62\) households), and a rural area (\(N_3 = 93\) households). Cost per household in the towns is $9 per unit (\(c_1 = c_2 = 9\)) and $16 per unit in the rural area (\(c_3 = 16\)). Using a prior survey the stratum standard deviations are estimated as \(\sigma_1 \approx 5\), \(\sigma_2 = 15\), and \(\sigma_3 \approx 10\). Based on this information alone we can compute the allocation fractions \(n_1/n\), \(n_2/n\), and \(n_3/n\).

allocate(Ni = c(155,62,93), si = c(5,15,10), ci = c(9,9,16))
$fractions
[1] 0.3225806 0.3870968 0.2903226

$ni
[1] NA

$variance
[1] NA

$cost
[1] NA

Note that without providing a fixed cost or variance the optimum total sample sizes as well as the cost and variance are not defined, but we know that for any given cost or variance approximately 32% of the sampled units should be from the first stratum, approximately 39% of the sampled units should be from the second stratum, and approximately 29% of the sampled units should be from the third stratum. In the example the authors consider the optimum allocation for a bound on the error of estimation of \(B=2\). Since \(B = 2\sqrt{V(\bar{y}_{\text{st}})}\), this implies a variance of \(V(\bar{y}_{\text{st}}) = B^2/4 = 1\). We can compute the allocation that minimizes cost for a given variance of \(V(\bar{y}_{\text{st}}) = 1\) as follows.

allocate(Ni = c(155,62,93), si = c(5,15,10), ci = c(9,9,16), vt = 1)
$fractions
[1] 0.3225806 0.3870968 0.2903226

$ni
[1] 18.52201 22.22642 16.66981

$n
[1] 57.41824

$variance
[1] 1

$cost
[1] 633.4528

Rounding to the nearest integer, the optimum sample sizes for the three strata are \(n_1 = 19\), \(n_2 = 22\), and \(n_3 = 17\) which yields a minimized cost of \(C \approx 633\) and the desired variance of \(V(\bar{y}_{\text{st}}) = 1\) (note that overhead cost \(c_0\) is assumed to be zero since it was not specified). Now suppose we want the optimum allocation for a total cost of \(C = 500\).

allocate(Ni = c(155,62,93), si = c(5,15,10), ci = c(9,9,16), ct = 500)
$fractions
[1] 0.3225806 0.3870968 0.2903226

$ni
[1] 14.61988 17.54386 13.15789

$n
[1] 45.32164

$variance
[1] 1.342242

$cost
[1] 500

Rounding to the nearest integer, the optimum sample sizes for the three strata are \(n_1 = 15\), \(n_2 = 18\), and \(n_3 = 13\) which yields a minimized variance of \(V(\bar{y}_{\text{st}}) \approx 1.34\) and the desired cost of \(C = 500\).

Optimum Allocation for Estimating \(\tau\)

The allocate function can also be used to determine an optimum allocation to estimate the population total \(\tau\) using the estimator \[ \hat{\tau}_{\text{st}} = \sum_{i=1}^L N_i \bar{y}_i. \] The use of the function is the same except that if the variance is to be specified then it must be converted from \(V(\hat{\tau}_{\text{st}})\) to \(V(\bar{y}_{\text{st}})\) through the relationship \(V(\bar{y}_{\text{st}}) = V(\hat{\tau}_{\text{st}})/N^2\). For example, the optimum allocation for a fixed variance of \(V(\hat{\tau}_{\text{st}}) = 50000\) with \(N = 100\) is equivalent to that for \(V(\bar{y}_{\text{st}}) = 50000/100^2 = 5\).

Function Code

function (Ni, si, ci = rep(1, length(Ni)), c0 = 0, ct = NA, vt = NA) 
{
    f <- Ni * si/sqrt(ci)/sum(Ni * si/sqrt(ci))
    N <- sum(Ni)
    if (!is.na(ct) & !is.na(vt)) {
        stop("both survey cost and variance cannot be fixed")
    }
    else if (is.na(ct) & is.na(vt)) {
        return(list(fractions = f, ni = NA, variance = NA, cost = NA))
    }
    else {
        t1 <- sum(Ni * si/sqrt(ci))
        t2 <- sum(Ni * si * sqrt(ci))
        if (!is.na(vt)) {
            n <- t1 * t2/(vt * N^2 + sum(Ni * si^2))
            ni <- n * f
            if (any(ni > Ni)) 
                warning("optimum sample size exceeds available units")
            return(list(fractions = f, ni = n * f, n = n, variance = ifelse(all(ni <= 
                Ni), sum(Ni^2 * (Ni - ni)/Ni * (si^2/ni))/N^2, 
                NA), cost = ifelse(all(ni <= Ni), c0 + sum(ni * 
                ci), NA)))
        }
        if (!is.na(ct)) {
            n <- (ct - c0) * t1/t2
            ni <- n * f
            if (any(ni > Ni)) 
                warning("optimum sample size exceeds available units")
            return(list(fractions = f, ni = n * f, n = n, variance = ifelse(all(ni <= 
                Ni), sum(Ni^2 * (Ni - ni)/Ni * (si^2/ni))/N^2, 
                NA), cost = ifelse(all(ni <= Ni), c0 + sum(ni * 
                ci), NA)))
        }
    }
}

  1. This function has not been thoroughly tested. Use with caution.