Background

This document describes a R function called stratify that implements the “cumulative square root frequency method” for computing approximately optimal stratification of sampling units assuming a stratified random sampling design with Neyman allocation.1 This method was originally proposed by Dalenius and Hodges (1959).2 It involves the following steps:

  1. Discretize a stratification variable that is strongly correlated with the target variable into sequential intervals.
  2. Compute the cumulative square root of the frequencies of the intervals.
  3. Divide the range from zero to the maximum cumulative square root frequency by \(L\), the desired number of strata, to create \(L\) intervals of equal length on the scale of the cumulative square root frequency.
  4. Assign the intervals for the auxiliary variable to strata based on the closest interval found in the previous step.

Motivation and further discussion of this method can also be found in Cochran’s Sampling Techniques.

Function Usage

The stratify function is included in the trtools package which can be installed using the command

devtools::install_github("trobinj/trtools")

This assumes you have the devtools package installed. The function code is also given at the end of this document.

The required arguments of stratify are the stratification variable (x), the number of strata (strata), and the breaks (breaks). There is an optional argument for whether to plot a histogram of the stratification variable showing the distribution of the stratification variable and the strata (plot, default: FALSE). The breaks argument determines how the stratification variable is discretized into intervals, and can take several forms consistent with the argument of same name for the hist function. However it is important to note that the stratify function is designed only for cases where the intervals are of equal width.

Examples

Consider the following example from Cochran’s Sampling Techniques. In that example the auxiliary variable has already been discretized, but “raw” data can be recreated to replicate the example using the interval midpoints and frequencies.

x <- rep(seq(2.5, 97.5, by = 5), c(3464, 2516, 2157, 1581, 1142, 746, 512, 376, 265, 207,
  126, 107, 82, 50, 39, 25, 16, 19, 2, 3))
stratify(x, strata = 5, breaks = seq(0, 100, by = 5))
$output
   lower upper freq     sqrtf    csqrtf stratum
1      0     5 3464 58.855756  58.85576       1
2      5    10 2516 50.159745 109.01550       2
3     10    15 2157 46.443514 155.45901       2
4     15    20 1581 39.761791 195.22081       3
5     20    25 1142 33.793490 229.01430       3
6     25    30  746 27.313001 256.32730       4
7     30    35  512 22.627417 278.95471       4
8     35    40  376 19.390719 298.34543       4
9     40    45  265 16.278821 314.62425       4
10    45    50  207 14.387495 329.01175       5
11    50    55  126 11.224972 340.23672       5
12    55    60  107 10.344080 350.58080       5
13    60    65   82  9.055385 359.63619       5
14    65    70   50  7.071068 366.70725       5
15    70    75   39  6.244998 372.95225       5
16    75    80   25  5.000000 377.95225       5
17    80    85   16  4.000000 381.95225       5
18    85    90   19  4.358899 386.31115       5
19    90    95    2  1.414214 387.72536       5
20    95   100    3  1.732051 389.45741       5

$cutpoints
[1]  77.89148 155.78297 233.67445 311.56593

The function returns output which is a data frame of interval boundaries, frequencies, square root frequencies, cumulative square root frequencies, and stratum assignments. The strata boundaries (cutpoints) from the third step described above are also returned. The results replicate those found in Sampling Techniques.

For another example consider simulated data from a normal distribution.

set.seed(101)
x <- rnorm(10000, 20, 3)
stratify(x, strata = 4, breaks = 25)
$output
   lower upper freq     sqrtf     csqrtf stratum
1      9    10    2  1.414214   1.414214       1
2     10    11   10  3.162278   4.576491       1
3     11    12   22  4.690416   9.266907       1
4     12    13   53  7.280110  16.547017       1
5     13    14  132 11.489125  28.036142       1
6     14    15  250 15.811388  43.847530       1
7     15    16  409 20.223748  64.071279       1
8     16    17  681 26.095977  90.167256       1
9     17    18  929 30.479501 120.646757       2
10    18    19 1167 34.161382 154.808139       2
11    19    20 1408 37.523326 192.331465       2
12    20    21 1250 35.355339 227.686804       3
13    21    22 1133 33.660065 261.346869       3
14    22    23  950 30.822070 292.168939       3
15    23    24  694 26.343880 318.512819       4
16    24    25  444 21.071308 339.584126       4
17    25    26  252 15.874508 355.458634       4
18    26    27  113 10.630146 366.088780       4
19    27    28   61  7.810250 373.899030       4
20    28    29   29  5.385165 379.284195       4
21    29    30    8  2.828427 382.112622       4
22    30    31    3  1.732051 383.844672       4

$cutpoints
[1]  95.96117 191.92234 287.88350

Function Code

function (x, strata, breaks) 
{
    h <- hist(x, plot = FALSE, breaks = breaks)
    g <- length(h$counts)
    z <- data.frame(lower = rep(NA, g), upper = rep(NA, g), freq = h$counts, 
        sqrtf = sqrt(h$counts), csqrtf = cumsum(sqrt(h$counts)), 
        stratum = NA)
    k <- 1:(strata - 1) * max(z$csqrtf)/strata
    for (i in 1:g) {
        z$lower[i] <- h$breaks[i]
        z$upper[i] <- h$breaks[i + 1]
    }
    for (i in 1:(strata - 1)) {
        tmp <- which(abs(z$csqrtf - k[i]) == min(abs(z$csqrtf - 
            k[i])))
        z$stratum[c(1:g) <= tmp & is.na(z$stratum)] <- i
    }
    z$stratum[is.na(z$stratum)] <- strata
    return(list(output = z, cutpoints = k))
}
<bytecode: 0x0000000023f44620>
<environment: namespace:trtools>

  1. This function has not been optimized or heavily tested. Use with caution.

  2. Dalenius, T. & Hodges, J. L., Jr. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88–101.