This document describes a R function called stratify
that implements the “cumulative square root frequency method” for computing approximately optimal stratification of sampling units assuming a stratified random sampling design with Neyman allocation.1 This method was originally proposed by Dalenius and Hodges (1959).2 It involves the following steps:
Motivation and further discussion of this method can also be found in Cochran’s Sampling Techniques.
The stratify
function is included in the trtools package which can be installed using the command
devtools::install_github("trobinj/trtools")
This assumes you have the devtools package installed. The function code is also given at the end of this document.
The required arguments of stratify
are the stratification variable (x
), the number of strata (strata
), and the breaks (breaks
). There is an optional argument for whether to plot a histogram of the stratification variable showing the distribution of the stratification variable and the strata (plot
, default: FALSE
). The breaks
argument determines how the stratification variable is discretized into intervals, and can take several forms consistent with the argument of same name for the hist
function. However it is important to note that the stratify
function is designed only for cases where the intervals are of equal width.
Consider the following example from Cochran’s Sampling Techniques. In that example the auxiliary variable has already been discretized, but “raw” data can be recreated to replicate the example using the interval midpoints and frequencies.
x <- rep(seq(2.5, 97.5, by = 5), c(3464, 2516, 2157, 1581, 1142, 746, 512, 376, 265, 207,
126, 107, 82, 50, 39, 25, 16, 19, 2, 3))
stratify(x, strata = 5, breaks = seq(0, 100, by = 5))
$output
lower upper freq sqrtf csqrtf stratum
1 0 5 3464 58.855756 58.85576 1
2 5 10 2516 50.159745 109.01550 2
3 10 15 2157 46.443514 155.45901 2
4 15 20 1581 39.761791 195.22081 3
5 20 25 1142 33.793490 229.01430 3
6 25 30 746 27.313001 256.32730 4
7 30 35 512 22.627417 278.95471 4
8 35 40 376 19.390719 298.34543 4
9 40 45 265 16.278821 314.62425 4
10 45 50 207 14.387495 329.01175 5
11 50 55 126 11.224972 340.23672 5
12 55 60 107 10.344080 350.58080 5
13 60 65 82 9.055385 359.63619 5
14 65 70 50 7.071068 366.70725 5
15 70 75 39 6.244998 372.95225 5
16 75 80 25 5.000000 377.95225 5
17 80 85 16 4.000000 381.95225 5
18 85 90 19 4.358899 386.31115 5
19 90 95 2 1.414214 387.72536 5
20 95 100 3 1.732051 389.45741 5
$cutpoints
[1] 77.89148 155.78297 233.67445 311.56593
The function returns output
which is a data frame of interval boundaries, frequencies, square root frequencies, cumulative square root frequencies, and stratum assignments. The strata boundaries (cutpoints
) from the third step described above are also returned. The results replicate those found in Sampling Techniques.
For another example consider simulated data from a normal distribution.
set.seed(101)
x <- rnorm(10000, 20, 3)
stratify(x, strata = 4, breaks = 25)
$output
lower upper freq sqrtf csqrtf stratum
1 9 10 2 1.414214 1.414214 1
2 10 11 10 3.162278 4.576491 1
3 11 12 22 4.690416 9.266907 1
4 12 13 53 7.280110 16.547017 1
5 13 14 132 11.489125 28.036142 1
6 14 15 250 15.811388 43.847530 1
7 15 16 409 20.223748 64.071279 1
8 16 17 681 26.095977 90.167256 1
9 17 18 929 30.479501 120.646757 2
10 18 19 1167 34.161382 154.808139 2
11 19 20 1408 37.523326 192.331465 2
12 20 21 1250 35.355339 227.686804 3
13 21 22 1133 33.660065 261.346869 3
14 22 23 950 30.822070 292.168939 3
15 23 24 694 26.343880 318.512819 4
16 24 25 444 21.071308 339.584126 4
17 25 26 252 15.874508 355.458634 4
18 26 27 113 10.630146 366.088780 4
19 27 28 61 7.810250 373.899030 4
20 28 29 29 5.385165 379.284195 4
21 29 30 8 2.828427 382.112622 4
22 30 31 3 1.732051 383.844672 4
$cutpoints
[1] 95.96117 191.92234 287.88350
function (x, strata, breaks)
{
h <- hist(x, plot = FALSE, breaks = breaks)
g <- length(h$counts)
z <- data.frame(lower = rep(NA, g), upper = rep(NA, g), freq = h$counts,
sqrtf = sqrt(h$counts), csqrtf = cumsum(sqrt(h$counts)),
stratum = NA)
k <- 1:(strata - 1) * max(z$csqrtf)/strata
for (i in 1:g) {
z$lower[i] <- h$breaks[i]
z$upper[i] <- h$breaks[i + 1]
}
for (i in 1:(strata - 1)) {
tmp <- which(abs(z$csqrtf - k[i]) == min(abs(z$csqrtf -
k[i])))
z$stratum[c(1:g) <= tmp & is.na(z$stratum)] <- i
}
z$stratum[is.na(z$stratum)] <- strata
return(list(output = z, cutpoints = k))
}
<bytecode: 0x0000000023f44620>
<environment: namespace:trtools>