Examples and theory in this document are from: Complete Business Statistics, Amir D. Aczel & Jayavel Souderpandian, 2002.
The population can be subdivided into different groups, where elements in each group are similar to each other. Every stratum is sampled and therefore represented. Number of strata should be <= 6.
Population has N units and can be subdivided into m groups.
Weight:
\[W_i = N_i / N\]
Sampling fraction:
\[f_i = n_i / N_i\]
Estimator of population mean:
\[\bar{X_{st}} = \sum_{i=1}^m W_i \bar{X}_i\]
In which:
If \(n_i / n = N_i / N\) then this is called stratification with proportional allocation.
The variance is usually not known and has to be estimated from the sample. If sampling is random, then:
Variance:
\[V(X_{st}) = \sum_{i=1}^m W_i^2 (\frac{\sigma_i^2}{n_i})(1 - f_i)\]
Sample of 100 out of FT500 companies for net income. The belief is that (i) companies within each group share similar characteristics related to net income, (ii) net income is normally distributed, (iii) estimated strata variances are true strata variances.
| group | N | W | n | mean | variance |
|---|---|---|---|---|---|
| Service | 100 | 0.2 | 20 | 52.7 | 97650 |
| Banking | 100 | 0.2 | 20 | 112.6 | 64300 |
| Financial Services | 150 | 0.3 | 30 | 85.6 | 76990 |
| Retail | 50 | 0.1 | 10 | 12.6 | 18320 |
| Transport | 50 | 0.1 | 10 | 8.9 | 9037 |
| Utilities | 50 | 0.1 | 10 | 52.3 | 83500 |
The calculation is shown below.
# df is the name of the table
# allocation of sample
df[, n := floor(sample * W)]
# weighted mean
x_st = sum(df[, W * mean])
# weighted variance
s_xt = df[, W^2 * (variance / n) * (1 - n/N) ] %>% sum %>% sqrt
# confidence interval
lo = x_st - 1.96 * s_xt # 95% low
hi = x_st + 1.96 * s_xt # 95% high
95% confidence interval is 20.888 - 111.352 .
Sample proportion in stratum i:
\[\hat{P}_i = X_i / n_i\]
where \(X_i\) is the number of successes in sample size \(n_i\).
Estimator of population proportion:
\[\hat{P}_{st} = \sum_{i=1}^m W_i \hat{P}_i\]
Approximate variance:
\[V(\hat{P}_{st}) = \sum_{i=1}^m W_i^2 \frac{\hat{P_i} \hat{Q_i}}{n_i}\]
where \(\hat{Q}_i = 1 - \hat{P}_i\).
Sample for wine preference, assumed difference between rural and urban areas. The sample is a small percentage of the population, therefore no finite-population correction is required. A sample size is proportionally allocated over the population, which is 35% rural and 65% urban. In this case, “success” is an expressed preference over one wine over another.
| strata | W | n | successes | p | q |
|---|---|---|---|---|---|
| rural | 0.35 | 70 | 18 | 0.257 | 0.743 |
| urban | 0.65 | 130 | 28 | 0.215 | 0.785 |
The calculation is shown below.
# dfx is the name of the table
avg = dfx[, sum(W * p, na.rm = T)]
SD = dfx[, W^2 * p * q / n] %>% sum %>% sqrt
# 90% confidence interval
CI_lo = round(avg - 1.645 * SD, 3)
CI_hi = round(avg + 1.645 * SD, 3)
The 90% confidence interval is 0.181 - 0.279 .
It is possible to break data into categories after sampling that was not stratified. Criteria:
Minimize variance or minimize costs, this can be done with optimal allocation or Neyman allocation. The Neyman allocation can be used when the cost per sample is the same throughout the strata.
The cost function:
\[C = C_0 + \sum_{i=1}^m C_i n_i\] In which \(C_0\) is the cost for setting up the survey.
The optimum allocation is then:
\[\frac{n_i}{n} = \frac{W_i \sigma_i / \sqrt{C_i}}{\sum_{i=1}^m W_i \sigma_i / \sqrt{C_i}} \] The formula says that the sample in stratum i should be larger if (i) if the stratum is larger (W), (ii) if the variance in the stratum is larger (larger \(\sigma\)) and if the costs per sample C are smaller.
\[\frac{n_i}{n} = \frac{W_i \sigma_i}{\sum_{i=1}^m W_i \sigma_i}\]
Allocate sample of 1000 over 3 strata, as follows.
The calculation for the optimal allocation is as follows:
sample.size = 1000
y = dfa[, W * SD / sqrt(cost_unit)] %>% sum
optimal = dfa[, .(round(W * SD / sqrt(cost_unit) / y, 3))]
dfa[, optimal.n := optimal * sample.size]
The proportional allocation can be calculated as follows:
dfa[, prop.n := W * sample.size]
The Neyman allocation is calculated as follows (ignoring costs; assumed to be the same):
z = dfa[, sum(W * SD)]
neyman = dfa[, (W * SD) / z]
dfa[, Neyman.n := ceiling(neyman * sample.size)]
This results in the following table:
| stratum | W | SD | cost_unit | optimal.n | prop.n | Neyman.n |
|---|---|---|---|---|---|---|
| 1 | 0.4 | 1 | 4 | 329 | 400 | 236 |
| 2 | 0.5 | 2 | 9 | 548 | 500 | 589 |
| 3 | 0.1 | 3 | 16 | 123 | 100 | 177 |
Elements are clustered in larger units. There is no frame (list of all the elements in the population). E.g. population is large and spread out over a large region; subregions are easily sampled.
Suppose there are M clusters. A random sample of m clusters is selected. Possibilities:
The average is estimated as:
\[X_{cl} = \frac{\sum_{i=1}^m n_i \hat{X_i}} {\sum_{i=1}^m n_i} \]
The variance is:
\[s^2 (\hat{X_{cl}})= \frac{M - m}{M m \hat{n}^2} * \frac{\sum_{i=1}^m \pi_i^2 (\hat{X_i} - \hat{X_{cl}})^2}{m-1}\] ## Example
Truck company wants to reduce fuel costs. There are 110 centers. Below are the results of a cluster sample size 20.
| cluster.no | fuel.saved | n.trucks |
|---|---|---|
| 1 | 21 | 8 |
| 2 | 22 | 8 |
| 3 | 11 | 9 |
| 4 | 34 | 10 |
| 5 | 28 | 7 |
| 6 | 25 | 8 |
| 7 | 18 | 10 |
| 8 | 24 | 12 |
| 9 | 19 | 11 |
| 10 | 20 | 6 |
| 11 | 30 | 8 |
| 12 | 26 | 9 |
| 13 | 12 | 9 |
| 14 | 17 | 8 |
| 15 | 13 | 10 |
| 16 | 29 | 8 |
| 17 | 24 | 8 |
| 18 | 26 | 10 |
| 19 | 18 | 10 |
| 20 | 22 | 11 |
The calculation is as follows:
# general data
M = 110 # centers
m = 20 # sample size
n = dfc[, mean(n.trucks)] # average sample size
# average
x = dfc[, sum(fuel.saved * n.trucks)]
y = dfc[, sum(n.trucks)]
x_cl = x/y
# variance
x1 = (M - m) / (M * m * n^2)
x2 = dfc[, sum(n.trucks^2 * (fuel.saved - x_cl)^2 ) ] / (m - 1)
v = x1 * x2
# confidence interval
CI_lo = round(x_cl - 1.96 * sqrt(v), 3)
CI_hi = round(x_cl + 1.96 * sqrt(v), 3)
The 95% confidence interval is 19.364 - 24.302 .
Theory and example is from here.
Notation:
\(\hat{\mu}\) = estimation of population average
\(N\) = number of clusters in the population
\(n\) = number of clusters selected (phase 1)
\(M_i\) = number of elements in cluster i
\(m_i\) = number of elements selected from cluster i
\(M\) = number of elements in population
\(\overline{M} = \frac{M}{N}\) = average cluster size for population
\(y_{ij}\) = j_th_ observation from i th cluster
\(\overline{y}_i\) = sample mean for i th cluster
Estimation of mean:
\[\hat{\mu} = \frac{N}{(M n)} \sum_{i=1}^n M_i \hat{y}_i\]
Estimation of variance:
\[\hat{V}(\hat{\mu}) = \frac{N - n}{N} (\frac{1}{n N \hat{M}^2}) S_b^2 + \frac{1}{n N \overline{M}^2} \sum_{i=1}^n M_i^2 (\frac{M_i - m_i}{M_i}) (\frac{S_i^2}{m_i})\] where:
\[S_b^2 = \sum_{i=1}^n \frac{(M_i \overline{y}_i - \overline{M} \hat{\mu})^2}{n-1}\] and
\[S_i^2 = \sum_{i=1}^m \frac{(y_{ij} - \overline{y}_i)^2}{m_i - 1}\] The variance above is calculated per sampled cluster.
Consider a manufacturer with 90 plants who wants to measure downtime for machines. Each plants is considered to be a cluster with machines. The sample size is 10 clusters and 20% of the machines in each plant. The total number of machines is 4500. In the table, “y” is the average downtime hours and s2 is the variance.
| plant | M | y | s2 | n |
|---|---|---|---|---|
| 1 | 50 | 5.40 | 11.38 | 10 |
| 2 | 65 | 4.00 | 10.67 | 13 |
| 3 | 45 | 5.67 | 16.75 | 9 |
| 4 | 48 | 4.80 | 13.29 | 10 |
| 5 | 52 | 4.30 | 11.12 | 10 |
| 6 | 58 | 3.83 | 14.88 | 12 |
| 7 | 42 | 5.00 | 5.14 | 8 |
| 8 | 66 | 3.85 | 4.31 | 13 |
| 9 | 40 | 4.88 | 6.13 | 8 |
| 10 | 56 | 5.00 | 11.80 | 11 |
The calculation is as follows:
# average
N = df2[, sum(n)] # total sample (Mi * 20%, rounded)
M = 4500 # total population
N = 90 # the total number of clusters
Mavg = M / N # average cluster size for population
n = 10 # number of sampled clusters
# original example sums to 100 instead of 104 for total sample, therefore results are different.
# mean
mu = N / (M * n) * df2[, sum(M * y)]
# variance
# variance, term Sb^2
Sb = sum((df2[, (M * y)] - (Mavg * mu))^2 / (n - 1))
# variance, term2 of equation
term2 = df2[, M^2 * (M - n) / M * s2 / n] %>% sum
# variance, final calculation
part1 = ((N - n) / N) * (1 / (n * Mavg^2)) * Sb
part2 = 1 / (n * N * Mavg^2) * term2
SD = sqrt(part1 + part2)
# confidence interval
CI_lo = avg - (1.96 * SD)
CI_hi = avg + (1.96 * SD)
# note that original example uses 2 instead of 1.96 for 95% z.
The confidence interval is: -0.148 - 0.607 .
The total downtime can now be calculated as follows:
Total:
\[\tau = M \hat{\mu} = N / n \sum_{i=1}^n M \overline{y}_i\] Variance:
\[\hat{V}_t = M^2 \hat{V} (\hat{\mu})\]
The calculation is as follows:
# total
total = (N/n) * sum(df2[, M * y])
# variance
sd = (M * SD)
# error term
B = 1.96 * sd
The total downtime was estimated at 21605.31 hours. The error term is 1698.68 hours.
Samples in clusters tend to be somewhat similar. Selecting a new sample within a cluster thus adds less new information than would be the case in an independent random sample. Thus, the variance in a cluster sample is less than it would be in an independent sample. This loss of effectiveness is the design effect.
The design effect is essentially the ratio of actual variance and the variance that you would have add in a simple random sample. A design effect of, say, 3, means that the variance is 3 times higher than it would have been in a simple random sample.
The design effect (or DEFF) is therefore used to adjust the sample size; it plays no role in the calculations above. It is calculated as:
\[DEFF = \delta (n-1)\]
in which \(\delta\) is the intraclass correlation for the statistic in question, and n the average size of the cluster.
So, the DEFF increases with both cluster size and intraclass correlation. The DEFF is often in the 1-3 range, but of course “everything depends” and the DEFF may be much higher for your study.