1 Stratified Sampling

Examples and theory in this document are from: Complete Business Statistics, Amir D. Aczel & Jayavel Souderpandian, 2002.

The population can be subdivided into different groups, where elements in each group are similar to each other. Every stratum is sampled and therefore represented. Number of strata should be <= 6.

1.1 Absolute numbers

1.1.1 Background

Population has N units and can be subdivided into m groups.

Weight:

\[W_i = N_i / N\]

Sampling fraction:

\[f_i = n_i / N_i\]

Estimator of population mean:

\[\bar{X_{st}} = \sum_{i=1}^m W_i \bar{X}_i\]

In which:

  • \(\bar{X_{st}}\) = Sample mean in stratified sampling
  • \(W_i\) = Weight of stratum i
  • \(\bar{X_i}\) = sample mean in stratum i

If \(n_i / n = N_i / N\) then this is called stratification with proportional allocation.

The variance is usually not known and has to be estimated from the sample. If sampling is random, then:

Variance:

\[V(X_{st}) = \sum_{i=1}^m W_i^2 (\frac{\sigma_i^2}{n_i})(1 - f_i)\]

1.1.2 Example

Sample of 100 out of FT500 companies for net income. The belief is that (i) companies within each group share similar characteristics related to net income, (ii) net income is normally distributed, (iii) estimated strata variances are true strata variances.

group N W n mean variance
Service 100 0.2 20 52.7 97650
Banking 100 0.2 20 112.6 64300
Financial Services 150 0.3 30 85.6 76990
Retail 50 0.1 10 12.6 18320
Transport 50 0.1 10 8.9 9037
Utilities 50 0.1 10 52.3 83500

The calculation is shown below.

# df is the name of the table

# allocation of sample
df[, n := floor(sample * W)]

# weighted mean
x_st = sum(df[, W * mean])

# weighted variance
s_xt = df[,  W^2 * (variance / n) * (1 - n/N) ] %>% sum %>% sqrt

# confidence interval
lo = x_st - 1.96 * s_xt  # 95% low
hi = x_st + 1.96 * s_xt  # 95% high

95% confidence interval is 20.888 - 111.352 .

1.2 Stratified sampling for proportions

1.2.1 Background

Sample proportion in stratum i:

\[\hat{P}_i = X_i / n_i\]

where \(X_i\) is the number of successes in sample size \(n_i\).

Estimator of population proportion:

\[\hat{P}_{st} = \sum_{i=1}^m W_i \hat{P}_i\]

Approximate variance:

\[V(\hat{P}_{st}) = \sum_{i=1}^m W_i^2 \frac{\hat{P_i} \hat{Q_i}}{n_i}\]

where \(\hat{Q}_i = 1 - \hat{P}_i\).

1.2.2 Example

Sample for wine preference, assumed difference between rural and urban areas. The sample is a small percentage of the population, therefore no finite-population correction is required. A sample size is proportionally allocated over the population, which is 35% rural and 65% urban. In this case, “success” is an expressed preference over one wine over another.

strata W n successes p q
rural 0.35 70 18 0.257 0.743
urban 0.65 130 28 0.215 0.785

The calculation is shown below.

# dfx is the name of the table

avg = dfx[, sum(W * p, na.rm = T)]

SD = dfx[, W^2 * p * q / n] %>% sum %>% sqrt

# 90% confidence interval
CI_lo = round(avg - 1.645 * SD, 3)
CI_hi = round(avg + 1.645 * SD, 3)

The 90% confidence interval is 0.181 - 0.279 .

1.3 Postsampling stratification

It is possible to break data into categories after sampling that was not stratified. Criteria:

  • Subsamples in each stratum should be at least n >= 20.
  • \(w_i = n_i / n\) should be close to true population strata.

1.4 Allocation

Minimize variance or minimize costs, this can be done with optimal allocation or Neyman allocation. The Neyman allocation can be used when the cost per sample is the same throughout the strata.

1.4.1 Optimum allocation

The cost function:

\[C = C_0 + \sum_{i=1}^m C_i n_i\] In which \(C_0\) is the cost for setting up the survey.

The optimum allocation is then:

\[\frac{n_i}{n} = \frac{W_i \sigma_i / \sqrt{C_i}}{\sum_{i=1}^m W_i \sigma_i / \sqrt{C_i}} \] The formula says that the sample in stratum i should be larger if (i) if the stratum is larger (W), (ii) if the variance in the stratum is larger (larger \(\sigma\)) and if the costs per sample C are smaller.

1.4.2 Neyman allocation

\[\frac{n_i}{n} = \frac{W_i \sigma_i}{\sum_{i=1}^m W_i \sigma_i}\]

1.4.3 Example

Allocate sample of 1000 over 3 strata, as follows.

The calculation for the optimal allocation is as follows:

sample.size = 1000

y = dfa[, W * SD / sqrt(cost_unit)] %>% sum
optimal = dfa[, .(round(W * SD / sqrt(cost_unit) / y, 3))] 
dfa[, optimal.n := optimal * sample.size]

The proportional allocation can be calculated as follows:

dfa[, prop.n := W * sample.size]

The Neyman allocation is calculated as follows (ignoring costs; assumed to be the same):

z = dfa[, sum(W * SD)]
neyman = dfa[, (W * SD) / z]
dfa[, Neyman.n := ceiling(neyman * sample.size)]

This results in the following table:

stratum W SD cost_unit optimal.n prop.n Neyman.n
1 0.4 1 4 329 400 236
2 0.5 2 9 548 500 589
3 0.1 3 16 123 100 177

2 Cluster sampling

Elements are clustered in larger units. There is no frame (list of all the elements in the population). E.g. population is large and spread out over a large region; subregions are easily sampled.

Suppose there are M clusters. A random sample of m clusters is selected. Possibilities:

  • Single stage cluster sampling: every element in the m selected clusters is sampled.
  • Two-stage cluster sampling: a random sample of n elements out of the m selected clusters is selected.

2.1 Single stage cluster sampling

The average is estimated as:

\[X_{cl} = \frac{\sum_{i=1}^m n_i \hat{X_i}} {\sum_{i=1}^m n_i} \]

The variance is:

\[s^2 (\hat{X_{cl}})= \frac{M - m}{M m \hat{n}^2} * \frac{\sum_{i=1}^m \pi_i^2 (\hat{X_i} - \hat{X_{cl}})^2}{m-1}\] ## Example

Truck company wants to reduce fuel costs. There are 110 centers. Below are the results of a cluster sample size 20.

cluster.no fuel.saved n.trucks
1 21 8
2 22 8
3 11 9
4 34 10
5 28 7
6 25 8
7 18 10
8 24 12
9 19 11
10 20 6
11 30 8
12 26 9
13 12 9
14 17 8
15 13 10
16 29 8
17 24 8
18 26 10
19 18 10
20 22 11

The calculation is as follows:

# general data
M = 110  # centers
m = 20   # sample size
n = dfc[, mean(n.trucks)]  # average sample size

# average
x = dfc[, sum(fuel.saved * n.trucks)]
y = dfc[, sum(n.trucks)]
x_cl = x/y

# variance
x1 = (M - m) / (M * m * n^2)
x2 = dfc[, sum(n.trucks^2 * (fuel.saved - x_cl)^2 ) ] / (m - 1)
v = x1 * x2

# confidence interval
CI_lo = round(x_cl - 1.96 * sqrt(v), 3)
CI_hi = round(x_cl + 1.96 * sqrt(v), 3)

The 95% confidence interval is 19.364 - 24.302 .

2.2 Two stage cluster sampling

Theory and example is from here.

2.2.1 Background

Notation:

\(\hat{\mu}\) = estimation of population average

\(N\) = number of clusters in the population

\(n\) = number of clusters selected (phase 1)

\(M_i\) = number of elements in cluster i

\(m_i\) = number of elements selected from cluster i

\(M\) = number of elements in population

\(\overline{M} = \frac{M}{N}\) = average cluster size for population

\(y_{ij}\) = j_th_ observation from i th cluster

\(\overline{y}_i\) = sample mean for i th cluster

Estimation of mean:

\[\hat{\mu} = \frac{N}{(M n)} \sum_{i=1}^n M_i \hat{y}_i\]

Estimation of variance:

\[\hat{V}(\hat{\mu}) = \frac{N - n}{N} (\frac{1}{n N \hat{M}^2}) S_b^2 + \frac{1}{n N \overline{M}^2} \sum_{i=1}^n M_i^2 (\frac{M_i - m_i}{M_i}) (\frac{S_i^2}{m_i})\] where:

\[S_b^2 = \sum_{i=1}^n \frac{(M_i \overline{y}_i - \overline{M} \hat{\mu})^2}{n-1}\] and

\[S_i^2 = \sum_{i=1}^m \frac{(y_{ij} - \overline{y}_i)^2}{m_i - 1}\] The variance above is calculated per sampled cluster.

2.2.2 Example

2.2.2.1 Data

Consider a manufacturer with 90 plants who wants to measure downtime for machines. Each plants is considered to be a cluster with machines. The sample size is 10 clusters and 20% of the machines in each plant. The total number of machines is 4500. In the table, “y” is the average downtime hours and s2 is the variance.

plant M y s2 n
1 50 5.40 11.38 10
2 65 4.00 10.67 13
3 45 5.67 16.75 9
4 48 4.80 13.29 10
5 52 4.30 11.12 10
6 58 3.83 14.88 12
7 42 5.00 5.14 8
8 66 3.85 4.31 13
9 40 4.88 6.13 8
10 56 5.00 11.80 11

2.2.2.2 Average and confidence interval

The calculation is as follows:

# average

N = df2[, sum(n)]  # total sample (Mi * 20%, rounded)
M = 4500           # total population
N = 90             # the total number of clusters
Mavg = M / N       # average cluster size for population
n = 10             # number of sampled clusters
# original example sums to 100 instead of 104 for total sample, therefore results are different.

# mean
mu = N / (M * n) * df2[, sum(M * y)]

# variance
# variance, term Sb^2
Sb = sum((df2[, (M * y)] - (Mavg * mu))^2 / (n - 1))

# variance, term2 of equation
term2 = df2[, M^2 * (M - n) / M * s2 / n] %>% sum

# variance, final calculation
part1 = ((N - n) / N) * (1 / (n * Mavg^2)) * Sb
part2 = 1 / (n * N * Mavg^2) * term2
SD = sqrt(part1 + part2)

# confidence interval
CI_lo = avg - (1.96 * SD)
CI_hi = avg + (1.96 * SD)

# note that original example uses 2 instead of 1.96 for 95% z.

The confidence interval is: -0.148 - 0.607 .

2.2.2.3 Total and confidence interval

The total downtime can now be calculated as follows:

Total:

\[\tau = M \hat{\mu} = N / n \sum_{i=1}^n M \overline{y}_i\] Variance:

\[\hat{V}_t = M^2 \hat{V} (\hat{\mu})\]

The calculation is as follows:

# total
total = (N/n) * sum(df2[, M * y])

# variance
sd = (M * SD)

# error term
B = 1.96 * sd

The total downtime was estimated at 21605.31 hours. The error term is 1698.68 hours.

2.3 Design effect

Samples in clusters tend to be somewhat similar. Selecting a new sample within a cluster thus adds less new information than would be the case in an independent random sample. Thus, the variance in a cluster sample is less than it would be in an independent sample. This loss of effectiveness is the design effect.

The design effect is essentially the ratio of actual variance and the variance that you would have add in a simple random sample. A design effect of, say, 3, means that the variance is 3 times higher than it would have been in a simple random sample.

The design effect (or DEFF) is therefore used to adjust the sample size; it plays no role in the calculations above. It is calculated as:

\[DEFF = \delta (n-1)\]

in which \(\delta\) is the intraclass correlation for the statistic in question, and n the average size of the cluster.

So, the DEFF increases with both cluster size and intraclass correlation. The DEFF is often in the 1-3 range, but of course “everything depends” and the DEFF may be much higher for your study.