The second step is called the sample size allocation.
Want to sample \(n\) units from the population
An allocation rule describes how \(n\) SUs will be spread across the \(H\) strata
In particular, the allocation rule is an equation for \(n_h\), the sample size for stratum \(h\) for given \(n\) and other population information.
Meaning: choose \(n_h\) in such a way that \[ \frac{n_h}{N_h} = \frac{n}{N} \] for all \(h=1, \cdots, H\).
Recall that, the unbiased estimator of \(T=\sum_{h=1}^H \sum_{i=1}^{N_h} y_{hi}\) under stratified random sampling is
\[ \hat{T}_{str} = \sum_{h=1}^H \sum_{i \in S_h} w_{hi} y_{hi} \] where \(w_{hi}=N_h/n_h\)
Under proportional allocation, \(N_h/n_h\) are all equal and so we can write \[ \hat{T}_{str} = \frac{N}{n} \sum_{h=1}^H \sum_{i \in S_h} y_{hi} \]
If the weight for each SU in the sample has the same value, a sample is said to be self-weighting
Since each weight is the same, each sample unit represents the same number of units in the population
| Stratum | Stratum Size (\(N_h\)) | Sample Size (\(n_h\)) | Sample weight (\(w_{hi}\)) |
|---|---|---|---|
| 1 (NE) | 220 | 21 | 220/21=10.5 |
| 2 (NC) | 1054 | 103 | 1054/103=10.2 |
| 3 (S) | 1382 | 135 | 1382/135 = 10.2 |
| 4 (W) | 422 | 41 | 422/41 = 10.3 |
| Total | \(N=3078\) | \(n=300\) |
If we use SRS to select a sample of size \(n\), then the variance of \(\hat{T}_{SRS}\) can be written as \[ V( \hat{T}_{SRS}) = \frac{N}{n} \left( 1- \frac{n}{N} \right) \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{U})^2. \]
Recall that, under proportional allocation, \[ V( \hat{T}_{str}) = \frac{N}{n} \left( 1- \frac{n}{N} \right) \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \]
Thus, we have only to show that \[ \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \le \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{U})^2. \]
Define the total sum of squares (SST) as \[ SST = \sum_{h=1}^H \sum_{i =1}^{N_h} \left( y_{hi} - \bar{Y}_U \right)^2 \]
Since \(SST \ge SSB\), we have \(V( \hat{T}_{SRS}) \ge V( \hat{T}_{str})\).
The gain in efficiency from stratification is greatest when the within stratum variance is small relative to the between stratum variance.
Under proportional allocation, we can write \[ V_{prop}( \hat{T}_{str}) \cong \frac{N^2}{n} \left( 1- \frac{n}{N} \right) S_w^2 \] where \[ S_w^2 = \sum_{h=1}^H W_h S_h^2 \] with \(W_h=N_h/N\) is the within-stratum variance. It is a weighted average of the stratum variances.
On the other hand, the variance under SRS uses \(S^2\) instead of \(S_w^2\) in the above formula.
Sample allocation: Given the total sample size \(n\), what should we use for the individual stratum sample sizes, \(n_1, \cdots, n_H\)?
Sample size determination : What should we use for the sample size \(n=n_1+ \cdots + n_H\)?
We’ll focus on strategies when the objective is to estimate a parameter for entire population (e.g. population total or population mean)
Total budget is given: Given the budget, we wish to find the best strategy for sample size determination and sample allocation
Given the budget \(C\) and known population parameters, we wish to find the sample allocation that minimizes \[ V( \hat{T}_{str}) \] subject to the constraint that the total cost is \(\le C\).
Mathematically, this is a constrained optimization problem.
We wish to find \((n_1, \cdots, n_H)\) that minimizes \[ V( \hat{T}_{str} ) = \sum_{h=1}^H \frac{N_h^2}{n_h} S_h^2 - \sum_{h=1}^H N_h S_h^2 \] subject to \[ c_0 + \sum_{h=1}^H n_h c_h = C. \]
The solution to the above optimization problem is \[ n_h \propto \frac{N_hS_h}{\sqrt{c_h} } \]
Interpretation
If \(c_h\) are all equal, the optimal allocation using \[ n_h =\frac{ N_h S_h}{ \sum_{h=1}^H N_h S_h} n \] leads to
\[ V_{opt}( \hat{T}_{str}) = \frac{N^2}{n} \left( \sum_{h=1}^H W_h S_h \right)^2 - N \sum_{h=1}^H W_h S_h^2
\] where \(W_h = N_h/N\).
Note that, under proportional allocation, we have
\[ V_{prop}( \hat{T}_{str}) = \frac{N^2}{n} \sum_{h=1}^H W_h S_h^2 - N \sum_{h=1}^H W_h S_h^2
\]
Let us suppose that a corporation has 260,000 accident reports available over a period of time and that a sample survey is being contemplated for the purpose of estimating the average number of days of work lost per accident.
Of the 260,000 accident reports, 150,000 are coded and 110,000 are uncoded. The coded forms could be processed on the computer directly, whereas the uncoded forms must first be coded before processing.
Approximately $ 10,000 is available for selecting the sample and coding and processing the data. It is assumed that the cost of sampling and processing sample forms in equal to $0.32 for a coded form and $0.98 for an uncoded form.
With this in mind, it is desired to find the best way of allocating the sample elements among coded and uncoded forms.
We are going to use a stratified sampling with \(H=2\), where \(N_1=150,000\) and \(N_2=100,000\).
Also, the cost function is \(C_T = c_1 n_1+ c_2 n_2\), where \(c_1 =0.32\) and \(c_2=0.98\). The cost restriction is \(C_T \le C=10,000\).
Thus, we can use \[ n_h = \frac{ N_h S_h/\sqrt{c_h}}{ \sum_{h=1}^2 N_h S_h\sqrt{c_h}} C \]
The answer is \[ n_1 = \tiny{ \frac{ 150,000 \times (S_2/2)\times \frac{1}{\sqrt{0.32}} }{150,000 \times (S_2/2) \times \sqrt{0.32} + 110,000 \times S_2 \times \sqrt{0.98}} \times 10,000 } \doteq 8,762 \] \[ n_2 = \tiny{ \frac{ 110,000 \times (S_2)\times \frac{1}{\sqrt{0.98}} }{150,000 \times (S_2/2) \times \sqrt{0.32} + 110,000 \times S_2 \times \sqrt{0.98}} \times 10,000 } \doteq 7,343 \]
Check:
| Stratum | Population Size \(N_h\) | \(S_h\) |
|---|---|---|
| 1 | 100 | 50 |
| 2 | 110 | 10 |
| 3 | 120 | 5 |
Consider the population with \(H=3\) strata with the summary data above. (The costs are all equal.)
Suppose that we wish to allocate sample size \(n=140\) to each stratum by using optimal allocation.
Using this formula, we obtain \[ n_1 = 104 \ \ n_2 = 23 \ \ n_3 = 13 \]
However, \(n_1=104\) is greater than \(N_1=100\). Thus, we would take \(n_1=N_1=100\) abd allocates for remaining 40 elements to strata and 3 to get \[ \tiny{n_2 = \frac{ 110 \times 10 }{ 110 \times 10 + 120 \times 5} \cdot 40 \doteq 26} \] and \(n_3= 40-26=14\).