Stat 421: Survey Sampling

Week 9

Jae-kwang Kim

3/10/2020

Review

  • Last week, we have studied
    • Stratified sampling overview
    • Estimation under stratified sampling
  • This week, we will study
    • Sample size allocation
    • Optimal allocation
    • Total sample size determination

1. Sample size allocation

Basic setup

  • Stratified random sampling (or stratified SRS):
    1. Stratify the population into \(H\) strata
    2. Determine \(n_h\) satisfying \(\sum_{h=1}^H n_h = n\) and \(n_h \le N_h\) for all \(h=1,\cdots, H\).
    3. Apply SRS of size \(n_h\) independently for each stratum \(h\).
    4. Estimate \(T=\sum_{h=1}^H \sum_{i=1}^{N_h}\) by \[\hat{T}_{str} = \sum_{h=1}^H \frac{N_h}{n_h} \sum_{i \in \mathcal{S}_h} y_{hi} \]

Basic Setup (Cont’d)

  • The second step is called the sample size allocation.

    • Note that \(\hat{T}_{str}\) is unbiased regardless of the choice of \(n_h\)
    • However, its variance can be different for difference choices of \(n_h\) \[\begin{eqnarray*} V( \hat{T}_{str} ) &=&\sum_{h=1}^H \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{N_h} \right)S_h^2 \\ &=& \sum_{h=1}^H \left( \frac{N_h}{n_h} - 1\right) N_h S_h^2 \end{eqnarray*}\]

Sample size allocation

  • Want to sample \(n\) units from the population

  • An allocation rule describes how \(n\) SUs will be spread across the \(H\) strata

  • In particular, the allocation rule is an equation for \(n_h\), the sample size for stratum \(h\) for given \(n\) and other population information.

Proportional allocation

  • Meaning: choose \(n_h\) in such a way that \[ \frac{n_h}{N_h} = \frac{n}{N} \] for all \(h=1, \cdots, H\).

  • Properties
    1. Sampling weights are all equal
    2. Variance is smaller than SRS

Estimation under proportional allocation

  • Recall that, the unbiased estimator of \(T=\sum_{h=1}^H \sum_{i=1}^{N_h} y_{hi}\) under stratified random sampling is
    \[ \hat{T}_{str} = \sum_{h=1}^H \sum_{i \in S_h} w_{hi} y_{hi} \] where \(w_{hi}=N_h/n_h\)

  • Under proportional allocation, \(N_h/n_h\) are all equal and so we can write \[ \hat{T}_{str} = \frac{N}{n} \sum_{h=1}^H \sum_{i \in S_h} y_{hi} \]

Self-weighting samples

  • If the weight for each SU in the sample has the same value, a sample is said to be self-weighting

  • Since each weight is the same, each sample unit represents the same number of units in the population

  • Self-weighting sample designs so far
    • SRS
    • SYS
    • STS with proportional allocation

Ag example

Stratum Stratum Size (\(N_h\)) Sample Size (\(n_h\)) Sample weight (\(w_{hi}\))
1 (NE) 220 21 220/21=10.5
2 (NC) 1054 103 1054/103=10.2
3 (S) 1382 135 1382/135 = 10.2
4 (W) 422 41 422/41 = 10.3
Total \(N=3078\) \(n=300\)
  • Even though we have used proportional allocation, rounding used to establish sample sizes can lead to unequal (but approximately equal) weights

Variance under prop’l allocation

  • The variance formula under stratified random sampling \[ V( \hat{T}_{str}) = \sum_{h=1}^H \frac{N_h^2}{n_h} \left( 1- \frac{n_h}{N_h} \right) S_h^2 \] where \[ S_h^2 = \frac{1}{N_h-1} \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \]
  • Under proportional allocation with \(N_h/ (N_h-1) \doteq 1\), we can write \[ V( \hat{T}_{str}) = \frac{N}{n} \left( 1- \frac{n}{N} \right) \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \]

Justification

Proposition

  • Under proportional allocation, we have (approximately) \[ V( \hat{T}_{SRS}) \ge V( \hat{T}_{str}) \] where \(\hat{T}_{SRS}\) is the estimator of the total under SRS with the same sample size \(n\).

Justification

  • If we use SRS to select a sample of size \(n\), then the variance of \(\hat{T}_{SRS}\) can be written as \[ V( \hat{T}_{SRS}) = \frac{N}{n} \left( 1- \frac{n}{N} \right) \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{U})^2. \]

  • Recall that, under proportional allocation, \[ V( \hat{T}_{str}) = \frac{N}{n} \left( 1- \frac{n}{N} \right) \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \]

  • Thus, we have only to show that \[ \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{hU})^2 \le \sum_{h=1}^H \sum_{i=1}^{N_h} (y_{hi} - \bar{Y}_{U})^2. \]

  • Define the total sum of squares (SST) as \[ SST = \sum_{h=1}^H \sum_{i =1}^{N_h} \left( y_{hi} - \bar{Y}_U \right)^2 \]

  • We can show that \[ SST = SSB + SSW \] where \[\begin{eqnarray*} SSW &=& \sum_{h=1}^H \sum_{i=1}^{N_h} \left( y_{hi} - \bar{Y}_{hU} \right)^2 \\ SSB &=& \sum_{h=1}^H \sum_{i=1}^{N_h} \left( \bar{Y}_{hU} - \bar{Y}_U \right)^2 \end{eqnarray*}\]

  • Note that \[\begin{eqnarray*} V( \hat{T}_{SRS}) &=& \frac{N}{n} \left( 1 - \frac{n}{N} \right) SST \\ V( \hat{T}_{str}) &=& \frac{N}{n} \left( 1 - \frac{n}{N} \right) SSW \end{eqnarray*}\]
  • Since \(SST \ge SSB\), we have \(V( \hat{T}_{SRS}) \ge V( \hat{T}_{str})\).

  • The gain in efficiency from stratification is greatest when the within stratum variance is small relative to the between stratum variance.

Remark

  • Under proportional allocation, we can write \[ V_{prop}( \hat{T}_{str}) \cong \frac{N^2}{n} \left( 1- \frac{n}{N} \right) S_w^2 \] where \[ S_w^2 = \sum_{h=1}^H W_h S_h^2 \] with \(W_h=N_h/N\) is the within-stratum variance. It is a weighted average of the stratum variances.

  • On the other hand, the variance under SRS uses \(S^2\) instead of \(S_w^2\) in the above formula.

2. Optimal allocation

Two different problems in STS

  1. Sample allocation: Given the total sample size \(n\), what should we use for the individual stratum sample sizes, \(n_1, \cdots, n_H\)?

  2. Sample size determination : What should we use for the sample size \(n=n_1+ \cdots + n_H\)?

Allocation Strategies for optimal estimation of the population total

  • We’ll focus on strategies when the objective is to estimate a parameter for entire population (e.g. population total or population mean)

  • Factors affecting allocation rules
    • Number of SUs in a stratum \(h\): \(N_h\)
    • Data collection costs per element within stratum \(h\): \(c_h\)
    • Within-stratum variance: \(S_h^2\)

Two different scenarios

  1. Total budget is given: Given the budget, we wish to find the best strategy for sample size determination and sample allocation

  2. Margin of error is given: Given the desired accuracy of the estimate, we wish to find the best strategy for the sample size determination and sample application
    • First find \(n\) under the given margin of error
    • Given \(n\), find optimal sample allocation

Problem formulation (under Scenario 1)

  • Given the budget \(C\) and known population parameters, we wish to find the sample allocation that minimizes \[ V( \hat{T}_{str}) \] subject to the constraint that the total cost is \(\le C\).

  • Mathematically, this is a constrained optimization problem.

Total cost

  • Three factors
    1. \(c_0\): Initial cost (for creating survey questions, ect)
    2. \(c_h\): Observation (e.g. survey) cost per element in stratum \(h\)
    3. \(n_h\): sample size in stratum \(h\)
  • It is reasonable to assume that the total cost is \[ C_T = c_0 + \sum_{h=1}^H n_h c_h \]

Neyman allocation (=Optimal allocation)

  • We wish to find \((n_1, \cdots, n_H)\) that minimizes \[ V( \hat{T}_{str} ) = \sum_{h=1}^H \frac{N_h^2}{n_h} S_h^2 - \sum_{h=1}^H N_h S_h^2 \] subject to \[ c_0 + \sum_{h=1}^H n_h c_h = C. \]

  • The solution to the above optimization problem is \[ n_h \propto \frac{N_hS_h}{\sqrt{c_h} } \]

Justification

  • We may use Cauchy-Schwartz inequality: \[ (\sum_{h=1}^H a_h^2 ) ( \sum_{h=1}^H b_h^2) \ge (\sum_{h=1}^H a_h b_h)^2 \] where the equality holds if and only if \[ a_h \propto b_h \]

Remark 1

  • For a given sample size \(n\), the optimal allocation is \[ n_h = \frac{ N_h S_h/\sqrt{c_h}}{ \sum_{h=1}^H N_h S_h/\sqrt{c_h}} n \]
  • Interpretation

    • If \(N_h \uparrow\) then \(n_h \uparrow\)
    • If \(S_h \uparrow\) then \(n_h \uparrow\)
    • If \(c_h \uparrow\) then \(n_h \downarrow\)

  • If \(c_h\) are all equal, then the optimal allocation is \[ n_h =\frac{ N_h S_h}{ \sum_{h=1}^H N_h S_h} n \]
  • Furthermore, if \(S_h\) are equal, then the optimal allocation reduces to the proportional allocation.

Remark 2

  • If the total cost of taking the sample is fixed at \(C\), the optimal allocation is \[ n_h = \frac{ N_h S_h/\sqrt{c_h}}{ \sum_{h=1}^H N_h S_h\sqrt{c_h}} (C-c_0) \]

Result

  • If \(c_h\) are all equal, the optimal allocation using \[ n_h =\frac{ N_h S_h}{ \sum_{h=1}^H N_h S_h} n \] leads to
    \[ V_{opt}( \hat{T}_{str}) = \frac{N^2}{n} \left( \sum_{h=1}^H W_h S_h \right)^2 - N \sum_{h=1}^H W_h S_h^2 \] where \(W_h = N_h/N\).

  • Note that, under proportional allocation, we have
    \[ V_{prop}( \hat{T}_{str}) = \frac{N^2}{n} \sum_{h=1}^H W_h S_h^2 - N \sum_{h=1}^H W_h S_h^2 \]

  • Thus, comparing the two variance formula, we can show that \[ V_{prop}( \hat{T}_{str}) \ge V_{opt}( \hat{T}_{str}) \]

Example

  • Let us suppose that a corporation has 260,000 accident reports available over a period of time and that a sample survey is being contemplated for the purpose of estimating the average number of days of work lost per accident.

  • Of the 260,000 accident reports, 150,000 are coded and 110,000 are uncoded. The coded forms could be processed on the computer directly, whereas the uncoded forms must first be coded before processing.

Example (Cont’d)

  • Approximately $ 10,000 is available for selecting the sample and coding and processing the data. It is assumed that the cost of sampling and processing sample forms in equal to $0.32 for a coded form and $0.98 for an uncoded form.

  • Also, we assume that the standard deviation of the distribution of days lost from work is twice as large among uncoded reports as among coded report. That is, \(S_1 = S_2/2\).
  • With this in mind, it is desired to find the best way of allocating the sample elements among coded and uncoded forms.

Solution

  • We are going to use a stratified sampling with \(H=2\), where \(N_1=150,000\) and \(N_2=100,000\).

  • Also, the cost function is \(C_T = c_1 n_1+ c_2 n_2\), where \(c_1 =0.32\) and \(c_2=0.98\). The cost restriction is \(C_T \le C=10,000\).

  • Thus, we can use \[ n_h = \frac{ N_h S_h/\sqrt{c_h}}{ \sum_{h=1}^2 N_h S_h\sqrt{c_h}} C \]

  • The answer is \[ n_1 = \tiny{ \frac{ 150,000 \times (S_2/2)\times \frac{1}{\sqrt{0.32}} }{150,000 \times (S_2/2) \times \sqrt{0.32} + 110,000 \times S_2 \times \sqrt{0.98}} \times 10,000 } \doteq 8,762 \] \[ n_2 = \tiny{ \frac{ 110,000 \times (S_2)\times \frac{1}{\sqrt{0.98}} }{150,000 \times (S_2/2) \times \sqrt{0.32} + 110,000 \times S_2 \times \sqrt{0.98}} \times 10,000 } \doteq 7,343 \]

  • Check:

    1. The total cost should be within the budget:
      \[ 8,762 \times 0.32 + 7,343 \times 0.98 = 10,000. \]
    2. The sample size should be less than the population size: \[ n_1 \le N_1 \mbox{ and } n_2 \le N_2 \]

Example 2

Stratum Population Size \(N_h\) \(S_h\)
1 100 50
2 110 10
3 120 5
  • Consider the population with \(H=3\) strata with the summary data above. (The costs are all equal.)

  • Suppose that we wish to allocate sample size \(n=140\) to each stratum by using optimal allocation.

Solution to Example 2

  • Using the formula with known \(n\), we may use \[ n_h =\frac{ N_h S_h}{ \sum_{h=1}^H N_h S_h} n \]
  • Using this formula, we obtain \[ n_1 = 104 \ \ n_2 = 23 \ \ n_3 = 13 \]

  • However, \(n_1=104\) is greater than \(N_1=100\). Thus, we would take \(n_1=N_1=100\) abd allocates for remaining 40 elements to strata and 3 to get \[ \tiny{n_2 = \frac{ 110 \times 10 }{ 110 \times 10 + 120 \times 5} \cdot 40 \doteq 26} \] and \(n_3= 40-26=14\).

3. Sample size determination

Basic Setup

  • Given the margin of error, we wish to find the sample size \(n\) for stratified sampling. That is, wish to find \(n\) such that \[ P\left\{ \left| \hat{\theta} - \theta \right| \le e \right\} = 1- \alpha \]
  • Under SRS, the solution was \[ e = z_{\alpha/2} \sqrt{ \frac{S^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 S^2 }{ e^2 + z_{\alpha/2}^2 S^2/N} \] for mean estimation and \[ e = z_{\alpha/2} \sqrt{ \frac{N^2S^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 N^2 S^2 }{ e^2 + z_{\alpha/2}^2 NS^2} \] for total estimation

Sample Size determination for STS (under proportional allocation)

  • Recall that, under proportional allocation, we have \[ V_{prop} ( \hat{T}_{str}) = \frac{N^2}{n} \left( 1- \frac{n}{N} \right) S_w^2 \] where \[ S_w^2 = \sum_{h=1}^H W_h S_h^2 \]
  • Thus, assuming proportional allocation, you can use \(S_w^2\) instead of \(S^2\) in the sample size determination formula under SRS.

  • That is, use \[ e = z_{\alpha/2} \sqrt{ \frac{S_w^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 S_w^2 }{ e^2 + z_{\alpha/2}^2 S_w^2/N} \] for mean estimation and \[ e = z_{\alpha/2} \sqrt{ \frac{N^2S_w^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 N^2 S_w^2 }{ e^2 + z_{\alpha/2}^2 NS_w^2} \] for total estimation

Advanced topic

Sample Size determination for STS (under optimal allocation)

  • Under optimal allocation (with equal cost), we have \[ n_h = \frac{N_hS_h}{ \sum_{h=1}^H N_h S_h } n \]
  • In this case, the variance of mean estimator is \[ V_{opt} (\bar{y}_{str}) = \frac{1}{n} \left( \sum_{h=1}^H W_h S_h\right)^2 - \frac{1}{N} S_w^2 \]

  • Thus, we can solve \[ {e = z_{\alpha/2} \sqrt{ \frac{1}{n} \left( \sum_{h=1}^H W_h S_h\right)^2 - \frac{1}{N} S_w^2 } } \] for \(n\), for mean estimation.
  • Ignoring the second term (\(N^{-1} S_w^2\)), the solution is \[ n \cong \frac{ z_{\alpha/2}^2 \left( \sum_{h=1}^H W_h S_h\right)^2 }{e^2 } \]