Stat 421: Survey Sampling

Week 7

Jae-kwang Kim

2/25/2020

Reveiw

  • last week, we have studied
    • Systematic sampling
    • Statistical Property
    • Efficiency issues
  • In this week, we will study PPS sampling
    • Motivation with \(n=1\)
    • General theory for PPS sampling
    • Examples

PPS sampling: Introduction

Introduction

  • Basic designs we have already covered: SRS, SYS

  • We will now select a SU with probability proportional to a size measure with replacement (PPSWR)

  • Uses auxiliary information in the design and estimator
    • Auxiliary information determines size measure

Grocery Store Example

  • Population
    • \(N=4\) grocery stores in town
  • Survey Objective
    • Estimate total sales last month, \(T_y=\sum_{i=1}^N y_i\)
    • Study variable (characteristics of interest): \(y_i\)=total sales (in $1,000) during the prior month for store \(i\)
    • Size measure on each store: \(x_i\)= floor area in \(m^2\) for store \(i\)

Grocery Store Example

  • Expect larger stores (stores with larger floor area) to have higher sales and their sales to be more variable

  • Population

ID Size (\(100 m^2\)) Total sales (in $1000)
1 1 11
2 2 20
3 3 24
4 10 245
Total 16 300

Sampling design (\(n=1\))

  • Total sales variable (\(y_i\)) is not available from the sampling frame. Only the size is available.

  • Two sampling designs
    • SRS: select one store at random
    • PPS: select one store with the selection probability proportional to size.

SRS of size \(n=1\)

  • Sampling distribution of \(\hat{T}\) under SRS of size \(n=1\)
Sample ID Estimator of \(T\) Probability
1 11*4 1/4
2 20*4 1/4
3 24*4 1/4
4 245*4 1/4
  • Expectation \[\begin{eqnarray*} E( \hat{T}) &=& \frac{1}{4} \left( 11*4 + 20*4 + 24 *4 + 245 * 4 \right)\\ &=& 11 + 20 + 24 + 245 = 300 \end{eqnarray*}\]

SRS of size \(n=1\)

  • Variance is \[ V( \hat{T}) = \frac{1}{4} \sum_{i=1}^4 ( \hat{T}_i - T )^2 = 154,488 \] where \(\hat{T}_i\) is the estimator of \(T\) under \(i\)-th possible sample.

  • Alternatively, we can use the variance formula for SRS: \[\begin{eqnarray*} V ( \hat{T}) &=& \frac{N^2}{n} \left( 1-\frac{n}{N} \right) S^2\\ &=& \frac{4^2}{1} \left( 1- \frac{1}{4}\right)* 12874 = 154488 \end{eqnarray*}\]

PPS sampling of size \(n=1\)

  • Idea
    • Select unit \(i\) with probability equal to \[ p_i = \frac{ x_i}{ \sum_{j=1}^N x_j } \]
    • If unit \(k\) is selected for sample, then use \[ \hat{T} = y_k / p_k \] as an estimator for the PPS sample with \(n=1\).
  • The probability \(p_i\) is called the draw probability .

  • PPS meaning: Probability Proportional to Size

Back to Grocery Store Example

  • First compute the drow probability \(p_i\) using the size measure \(x_i\)
ID \(x_i\) \(y_i\) \(p_i\)
1 1 11 1/16
2 2 20 2/16
3 3 24 3/16
4 10 245 10/16
  • Sampling distribution of \(\hat{T}\) under PPS sampling (\(n=1\))
Sample ID Estimator of \(T\) Probability
1 11*(16/1)= 176 1/16
2 20*(16/2)= 160 2/16
3 24*(16/3)= 128 3/16
4 245*(16/10)= 392 10/16

Grocery Store Example (Continued)

  • We can check that \(\hat{T}\) is unbiased. \[\begin{eqnarray*} E( \hat{T}) &=& \sum_{i=1}^4 p_i \hat{T}_i =\sum_{i=1}^4 p_i \left( \frac{y_i}{p_i} \right) = \sum_{i=1}^N y_i = 300. \end{eqnarray*}\]
  • What is the variance of \(\hat{T}\)?
\[\begin{eqnarray*} V( \hat{T}) &=& \sum_{i=1}^4 p_i \left( \hat{T}_i - T \right)^2 \\ &=& \frac{1}{16} (176-300)^2 + \cdots + \frac{10}{16} ( 392-300)^2 \\ &=& 14248 \end{eqnarray*}\]

Comparison with SRS

  • Both sampling designs provide unbiased estimation.

  • However, the PPS sample is more efficient than the SRS sample in this example.

  • In general, if the size measure (\(x_i\)) is correlated with \(y_i\), then the PPS sampling is preferable.

General theory for PPS sampling

Basic Concepts (for PPSWR)

  • With replacement sampling procedure: Repeat the following steps \(n\) times:
  1. For each draw, select element \(i\) with probability \(p_i\)
  2. Return the selected elements to the fame
  • Key Ideas
    • With-replancement sample: Element can be selected into the sample more than once
    • Draw probability \(p_i\): Unequal probability sampling can improve the efficiency

With replacement sampling

  • WR: always start with a full frame at the beginning of each draw
    • Randomly select a SU from all SUs in the frame
    • Put it back in frame
    • Randomly select a SU from all SUs in the frame
    • Put it back…

Draw probability \(p_i\)

  • \(p_i\)= probability of selecting SU \(i\) for a single draw

  • Draw probability is the same for each draw
    • Set of SUs eligible for selection is same after each draw because the selected unit is always returned to the frame
    • Same \(p_i\) for SU \(i\) for all draws
  • One unit is selected on every draw as \[ \sum_{i=1}^N p_i = 1\]

Draw probability \(p_i\)

  • How do we choose \(p_i\)?
    • For estimating the population total (or mean, with known \(N\)), the best choice of \(p_i\) is \(p_i \propto y_i\)
    • That is, select units with larger \(y_i\) with higher probability
  • In practice, we do not know \(y_i\)

  • We may select \(p_i \propto x_i\)
    • \(x_i\) is an auxiliary variable known on the frame for all population elements
    • \(x_i\) has a positive correlation with \(y_i\)

Draw probability is NOT the inclusion probability

  • Inclusion probability \(\pi_i\) is the probability of being included in the same

  • The draw probability \(p_i\) is the probability of being included on a particular draw, but there are \(n\) draws in total.

  • For PPSWR, the inclusion probability is the probability that the element is selected on at least one draw. Thus, \[\begin{eqnarray*} \pi_i &=& P ( i \mbox{ is selected at least once}) \\ &=& 1- P( i \mbox{ is never selected in $n$ draws}) \\ &=& 1- (1- p_i)^n \end{eqnarray*}\]

Estimation under PPS of size \(n=1\)

  • Recall that, for PPS sample of size \(n=1\), we have the following estimator of \(T\)

\[ \hat{T} = \left\{ \begin{array}{ll} y_1/p_1 & \mbox{ with prob. } p_1 \\ y_2/ p_2 & \mbox{ with prob. } p_2 \\ \vdots & \vdots \\ y_N/ p_N & \mbox{ with prob. } p_N \end{array} \right. \]

  • From the above sampling distribution, we can check that the above estimator is unbiased
    \[\begin{eqnarray*} E( \hat{T}) &=& \sum_{i=1}^N p_i( y_i/p_i) = \sum_{i=1}^N y_i. \end{eqnarray*}\]

Estimataion under PPS of size \(n>1\)

  • Now, we have \(n\) independent draws from the same population. The with-replacement sampling makes the population unchanged after each draw.

  • For each draw, we obtain one \(\hat{T}\) from the probability distribution in the previous page.

  • Let \(\hat{T}_k\) be the value of \(\hat{T}\) from the \(k\)-th draw. We can write \(\hat{T}_k= y_i/p_i\) if unit \(i\) is selected at the \(k\)-th draw.

  • Note that \(\hat{T}_1, \cdots, \hat{T}_n\) are independently and identically distributed with distribution \[ \hat{T}_k = \left\{ \begin{array}{ll} y_1/p_1 & \mbox{ with prob. } p_1 \\ y_2/ p_2 & \mbox{ with prob. } p_2 \\ \vdots & \vdots \\ y_N/ p_N & \mbox{ with prob. } p_N \end{array} \right. \]
  • Thus, we have \[ E( \hat{T}_k )= T \] and \[ V( \hat{T}_k)= \sum_{i=1}^N p_i \left( \frac{y_i}{p_i} - T\right)^2 . \]

Estimation under PPSWR of size \(n>1\)

  • The final estimator of \(T\) based on the PPSWR sample of size \(n\) is \[ \hat{T}_{PPS} = \frac{1}{n} \sum_{k=1}^n \hat{T}_k. \] This is the simple mean of \(n\) IID (Independently and Identically Distributed) realizations of a random variable \(\hat{T}\).

  • Since each of \(\hat{T}_k\) is unbiased for \(T\), \(\hat{T}_{PPS}\) is unbiased.

  • The variance of \(\hat{T}_{PPS}\) is \[\begin{eqnarray*} V( \hat{T}_{PPS}) &=& \frac{1}{n^2} \sum_{k=1}^n V( \hat{T}_k) = \frac{1}{n} \sum_{i=1}^N p_i \left( \frac{y_i}{p_i} - T\right)^2 \end{eqnarray*}\]
  • Since we have \(n\) IID realzed values, we can estimate the variance of \(\hat{T}_{PPS}\) by \[ \hat{V} ( \hat{T}_{PPS}) = \frac{1}{n} s_T^2 \] where \[ s_T^2 = \frac{1}{n-1} \sum_{k=1}^n \left( \hat{T}_k - \hat{T}_{PPS} \right)^2 \]

Remark

  • For each draw \(k\), we can express \[ \hat{T}_k = \sum_{i=1}^N I_i^{(k)} (y_i/p_i) \] where \(I_i^{(k)}=1\) if \(i\) is selected at the \(k\)-th draw and \(I_i^{(k)}=0\) otherwise.

  • The PPS estimator can be writeen as a weighted total of the sample observations
    \[ \hat{T}_{PPS}= \frac{1}{ n} \sum_{k=1}^n \hat{T}_k = \frac{1}{n} \sum_{i=1}^N Q_i \frac{ y_i}{ p_i } \] where \(Q_i= \sum_{k=1}^n I_i^{(k)}\) is the number of times that unit \(i\) is selected in the sample.

  • Thus, we can express \[ \hat{T}_{PPS} = \sum_{i \in S} w_i y_i \] where \[ w_i = \frac{Q_i}{ n p_i} . \]

Remark 2

  • Using \(Q_i\) notation, we can express the variance estimation formula as \[ \hat{V} ( \hat{T}_{PPS})= \frac{1}{n} \left\{\frac{1}{n-1} \sum_{i \in S} Q_i \left( \frac{ y_i}{ p_i} - \hat{T}_{PPS}\right)^2 \right\} \]

  • If \(n/N\) is very small, then \(Q_i\) are either 0 or 1. In this case, \[ \hat{V} ( \hat{T}_{PPS})= \frac{n}{n-1} \sum_{i \in S} \left( w_i y_i - \hat{T}_{PPS}\right)^2 \]

Examples

Grocery Store Example

  • Suppose that we apply PPSWR of size \(n=2\).
  • Suppose that we select unit 3 in the first draw and select unit 1 in the second draw

  • Recall that

ID \(x_i\) \(y_i\) \(p_i\)
1 1 11 1/16
2 2 20 2/16
3 3 24 3/16
4 10 245 10/16

Problem

  • What is the estimated total from the sample? (Answer: 152)

  • What is the estimated standard error? (Answer: 24)

Solution

Confidence Intervales

  • Using the asymptotic normality of \(\hat{T}_{PPS}\), we can construct a 95% confidence interval for \(T\) as follows: \[ \left[ \hat{T}_{PPS} - 1.96 \cdot \widehat{SE}, \hat{T}_{PPS} + 1.96 \cdot \widehat{SE} \right] \] where \[ \widehat{SE} = \sqrt{\hat{V} ( \hat{T}_{PPS} ) }. \]
  • In the grocery store example, a 95% confidence interval for the total sales for the population of 4 grocery sotres is \[ [ 152-1.96 (24), 152 +1.96(24)] =[104.96, 199.04] \]

Example: Canadian Workplace Data

Recall that

  • Population: \(N=2,025\) workplaces in Canada
  • Characteristic of interest for workplace \(i\)
    • \(y_i\): total payroll of workplace \(i\)
  • Auxiliary variable for workplace \(i\)
    • \(x_i\): employment of workplace \(i\)
  • Population parameter: total payroll
    • \(T=\sum_{i=1}^N y_i\) dollars

Canadian Workplace Data

cd <- read.csv("CanadianData.csv", header = FALSE)
cd <- subset(cd, V3 < 5000)

hist(cd$V3, breaks = 500, xlab = "1000 Dollars", 
     main = "Histogram of Payroll at \n Canadian Workplaces", 
     freq = FALSE)

  • Previously, we considered two sample designs
    1. Simple random sampling
    2. Systematic sampling: Sort the frame in increasing order by employment
  • Now, also consider PPSWR

    • Size measure \(x_i\)= employment for workplace \(i\)
    • Draw probability \[ p_i = x_i/(\sum_{i=1}^N x_i) \]

PPS sampling simulation

result3 <-double(nsim) 
ppst <- function(n){
    pik <- cd$V1/sum(cd$V1)
    sout <- sample(1:nrow(cd), size = n, replace = TRUE, prob = pik)
  ys <- cd$V3[sout]
    return( mean(ys/pik[sout])/nrow(cd) )  }
for (i in 1:nsim){  result3[i] <- ppst(100)} 
hist(result3, breaks=50, main="Histogram of sample means under PPS")

Comparison

  • Simulation Results (based on \(B=10,000\) simulation)
Design Mean Variance
SRS 142.53 968.21
SYS 143.05 139.03
PPS 143.01 44.29
  • Point estimates are unbiased for all designs

  • In terms of variance, the PPS sampling has the smallest one