Stat 421: Survey Sampling

Week 7

Jae-kwang Kim

2/25/2020

Reveiw

last week, we have studied
- Systematic sampling
- Statistical Property
- Efficiency issues
In this week, we will study PPS sampling
- Motivation with $n=1$
- General theory for PPS sampling
- Examples

PPS sampling: Introduction

Introduction

Basic designs we have already covered: SRS, SYS
We will now select a SU with probability proportional to a size measure with replacement (PPSWR)
Uses auxiliary information in the design and estimator
- Auxiliary information determines size measure

Grocery Store Example

Population
- $N=4$ grocery stores in town
Survey Objective
- Estimate total sales last month, $T_y=\sum_{i=1}^N y_i$
- Study variable (characteristics of interest): $y_i$=total sales (in $1,000) during the prior month for store $i$
- Size measure on each store: $x_i$= floor area in $m^2$ for store $i$

Grocery Store Example

Expect larger stores (stores with larger floor area) to have higher sales and their sales to be more variable
Population

ID	Size ($100 m^2$)	Total sales (in $1000)
1	1	11
2	2	20
3	3	24
4	10	245
Total	16	300

Sampling design ($n=1$)

Total sales variable ($y_i$) is not available from the sampling frame. Only the size is available.
Two sampling designs
- SRS: select one store at random
- PPS: select one store with the selection probability proportional to size.

SRS of size $n=1$

Sampling distribution of $\hat{T}$ under SRS of size $n=1$

Sample ID	Estimator of $T$	Probability
1	11*4	1/4
2	20*4	1/4
3	24*4	1/4
4	245*4	1/4

Expectation \[\begin{eqnarray*} E( \hat{T}) &=& \frac{1}{4} \left( 11*4 + 20*4 + 24 *4 + 245 * 4 \right)\\ &=& 11 + 20 + 24 + 245 = 300 \end{eqnarray*}\]

SRS of size $n=1$

Variance is \[ V( \hat{T}) = \frac{1}{4} \sum_{i=1}^4 ( \hat{T}_i - T )^2 = 154,488 \] where $\hat{T}_i$ is the estimator of $T$ under $i$-th possible sample.
Alternatively, we can use the variance formula for SRS: \[\begin{eqnarray*} V ( \hat{T}) &=& \frac{N^2}{n} \left( 1-\frac{n}{N} \right) S^2\\ &=& \frac{4^2}{1} \left( 1- \frac{1}{4}\right)* 12874 = 154488 \end{eqnarray*}\]

PPS sampling of size $n=1$

Idea
- Select unit $i$ with probability equal to \[ p_i = \frac{ x_i}{ \sum_{j=1}^N x_j } \]
- If unit $k$ is selected for sample, then use \[ \hat{T} = y_k / p_k \] as an estimator for the PPS sample with $n=1$.
The probability $p_i$ is called the draw probability .
PPS meaning: Probability Proportional to Size

Back to Grocery Store Example

First compute the drow probability $p_i$ using the size measure $x_i$

ID	$x_i$	$y_i$	$p_i$
1	1	11	1/16
2	2	20	2/16
3	3	24	3/16
4	10	245	10/16

Sampling distribution of $\hat{T}$ under PPS sampling ($n=1$)

Sample ID	Estimator of $T$	Probability
1	11*(16/1)= 176	1/16
2	20*(16/2)= 160	2/16
3	24*(16/3)= 128	3/16
4	245*(16/10)= 392	10/16

Grocery Store Example (Continued)

We can check that $\hat{T}$ is unbiased. \[\begin{eqnarray*} E( \hat{T}) &=& \sum_{i=1}^4 p_i \hat{T}_i =\sum_{i=1}^4 p_i \left( \frac{y_i}{p_i} \right) = \sum_{i=1}^N y_i = 300. \end{eqnarray*}\]
What is the variance of $\hat{T}$?

\[\begin{eqnarray*} V( \hat{T}) &=& \sum_{i=1}^4 p_i \left( \hat{T}_i - T \right)^2 \\ &=& \frac{1}{16} (176-300)^2 + \cdots + \frac{10}{16} ( 392-300)^2 \\ &=& 14248 \end{eqnarray*}\]

Comparison with SRS

Both sampling designs provide unbiased estimation.
However, the PPS sample is more efficient than the SRS sample in this example.
In general, if the size measure ($x_i$) is correlated with $y_i$, then the PPS sampling is preferable.

General theory for PPS sampling

Basic Concepts (for PPSWR)

With replacement sampling procedure: Repeat the following steps $n$ times:

For each draw, select element $i$ with probability $p_i$
Return the selected elements to the fame

Key Ideas
- With-replancement sample: Element can be selected into the sample more than once
- Draw probability $p_i$: Unequal probability sampling can improve the efficiency

With replacement sampling

WR: always start with a full frame at the beginning of each draw
- Randomly select a SU from all SUs in the frame
- Put it back in frame
- Randomly select a SU from all SUs in the frame
- Put it back…

Draw probability $p_i$

$p_i$= probability of selecting SU $i$ for a single draw
Draw probability is the same for each draw
- Set of SUs eligible for selection is same after each draw because the selected unit is always returned to the frame
- Same $p_i$ for SU $i$ for all draws
One unit is selected on every draw as \[ \sum_{i=1}^N p_i = 1\]

Draw probability $p_i$

How do we choose $p_i$?
- For estimating the population total (or mean, with known $N$), the best choice of $p_i$ is $p_i \propto y_i$
- That is, select units with larger $y_i$ with higher probability
In practice, we do not know $y_i$
We may select $p_i \propto x_i$
- $x_i$ is an auxiliary variable known on the frame for all population elements
- $x_i$ has a positive correlation with $y_i$

Draw probability is NOT the inclusion probability

Inclusion probability $\pi_i$ is the probability of being included in the same
The draw probability $p_i$ is the probability of being included on a particular draw, but there are $n$ draws in total.
For PPSWR, the inclusion probability is the probability that the element is selected on at least one draw. Thus, \[\begin{eqnarray*} \pi_i &=& P ( i \mbox{ is selected at least once}) \\ &=& 1- P( i \mbox{ is never selected in $n$ draws}) \\ &=& 1- (1- p_i)^n \end{eqnarray*}\]

Estimation under PPS of size $n=1$

Recall that, for PPS sample of size $n=1$, we have the following estimator of $T$

\[ \hat{T} = \left\{ \begin{array}{ll} y_1/p_1 & \mbox{ with prob. } p_1 \\ y_2/ p_2 & \mbox{ with prob. } p_2 \\ \vdots & \vdots \\ y_N/ p_N & \mbox{ with prob. } p_N \end{array} \right. \]

From the above sampling distribution, we can check that the above estimator is unbiased
\[\begin{eqnarray*} E( \hat{T}) &=& \sum_{i=1}^N p_i( y_i/p_i) = \sum_{i=1}^N y_i. \end{eqnarray*}\]

Estimataion under PPS of size $n>1$

Now, we have $n$ independent draws from the same population. The with-replacement sampling makes the population unchanged after each draw.
For each draw, we obtain one $\hat{T}$ from the probability distribution in the previous page.
Let $\hat{T}_k$ be the value of $\hat{T}$ from the $k$-th draw. We can write $\hat{T}_k= y_i/p_i$ if unit $i$ is selected at the $k$-th draw.

Note that $\hat{T}_1, \cdots, \hat{T}_n$ are independently and identically distributed with distribution \[ \hat{T}_k = \left\{ \begin{array}{ll} y_1/p_1 & \mbox{ with prob. } p_1 \\ y_2/ p_2 & \mbox{ with prob. } p_2 \\ \vdots & \vdots \\ y_N/ p_N & \mbox{ with prob. } p_N \end{array} \right. \]
Thus, we have \[ E( \hat{T}_k )= T \] and \[ V( \hat{T}_k)= \sum_{i=1}^N p_i \left( \frac{y_i}{p_i} - T\right)^2 . \]

Estimation under PPSWR of size $n>1$

The final estimator of $T$ based on the PPSWR sample of size $n$ is \[ \hat{T}_{PPS} = \frac{1}{n} \sum_{k=1}^n \hat{T}_k. \] This is the simple mean of $n$ IID (Independently and Identically Distributed) realizations of a random variable $\hat{T}$.
Since each of $\hat{T}_k$ is unbiased for $T$, $\hat{T}_{PPS}$ is unbiased.

The variance of $\hat{T}_{PPS}$ is \[\begin{eqnarray*} V( \hat{T}_{PPS}) &=& \frac{1}{n^2} \sum_{k=1}^n V( \hat{T}_k) = \frac{1}{n} \sum_{i=1}^N p_i \left( \frac{y_i}{p_i} - T\right)^2 \end{eqnarray*}\]
Since we have $n$ IID realzed values, we can estimate the variance of $\hat{T}_{PPS}$ by \[ \hat{V} ( \hat{T}_{PPS}) = \frac{1}{n} s_T^2 \] where \[ s_T^2 = \frac{1}{n-1} \sum_{k=1}^n \left( \hat{T}_k - \hat{T}_{PPS} \right)^2 \]

Remark

For each draw $k$, we can express \[ \hat{T}_k = \sum_{i=1}^N I_i^{(k)} (y_i/p_i) \] where $I_i^{(k)}=1$ if $i$ is selected at the $k$-th draw and $I_i^{(k)}=0$ otherwise.

The PPS estimator can be writeen as a weighted total of the sample observations
\[ \hat{T}_{PPS}= \frac{1}{ n} \sum_{k=1}^n \hat{T}_k = \frac{1}{n} \sum_{i=1}^N Q_i \frac{ y_i}{ p_i } \] where $Q_i= \sum_{k=1}^n I_i^{(k)}$ is the number of times that unit $i$ is selected in the sample.
Thus, we can express \[ \hat{T}_{PPS} = \sum_{i \in S} w_i y_i \] where \[ w_i = \frac{Q_i}{ n p_i} . \]

Remark 2

Using $Q_i$ notation, we can express the variance estimation formula as \[ \hat{V} ( \hat{T}_{PPS})= \frac{1}{n} \left\{\frac{1}{n-1} \sum_{i \in S} Q_i \left( \frac{ y_i}{ p_i} - \hat{T}_{PPS}\right)^2 \right\} \]
If $n/N$ is very small, then $Q_i$ are either 0 or 1. In this case, \[ \hat{V} ( \hat{T}_{PPS})= \frac{n}{n-1} \sum_{i \in S} \left( w_i y_i - \hat{T}_{PPS}\right)^2 \]

Examples

Grocery Store Example

Suppose that we apply PPSWR of size $n=2$.
Suppose that we select unit 3 in the first draw and select unit 1 in the second draw
Recall that

ID	$x_i$	$y_i$	$p_i$
1	1	11	1/16
2	2	20	2/16
3	3	24	3/16
4	10	245	10/16

Problem

What is the estimated total from the sample? (Answer: 152)
What is the estimated standard error? (Answer: 24)

Solution

Confidence Intervales

Using the asymptotic normality of $\hat{T}_{PPS}$, we can construct a 95% confidence interval for $T$ as follows: \[ \left[ \hat{T}_{PPS} - 1.96 \cdot \widehat{SE}, \hat{T}_{PPS} + 1.96 \cdot \widehat{SE} \right] \] where \[ \widehat{SE} = \sqrt{\hat{V} ( \hat{T}_{PPS} ) }. \]
In the grocery store example, a 95% confidence interval for the total sales for the population of 4 grocery sotres is \[ [ 152-1.96 (24), 152 +1.96(24)] =[104.96, 199.04] \]

Example: Canadian Workplace Data

Recall that

Population: $N=2,025$ workplaces in Canada
Characteristic of interest for workplace $i$
- $y_i$: total payroll of workplace $i$
Auxiliary variable for workplace $i$
- $x_i$: employment of workplace $i$
Population parameter: total payroll
- $T=\sum_{i=1}^N y_i$ dollars

Canadian Workplace Data

cd <- read.csv("CanadianData.csv", header = FALSE)
cd <- subset(cd, V3 < 5000)

hist(cd$V3, breaks = 500, xlab = "1000 Dollars", 
     main = "Histogram of Payroll at \n Canadian Workplaces", 
     freq = FALSE)

Previously, we considered two sample designs
1. Simple random sampling
2. Systematic sampling: Sort the frame in increasing order by employment
Now, also consider PPSWR
- Size measure $x_i$= employment for workplace $i$
- Draw probability \[ p_i = x_i/(\sum_{i=1}^N x_i) \]

PPS sampling simulation

result3 <-double(nsim) 
ppst <- function(n){
    pik <- cd$V1/sum(cd$V1)
    sout <- sample(1:nrow(cd), size = n, replace = TRUE, prob = pik)
  ys <- cd$V3[sout]
    return( mean(ys/pik[sout])/nrow(cd) )  }
for (i in 1:nsim){  result3[i] <- ppst(100)} 
hist(result3, breaks=50, main="Histogram of sample means under PPS")

Comparison

Simulation Results (based on $B=10,000$ simulation)

Design	Mean	Variance
SRS	142.53	968.21
SYS	143.05	139.03
PPS	143.01	44.29

Point estimates are unbiased for all designs
In terms of variance, the PPS sampling has the smallest one

Stat 421: Survey Sampling

Week 7

Jae-kwang Kim

2/25/2020

Reveiw

PPS sampling: Introduction

Introduction

Grocery Store Example

Grocery Store Example

Sampling design (\(n=1\))

SRS of size \(n=1\)

SRS of size \(n=1\)

PPS sampling of size \(n=1\)

Back to Grocery Store Example

Grocery Store Example (Continued)

Comparison with SRS

General theory for PPS sampling

Basic Concepts (for PPSWR)

With replacement sampling

Draw probability \(p_i\)

Draw probability \(p_i\)

Draw probability is NOT the inclusion probability

Estimation under PPS of size \(n=1\)

Estimataion under PPS of size \(n>1\)

Estimation under PPSWR of size \(n>1\)

Remark

Remark 2

Examples

Grocery Store Example

Problem

Solution

Confidence Intervales

Example: Canadian Workplace Data

Recall that

Canadian Workplace Data

PPS sampling simulation

Comparison