Basic designs we have already covered: SRS, SYS
We will now select a SU with probability proportional to a size measure with replacement (PPSWR)
Expect larger stores (stores with larger floor area) to have higher sales and their sales to be more variable
Population
| ID | Size (\(100 m^2\)) | Total sales (in $1000) |
|---|---|---|
| 1 | 1 | 11 |
| 2 | 2 | 20 |
| 3 | 3 | 24 |
| 4 | 10 | 245 |
| Total | 16 | 300 |
Total sales variable (\(y_i\)) is not available from the sampling frame. Only the size is available.
| Sample ID | Estimator of \(T\) | Probability |
|---|---|---|
| 1 | 11*4 | 1/4 |
| 2 | 20*4 | 1/4 |
| 3 | 24*4 | 1/4 |
| 4 | 245*4 | 1/4 |
Variance is \[ V( \hat{T}) = \frac{1}{4} \sum_{i=1}^4 ( \hat{T}_i - T )^2 = 154,488 \] where \(\hat{T}_i\) is the estimator of \(T\) under \(i\)-th possible sample.
The probability \(p_i\) is called the draw probability .
PPS meaning: Probability Proportional to Size
| ID | \(x_i\) | \(y_i\) | \(p_i\) |
|---|---|---|---|
| 1 | 1 | 11 | 1/16 |
| 2 | 2 | 20 | 2/16 |
| 3 | 3 | 24 | 3/16 |
| 4 | 10 | 245 | 10/16 |
| Sample ID | Estimator of \(T\) | Probability |
|---|---|---|
| 1 | 11*(16/1)= 176 | 1/16 |
| 2 | 20*(16/2)= 160 | 2/16 |
| 3 | 24*(16/3)= 128 | 3/16 |
| 4 | 245*(16/10)= 392 | 10/16 |
Both sampling designs provide unbiased estimation.
However, the PPS sample is more efficient than the SRS sample in this example.
In general, if the size measure (\(x_i\)) is correlated with \(y_i\), then the PPS sampling is preferable.
\(p_i\)= probability of selecting SU \(i\) for a single draw
One unit is selected on every draw as \[ \sum_{i=1}^N p_i = 1\]
In practice, we do not know \(y_i\)
Inclusion probability \(\pi_i\) is the probability of being included in the same
The draw probability \(p_i\) is the probability of being included on a particular draw, but there are \(n\) draws in total.
\[ \hat{T} = \left\{ \begin{array}{ll} y_1/p_1 & \mbox{ with prob. } p_1 \\ y_2/ p_2 & \mbox{ with prob. } p_2 \\ \vdots & \vdots \\ y_N/ p_N & \mbox{ with prob. } p_N \end{array} \right. \]
Now, we have \(n\) independent draws from the same population. The with-replacement sampling makes the population unchanged after each draw.
For each draw, we obtain one \(\hat{T}\) from the probability distribution in the previous page.
Let \(\hat{T}_k\) be the value of \(\hat{T}\) from the \(k\)-th draw. We can write \(\hat{T}_k= y_i/p_i\) if unit \(i\) is selected at the \(k\)-th draw.
The final estimator of \(T\) based on the PPSWR sample of size \(n\) is \[ \hat{T}_{PPS} = \frac{1}{n} \sum_{k=1}^n \hat{T}_k. \] This is the simple mean of \(n\) IID (Independently and Identically Distributed) realizations of a random variable \(\hat{T}\).
Since each of \(\hat{T}_k\) is unbiased for \(T\), \(\hat{T}_{PPS}\) is unbiased.
The PPS estimator can be writeen as a weighted total of the sample observations
\[ \hat{T}_{PPS}= \frac{1}{ n} \sum_{k=1}^n \hat{T}_k = \frac{1}{n} \sum_{i=1}^N Q_i \frac{ y_i}{ p_i } \] where \(Q_i= \sum_{k=1}^n I_i^{(k)}\) is the number of times that unit \(i\) is selected in the sample.
Thus, we can express \[ \hat{T}_{PPS} = \sum_{i \in S} w_i y_i \] where \[ w_i = \frac{Q_i}{ n p_i} . \]
Using \(Q_i\) notation, we can express the variance estimation formula as \[ \hat{V} ( \hat{T}_{PPS})= \frac{1}{n} \left\{\frac{1}{n-1} \sum_{i \in S} Q_i \left( \frac{ y_i}{ p_i} - \hat{T}_{PPS}\right)^2 \right\} \]
If \(n/N\) is very small, then \(Q_i\) are either 0 or 1. In this case, \[ \hat{V} ( \hat{T}_{PPS})= \frac{n}{n-1} \sum_{i \in S} \left( w_i y_i - \hat{T}_{PPS}\right)^2 \]
Suppose that we select unit 3 in the first draw and select unit 1 in the second draw
Recall that
| ID | \(x_i\) | \(y_i\) | \(p_i\) |
|---|---|---|---|
| 1 | 1 | 11 | 1/16 |
| 2 | 2 | 20 | 2/16 |
| 3 | 3 | 24 | 3/16 |
| 4 | 10 | 245 | 10/16 |
What is the estimated total from the sample? (Answer: 152)
What is the estimated standard error? (Answer: 24)
cd <- read.csv("CanadianData.csv", header = FALSE)
cd <- subset(cd, V3 < 5000)
hist(cd$V3, breaks = 500, xlab = "1000 Dollars",
main = "Histogram of Payroll at \n Canadian Workplaces",
freq = FALSE)Now, also consider PPSWR
result3 <-double(nsim)
ppst <- function(n){
pik <- cd$V1/sum(cd$V1)
sout <- sample(1:nrow(cd), size = n, replace = TRUE, prob = pik)
ys <- cd$V3[sout]
return( mean(ys/pik[sout])/nrow(cd) ) }
for (i in 1:nsim){ result3[i] <- ppst(100)}
hist(result3, breaks=50, main="Histogram of sample means under PPS")| Design | Mean | Variance |
|---|---|---|
| SRS | 142.53 | 968.21 |
| SYS | 143.05 | 139.03 |
| PPS | 143.01 | 44.29 |
Point estimates are unbiased for all designs
In terms of variance, the PPS sampling has the smallest one