In general, this is more practical
We will be only discussing estimation of parameters for which an unbiased estimator exists.
For the N clusters that make up the population, they have \(M_1,M_2,...M_N\) SSU’s respectively. (M being the number of people per cluster) (SSU = # of people in said population)
Let \[M_0 = \sum_{i=1}^NM_i\] then \(M_0\) is the number of SSU’s in the population. (aka sum of the population).
The probability that a PSU is in our sample is \[\frac{n}{N}\] Since we are still discussing one-stage cluster sampling, we sample every unit in every selected cluster.
Here, where i represents the cluster(PSU)
\[t_i = \sum_{j=1}^{M_i}y_{ij}\]
Where N = Number of clusters in the population n = Number of chosen samples in the population
\[\hat{t}_{unb} = \frac{N}{n}\sum_{i=1}^nt_i \]
Aside: A potential alternative would be to look at the average of all the SSU’s and then divide them
\[M_0 * \frac{t_1 + t_2 + ... t_n}{M_1 + M_2 + ... M_n}\]
denominator = average for all \(y_{ij}\)’s
It can be shown that the estimated standard error for \(\hat{t}_{unb}\) is \[SE(\hat{t}_{unb}) = N* \sqrt{(1-\frac{n}{N})*\frac{S_t^2}{n}}\]
Where \[s_t^2 = \frac{1}{n-1}*\sum(t_i - \frac{\hat{t}}{N})^2\]
Although this is the same formula as the case of equal cluster sizes, the above SE tends to be larger.
This is because there is more variation between the cluster totals.
Thus, a (1-x)*100% CI for Tao, (the population total), is
\[\hat{t}_{unb} +- Z_{\alpha/2}*SE(\hat{t}_{unb})\]
The estimator is,
\[\hat{\bar{y}}_{unb} = \frac{1}{M_0}*\hat{t}_{unb}\]
with standard error
\[SE(\hat{\bar{y}}_{unb}) = \frac{1}{M_0}*SE(\hat{t}_{unb})\]
Thus resulting in the CI
\[\hat{\bar{y}}_{unb} +- Z_{\alpha/2}*SE(\hat{\bar{y}}_{unb})\]
Let \(a_i\) be the number of individuals in cluster i having the characteristic of interest.
Across all SSU’s sampled, the sample proportion is
\[\hat{p} = \frac{\sum{a_n}}{\sum{m_n}}\]
It can be shown that
\[\hat{V(\hat{p})} = (\frac{N-n}{N_n \bar{M}^2})*S_p^2\]
where \[S_p^2 = \sum_{i=1}^n(\frac{(a_i - \hat{p}m_i)^2}{n-1})\] (Variances among the clusters)
and \(\bar{M} = \frac{M_0}{N}\) (Average cluster size, where N = number of clusters, \(M_0\) = number of people in each cluster
If we don’t know \(\bar{M}\), we can estimate it with
\[\bar{m} = \frac{1}{n}\sum_{i=1}^nM_i = \hat{\bar{M}}\]
The 95% confidence interval for p is
\[\hat{p} \pm z_{\alpha/2}*\sqrt{\hat{V(\hat{p})}}\]
Two-Stage cluster Sampling - Recall, we take a random sample of clusters (psu’s) and then, from each, take a random sample of SSU’s
Where i is an index for each cluster, and j being the index for an observation within a cluster
\[\hat{t}_i= M_i*\frac{1}{m_i}\sum_{j=1}^1y_{ij}\]
Similar to before
\[\hat{t}_{unb}=\frac{N}{n}\sum_{i=1}^n\hat{t}_i\] So \[\hat{\bar{y}}=\frac{1}{M_0}*\hat{t}_{unb}\]
Aside: Q: What if \(M_0\) is not given -Estimate M_0 \[\hat{M}_0 = N/n * \sum_{i=1}^nm_i\]
It can be shown that the:
\[\hat{V(\hat{t_{unb}})} = N^2 (1-\frac{n}{N})* \frac{S_t^2}{n} +\frac{N}{n} * (\sum_{i=1}^n (1-\frac{m_i}{M_i}) * M_i^2 * \frac{S_i^2}{m_i})\]
where before
\[S_t^2 = \frac{1}{n-1}\sum_{i=1}^n (\hat{t_i}-\frac{\hat{t_{unb}}}{N})^2\]
Where \(S_i^2\) is the sample variance within the ith cluster.
\[S_i^2 = \sum \frac{(y_{ij}-\bar{y}_U)^2}{M_i-1}\] = Population variance within psu i
Note: if N isn’t given, then we must assume it is very large.
If N is very large, but known
\[\hat{V(\hat{t}_{unb})} =(approx)= N^2 \frac{S_t^2}{n} = \hat{V}_{wr}(\hat{t}_{unb})_{\ wr=\ with\ replacement}\]
We take the square root to get the stimated standard error.
\[SE(\hat{t}_{unb})=\sqrt(\hat{V}(\hat{t}_{unb}))\]
Similarly,
\[SE(\hat{\bar{y}}_{unb}) = \frac{1}{M_0}SE(\hat{t}_{unb})\]
The CLT says both \(\hat{\bar{y}}_{unb}\) and \(\hat{t}_{unb}\) are approximately normal.
and to get the CI we use
\[\hat{t}_{unb} \pm Z_{alpha/2} SE*(\hat{t}_{unb})\] and \[\hat{\bar{y}}_{unb} \pm Z_{alpha/2} SE*(\hat{\bar{y}}_{unb})\]
In class example.
Example: A garment manufacturer has 90 plants located throughout the U.S. to estimate the total number of hours that the sewing machines were down for repairs in the past month (clustering by plant). Estimate the total amount of downtime during the past month for all machines owned by the manufacturer with 95% confidence.
library(tidyverse)
N <- 90
n <- 10
Plant <- 1:10
Mi <- c(50,65,45,48,52,58,42,66,40,56)
#number of machines per plant
mi <- c(10,13,9,10,10,12,8,13,8,11)
#number of sampled machines per plant
ybar_i <- c(5.4,4,5.67,4.8,4.3,3.83,5,3.85,4.88,5)
#average downtime for machines in that plant
si2 <- c(11.38,10.67,16.75,13.29,11.12,14.88,5.14,4.31,6.13,11.8)
#variance of each sample per plant
plant_example <- tibble(Plant,Mi,mi,ybar_i,si2)
plant_example
## # A tibble: 10 x 5
## Plant Mi mi ybar_i si2
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 50 10 5.4 11.4
## 2 2 65 13 4 10.7
## 3 3 45 9 5.67 16.8
## 4 4 48 10 4.8 13.3
## 5 5 52 10 4.3 11.1
## 6 6 58 12 3.83 14.9
## 7 7 42 8 5 5.14
## 8 8 66 13 3.85 4.31
## 9 9 40 8 4.88 6.13
## 10 10 56 11 5 11.8
Estimate the total amount of downtime for all the machines owned by the manufacturer
plant_example2 <- plant_example %>% mutate(t_i_hat = ybar_i*Mi)
plant_example2
## # A tibble: 10 x 6
## Plant Mi mi ybar_i si2 t_i_hat
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 50 10 5.4 11.4 270
## 2 2 65 13 4 10.7 260
## 3 3 45 9 5.67 16.8 255.
## 4 4 48 10 4.8 13.3 230.
## 5 5 52 10 4.3 11.1 224.
## 6 6 58 12 3.83 14.9 222.
## 7 7 42 8 5 5.14 210
## 8 8 66 13 3.85 4.31 254.
## 9 9 40 8 4.88 6.13 195.
## 10 10 56 11 5 11.8 280
plant_example3 <- plant_example2 %>% mutate(t_hat_unb = N*mean(t_i_hat))
plant_example3
## # A tibble: 10 x 7
## Plant Mi mi ybar_i si2 t_i_hat t_hat_unb
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 50 10 5.4 11.4 270 21605.
## 2 2 65 13 4 10.7 260 21605.
## 3 3 45 9 5.67 16.8 255. 21605.
## 4 4 48 10 4.8 13.3 230. 21605.
## 5 5 52 10 4.3 11.1 224. 21605.
## 6 6 58 12 3.83 14.9 222. 21605.
## 7 7 42 8 5 5.14 210 21605.
## 8 8 66 13 3.85 4.31 254. 21605.
## 9 9 40 8 4.88 6.13 195. 21605.
## 10 10 56 11 5 11.8 280 21605.
We then calculate \(S_t^2\)
s_t <- var(plant_example3$t_i_hat)
Then we get variance of T_hat_unb ; that long ass equation for V(t_unb)
var_t_hatunb <- N^2*(1-n/N)*s_t/n + N/n * sum((1 - mi/Mi)*Mi^2*si2/mi)
Then to get the Standard Error, just take the Square root
se <- sqrt(var_t_hatunb)
The 95% ci is then
t_hat_unb <- mean(plant_example3$t_hat_unb)
low <- t_hat_unb - 1.96*se
high <- t_hat_unb + 1.96*se
print(c(low,high))
## [1] 19906.63 23303.99