Clusters of Unequal Size

In general, this is more practical
We will be only discussing estimation of parameters for which an unbiased estimator exists.
For the N clusters that make up the population, they have \(M_1,M_2,...M_N\) SSU’s respectively. (M being the number of people per cluster) (SSU = # of people in said population)
Let \[M_0 = \sum_{i=1}^NM_i\] then \(M_0\) is the number of SSU’s in the population. (aka sum of the population).
The probability that a PSU is in our sample is \[\frac{n}{N}\] Since we are still discussing one-stage cluster sampling, we sample every unit in every selected cluster.

Here, where i represents the cluster(PSU)

\[t_i = \sum_{j=1}^{M_i}y_{ij}\]

An estimator for the population total is then,

Where N = Number of clusters in the population n = Number of chosen samples in the population

\[\hat{t}_{unb} = \frac{N}{n}\sum_{i=1}^nt_i \]

Aside: A potential alternative would be to look at the average of all the SSU’s and then divide them

\[M_0 * \frac{t_1 + t_2 + ... t_n}{M_1 + M_2 + ... M_n}\]

denominator = average for all \(y_{ij}\)’s

It can be shown that the estimated standard error for \(\hat{t}_{unb}\) is \[SE(\hat{t}_{unb}) = N* \sqrt{(1-\frac{n}{N})*\frac{S_t^2}{n}}\]

Where \[s_t^2 = \frac{1}{n-1}*\sum(t_i - \frac{\hat{t}}{N})^2\]

Although this is the same formula as the case of equal cluster sizes, the above SE tends to be larger.

This is because there is more variation between the cluster totals.

Confidence Interval

Thus, a (1-x)*100% CI for Tao, (the population total), is

\[\hat{t}_{unb} +- Z_{\alpha/2}*SE(\hat{t}_{unb})\]

For the mean

The estimator is,

\[\hat{\bar{y}}_{unb} = \frac{1}{M_0}*\hat{t}_{unb}\]

with standard error

\[SE(\hat{\bar{y}}_{unb}) = \frac{1}{M_0}*SE(\hat{t}_{unb})\]

Thus resulting in the CI

\[\hat{\bar{y}}_{unb} +- Z_{\alpha/2}*SE(\hat{\bar{y}}_{unb})\]

Estimating the Population Proportion

Let \(a_i\) be the number of individuals in cluster i having the characteristic of interest.

Across all SSU’s sampled, the sample proportion is

\[\hat{p} = \frac{\sum{a_n}}{\sum{m_n}}\]

It can be shown that

\[\hat{V(\hat{p})} = (\frac{N-n}{N_n \bar{M}^2})*S_p^2\]

where \[S_p^2 = \sum_{i=1}^n(\frac{(a_i - \hat{p}m_i)^2}{n-1})\] (Variances among the clusters)

and \(\bar{M} = \frac{M_0}{N}\) (Average cluster size, where N = number of clusters, \(M_0\) = number of people in each cluster

If we don’t know \(\bar{M}\), we can estimate it with

\[\bar{m} = \frac{1}{n}\sum_{i=1}^nM_i = \hat{\bar{M}}\]

The 95% confidence interval for p is

\[\hat{p} \pm z_{\alpha/2}*\sqrt{\hat{V(\hat{p})}}\]

Week 10 Day 1

Two-Stage cluster Sampling - Recall, we take a random sample of clusters (psu’s) and then, from each, take a random sample of SSU’s

We will take a SRS of \(m_i\) observations from the ith cluster, containing \(M_i\) individuals.

Goal: CIs for Tao, Mu, P

Where as in one-stage clustering, ti is calculated in two-stage clustering, it must be estimated with

Where i is an index for each cluster, and j being the index for an observation within a cluster

\[\hat{t}_i= M_i*\frac{1}{m_i}\sum_{j=1}^1y_{ij}\]

Similar to before

\[\hat{t}_{unb}=\frac{N}{n}\sum_{i=1}^n\hat{t}_i\] So \[\hat{\bar{y}}=\frac{1}{M_0}*\hat{t}_{unb}\]

Aside: Q: What if \(M_0\) is not given -Estimate M_0 \[\hat{M}_0 = N/n * \sum_{i=1}^nm_i\]

It can be shown that the:

\[\hat{V(\hat{t_{unb}})} = N^2 (1-\frac{n}{N})* \frac{S_t^2}{n} +\frac{N}{n} * (\sum_{i=1}^n (1-\frac{m_i}{M_i}) * M_i^2 * \frac{S_i^2}{m_i})\]

where before

\[S_t^2 = \frac{1}{n-1}\sum_{i=1}^n (\hat{t_i}-\frac{\hat{t_{unb}}}{N})^2\]

Where \(S_i^2\) is the sample variance within the ith cluster.

\[S_i^2 = \sum \frac{(y_{ij}-\bar{y}_U)^2}{M_i-1}\] = Population variance within psu i

Note: if N isn’t given, then we must assume it is very large.

If N is very large, but known

\[\hat{V(\hat{t}_{unb})} =(approx)= N^2 \frac{S_t^2}{n} = \hat{V}_{wr}(\hat{t}_{unb})_{\ wr=\ with\ replacement}\]

We take the square root to get the stimated standard error.

\[SE(\hat{t}_{unb})=\sqrt(\hat{V}(\hat{t}_{unb}))\]

Similarly,

\[SE(\hat{\bar{y}}_{unb}) = \frac{1}{M_0}SE(\hat{t}_{unb})\]

The CLT says both \(\hat{\bar{y}}_{unb}\) and \(\hat{t}_{unb}\) are approximately normal.

and to get the CI we use

\[\hat{t}_{unb} \pm Z_{alpha/2} SE*(\hat{t}_{unb})\] and \[\hat{\bar{y}}_{unb} \pm Z_{alpha/2} SE*(\hat{\bar{y}}_{unb})\]

In class example.

Example: A garment manufacturer has 90 plants located throughout the U.S. to estimate the total number of hours that the sewing machines were down for repairs in the past month (clustering by plant). Estimate the total amount of downtime during the past month for all machines owned by the manufacturer with 95% confidence.

library(tidyverse)

N <- 90
n <- 10
Plant <- 1:10
Mi <- c(50,65,45,48,52,58,42,66,40,56)
#number of machines per plant
mi <- c(10,13,9,10,10,12,8,13,8,11)
#number of sampled machines per plant
ybar_i <- c(5.4,4,5.67,4.8,4.3,3.83,5,3.85,4.88,5)
#average downtime for machines in that plant
si2 <- c(11.38,10.67,16.75,13.29,11.12,14.88,5.14,4.31,6.13,11.8)
#variance of each sample per plant

plant_example <- tibble(Plant,Mi,mi,ybar_i,si2)
plant_example

## # A tibble: 10 x 5
##    Plant    Mi    mi ybar_i   si2
##    <int> <dbl> <dbl>  <dbl> <dbl>
##  1     1    50    10   5.4  11.4 
##  2     2    65    13   4    10.7 
##  3     3    45     9   5.67 16.8 
##  4     4    48    10   4.8  13.3 
##  5     5    52    10   4.3  11.1 
##  6     6    58    12   3.83 14.9 
##  7     7    42     8   5     5.14
##  8     8    66    13   3.85  4.31
##  9     9    40     8   4.88  6.13
## 10    10    56    11   5    11.8

Estimate the total amount of downtime for all the machines owned by the manufacturer

plant_example2 <- plant_example %>% mutate(t_i_hat = ybar_i*Mi)
plant_example2

## # A tibble: 10 x 6
##    Plant    Mi    mi ybar_i   si2 t_i_hat
##    <int> <dbl> <dbl>  <dbl> <dbl>   <dbl>
##  1     1    50    10   5.4  11.4     270 
##  2     2    65    13   4    10.7     260 
##  3     3    45     9   5.67 16.8     255.
##  4     4    48    10   4.8  13.3     230.
##  5     5    52    10   4.3  11.1     224.
##  6     6    58    12   3.83 14.9     222.
##  7     7    42     8   5     5.14    210 
##  8     8    66    13   3.85  4.31    254.
##  9     9    40     8   4.88  6.13    195.
## 10    10    56    11   5    11.8     280

plant_example3 <- plant_example2 %>% mutate(t_hat_unb = N*mean(t_i_hat))
plant_example3

## # A tibble: 10 x 7
##    Plant    Mi    mi ybar_i   si2 t_i_hat t_hat_unb
##    <int> <dbl> <dbl>  <dbl> <dbl>   <dbl>     <dbl>
##  1     1    50    10   5.4  11.4     270     21605.
##  2     2    65    13   4    10.7     260     21605.
##  3     3    45     9   5.67 16.8     255.    21605.
##  4     4    48    10   4.8  13.3     230.    21605.
##  5     5    52    10   4.3  11.1     224.    21605.
##  6     6    58    12   3.83 14.9     222.    21605.
##  7     7    42     8   5     5.14    210     21605.
##  8     8    66    13   3.85  4.31    254.    21605.
##  9     9    40     8   4.88  6.13    195.    21605.
## 10    10    56    11   5    11.8     280     21605.

We then calculate \(S_t^2\)

s_t <- var(plant_example3$t_i_hat)

Then we get variance of T_hat_unb ; that long ass equation for V(t_unb)

var_t_hatunb <- N^2*(1-n/N)*s_t/n + N/n * sum((1 - mi/Mi)*Mi^2*si2/mi)

Then to get the Standard Error, just take the Square root

se <- sqrt(var_t_hatunb)

The 95% ci is then

t_hat_unb <- mean(plant_example3$t_hat_unb)
low <- t_hat_unb - 1.96*se
high <- t_hat_unb + 1.96*se
print(c(low,high))

## [1] 19906.63 23303.99

week9day1

Gene Dalusong

March 19, 2019