Personal note: This method is usually applied in groups where there is diversity within the groups and not between clusters.
Recall that cluster sampling is a two-stage sampling procedure
Where in we:
The scottish government wishes to find out about the sleeping patterns of all primary school pupils. The survey will require 1 to 1 (1 on 1?) interviews with the pupils and the parents/guardians. It is not feasible to survey all pupils in all schools nor even all pupils with in the selected schools.
We randomly selecta number of schools and then randomly select some number of pupils within eacch of the selected schools.
Notation Page 169; Basically we need an index for the PSU and SSU Here \(y_{ij}\) represents the \(j^{th}\) SSU(observational unit) within the \(i^{th}\) PSU(cluster)
If a cluster is selected among the n(the number of clusters) chosen, then all the SSU’s are measured within that cluster (ith cluster size)\((M_i \ = \ m_i)\)(sample size from the ith cluster)
To ease our way into this, suppose every cluster was of the same size. That means \[M_i = m_i = M\ for\ \ all\ i.\]
One-Stage clustering, with Equal Cluster sizes (which implies) \((M_i \ = \ m_i)\)
We do the same as Stratified Sampling, where we find the mean, total, and
For the Total, start with calculating the total for each cluster sample.
(Remember: Total refers to the variable I’m measuring for every unit) We will use \(t_1,t_2, ... t_n\)
We don’t need to use \(\hat{t}\) because we’re calculating actual totals and not estimating per cluster.
The estimated total for the actual population is
\[\hat{t} = N * \frac{1}{n}*\sum_{i=1}^n{t_i}\]
Since we don’t know which clusters will actually be selected, \(t_i\) is still considered a random variable, and each cluster should be independent.
The CLT gives us approximate normality.
The standard error of \(\hat{t}\) is
\[SE(\hat{t})=N * \sqrt{(1-\frac{n}{N})*\frac{s_t^2}{n}}\]
where \[s_t^2 = \frac{1}{n-1} \sum(t_i - \frac{\hat{t}}{N})^2\]
CI =
\[\hat{t} +- Z_(\alpha/2)*SE(\hat{t})\]
For estimating the population mean, we use
\[\hat{\bar{y}} = \frac{1}{M*N} * \hat{t}\]
Every cluster (in this over simplified example) is of size big M
In other words, **M*N = Population size**, so you divide the estimated total by the population size to get the mean
The 3 recurring things we’ve solved for in this class is Distribution, Standard Error, and then we get the Confidence Interval
Calculating Variance
\[V(\hat{t}) = N^2 * \frac{\frac{1}{n-1} \sum(t_i-\frac{1}{N}\hat{t})^2}{n} * (1-\frac{n}{N})\]
\[ = N^2(1-\frac{n}{N})*\frac{s_t^2}{n}\]
\[ = V(\frac{1}{M*N} * \hat{t}) = \frac{1}{M^2*N^2} V(\hat{t}) = \frac{1}{M^2}(1-\frac{n}{N})\frac{s_t^2}{n}\]
\[SE(\bar{\hat{y}}) = \frac{1}{M}\sqrt{(1-\frac{n}{N})\frac{s_t^2}{n}}\]
Therefore, an approximate \((1-\alpha)*100%\) CI for \(\mu\) is
\[= \bar{\hat{y}} +- z_(\alpha/2) SE(\bar{hat{y}})\]
A sociologist wants to estimate the total income of all residents ina certain small city. No list of resident adults is available. [If she did, a SRS would be appropriate]
The city is marked off into 415 rectangular blocks and she will sample n=25 of these blocks
Suppose There are M=6 adult residents within each of these blocks
library(tidyverse)
block <- array()
income <- array()
income <- rnorm(n = 150, mean = 35000, sd = 8000)
block[1:25] <- c(1:25)
block[26:50] <- c(1:25)
block[51:75] <- c(1:25)
block[76:100] <- c(1:25)
block[101:125] <- c(1:25)
block[126:150] <- c(1:25)
N = 415
n = 25
example_table <- tibble(block,income)
example_table
## # A tibble: 150 x 2
## block income
## <int> <dbl>
## 1 1 32911.
## 2 2 44377.
## 3 3 45711.
## 4 4 37343.
## 5 5 33574.
## 6 6 26355.
## 7 7 40809.
## 8 8 41689.
## 9 9 46379.
## 10 10 41547.
## # ... with 140 more rows
new_table <- example_table %>% group_by(block)
new_table
## # A tibble: 150 x 2
## # Groups: block [25]
## block income
## <int> <dbl>
## 1 1 32911.
## 2 2 44377.
## 3 3 45711.
## 4 4 37343.
## 5 5 33574.
## 6 6 26355.
## 7 7 40809.
## 8 8 41689.
## 9 9 46379.
## 10 10 41547.
## # ... with 140 more rows
next_table <- summarise(new_table, t_i = sum(income))
t_hat <- N * 1/n * sum(next_table$t_i)
t_hat
## [1] 87606053
s_t <- 1/(n-1)*sum((next_table$t_i - t_hat/N)^2)
SE <- N * sqrt((1-n/N)*s_t/n)
SE
## [1] 1394938
#CI calculation
low <- t_hat - 1.96*SE
high <- t_hat + 1.96*SE
print(c(low,high))
## [1] 84871974 90340132