Cluster Sampling

Personal note: This method is usually applied in groups where there is diversity within the groups and not between clusters.

Introduction

Recall that cluster sampling is a two-stage sampling procedure

Where in we:

    1. Break the population into clusters (blocks)
    1. We take a Simple Random Sample of these clusters
    • Note, we take a SRS of all clusters, in contrast to a Stratified Random Sample where we took a SRS of each stratum.
      1. We measure all units in the selected clusters (one-stage cluster sample) or
      1. We take a simple random sample from each of the selected clusters. (two-stage cluster sample)
      • Where the units measured are secondary sampling units or SSU
  • Definitions
    • PSU : Primary Sampling Unit + SSU : Secondary Sampling Unit

Example (2-stage cluster sample)

The scottish government wishes to find out about the sleeping patterns of all primary school pupils. The survey will require 1 to 1 (1 on 1?) interviews with the pupils and the parents/guardians. It is not feasible to survey all pupils in all schools nor even all pupils with in the selected schools.

We randomly selecta number of schools and then randomly select some number of pupils within eacch of the selected schools.

  • Advantages: The two primary reasons we employ cluser sampling are (Particularly for populations spread over a large geographical area):
    • Feasibility
      • Often times we don’t have a comprehensive list of observational units
    • Economics
      • Two major sources of cost are almost always reduced with cluster sampling:
        1. Listing costs
        2. Travel costs
  • Disadvantages: The standard error of estimators are typically larger than those from other sampling methods.
    • Tends to be the case because observational units within the same cluster tend to be rather homogenous (redundant information) explaination: 100 observations may only give the same information as a simple random sample of 40.

Notation Page 169; Basically we need an index for the PSU and SSU Here \(y_{ij}\) represents the \(j^{th}\) SSU(observational unit) within the \(i^{th}\) PSU(cluster)

One-Stage Cluster Sampling

If a cluster is selected among the n(the number of clusters) chosen, then all the SSU’s are measured within that cluster (ith cluster size)\((M_i \ = \ m_i)\)(sample size from the ith cluster)

To ease our way into this, suppose every cluster was of the same size. That means \[M_i = m_i = M\ for\ \ all\ i.\]

Week 8 day 2

One-Stage clustering, with Equal Cluster sizes (which implies) \((M_i \ = \ m_i)\)

We do the same as Stratified Sampling, where we find the mean, total, and

For the Total, start with calculating the total for each cluster sample.

(Remember: Total refers to the variable I’m measuring for every unit) We will use \(t_1,t_2, ... t_n\)

We don’t need to use \(\hat{t}\) because we’re calculating actual totals and not estimating per cluster.

The estimated total for the actual population is

\[\hat{t} = N * \frac{1}{n}*\sum_{i=1}^n{t_i}\]

Since we don’t know which clusters will actually be selected, \(t_i\) is still considered a random variable, and each cluster should be independent.

The CLT gives us approximate normality.

The standard error of \(\hat{t}\) is

\[SE(\hat{t})=N * \sqrt{(1-\frac{n}{N})*\frac{s_t^2}{n}}\]

where \[s_t^2 = \frac{1}{n-1} \sum(t_i - \frac{\hat{t}}{N})^2\]

CI =

\[\hat{t} +- Z_(\alpha/2)*SE(\hat{t})\]

For estimating the population mean, we use

\[\hat{\bar{y}} = \frac{1}{M*N} * \hat{t}\]

Every cluster (in this over simplified example) is of size big M

In other words, **M*N = Population size**, so you divide the estimated total by the population size to get the mean

The 3 recurring things we’ve solved for in this class is Distribution, Standard Error, and then we get the Confidence Interval

Calculating Variance

\[V(\hat{t}) = N^2 * \frac{\frac{1}{n-1} \sum(t_i-\frac{1}{N}\hat{t})^2}{n} * (1-\frac{n}{N})\]

\[ = N^2(1-\frac{n}{N})*\frac{s_t^2}{n}\]

\[ = V(\frac{1}{M*N} * \hat{t}) = \frac{1}{M^2*N^2} V(\hat{t}) = \frac{1}{M^2}(1-\frac{n}{N})\frac{s_t^2}{n}\]

\[SE(\bar{\hat{y}}) = \frac{1}{M}\sqrt{(1-\frac{n}{N})\frac{s_t^2}{n}}\]

Therefore, an approximate \((1-\alpha)*100%\) CI for \(\mu\) is

\[= \bar{\hat{y}} +- z_(\alpha/2) SE(\bar{hat{y}})\]

Example:

A sociologist wants to estimate the total income of all residents ina certain small city. No list of resident adults is available. [If she did, a SRS would be appropriate]

The city is marked off into 415 rectangular blocks and she will sample n=25 of these blocks

Suppose There are M=6 adult residents within each of these blocks

library(tidyverse)

block <- array()
income <- array()

income <- rnorm(n = 150, mean = 35000, sd = 8000)

block[1:25] <- c(1:25)
block[26:50] <- c(1:25)
block[51:75] <- c(1:25)
block[76:100] <- c(1:25)
block[101:125] <- c(1:25)
block[126:150] <- c(1:25)
N = 415
n = 25

example_table <- tibble(block,income)
example_table
## # A tibble: 150 x 2
##    block income
##    <int>  <dbl>
##  1     1 32911.
##  2     2 44377.
##  3     3 45711.
##  4     4 37343.
##  5     5 33574.
##  6     6 26355.
##  7     7 40809.
##  8     8 41689.
##  9     9 46379.
## 10    10 41547.
## # ... with 140 more rows
new_table <- example_table %>% group_by(block)
new_table
## # A tibble: 150 x 2
## # Groups:   block [25]
##    block income
##    <int>  <dbl>
##  1     1 32911.
##  2     2 44377.
##  3     3 45711.
##  4     4 37343.
##  5     5 33574.
##  6     6 26355.
##  7     7 40809.
##  8     8 41689.
##  9     9 46379.
## 10    10 41547.
## # ... with 140 more rows
next_table <- summarise(new_table, t_i = sum(income))

t_hat <- N * 1/n * sum(next_table$t_i)
t_hat
## [1] 87606053
s_t <- 1/(n-1)*sum((next_table$t_i - t_hat/N)^2)
SE <- N * sqrt((1-n/N)*s_t/n)
SE
## [1] 1394938
#CI calculation

low <- t_hat - 1.96*SE
high <- t_hat + 1.96*SE

print(c(low,high))
## [1] 84871974 90340132