Stat 421: Survey Sampling

Week 6

Jae-kwang Kim

2/18/2020

Reveiw

So far, we have studied
- Simple random sampling design
- Large-sample inference under SRS
- Sample size determination
In this week, we will study
- Systematic sampling
- Statistical Property
- Efficiency issues

Systematic sampling

Introduction

Like SRS, Systematic Sampling (SYS) is another basic sampling design for selecting sample of size n from a population of size N
A 1-in-\(k\) sysmatic sample is a sample obtained by random selecting one sampling unit from the first \(k\) sampling units in the sampling frame, and every \(k\)-th sampling unit thereafter.
Use SRS estimators as approximate estimators of the SYS sampling distribution

Sampling Procedures for SYS

Have a frame, or list of \(N\) sampling units
Determine sampling interval, \(k\)
- If \(N/n\) is an integer, \(k=N/n\)
- If \(N/n\) is not an integer, \(k\) is the next integer after \(N/n\)
Select a random number \(R\) from the set \(\{1,2, \cdots, k \}\)
Take \(R\)-th unit, then every \(k\)-th unit thereafter

SYS example: Stat 421 Class

There are 51 students enrolled in Stat 421
List of last names of the students

## Warning: package 'readxl' was built under R version 3.6.2

##  [1] "Anderson"     "Arias"        "Banta"        "Bao"         
##  [5] "Bretoi"       "Buske"        "Collins"      "Deblois"     
##  [9] "Dickey"       "Duran"        "Ervin"        "Fee"         
## [13] "Gingrich"     "Hassan"       "Heron"        "Hu"          
## [17] "Hu"           "Humphries"    "Huxford"      "Kallis"      
## [21] "Kassmeyer"    "Kolars"       "Liang"        "Lu"          
## [25] "Mendoza"      "Miranda"      "Ou"           "Palmer"      
## [29] "Pei"          "Ray"          "Ruedy"        "Saaranen"    
## [33] "Saathoff"     "Scheideman"   "Schwenneker"  "Shi"         
## [37] "Steyer"       "Stocker"      "Sundberg"     "Thada"       
## [41] "Tian"         "Tittler"      "Vander Werff" "Von Behren"  
## [45] "Voss"         "Weaver"       "Williams"     "Won"         
## [49] "Yang"         "Zaino"        "Zhu"

Class Example 1

From \(N=51\) students, wish to select a sample of size \(n=3\) by systematic sampling
Sampling interval \(k\): the smallest integer \(\ge N/n\)
Select a random number \(R\) from \(\{1, \cdots, k\}\) and take every \(k\)th unit thereafter.

k <- 51/3 
r <- sample(1:k, 1)
sys.samp <- seq(r, r+k*(3-1), k)
sys.samp;  pop[sys.samp]

## [1] 17 34 51

## [1] "Hu"         "Scheideman" "Zhu"

Class Example 2

Now, suppose that we wish to select a sample of size \(n=6\) by systematic sampling
In this case, \(k=[51/6]+1=9\).

pn <- 51 
sn <- 6
k <- ceiling(pn/sn)
r <- sample(1:k, 1)
sys.samp <- seq(r, r+k*(sn-1), k)
sys.samp; pop[sys.samp]

## [1]  1 10 19 28 37 46

## [1] "Anderson" "Duran"    "Huxford"  "Palmer"   "Steyer"   "Weaver"

Class Example 2 (Cont’d)

Arrange population in rows of length \(k=9\).
Each column is a different sample .

pop2 <- matrix(c(1:51, 0, 0,0), nrow = 9, ncol = 6) 
t(pop2)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,]    1    2    3    4    5    6    7    8    9
## [2,]   10   11   12   13   14   15   16   17   18
## [3,]   19   20   21   22   23   24   25   26   27
## [4,]   28   29   30   31   32   33   34   35   36
## [5,]   37   38   39   40   41   42   43   44   45
## [6,]   46   47   48   49   50   51    0    0    0

SYS property (general)

Unlike SRS, some subsets of size \(n\) from the population of \(N\) elements have no chance of being selected given a frame
This is legitimate (and will happen with many other designs discussed in this class)
It can help us improve precision if we do a good job of sorting the list (more later on how)

SYS property (general)

Only 1 random act: selecting R
- After select 1st SU, all other SUs to be included in the sample are predetermined given the ordering of the sampling frame
- This amounts to selecting a cluster sample of 1 cluster. (Here, cluster = group)
Plan for sample size of \(n\), but actual sample size may be less than \(n\)

Advanced topic

Selecting an SY sample with exact sample size n

Idea: Instead of using integer \(k\), use fractional sampling interval \(k\) (i.e. \(k\) can be non-integer valued)
Steps:

Compute \(k=N/n\)
Generate \(R_0\) from Uniform \((0,k)\)
Compute \(R_{i+1} = R_i + k\) for each \(i=0, 1, \cdots n-1\).
The smallest integer \(\ge R_i\) is selected

Back to Class Example

SYS of size \(n=6\) from \(N=51\). In this case, \(k\)= 8.50

pn <- 51 ; sn <- 6; k <- pn/sn
r <- runif(1, min=0, max = k)
sys.samp1 <- seq(r, r+k*(sn-1), k)
sys.samp2 <- ceiling(sys.samp1) 
sys.samp1; sys.samp2

## [1]  4.47316 12.97316 21.47316 29.97316 38.47316 46.97316

## [1]  5 13 22 30 39 47

pop[sys.samp2]

## [1] "Bretoi"   "Gingrich" "Kolars"   "Ray"      "Sundberg" "Williams"

Statistical Property of SYS

SYS property

Number of possible SYS samples of size \(n\) is \(k\)
The population is decomposed into \(k\) disjoint samples \[ U = U_1 \cup \cdots \cup U_k \] where \(\left| U_g \right| = n\) if \(g + k(n-1) \le N\) and \(\left| U_g \right| = n-1\) otherwise.
Sample selection \[ \mathcal{S} = U_g, \ \ \ \ g \in \{ 1, \cdots, k\} \] with probability \(1/k\)
Inclusion probability for unit \(i \in U_g\): \[\pi_i = P( i \in \mathcal{S}) = P(\mathcal{S} = U_g) = 1/k \]

SYS property

Because only the starting SU of a SYS sample is randomized, a direct estimate of the variance of the sampling distribution can not be estimated
- Need multiple systematic samples to estimate variance

Estimation for SYS

Without replicate SYS samples, an estimation approach for SYS is to use SRS estimators for population parameters
That is, use
- Estimate \(\bar{Y}_U\) with \(\bar{y}\) and \(\hat{V} ( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) s^2\)
- Estimate \(T=N \bar{Y}_U\) with \(\hat{T} = N \bar{y}\) and \(\hat{V} ( \hat{T}) = N^2\hat{V} ( \bar{y})\)
- Estimate \(P\) with \(\hat{P}\) and \(\hat{V} ( \hat{P}) = \frac{1}{n-1} \left( 1- \frac{n}{N}\right)\hat{P} (1- \hat{P})\)

Sampling distribution of the estimator under a SYS design

The sampling distribution is the distribution of [estimator] over all possible samples from [specify design]
Design
- SYS samples of size n = [sample size] from N = [pop size] SUs using a frame with [specify ordering]

Systematic sampling summary

What do we determine (under the control of the investigator)?
- Ordering of the elements
- Sampling interval: determined by the sample size \(n\)
What is random?
- The random start R
- Which of the k possible samples we observe
- Observe 1 of k possible samples with probability 1/k

Systematic sampling summary

Variance of an estimator based on SYS
- Variability among the estimators for the k possible samples
Variance estimation
- Because we only observe one of the k samples, we cannot estimate the sampling variance unbiasedly in general
- Use SRS formula as an approximation

Systematic sampling summary

SRS Approximation
- Parameter estimates
- In general, SRS estimators used with a SYS sample have little (when k is not an integer) or no bias (when k is an integer) under SYS
Variance estimates
- The properties of SRS variance estimators used with a SYS sample depend heavily on the ordering of sample frame
- In some cases, SRS variance formula offers a very poor approximation

Ordering and the efficiency of SYS

Ordering the frame for SYS

Impacts of frame ordering for SYS compared to SRS of the same size \(n\)

For random ordering, the variance of a SYS sample is approximately equal to the variance of a SRS sample
If the frame is ordered with a periodicity in \(y\) that matches the sampling interval \(k\), then the variance of a SYS sample is higher than the variance of a SRS sample
If the frame is ordered in relation to a variable \(x\) that is correlated with \(y\), then the variance of a SYS sample is smaller than the variance of a SRS sample

Order of sampling frame #1

Frame is in random order

SYS acts very much like SRS
SRS variance formula is a good approximation to the true sampling distribution variance under SYS

Order of sampling frame #2

Frame is ordered with a periodicity in \(y\)

May result in very poor estimamtes, especially when the sampling interval \(k\) matches with the periodicity of the frame.
SRS formula underestimates sampling variance. That is, the estimate is actually more variable than indicated by the calculated SE.

Example for ordering with periodicity in y

We are interested in estimating proposetion of Iowa land in roads
Iowa laid out using Public Land Survey system with 1 mi * 1 mi “sections” (squres on the graphic in the next page)
North-south and east-west roads typically run along section boundaries

Public Land Survey system

Example for ordering with periodicity in y

Suppose we have a grid of roads at 1 mi intervals
Take a systematic sample of points on the land with 5 mi sampling interval
Grid of points with 5 mi to east/west and north/south between points

## Warning: package 'ggplot2' was built under R version 3.6.2

Iowa Roads Example

Grid of points with 5 mi intervals
If selection \(R\) at a road intersection, entire point sample lands on roads
If select R NOT at a road intersection, NO POINT in the sample lands on roads
Sampling variance (of estimator across samples) is larger than calculated SE indicates

Order of sampling frame #3

Frame is ordered in relation to a variable x that is correlated with y

Improves representativeness of sample
- Each sample tends to have the same composition, so variation in estimates among samples is small
SRS formula overestimates sampling distribution variance
- Actual parameter estimator is more precise (smaller variance of the sampling distribution) than indicated by SE calculated using the SRS formula

Example: Canadian Workplace Data

Population: \(N=2,025\) workplaces in Canada
Characteristic of interest for workplace \(i\)
- \(y_i\): total payroll of workplace \(i\)
Auxiliary variable for workplace \(i\)
- \(x_i\): employment of workplace \(i\)
Population parameter: total payroll
- \(T=\sum_{i=1}^N y_i\) dollars

Canadian Workplace Data

cd <- read.csv("CanadianData.csv", header = FALSE)
cd <- subset(cd, V3 < 5000)

hist(cd$V3, breaks = 500, xlab = "1000 Dollars", 
     main = "Histogram of Payroll at \n Canadian Workplaces", 
     freq = FALSE)

Canadian Workplace Data

The population mean is 143.033 (times 1000 )dollars.

with(cd, plot(V1, V3, xlab = "Number Employees", ylab = "Payroll ($1000)"))

Comparison of SYS with SRS

nsim <-10000
result1 <-double(nsim) 
for (i in 1:nsim){  result1[i] <- mean(sample(cd$V3, 100, replace = F))}
hist(result1, breaks=50, main="Histogram of sample means under SRS")

Comparison of SYS with SRS

pn <- 2025; sn <- 100; k=pn/sn 
result2 <-double(nsim) 
for (i in 1:nsim){  
  r <- runif(1, min=0, max = k)
sys.samp <- ceiling(seq(r, r+k*(sn-1), k)) 
  result2[i] <- mean(cd$V3[sys.samp])}
hist(result1, breaks=50, main="Histogram of sample means under SYS")

Comparison

Mean and variance of the sample mean under SRS

mean(result1); var(result1)

## [1] 143.5832

## [1] 984.8151

Mean and variance of the sample mean under SYS

mean(result2); var(result2)

## [1] 143.1152

## [1] 137.9568

Discussion

Why SYS is more efficient (i.e. smaller variance than SRS)?
- The sampling frame is sorted by \(x\) (=employment)
- The study variable \(y\) (=payroll) is correlated with \(x\)
By applying SYS to the sampling frame sorted by an auxiliary variable (which is correlated with \(y\)), we can obtain a more representative sample (thus a more efficient estimation) than the SRS.

Discussion

To explain why, suppose that we have a finite population of size \(N=4\). If we wish to select a sample of size \(n=2\) by SYS, we have the following sampling design.

Case	Sample ID	Selection Prob.
1	1,2	0
2	1,3	1/2
3	1,4	0
4	2,3	0
5	2,4	1/2
6	3,4	0

Thus, \(y_i\) are sorted by an increasing order (or descreasing order) then the sample mean under SYS is less variable than the sample mean under SRS.

Further Discussion

Introduction

For SRS, how can we reduce the variance of the estimator?
- Increase the sample size, n
We can also change the sampling distribution by choosing a different sample design
- By choosing a different sample design, we may be able to keep the sample size the same and obtain an estimator with a smaller variance than the variance of the estimator for a SRS

Auxiliary Information in Design

Before selecting the sample and collecting data, we do not know the value of the characteristic of interest
We may have additional information about the population on the sampling frame
- Auxiliary information
- Call the auxiliary variable x

Auxiliary Information in Design

Canadian Workplace Data

Sampling frame may contain information about the population units, other than the characteristics of interest
Canadian Workplace Data

Workplace ID	Province	Employment	Payroll
1	Ontario	243	?
2	Quebec	60	?
…	…	…	…
N	Manitoba	2	?

We can use the auxiliary information in the frame to develop different sample designs

Systematic Sampling

Sort the elements in the frame
- May be sorted arbitrarily (i.e., by phone number)
- May be sorted by a meaningful auxiliary variable in the frame (i.e., employment in the workplace data example)
Determine a sampling interval, \(k\)
Select a random start \(r\) in \(\{1, \cdots, k\}\).
Select every \(k\)-th element in the frame, starting with element \(r\)

Probability Proportional to Size sampling

Define a size measure based on an auxiliary variable
- Example: employment in the Canadian workplace data example
On each draw, select element \(i\) with probability proportional to the size measure
To be covered in week 7

Stratified Sampling

Split the frame into \(H\) strata(=groups)
- Example: groups may be Canadian provinces in the workplace data example
Select a SRS of size \(n_h\) from stratum \(h\) independently for each \(h=1, \cdots, H\). The sample size \(n_h\) for each stratum \(h\) needs to be determined.
To be covered in week 8-9