Like SRS, Systematic Sampling (SYS) is another basic sampling design for selecting sample of size n from a population of size N
A 1-in-\(k\) sysmatic sample is a sample obtained by random selecting one sampling unit from the first \(k\) sampling units in the sampling frame, and every \(k\)-th sampling unit thereafter.
Use SRS estimators as approximate estimators of the SYS sampling distribution
## Warning: package 'readxl' was built under R version 3.6.2
## [1] "Anderson" "Arias" "Banta" "Bao"
## [5] "Bretoi" "Buske" "Collins" "Deblois"
## [9] "Dickey" "Duran" "Ervin" "Fee"
## [13] "Gingrich" "Hassan" "Heron" "Hu"
## [17] "Hu" "Humphries" "Huxford" "Kallis"
## [21] "Kassmeyer" "Kolars" "Liang" "Lu"
## [25] "Mendoza" "Miranda" "Ou" "Palmer"
## [29] "Pei" "Ray" "Ruedy" "Saaranen"
## [33] "Saathoff" "Scheideman" "Schwenneker" "Shi"
## [37] "Steyer" "Stocker" "Sundberg" "Thada"
## [41] "Tian" "Tittler" "Vander Werff" "Von Behren"
## [45] "Voss" "Weaver" "Williams" "Won"
## [49] "Yang" "Zaino" "Zhu"
## [1] 17 34 51
## [1] "Hu" "Scheideman" "Zhu"
Now, suppose that we wish to select a sample of size \(n=6\) by systematic sampling
In this case, \(k=[51/6]+1=9\).
pn <- 51
sn <- 6
k <- ceiling(pn/sn)
r <- sample(1:k, 1)
sys.samp <- seq(r, r+k*(sn-1), k)
sys.samp; pop[sys.samp]## [1] 1 10 19 28 37 46
## [1] "Anderson" "Duran" "Huxford" "Palmer" "Steyer" "Weaver"
Arrange population in rows of length \(k=9\).
Each column is a different sample .
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1 2 3 4 5 6 7 8 9
## [2,] 10 11 12 13 14 15 16 17 18
## [3,] 19 20 21 22 23 24 25 26 27
## [4,] 28 29 30 31 32 33 34 35 36
## [5,] 37 38 39 40 41 42 43 44 45
## [6,] 46 47 48 49 50 51 0 0 0
Unlike SRS, some subsets of size \(n\) from the population of \(N\) elements have no chance of being selected given a frame
This is legitimate (and will happen with many other designs discussed in this class)
It can help us improve precision if we do a good job of sorting the list (more later on how)
Idea: Instead of using integer \(k\), use fractional sampling interval \(k\) (i.e. \(k\) can be non-integer valued)
Steps:
pn <- 51 ; sn <- 6; k <- pn/sn
r <- runif(1, min=0, max = k)
sys.samp1 <- seq(r, r+k*(sn-1), k)
sys.samp2 <- ceiling(sys.samp1)
sys.samp1; sys.samp2 ## [1] 4.47316 12.97316 21.47316 29.97316 38.47316 46.97316
## [1] 5 13 22 30 39 47
## [1] "Bretoi" "Gingrich" "Kolars" "Ray" "Sundberg" "Williams"
Number of possible SYS samples of size \(n\) is \(k\)
The population is decomposed into \(k\) disjoint samples \[ U = U_1 \cup \cdots \cup U_k \] where \(\left| U_g \right| = n\) if \(g + k(n-1) \le N\) and \(\left| U_g \right| = n-1\) otherwise.
Sample selection \[ \mathcal{S} = U_g, \ \ \ \ g \in \{ 1, \cdots, k\} \] with probability \(1/k\)
Inclusion probability for unit \(i \in U_g\): \[\pi_i = P( i \in \mathcal{S}) = P(\mathcal{S} = U_g) = 1/k \]
Without replicate SYS samples, an estimation approach for SYS is to use SRS estimators for population parameters
That is, use
The sampling distribution is the distribution of [estimator] over all possible samples from [specify design]
Design
For random ordering, the variance of a SYS sample is approximately equal to the variance of a SRS sample
If the frame is ordered with a periodicity in \(y\) that matches the sampling interval \(k\), then the variance of a SYS sample is higher than the variance of a SRS sample
If the frame is ordered in relation to a variable \(x\) that is correlated with \(y\), then the variance of a SYS sample is smaller than the variance of a SRS sample
May result in very poor estimamtes, especially when the sampling interval \(k\) matches with the periodicity of the frame.
SRS formula underestimates sampling variance. That is, the estimate is actually more variable than indicated by the calculated SE.
## Warning: package 'ggplot2' was built under R version 3.6.2
cd <- read.csv("CanadianData.csv", header = FALSE)
cd <- subset(cd, V3 < 5000)
hist(cd$V3, breaks = 500, xlab = "1000 Dollars",
main = "Histogram of Payroll at \n Canadian Workplaces",
freq = FALSE)nsim <-10000
result1 <-double(nsim)
for (i in 1:nsim){ result1[i] <- mean(sample(cd$V3, 100, replace = F))}
hist(result1, breaks=50, main="Histogram of sample means under SRS")pn <- 2025; sn <- 100; k=pn/sn
result2 <-double(nsim)
for (i in 1:nsim){
r <- runif(1, min=0, max = k)
sys.samp <- ceiling(seq(r, r+k*(sn-1), k))
result2[i] <- mean(cd$V3[sys.samp])}
hist(result1, breaks=50, main="Histogram of sample means under SYS")## [1] 143.5832
## [1] 984.8151
## [1] 143.1152
## [1] 137.9568
| Case | Sample ID | Selection Prob. |
|---|---|---|
| 1 | 1,2 | 0 |
| 2 | 1,3 | 1/2 |
| 3 | 1,4 | 0 |
| 4 | 2,3 | 0 |
| 5 | 2,4 | 1/2 |
| 6 | 3,4 | 0 |
Before selecting the sample and collecting data, we do not know the value of the characteristic of interest
We may have additional information about the population on the sampling frame
Sampling frame may contain information about the population units, other than the characteristics of interest
Canadian Workplace Data
| Workplace ID | Province | Employment | Payroll |
|---|---|---|---|
| 1 | Ontario | 243 | ? |
| 2 | Quebec | 60 | ? |
| … | … | … | … |
| N | Manitoba | 2 | ? |
Define a size measure based on an auxiliary variable
On each draw, select element \(i\) with probability proportional to the size measure
To be covered in week 7
Split the frame into \(H\) strata(=groups)
Select a SRS of size \(n_h\) from stratum \(h\) independently for each \(h=1, \cdots, H\). The sample size \(n_h\) for each stratum \(h\) needs to be determined.
To be covered in week 8-9