Stat 421: Survey Sampling

Week 6

Jae-kwang Kim

2/18/2020

Reveiw

  • So far, we have studied
    • Simple random sampling design
    • Large-sample inference under SRS
    • Sample size determination
  • In this week, we will study
    • Systematic sampling
    • Statistical Property
    • Efficiency issues

Systematic sampling

Introduction

  • Like SRS, Systematic Sampling (SYS) is another basic sampling design for selecting sample of size n from a population of size N

  • A 1-in-\(k\) sysmatic sample is a sample obtained by random selecting one sampling unit from the first \(k\) sampling units in the sampling frame, and every \(k\)-th sampling unit thereafter.

  • Use SRS estimators as approximate estimators of the SYS sampling distribution

Sampling Procedures for SYS

  • Have a frame, or list of \(N\) sampling units
  • Determine sampling interval, \(k\)
    • If \(N/n\) is an integer, \(k=N/n\)
    • If \(N/n\) is not an integer, \(k\) is the next integer after \(N/n\)
  • Select a random number \(R\) from the set \(\{1,2, \cdots, k \}\)
  • Take \(R\)-th unit, then every \(k\)-th unit thereafter

SYS example: Stat 421 Class

  • There are 51 students enrolled in Stat 421
  • List of last names of the students
## Warning: package 'readxl' was built under R version 3.6.2
##  [1] "Anderson"     "Arias"        "Banta"        "Bao"         
##  [5] "Bretoi"       "Buske"        "Collins"      "Deblois"     
##  [9] "Dickey"       "Duran"        "Ervin"        "Fee"         
## [13] "Gingrich"     "Hassan"       "Heron"        "Hu"          
## [17] "Hu"           "Humphries"    "Huxford"      "Kallis"      
## [21] "Kassmeyer"    "Kolars"       "Liang"        "Lu"          
## [25] "Mendoza"      "Miranda"      "Ou"           "Palmer"      
## [29] "Pei"          "Ray"          "Ruedy"        "Saaranen"    
## [33] "Saathoff"     "Scheideman"   "Schwenneker"  "Shi"         
## [37] "Steyer"       "Stocker"      "Sundberg"     "Thada"       
## [41] "Tian"         "Tittler"      "Vander Werff" "Von Behren"  
## [45] "Voss"         "Weaver"       "Williams"     "Won"         
## [49] "Yang"         "Zaino"        "Zhu"

Class Example 1

  • From \(N=51\) students, wish to select a sample of size \(n=3\) by systematic sampling
  • Sampling interval \(k\): the smallest integer \(\ge N/n\)
  • Select a random number \(R\) from \(\{1, \cdots, k\}\) and take every \(k\)th unit thereafter.
## [1] 17 34 51
## [1] "Hu"         "Scheideman" "Zhu"

Class Example 2

  • Now, suppose that we wish to select a sample of size \(n=6\) by systematic sampling

  • In this case, \(k=[51/6]+1=9\).

## [1]  1 10 19 28 37 46
## [1] "Anderson" "Duran"    "Huxford"  "Palmer"   "Steyer"   "Weaver"

Class Example 2 (Cont’d)

  • Arrange population in rows of length \(k=9\).

  • Each column is a different sample .

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,]    1    2    3    4    5    6    7    8    9
## [2,]   10   11   12   13   14   15   16   17   18
## [3,]   19   20   21   22   23   24   25   26   27
## [4,]   28   29   30   31   32   33   34   35   36
## [5,]   37   38   39   40   41   42   43   44   45
## [6,]   46   47   48   49   50   51    0    0    0

SYS property (general)

  • Unlike SRS, some subsets of size \(n\) from the population of \(N\) elements have no chance of being selected given a frame

  • This is legitimate (and will happen with many other designs discussed in this class)

  • It can help us improve precision if we do a good job of sorting the list (more later on how)

SYS property (general)

  • Only 1 random act: selecting R
    • After select 1st SU, all other SUs to be included in the sample are predetermined given the ordering of the sampling frame
    • This amounts to selecting a cluster sample of 1 cluster. (Here, cluster = group)
  • Plan for sample size of \(n\), but actual sample size may be less than \(n\)

Advanced topic

Selecting an SY sample with exact sample size n

  • Idea: Instead of using integer \(k\), use fractional sampling interval \(k\) (i.e. \(k\) can be non-integer valued)

  • Steps:

  1. Compute \(k=N/n\)
  2. Generate \(R_0\) from Uniform \((0,k)\)
  3. Compute \(R_{i+1} = R_i + k\) for each \(i=0, 1, \cdots n-1\).
  4. The smallest integer \(\ge R_i\) is selected

Back to Class Example

  • SYS of size \(n=6\) from \(N=51\). In this case, \(k\)= 8.50
## [1]  4.47316 12.97316 21.47316 29.97316 38.47316 46.97316
## [1]  5 13 22 30 39 47
## [1] "Bretoi"   "Gingrich" "Kolars"   "Ray"      "Sundberg" "Williams"

Statistical Property of SYS

SYS property

  • Number of possible SYS samples of size \(n\) is \(k\)

  • The population is decomposed into \(k\) disjoint samples \[ U = U_1 \cup \cdots \cup U_k \] where \(\left| U_g \right| = n\) if \(g + k(n-1) \le N\) and \(\left| U_g \right| = n-1\) otherwise.

  • Sample selection \[ \mathcal{S} = U_g, \ \ \ \ g \in \{ 1, \cdots, k\} \] with probability \(1/k\)

  • Inclusion probability for unit \(i \in U_g\): \[\pi_i = P( i \in \mathcal{S}) = P(\mathcal{S} = U_g) = 1/k \]

SYS property

  • Because only the starting SU of a SYS sample is randomized, a direct estimate of the variance of the sampling distribution can not be estimated
    • Need multiple systematic samples to estimate variance

Estimation for SYS

  • Without replicate SYS samples, an estimation approach for SYS is to use SRS estimators for population parameters

  • That is, use

    • Estimate \(\bar{Y}_U\) with \(\bar{y}\) and \(\hat{V} ( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) s^2\)
    • Estimate \(T=N \bar{Y}_U\) with \(\hat{T} = N \bar{y}\) and \(\hat{V} ( \hat{T}) = N^2\hat{V} ( \bar{y})\)
    • Estimate \(P\) with \(\hat{P}\) and \(\hat{V} ( \hat{P}) = \frac{1}{n-1} \left( 1- \frac{n}{N}\right)\hat{P} (1- \hat{P})\)

Sampling distribution of the estimator under a SYS design

  • The sampling distribution is the distribution of [estimator] over all possible samples from [specify design]

  • Design

    • SYS samples of size n = [sample size] from N = [pop size] SUs using a frame with [specify ordering]

Systematic sampling summary

  • What do we determine (under the control of the investigator)?
    • Ordering of the elements
    • Sampling interval: determined by the sample size \(n\)
  • What is random?
    • The random start R
    • Which of the k possible samples we observe
    • Observe 1 of k possible samples with probability 1/k

Systematic sampling summary

  • Variance of an estimator based on SYS
    • Variability among the estimators for the k possible samples
  • Variance estimation
    • Because we only observe one of the k samples, we cannot estimate the sampling variance unbiasedly in general
    • Use SRS formula as an approximation

Systematic sampling summary

  • SRS Approximation
    • Parameter estimates
    • In general, SRS estimators used with a SYS sample have little (when k is not an integer) or no bias (when k is an integer) under SYS
  • Variance estimates
    • The properties of SRS variance estimators used with a SYS sample depend heavily on the ordering of sample frame
    • In some cases, SRS variance formula offers a very poor approximation

Ordering and the efficiency of SYS

Ordering the frame for SYS

  • Impacts of frame ordering for SYS compared to SRS of the same size \(n\)
  1. For random ordering, the variance of a SYS sample is approximately equal to the variance of a SRS sample

  2. If the frame is ordered with a periodicity in \(y\) that matches the sampling interval \(k\), then the variance of a SYS sample is higher than the variance of a SRS sample

  3. If the frame is ordered in relation to a variable \(x\) that is correlated with \(y\), then the variance of a SYS sample is smaller than the variance of a SRS sample

Order of sampling frame #1

Frame is in random order

  • SYS acts very much like SRS
  • SRS variance formula is a good approximation to the true sampling distribution variance under SYS

Order of sampling frame #2

Frame is ordered with a periodicity in \(y\)

  • May result in very poor estimamtes, especially when the sampling interval \(k\) matches with the periodicity of the frame.

  • SRS formula underestimates sampling variance. That is, the estimate is actually more variable than indicated by the calculated SE.

Example for ordering with periodicity in y

  • We are interested in estimating proposetion of Iowa land in roads
  • Iowa laid out using Public Land Survey system with 1 mi * 1 mi “sections” (squres on the graphic in the next page)
  • North-south and east-west roads typically run along section boundaries

Public Land Survey system

Example for ordering with periodicity in y

  • Suppose we have a grid of roads at 1 mi intervals
  • Take a systematic sample of points on the land with 5 mi sampling interval
  • Grid of points with 5 mi to east/west and north/south between points

## Warning: package 'ggplot2' was built under R version 3.6.2

Iowa Roads Example

  • Grid of points with 5 mi intervals
  • If selection \(R\) at a road intersection, entire point sample lands on roads
  • If select R NOT at a road intersection, NO POINT in the sample lands on roads
  • Sampling variance (of estimator across samples) is larger than calculated SE indicates

Order of sampling frame #3

Frame is ordered in relation to a variable x that is correlated with y

  • Improves representativeness of sample
    • Each sample tends to have the same composition, so variation in estimates among samples is small
  • SRS formula overestimates sampling distribution variance
    • Actual parameter estimator is more precise (smaller variance of the sampling distribution) than indicated by SE calculated using the SRS formula

Example: Canadian Workplace Data

  • Population: \(N=2,025\) workplaces in Canada
  • Characteristic of interest for workplace \(i\)
    • \(y_i\): total payroll of workplace \(i\)
  • Auxiliary variable for workplace \(i\)
    • \(x_i\): employment of workplace \(i\)
  • Population parameter: total payroll
    • \(T=\sum_{i=1}^N y_i\) dollars

Canadian Workplace Data

Canadian Workplace Data

  • The population mean is 143.033 (times 1000 )dollars.

Comparison of SYS with SRS

Comparison of SYS with SRS

Comparison

  • Mean and variance of the sample mean under SRS
## [1] 143.5832
## [1] 984.8151
  • Mean and variance of the sample mean under SYS
## [1] 143.1152
## [1] 137.9568

Discussion

  • Why SYS is more efficient (i.e. smaller variance than SRS)?
    • The sampling frame is sorted by \(x\) (=employment)
    • The study variable \(y\) (=payroll) is correlated with \(x\)
  • By applying SYS to the sampling frame sorted by an auxiliary variable (which is correlated with \(y\)), we can obtain a more representative sample (thus a more efficient estimation) than the SRS.

Discussion

  • To explain why, suppose that we have a finite population of size \(N=4\). If we wish to select a sample of size \(n=2\) by SYS, we have the following sampling design.
Case Sample ID Selection Prob.
1 1,2 0
2 1,3 1/2
3 1,4 0
4 2,3 0
5 2,4 1/2
6 3,4 0
  • Thus, \(y_i\) are sorted by an increasing order (or descreasing order) then the sample mean under SYS is less variable than the sample mean under SRS.

Further Discussion

Introduction

  • For SRS, how can we reduce the variance of the estimator?
    • Increase the sample size, n
  • We can also change the sampling distribution by choosing a different sample design
    • By choosing a different sample design, we may be able to keep the sample size the same and obtain an estimator with a smaller variance than the variance of the estimator for a SRS

Auxiliary Information in Design

  • Before selecting the sample and collecting data, we do not know the value of the characteristic of interest

  • We may have additional information about the population on the sampling frame

    • Auxiliary information
    • Call the auxiliary variable x

Auxiliary Information in Design

Canadian Workplace Data

  • Sampling frame may contain information about the population units, other than the characteristics of interest

  • Canadian Workplace Data

Workplace ID Province Employment Payroll
1 Ontario 243 ?
2 Quebec 60 ?
N Manitoba 2 ?
  • We can use the auxiliary information in the frame to develop different sample designs

Systematic Sampling

  • Sort the elements in the frame
    • May be sorted arbitrarily (i.e., by phone number)
    • May be sorted by a meaningful auxiliary variable in the frame (i.e., employment in the workplace data example)
  • Determine a sampling interval, \(k\)
  • Select a random start \(r\) in \(\{1, \cdots, k\}\).
  • Select every \(k\)-th element in the frame, starting with element \(r\)

Probability Proportional to Size sampling

  • Define a size measure based on an auxiliary variable

    • Example: employment in the Canadian workplace data example
  • On each draw, select element \(i\) with probability proportional to the size measure

  • To be covered in week 7

Stratified Sampling

  • Split the frame into \(H\) strata(=groups)

    • Example: groups may be Canadian provinces in the workplace data example
  • Select a SRS of size \(n_h\) from stratum \(h\) independently for each \(h=1, \cdots, H\). The sample size \(n_h\) for each stratum \(h\) needs to be determined.

  • To be covered in week 8-9