Stat 421: Survey Sampling

Week 2

Jae-kwang Kim @ ISU

1/21/2020

Reveiw

  • In Week 1, we have studied
    • Basic concepts of survey sampling
    • Survey Process
    • Target population vs Survey Population
    • Observation Unit vs Sampling Unit
  • In Week 2, we will study
    • Selection bias and Measurement bias
    • Survey Errors
    • Basic concepts of probability sampling

Selection bias & Measurement bias

Selection bias

  • Selection bias occurs when some part of the target population is not in the sampled population
  • May be due to
    • Sampling process
    • Data collection process

Types of Selection Bias
(Things you should avoid)

  • Convenience, volunteer samples

  • Take whomever is willing (e.g.: Volunteer web surveys)

  • Judgment, purposive, quota samples

    • Select OUs without a probability mechanism
    • Pick sample using your judgment to reflect the target population composition
  • May be useful for initial studies to probe a topic

  • CANNOT make inferences about a population from such studies

Types of Selection Bias - 2
(Things you should avoid)

  • Ad hoc substitution of observation unit
    • If respondent not home, go to (unselected) neighbor
    • Characteristics of substitute are likely to vary, may alter sample composition

Types of Selection Bias - 3
(Things you can partially control)

  • Undercoverage: sampling frame omits portion of target population (e.g. Homeless in telephone survey of U.S. residents)

  • Remedies

    • Select / construct sampling frame carefully
      • Cover as much of the target population as possible
      • Better if portion not covered by frame is small, or if it differs in a way that minimizes impact on inferences
    • Once you have a frame, use probability sampling

Types of Selection Bias - 4
(Things you can partially control)

  • Nonresponse during measurement process
    • Refusals
    • Not reachable
    • Incompetent
  • Remedies
    • Use multiple and persistent methods to find / reach OU
    • Use rigorous methods encourage OU to participate

Example: Literary Digest Survey

  • Predicted correctly presidential election outcome 1912-1932
    • 1932: Predicted Roosevelt w/ 56%, got 58% in election
    • Used “commercial sampling methods”" used to market books
      • Telephone books, club rosters, city directories, registered voter lists, mail-order lists, auto registrations
      • Mailed out 10 million questionnaires, received 2.3 million
  • 1936 election
    • Predicted Roosevelt loss (41% to Landon’s 55%)
    • Roosevelt won, 61% to 37%

  • What happened?
    • Undercoverage in sampling frame: Heavy reliance on auto and phone lists. Those w/ cars and/or phones voted in favor or Roosevelt, but not to the extent that those without cars and phones did.
    • Low response rate: Those responding preferred Landon relative to those who didn’t. Many Roosevelt supporters didn’t remember receiving survey
  • On the other hand, George Gallup achieved national recognition by correctly predicting, from the replies of only 50,000 respondents, that Franklin Roosevelt would defeat Alf Landon in the 1936 presidential election. (https://en.wikipedia.org/wiki/George_Gallup)

Discussion

  • Want sample and resulting survey data to be “representative” of the target population, or free of “selection bias”.
    • Good survey design and proper implementation of protocols are key to minimizing selection bias
    • Methods should be described in documentation and published articles, to enable user/reader to make judgments about the nature of selection bias and its effects on the interpretation of results
  • Ideally, probability sampling from a sampling frame without coverage error can remove the selection bias (if the sampled units participate the survey).

Measurement Bias

  • Recall that survey sampling has two important components

    • Representativeness
    • Accurate Measurement
  • Selection bias occurs when the representativeness is impaired in the sample.

  • Measurement bias occurs when measurement process produces observations on an OU that differ from the true value for the OU in a systematic manner

    • Calibration error in scale adds 5 lb to weight for each person in a health survey
    • Fail to present a valid option in a response list

Measurement Bias in people

  • Respondent may provide false information
    • More likely with sensitive subject matter
    • Socially unacceptable behavior (drug use)
    • Desire to influence outcome of survey to reap benefit (ag yields) - “social desirability” bias
    • Memory
      • Recall bias - distant memory more prone to error
      • Telescoping - recall events that occurred before reference period

Measurement bias in people - 2

  • Impact of interviewer
    • Respondent reactions : Caucasians provide different answers to white and black interviewers, vice versa
    • Interviewer interaction with respondent
      • Misleading questions
      • Poor rapport

Measurement bias in people - 3

  • Impact of questionnaire
    • Respondent fails to understand question
    • Variation in interpretation of words or phrases
      • Even simple questions may not be explicitly clear
    • Question order
      • Context effects - previous question impacts answer
      • Poorly organized questionnaire can make it difficult for respondent to understand questions

Questionaire Design

  • Clearly and specifically define study objectives
  • Evaluate proposed questions as to whether they clearly support objectives and analysis methods
  • Pre-test the survey instrument (=questionnaire)

Writing Questions

  • Use clear, simple, precise language
  • Focus on one well-defined item in a question
    • Avoid referring to multiple concepts in a single question
    • Divide lengthy questions into a contextual statement plus a simple question
    • Specify a time frame, area, or other form of scope
    • Define critical terms
  • State question neutrally
    • Avoid leading questions that might induce bias

Writing Questions - 2

  • Response formats
    • Use mutually-exclusive categories in closed-ended questions
    • Reduce post-hoc coding by minimizing use of open-ended questions
  • Organization
    • Group questions to improve ability of respondent to follow content and understand questions
    • Put key questions first while the respondent is fresh (but start easy)

Impact of measurement bias

  • Bias at the observation level impacts estimates in two ways
    • Systematic bias over OUs in sample in same direction results in a biased estimate of a population characteristic

    • Measurement error often results in increased variance in estimates (with or without bias) as well

Survey Errors

Source of Errors

  • Errors of nonobservation
    • Coverage Error: Target population \(\neq\) Frame
    • Sampling Error: Frame \(\neq\) sample
    • Non-response Error: sample \(\neq\) respondents
  • Errors of observartion
    • Measurement error
    • Processing Error

Total survey errors

  • Sampling error:

    • If \(n=N\), no sampling error
    • In fact, if \(n \uparrow\) then sampling errror \(\downarrow\)
  • Nonsampling error: Everything else

    • Even if \(n=N\), we have nonsampling error.
  • In Stat 421, we will mostly focus on sampling error.

  • Sampling error can be controlled and understood better under probability sampling.

Probability Sampling

Parameter

  • Study Variable (=characteristic of interest): The data item(s) measured/collected for each observation unit
    • Example: “Yes”" if the adult read the Des Moines Register during Jan 8-13 of 2020, and “No” otherwise
  • Parameter: Summary of all variables/characteristics in the finite population.
    • Example: Proportion of adults in the target population who read the Des Moines Register during Jan 13-18 of 2020.

Example

  • Let \(y_i\) be the (numerical) value of study variable \(Y\) for unit \(i\) in the finite population \(\mathcal{U}=\{1, \cdots, N\}\)
  • Population mean of \(Y\): \[ \bar{Y}_U = \frac{1}{N} \sum_{i=1}^N y_i \]
  • Population Proportion of \(Y \le 2\): \[ P( Y \le 2) = \frac{1}{N} \sum_{i=1}^N I ( y_i\le 2) \] where \(I(A )=1\) if \(A\) is true and \(I(A) = 0\) otherwise.

Terminology

  • Statistic : a function of observations (\(y_i\)’s) in the sample

  • Estimator : If a statistic is used to estimate a parameter, it is called an estimator for the parameter

  • Estimate : A numerical value obtained from the estimator by applying the real observations in the sample.

  • Example: Population mean is a parameter. Mean of the sample observations (i.e. sample mean) is a statistic. If the sample mean is used to estimate the population mean, it is an estimator of the population mean.

Probability sampling

  • Setup: Suppose that we wish to draw a sample of size \(n\) from a finite population of size \(N\).

  • There are \(N \choose n\) possible samples.

  • Recall that \[ {N \choose n} = \frac{ N!}{ n! (N-n)!} = \frac{ N \cdot (N-1) \cdots (N-n+1) }{ n \cdot (n-1) \cdots 1} \]

Toy Example

  • Let’s consider an artificial population (of size \(N=4\)).
ID Size of farms (Acres) Yield
1 4 1
2 6 3
3 6 5
4 20 15
  • Parameter of interest: Mean yield of the farms in the population \[ \bar{Y}_U = (y_1 + y_2 + y_3 + y_4)/ 4\]

  • Instead of observing \(N=4\) farms, we want to select a sample of size \(n=2\).

  • There are 6 possible samples

Case Sample ID Sample mean \((=\bar{y})\)
1 1,2 2
2 1,3 3
3 1,4 8
4 2,3 4
5 2,4 9
6 3,4 10
  • We will select only one sample and each sample has an error.

How to choose a sample?

  1. Non-probability sampling approach: (Using the size of farms or etc) select a sample subjectively.

  2. Probability sampling approach: select a sample by a probability rule.

  • Definition: Probability sampling

    • The selection probability should be known .
    • Every element in the population should have a positive probability of being selected.

Sampling Design 1
A non-probability sampling

Case Sample ID Selection Prob.
1 1,2 0
2 1,3 0
3 1,4 0
4 2,3 1
5 2,4 0
6 3,4 0

Sampling Design 2 A probability sampling

  • Simple random sampling (SRS): Assign the same selection probability to all possible samples
Case Sample ID Selection Prob.
1 1,2 1/6
2 1,3 1/6
3 1,4 1/6
4 2,3 1/6
5 2,4 1/6
6 3,4 1/6

Example (Cont’d)

  • From the sample design, we can induce the sampling distribution of a statistic.

  • Under SRS, the sampling distribution of a statistic can be induced from the sampling design.

Case Sample ID Statistic
(Sample mean)
Selection Prob.
1 1,2 \((y_1 + y_2)/2\) 1/6
2 1,3 \((y_1+ y_3)/2\) 1/6
3 1,4 \((y_1 + y_4)/2\) 1/6
4 2,3 \((y_2 + y_3)/2\) 1/6
5 2,4 \((y_2 + y_4)/2\) 1/6
6 3,4 \((y_3+ y_4)/2\) 1/6

Sampling distribution

  • In survey sampling from a finite population, the statistic (e.g. sample mean) can be treated as a discrete random variable whose probability distribution is completely determined by the sampling design.

  • In the above example, the sampling distribution of the sample mean (\(\bar{y}\)) is derived as follows:

Case Sample mean Selection Prob.
1 2 1/6
2 3 1/6
3 8 1/6
4 4 1/6
5 9 1/6
6 10 1/6

Example (Cont’d)

  • Since the sampling distribution of \(\bar{y}\) is known, we can compute its mean and variance.

  • Mean of \(\bar{y}\) in this example is \[ E( \bar{y} ) = \frac{1}{6} \left( 2+ 3 + \cdots + 10 \right) = 6. \]

  • Here, the expectation is over the sampling distribution, the distribution obtained by repeatedly applying the sample selection from the sampling design.

  • The individual values of \(y_i\) are treated as fixed, only the selected sample (i.e. which elements are selected) is random.

Remark

  • No model assumption about \(y_i\) in the example: totally different framework!
  • Design-based approach: the reference distribution is the sampling distribution generated by the repeated application of the given sampling design.

Definition

  • Let \(\mathcal{U}=\{1, 2, \cdots, N\}\) be the index set of the finite population. We are interested in estimating parameter \(\theta\) for the finite population.
  • Let \(\mathcal{S} \subset \mathcal{U}\) be the index set of the sample selected from \(\mathcal{U}\). Let \(\hat{\theta}\) be an estimator of population parameter \(\theta\) computed from the observations in sample \(\mathcal{S}\).

  • The bias of \(\hat{\theta}\) as an estimator of \(\theta\) is defined as \[ Bias ( \hat{\theta}) = E ( \hat{\theta}) - \theta \] where the expectation is with respect to the sampling mechanism, or the sampling distribution of \(\hat{\theta}\).

  • If the bias of an estimator is zero, the estimator is called unbiased, or design-unbiased (to emphasize that the expectation is obtained from the sampling design.)

  • Under SRS, the sample mean is unbiased for the population mean. (It is not necessarily true for other sampling designs.)

  • The variance of \(\hat{\theta}\) is defined as \[ V( \hat{\theta} ) = E\left\{ \left( \hat{\theta} - E ( \hat{\theta}) \right)^2 \right\} \]

  • The MSE of \(\hat{\theta}\) as an estimator of \(\theta\) is defined as \[ MSE ( \hat{\theta}) = E\left\{ \left( \hat{\theta} - \theta\right)^2 \right\} \]

  • In general, we have \[ MSE ( \hat{\theta}) = \{Bias ( \hat{\theta}) \}^2 + V( \hat{\theta} ) . \]

Precision vs Accuracy

  • An estimator has a high precision if the variance is small.
  • An estimator is accurate if the MSE is small.

Probability sampling: Meaning

  • The selection probability for each sample is known.
  • Every element in the population should have a positive probability of being selected.
    That is, \(P( i \in \mathcal{S})>0\) for all \(i \in \mathcal{U}\).

\[ \]

  • Note : \(P( i \in \mathcal{S})\) is called the first-order inclusion probability of unit \(i\).
  • The first-order inclusion probability can be derived from the selection probability for each sample.

Probability sampling: Example

  • Simple random sampling is one example of probability sampling with the first-oder inclusion probability \(P ( i \in \mathcal{S})= n/N>0\) for all \(i \in \mathcal{U}\).

Q: Is this a probability sampling?

Case Sample ID Selection Prob.
1 1,2 1/3
2 1,3 0
3 1,4 1/3
4 2,3 0
5 2,4 1/3
6 3,4 0

Probability sampling: Advantages

  • Does not use model assumption on \(Y\)
    (i.e. \(y_i\) are treated as fixed)
  • Romoves the subjectivity in the sample selection
  • Can implement unbiased estimation (i.e. free of selection bias)

Remark

  • Probability sampling does not gaurantee accuracy.
  • For the toy example, the MSE of the sample mean under the non-probability sampling design (selecting \(\{2,3\}\) with certainty) is \[ MSE ( \bar{y}) = (4-6)^2 = 4. \]
  • On the other hand, the MSE of the sample mean under SRS is \[ MSE ( \bar{y}) = \frac{1}{6} \{ (2-6)^2 + \cdots + (10-6)^2 \}= 9.67 \]
  • Thus, in this toy example, SRS is less efficent (has a larger MSE) than the purposive sampling of selecting unit 2 and 3 always.

How to improve the accuracy?

  • Recall that \[ MSE( \hat{\theta})= \{ Bias (\hat{\theta})\}^2 + V ( \hat{\theta}) \]

    • In non-porbability sampling, variance is zero but there is some bias.
    • In probability sampling, it is exactly opposite: bias is zero but variance is positive.
  • In general, the variance will be reduced if we increase the sample size. However, the bias is independent of the sample size.

  • Thus, in probability sampling, we can make \(MSE ( \hat{\theta}) \downarrow 0\) as \(n \rightarrow \infty\).

  • In non-porbability sampling, the MSE does not necessarily decrease with the sample size.

Probability sampling

  • Statistical theories available for probability sampling
    • Law of large numbers: \(\hat{\theta}\) is close to \(E( \hat{\theta})\) for sufficiently large sample sizes.
    • Central Limit theorem: \(\hat{\theta}\) follows a normal distribution for sufficiently large sample sizes.
  • Additional advantages of probability sampling with large sample:
    • Can control the accuracy
    • Can perform statistical inferences (e.g. confidence intervals)

Take-home message

  • Probability sampling is a gold standard for obtaining a representative sample as it can provide unbiased estimation.
  • In probability sampling, accuracy is achieved by increasing the sample size. But, it will also incease the cost. So, there is a trade-off between accuracy and cost.
  • Statistical inference can be made only under probability sampling.
  • Probability sampling does not make any model assumptions. The sampling distribution is induced from the sampling design, which is under the control of survey sampler. Because of its objectivity, it is very popular in the official statistics produced by government agencies.