1/14/2020

Example

Women in love: a cultural revolution in progress, by Shere Hite (1987)

  • 84% of women not satisfied with their relationships
  • 70% of all women married \(\ge 5\) years have extramarital affairs

  • 95% of women report psychological and physical harassment from men with whom they are in love relationships

  • Widely criticized by media - “dubious” and “of limited value” by Times magazine (October 12, 1987).

  • Why?

    • Survey design (sampling methods, questionnaire) inadequate
    • Did not lead to a survey data set that supports inference to entire population of women in US

Sample Surveys

  • Characteristics of a target population are of interest

  • Impractical or impossible to observe the whole population

  • Select a sample of units in the population

  • Use data from sampled units to estimate characteristics of the entire population

Hite’s survey design

  • Sample
    • The sample was self-selected. Mailed questionnaires to 100K \(\rightarrow\) 4.5% returned (i.e. low response rate)
    • Addresses from broad range of special groups \(\rightarrow\) excludes many women in population
  • Questionnaire
    • 127 essay questions \(\rightarrow\) high respondent burden, nonresponse bias (who completes?)
    • Question wording vague (“in love” has many different interpretations)
    • Leading questions

Example - Continued

  • What is a good sample? - “Representativeness”"

A good sample should reproduce the characteristics of interest in the population, as closely as possible.

  • What else? - accurate measurement

We should get answers as accurately as possible

Survey Sampling

  • Survey: Measurement

  • Sampling: Representation

  • Survey Methodology and Sampling Statistics

Survey Methodology Sampling Statistics
Psychology, Cognitive Science Statistics
Studies Nonsampling error Studies Sampling error
Questionnaire design Sampling design, estimation

Sir Francis Galton (1822-1911)

  • Galton was a polymath who made important contributions in many fields of science, including meteorology (the anti-cyclone and the first popular weather maps), statistics (regression and correlation), psychology (synesthesia), biology (the nature and mechanism of heredity), and criminology (fingerprints)

  • He first introduced the use of questionnaires and surveys for collecting data on human communities.

Jerzy Neyman (1894-1981)

  • Showed that random sampling can provide a representative sample.

  • Neyman, J.(1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection, Journal of the Royal Statistical Society: Series B, 97 (4), 557–625

Morris Hansen (1910-1990)

  • Developed sampling designs for official statistics in Census Bureau and Bureau of Labor Statistics
  • A pioneer in making Survey Sampling a gold standard for data collection in government agencies.
  • Olkin, I. (1987). A conversation with Morris Hansen. Statistical Science 2, 162-179

Why sampling?

  • To reduce cost (efficiency)
  • To obtain information faster (timeliness)
  • Sometimes, sampling is the only way to obtain the information about the population. (e.g. quality inspection in automobile company)

Introduction

  • Surveys vary greatly along several dimensions
    • Scope of study objectives
    • Scale of the target population
    • Data collection methods
  • Illustrate variety in surveys with a few examples

1. Study Objectives

  • What inferences do we want to make?

    • What characteristics are of interest?
    • What estimates are of interest?
  • Many surveys are multi-purpose: Analyst is interested in many characteristics of the population

    • Example: National Health and Nutrition Examination Survey (NHANES):

  • Others focus on a relatively fewer number of parameters
    • Crop Yield Survey: Estimate crop yield through laboratory analysis of plant measurements

2. Target population

  • The extent of the target population needs to be defined precisely
  • Some surveys study relatively broad domains
  • Others are more narrow
    • National Agricultural Workers Survey restricts to the agricultural sector of the labor force .

3. Data Collection

  • Interviewer-mediated: face-to-face survey, telephone survey
  • Self-administered: Paper, Web-based survey
  • Field observation: Data observed at sample sites (e.g. ground, tree, water, air)
  • A single survey may use multiple data collection modes
  • For example, National Health and Nutrition Examination Survey uses face-to-face interview and physical examination.

Survey Process

  • Multiple components: Planning & design, sample preparation, data collection, data analysis

  • Collaboration between the statisticians and the investigators is crucial throughout the survey process

  • Components can occur in parallel, and the process is often iterative

Survey Process: Define precise objectives

  • Parameter of interest
    • What estimates are of interest?
    • What characteristics of the target population do we want to measure?
    • Translate general concepts into specific, measurable quantity
  • Target population
    • Precisely, what population do we want to study?
    • Eligibility criteria: Age groups, Geographic scope
    • Reference time points

Example

  • An investigator wants to study “attitudes of Iowa State students about doing volunteer work.”
  • Measurable and quantifiable concepts
    • Proportion of students who are “very likely”" to volunteer in 2020
    • Total hours of volunteer work completed in 2019
  • Target population: Iowa State students
    • Include graduate students?
    • Include part-time students?

Survey Process: Develop Data Collection Protocols

  • Questions or measurement techniques that capture the quantities of interest
    1. Construct questionnaire or other data collection forms (Ex: “How many hours of volunteer work did you complete in 2019?”")
    2. Pre-test and revise the questionnaire/form (Ex: Expand the question to specify different types of volunteer work - define volunteer work precisely)
    3. Train interviewers, data collectors

Survey Process: Represent population in a frame

  • To draw the sample, we need a representation of the population in a physical format that enables us to identify and select units
  • Frame: list of units from which we select the sample
  • Examples
    • List: such as telephone directory
    • Map: geographic representation of locations of interest
  • More details on frames to come

Survey Process: Sample Design and Selection

  • Determine a sample size
    • Precision of estimates
    • Costs / budget
  • Choose a sample design
    • Distribute the sample to obtain precise estimates of characteristics of interest
    • Practical factors often impact design
  • Select the sample

Survey Process: Collect and prepare data

  • Collect data (interview, observe, self-administer)
  • Edit and code data
    • Correct errors if possible
    • Translate responses into numeric codes for analysis

Survey Process: Data Analysis

  1. Exploratory analysis
    • Check for missing values, outliers, potential errors
    • Examine relationships between survey responses and auxiliary information from external sources
  2. Estimation
    • Compute a “survey weight” that projects the sample onto the larger population
    • Estimation methods are tied to the survey design
  3. Variance estimation
    • Quantify the uncertainty in the estimator
    • Standard error, confidence interval, coefficient of variation

Survey Process and Stat 421 emphasis

  • Step 1: Study Planning & Survey Design
    1. Define objectives, target population & parameters of interest
    2. Choose sampling design
    3. Choose data collection method
  • Step 2: Preparation
    1. Create sampling frame
    2. Select sample
    3. Develop questions or measurements
    4. Pre-test & revise questionnaire/form

  • Step 3: Collect and Prepare data
    1. Collect data
    2. Code data
    3. Edit data file
  • Step 4: Data Analysis
    1. Calculate estimates of parameters
    2. Make inference about the population

Stat 421: Sample Design and Estimation

  • Each sample design has its own estimators of population parameters (e.g., estimators of the population mean)
    • Estimation and variance estimation depend on the properties of the sample design.
    • Estimation in surveys involves computing a survey weight. This weight projects the sample onto the larger population.
  • Objectives and survey designs are integrally related.
  • Many different survey sample designs.

Probability Sampling Designs

  • Basic Selection Methods
    • Simple random sampling (Ch 2.3): Randomly select unit from list using equal probability selection method (e.g., draw chips from a bowl)
    • Systematic sampling (Ch 2.7, 5.5): Sort units in frame, random start, take every k-th unit
    • Sampling with probability proportional to a size or importance measure (Ch 6.2.3): Uses extra information on units, Larger or more important units have a higher chance of being included in sample
    • Stratified sampling (Ch 3): Divide population into groups (strata), Select independent sample from each stratum
    • Cluster sampling (Ch 5 & 6): Population units aggregated into larger units called clusters. The cluster is sampled in the selection process

Choosing the probability sampling design

  • Statistical factors
    • Consider survey objectives
    • Most precise estimate
    • Likelihood of generating a representative sample
    • EX: a stratified design uses strata to legitimately exclude some samples that are unlikely to be representative
  • Practical factors
    • A list (or sampling frame) may not exist for elements of the population
    • EX: A cluster design is needed when we want to sample elementary school children, but we have a list of schools - not a list of children.

Survey Estimation

  • Inference in surveys should be compatible with the survey design
    • Weights
    • Estimators
    • Variance estimators (confidence intervals)

Broad Syllabus

  • Basic concepts in probability sampling and the survey process (2 weeks)
  • Simple random sampling (2 weeks)
  • Using extra information in sample selection (4 weeks)
    • Systematic sampling
    • Probability proportional to size
    • Stratified sampling

  • Using extra information in estimation (in context of simple random sampling) (2 weeks)
    • Ratio estimation
    • Regression estimation
  • More complex estimation problems (2 weeks)
    • Domain estimation
    • Nonresponse
  • Cluster sampling (2 weeks)
    • Single stage
    • Two-stage cluster sampling

Part 2: Foundations of Survey Sampling

Survey Design

  • Survey design involves selecting methods to address all phases of the survey process
    • Objectives
    • Sample Design
    • Data Collection
    • Analysis approach
      • Weights
      • Estimation
      • Variance estimation

Population and Sample

Definition

  • Target population
    • The entire set of units for which the survey data are to be used to make inferences.
    • Thus, the target population defines those units for which the findings of the survey are meant to generalize.
  • Survey Population
    • The population from which the sample can be taken.
  • Sampling frame
    • A realized list of survey population
  • Observational Units (elements)
    • An object on which a measurement is taken; the members of the population

Finite Population

  • The target population contains a FINITE NUMBER of units
    • \(N\) = Total number of elements in the population
  • Differs from notions of a population in other statistics courses
    • Infinite population defined by all possible realizations from a distribution, such as a normal distribution
    • For analysis, we act as if the population is infinite
  • In Stat 421, we only consider finite population
    • Population is a finite collection of \(N\) units.

Example

  • Suppose that we are interested in the readership of the Des Moines register among Iowa adults
  • We decide to estimate the percent of adults (ages 18 or older) residing in Iowa who read the Des Moines register during the week of Jan 8th-13th 2020
    • Target population: All adults ages 18 or older residing in Iowa during the week of Jan 13th-18th of 2020.
    • Element: Adult (individual 18 or older)
    • Population size: N = 3.16 million (Census Bureau 2019 estimate)

Target population: Complexities

  • Target populations are often difficult to define
  • Example: Political poll for an election – What population should we target?
    • Registered voters?
    • Voters in the last election?
    • Those “likely to vote” in the next election?

  • Example: 1994 Democratic gubernatorial primary election in Arizona

    • Target population was defined as registered voters who voted in the last election
    • Poll prediction: Eddie Basha would lose by at least 9 percentage points
    • Election outcome: Basha won 37% of the vote; the other candidates won 35% and 28%, respectively.
    • What happened? Misspecification of the target population!
    • Basha had strong support from demographic groups who had not voted before

Element/Observation Unit: Complexities

  • Some surveys have multiple levels of observation units

  • Example: Survey of Namibian households

    • Some measurements are taken at the household level, while other measurements are taken for individuals living in the household.
    • Household level measurement: Does this household have access to clean drinking water?
    • Individual level measurement: For each individual in the household, what is the highest level of education attained?

Sampling Frame

  • Telephone survey: sampling frame may be a list of telephone numbers
  • Face-to-face interview survey: sampling frame may be a list of addresses
  • Agricultural survey: sampling frame may be a map of areas containing farms

Sampling Frame: Complexities

  • Constructing a sampling frame that accurately reflects the target population can be a challenge.

    • Units in the population may be excluded from the frame (This is called the undercoverage problem)
    • Units in the frame may not be in the target population
  • If “frame”\(\neq\) “target population”, it is called coverage error.

  • Example: What is the average payroll among Iowa businesses with more than 5 employees in 2020?
    • Frame = list of businesses with more than 5 employees from 2019 tax records
    • New businesses in 2020: In the population but not the frame
    • Businesses that closed in 2020: In the frame but not the population

Sampling Frame Types: List and Area Frames

  • List frames
    • Examples: telephone numbers, addresses
    • Strength: may contain good auxiliary information about the population
    • Weakness: may exclude members of the population
  • Area frames: geographic representation
    • Examples: Map, area divided into parcels or tracts
    • Strength: may completely cover the population
    • Weakness: may have little auxiliary information; may contain ineligible units

List and Area Frame Examples

  • National Crime and Victimization Survey
    • What percent of US households were victimized by crime in 2019?
    • Frame: list of households from US Census information and building permits
  • Census Bureau area frame
    • Divides US area into tracts, block groups, and blocks
    • Blocks are clusters of households
    • Block groups are clusters of blocks
    • Tracts are clusters of block groups (and blocks)

Census Tract and Block Groups

Sample

  • Sampling unit (SU): The unit that we actually sample

  • Observational Unit (OU): An object on which a measurement is taken

  • Not necessarily “SU = OU” holds.

  • Example: survey of students at public schools

    • We have a frame of schools, not students

    • Select a sample of schools and interview students in selected schools

      * Sampling unit: school
      * Observation unit: student

Sample

  • Sample: A subset of the survey population
  • Sampled population: Collection of all possible observation units that might have been chosen in a sample
    • Ideally, the sampled population is equal to the target population
    • Why might the sampled population differ from the target population?

Population