Spring 2018

Data Collection

The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals.

1. The Goal

Learn about the entire group of individuals

2. The Problem

It is usually impossible to collect data on the entire population

  • expensive
  • time consuming
  • impossible to find everyone
  • not everyone willing to participate
  • population changing constantly - births and deaths

3. The Compromise

Collect data on a smaller group of individuals selected from the population

4. The Challenge

Bias (double-counting and under-counting)

Sampling from a Population

Population - the group that we are interested in making conclusions about
Sample - A subset of the population

Statistical Inference

It is the process of making judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.
\(\require{AMScd}\) \[ \begin{CD} Sample @> {\text {statistical inference}} >> Population \end{CD} \]

\[\underbrace{\text {sample statistics}}_{\text{investigator knows}} = \underbrace{\text {population parameter}}_{\text{investigator wants to know}} + \text {bias} + \text {chance variation} \]

  • Selection bias - occurs when the sample is selected in such a way that it systematically excludes or underepresented part of the population.
  • Nonresponse bias - occurs when responses are not obtained from all individuals selected for inclusion in a sample.
  • Measurement or Response bias - occurs when the data are collected in such a way that it tends to result in observed values that are different from the actual value in some systematic way. Contributing factors: question wording and order, mode of survey, and influence of the interviewer, etc.

Bias and Precision


The average difference between the estimator and the true value.


The standard deviation of the estimator.

\[ \begin{aligned} \text{Mean Squared Error, MSE} &= precision^2 + bias^2 \\ \text{Root Mean Squared Error, RMSE} &= \sqrt{MSE} \end{aligned} \]

Parameter and Statistics

A statistic is a value from our observed data.

A parameter is a value that describes the population.

\[ \begin{array} {l|c} \text{Name} & \text{Statistic} & \text{Parameter} \\ \hline \text {Mean} & \bar y & \mu \\ \text {Std. Deviation} & s & \sigma \\ \text {Correlation} & r & \rho \\ \text {Regression Coefficient} & b & \beta \\ \text {Proportion} & \hat p & p \end{array} \]

Sampling Methods

Simple Random Sampling (SRS)

Each case in the population has an equal chance of being included in the sample.

Stratified Sampling

  1. The population is divided into non-overlapping, homogeneous subgroups called strata .
  2. Then, SRS is employed to select a certain number or a certain proportion of the whole within each stratum.

Cluster Sampling

  1. The population is often divided into non-overlapping mutually homogeneous yet internally heterogeneous subgroups called clusters. Cluster sampling is much like SRS, but instead of randomly selecting individuals, SRS is applied to select clusters.
    • In other words, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another. That is, we expect strata to be self-similar (homogeneous), while we expect clusters to be diverse (heterogeneous).
  2. The elements in each cluster are then sampled. If all elements in each sampled cluster are sampled, then this is referred to as a "one-stage" cluster sampling plan.
  3. Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. For example, if neighborhoods represented clusters, this sampling method works best when each neighborhood is very diverse. Because each neighborhood itself encompasses diversity, a cluster sample can reduce the time and cost associated with data collection, because the interviewer would need only go to some of the neighborhoods rather than to all parts of a city, in order to collect a useful sample.

One-Stage Cluster Sampling

Multistage Sampling

A "multistage" or "multistage cluster" sampling is an extention of cluster sampling and involves two (or more) steps.

  1. First step is to take a cluster sample.
  2. Then, instead of including all of the individuals in these clusters in the sample, a second sampling method, usually SRS, is employed within each of the selected clusters.

In the neighborhood example, we could first randomly select some number of neighborhoods and then take a SRS from just those selected neighborhoods. As seen in Figure, stratified sampling requires observations to be sampled from every stratum. Multistage sampling selects observations only from those clusters that were randomly selected in the first step.

It is also possible to have more than two steps in multistage sampling. Each cluster may be naturally divided into subclusters. For example, each neighborhood could be divided into streets. To take a three-stage sample, we could first select some number of clusters (neighborhoods), and then, within the selected clusters, select some number of subclusters (streets). Finally, we could select some number of individuals from each of the selected streets.

Multistage Sampling

Non-Random Sampling

Systematic Sampling

Select every \(k^{th}\) individual froma list of the population, where the position of the first person chosen is randomly selected from the \(k\) individuals. This will give a non-representative sample if there is a structure to the list.

Non-Random Sampling

Convenience or Volunteer Sampling

Use the first \(n\) individuals that are available or the individuals who volunteer to participate. This is almost sure to give a non-representative sample which cannot be generalized to the population.

Example: Non-Representative Sample


  • Surveyed 10 million people who were subcribers or had telephones.
  • 2.4 million people responded (i.e. 24% response rate)

Landslide victory of Landon.

Election Result
Landslide victory of Roosevelt.

What did go wrong with the poll?

  • Sample was drawn from telephone directories, club membership, magazine subscibers, etc. who were upper middle class people, largely excluding poor unemployed people.
  • The sample suffered from both selection and nonresponse bias.

Observational Studies

Observational Studies

Generally, data in observational studies are collected only by passively monitoring study participants. They are inexpensive and good for discovering relationships related to rare outcomes. They are generally only sufficient to show associations.

Types of observational studies:

1. Retrospective Studies

  • Collect data on something that has already occurred.

2. Prospective Studies

  • Identify subjects in advance and collect data as events unfold.

Observational Studies

Case: why did Fido and Fluffy die?

Early 2007, many dogs and cats died of kidney failure. Should you conduct a retrospective or prospective study to find out why?

Retrospective Studies could be appropriate since

  • The event happened in the past.
  • It may have been a rare event.
  • The retrospective study may provide clues on the cause.

Once possible causes are identified, try a Prospective Studies to verify the causes - if it doesn't kill any pets.

Observational Studies

Case: drinking coffee and longivity

Coffee drinkers may live longer - nytimes.com, May 16, 2012.

Coffee may help you live longer, study suggests - thestar.com, May 17, 2012.

No, drinking coffee probably won't make you live longer - washingtonpost.com, May 17, 2012.

Observational Study

Case: drinking coffee and longivity

"Association of coffee drinking with total and cause-specific mortality," New England Journal of Medicine, May 2012.

Sample Size


Age Range

50-71 years


1995 - 2008



How would you interpret the result?

Confounding Variable

A confounding variable is a variable that is associated with both the explanatory and response variables. Because of the confounding variable's association with both variables, we do not know if the response is due to the explanatory variable or due to the confounding variable.

Sun exposure is a confounding factor because it is associated with both the use of sunscreen and the development of skin cancer. People who are out in the sun all day are more likely to use sunscreen, and people who are out in the sun all day are more likely to get skin cancer.

Lurking Variable

Lurking variables are variables that are not considered in the analysis, but may affect the nature of the relationship between the explanatory variable and the outcome.

20-year survival status of women by smoking status

\[ \begin{array}{c|lcr} & \text{Smoker} \\ & \text{Yes} & \text{No} \\ \hline \text {Dead} & 0.239 & 0.314 \\ \text {Alive} & 0.761 & 0.686 \end{array} \]

Are smokers less likely to die?

Lurking Variable