Part 3: Collecting Data

Aimee Schwab
June 13, 2013

Where We're Headed…

Observational studies
Experimental studies

Disclaimer: most of the examples in this section are geared toward studies using humans as subjects. Why? As a whole, we tend to be most familiar with this type of research, and it mirrors what you'll be asked to do for your project. These ideas (especially experimental studies) can be extended to any desired subjects: cars, crops, animals.

When we design a study, we need to decide which variables we're most interested in. There are two major types of variables:

Response variable: the outcome variable on which comparisons are made
Explanatory variable: the variable that defines the groups (categorical) or changes in numerical values (quantitative) to be compared with respect to values for the response variable

The response variables are what we want to make a statement about. For example, I might want to design a survey to investigate what makes a person more likely to vote Republican or Democrat in the upcoming election. My response variable would be their voting preference - Republican or Democrat. You can sometimes think of the response as the final outcome.

There's lots of explanatory variables I could use. I could record a person's age, gender, income, education level, or marital status as explanatory variables. These explanatory variables might have an effect on who my subject is more likely to vote for. Changes in the explanatory variable explain changes in the response variable!

The first part of any study or experiment is the design stage. Before we can collect our data, we have to consider what methods to use. Some methods will be more appropriate than others, depending on the variable we are ultimately interested in.

Experimental study: a researcher assigns subjects to certain experimental conditions, then observes outcomes of the response variable
- The experimental conditions are the explanatory variables!
Observational study: a researcher observes values of the response variable and explanatory variables for the sample subjects without manipulating those subjects in any way

Example: For each scenario below, determine whether it is an observational or experimental study.

Patients are given Lipitor to determine whether this drug has the effect of lowering high levels of cholesterol.
Patients with West Nile virus who were not given a treatment that could have cured them. Their health was followed for years after they were found to have West Nile.
The Lancaster County Bureau of Weights and Measures randomly selects gas stations and obtains 1 gallon of gas from each pump. The amount pumped is measured for accuracy.
Cruise ship passengers are given magnetic bracelets, which they agree to wear in an attempt to eliminate or diminish the effects of motion sickness.

On the previous slide, 1 and 4 are experimental studies. 2 and 3 are observational studies.

Caution: It is very difficult to determine causation from observational studies, because of possible confounding effects between variables. Two variables are confounded when their effects on a response variable cannot be distinguished from each other. Unfortunately, sometimes it is unethical, or very difficult to conduct experiments. Even observational studies, when done properly, can provide good data.

Sampling Methods for Observational Studies

Biased samples systematically favor certain outcomes over others. There are two relatively common ways researchers sample a population that are actually biased. If possible, you should avoid using these types of samples.

Volunteer sample: subjects volunteer to participate in the study -Typically one segment of the population may be more likely to volunteer than others, so these samples may be biased.
Convenience sample: the researcher includes subjects who are most convenient in the study instead of choosing a random sample
- Because of the time and place of the data collection, results may be biased.

Example: Identify the following scenarios as a volunteer sample or a convenience sample. Explain your choice.

An NBC television news reporter gets a reaction to a breaking story by polling people as they pass in front of his studio.
The BBC requested viewers to call the network and indicate their favorite poem. Of more than 7500 callers, more than twice as many voted for Rudyard Kipling's If than for any other poem.

1 is convenience, 2 is volunteer.

Random sample: subjects are chosen at random to participate in the study or experiment.

Random samples have some huge advantages! Random sampling lets us avoid selection bias - subjects aren't chosen to be in the sample based on convenience alone, nor is one group more likely to be chosen.

Random samples also allow us to use probability to analyze our results and make inferences about the data. All statistical methods we'll use in this class assume that data comes from a random sample.

Simple random sample: a sample of \( n \) subjects from a population in which each possible sample of that size has the same chance of being selected

Simplest type of sample, like drawing names from a hat
All the statistical inference procedures we use in this class require an SRS

Sampling frame: a list of subjects in the population from which the sample is taken

For example, if I wanted to randomly sample 10 students from Stat 218, I could use the class roster as my sampling frame.

For the social sciences, most often we will use sample surveys to collect the data. Data for sample surveys is collected in one of three ways, depending on the goals of the study and the budget of the researcher.

Personal interview: an interviewer asks prepared questions and records the subject's responses.

Pros:
Cons:

Subjects are most likely to agree to participate in a personal interview, but these are the most costly to administer. If the research topic is sensitive (such as alcohol and drug use), a subject may be less likely to answer truthfully.

Telephone interview: an interviewer asks prepared questions over the phone (like a Gallup survey), and records the subject's spoken responses.

Pros:
Cons:

These are cheaper since there is no travel involved, but subjects are less likely to participate. Interviews have to be short over the phone or the subject may decide to stop it short.

Questionnaire: subjects are requested to fill out a questionnaire that's sent to them by email, traditional mail, or by some other means.

Pros:
Cons:

These are the cheapest interviews to conduct, but subjects are most likely to not participate.

Telephone interviews are most used in major national polls, and are probably what you're most familiar with.

While sample surveys are the most efficient method to collect some types of data, there are many sources of potential bias from survey sampling.

Undercoverage: the sample lacks representation from part of the population

For example, if a telephone interview is conducted and the phone book is used as a sampling frame, people with unlisted numbers or without home phone lines will be undercovered.

Nonresponse bias: some sampled subjects cannot be reached or refuse to participate

Response bias: subjects may give the response that they think is socially acceptable, not necessarily the response that is correct

Response bias also occurs if the interviewer asks a question in a leading way, such that subjects are more likely to give a certain response.

Example 1: A highly conservative website asks its readers, “Do you support gay marriage?” 99% of the website's respondents said that they do NOT support gay marriage. Do you believe that 99% of Americans do not support gay marriage? How might this survey be biased? What could you change to eliminate this bias?

Choose another website.

Example 2: Read the two versions of the question below. Which do you think will get more “yes” answers and why? What would be a better question to include on a survey?

“Do you believe guns should be banned, given the fact that last year 1,134 people in America - many of them children - were killed accidentally or unintentionally by firearms?
"Do you believe guns should be banned, given the fact that Americans use firearms to prevent crimes approximately 1 to 1.5 million times per year?

"Do you believe guns should be banned?”

Example 3: In Part 1, we identified the population and sample for this scenario. What type of sample survey is this? How might the survey be biased? Are the results believable?

A graduate student at UNL conducts a research project about how adult Americans communicate. She mails a survey to 500 adults she knows. She asks them to mail back a response to the question: “Do you prefer to use e-mail or snail mail?” She gets back 65 responses, 42 indicating a preference for snail mail.

Example 4: We are interested in the percentage of households that still bake bread the old-fashioned way. To answer this question, a researcher makes random phone calls between 9 am and 5 pm. How might this survey be biased?

Example 5: A researcher wants to conduct a survey concerning students' sexual habits. How could each of the following influence student responses?

Age of the interviewer
Gender of the interviewer
Location of the interview

There are also other types of random sampling that we can use. Most of these are used in survey sampling.

Cluster sample: the population is divided into a large number of clusters, such as counties or city blocks. A simple random sample of the clusters is selected, and all subjects in those particular clusters are sampled.

We use cluster sampling if a reliable sampling frame is not available, or the cost of a simple random sample is excessive. Usually we need a larger sample size with a cluster sample to achieve the same accuracy as a simple random sample.

Stratified random sample: the population is divided into separate groups called strata. A simple random sample is taken from each of the strata.

Stratified random sampling allows you to include enough subjects in each group that you want to evaluate. However you must have a sampling frame and know which strata each subject belongs in.

Example: For each sample, identify whether the researcher used cluster sampling or stratified random sampling. Can you explain why a specialized sampling design is better than a simple random sample for these cases?

CNN is planning an exit poll in which 100 polling stations will be randomly selected and all voters will be interviewed as they leave the building.
An economist is studying the effect of education on salary and conducts a survey of 150 randomly selected workers from each of these categories: less than a high school degree; high school degree; more than a high school degree.

A Johns Hopkins University researcher surveys all cardiac patients in each of 30 randomly selected hospitals.
A marketing expert for MTV is planning a survey in which 500 people will be randomly selected from each age group of 10-19, 20-29, and so on.

Experimental Methods

From before, remember that an experimental study is when we actually do something to people, animals, or objects in order to observe the response. When we do this, we are interested in:

Treatments: experimental conditions that are randomly assigned to each experimental unit (subject)

Explanatory variable: the treatment that was assigned to that particular experimental unit (subject)

Response variable: the outcome variable of interest. We want to compare the effects of each treatment on the response variable

Suppose that I wanted to do a study on whether antidepressants help people quit smoking. My experimental units (subjects) could be adults who were 18 or over and had smoked 5 or more cigarettes per day for the previous year. My treatments could be two different antidepressants, A and B. The explanatory variable would then be whether the subject was on antidepressant A or antidepressant B. The response variable would be whether the subject had successfully quit smoking at the end of the experiment.

What specifically makes this an experiment, and not an observational study? I've manipulated my subjects, and assigned them to one drug or the other! They did not choose the treatment that they received.

In an experimental study, treatments should be randomly assigned to each experimental unit. Random assignment has a couple of benefits:

It eliminates any potential bias that might exist if the researchers assign the treatments.
It tends to balance the groups on other variables that may affect the response (age, gender, etc.).
It tends to balance the groups on other variables that you may not have thought of before the study!

Example: We are interested in studying the effects of diet on weight loss. What are the response and explanatory variables? What are the experimental units? What treatments could we apply?

Write a short explanation of your study, then share it with a student whom you haven't worked with yet. Compare and contrast your studies. When you've finished, turn in your answer sheets.

To avoid confounding effects in our experiments, we try to make sure the conditions are as similar as possible for all variables except the factors we are studying. In a laboratory setting we can control everything, but in some situations it is not very realistic.

In some experiments, especially medical studies, we usually add a placebo (dummy treatment) to the experiment. This is because of the psychological effect that the placebo can have, called the placebo effect. Some people really do get better with a dummy treatment, because they believe they are getting an active treatment.

Control group: the group using the placebo treatment. Helps determine the true effect of the treatment by giving the researchers a baseline response to compare to.

Blind study: subjects are unaware which treatment they are receiving.

Double blind study: subjects and those interacting with them are unaware which treatment they are receiving.

Treatments are often assigned a code identifier (treatment A, B, etc.), but the researchers won't find out until the conclusion of the study which treatment is which.

When we're setting up an experiment, it's important to make sure that we are using a valid randomization method. We can create random samples by using a mechanical method to select subjects and assign them to treatments.

Replication: when several experimental units are assigned to each treatment.

Why is replication desirable? Each experimental unit will react a little differently to the treatments. By using as many experimental units as we can, we reduce the chance of any treatment getting “lucky”. The effects of any particular experimental unit will be averaged out by the rest.

Example: Eight people with headaches are given one of Advil, Tylenol, Excedrin, or a placebo. The time until they report the pain is gone is recorded for each person.

Identify the response variable, experimental unit, and treatments.
Is there a control in this study?
How could we randomize this experiment?
Is there replication?

Statistically Significant Differences

If our samples show differences between the treatments, does that automatically mean that there will also be differences between the treatments for the entire population?

Differences between two or more treatments are called statistically significant if they are too large to be attributed to chance.

When a study reports a statistically significant result, it means the researchers found good evidence to support their hypothesis.

How large does a difference have to be for statistical significance?