Introduction to Statistics

Spring 2018

Data Collection

The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals.

1. The Goal

Learn about the entire group of individuals

2. The Problem

It is usually impossible to collect data on the entire population

expensive
time consuming
impossible to find everyone
not everyone willing to participate
population changing constantly - births and deaths

3. The Compromise

Collect data on a smaller group of individuals selected from the population

4. The Challenge

Bias (double-counting and under-counting)

Sampling from a Population

Population - the group that we are interested in making conclusions about
Sample - A subset of the population

Statistical Inference

It is the process of making judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.
\(\require{AMScd}\) \[ \begin{CD} Sample @> {\text {statistical inference}} >> Population \end{CD} \]

\[\underbrace{\text {sample statistics}}_{\text{investigator knows}} = \underbrace{\text {population parameter}}_{\text{investigator wants to know}} + \text {bias} + \text {chance variation} \]

Selection bias - occurs when the sample is selected in such a way that it systematically excludes or underepresented part of the population.
Nonresponse bias - occurs when responses are not obtained from all individuals selected for inclusion in a sample.
Measurement or Response bias - occurs when the data are collected in such a way that it tends to result in observed values that are different from the actual value in some systematic way. Contributing factors: question wording and order, mode of survey, and influence of the interviewer, etc.

Bias and Precision

Bias

The average difference between the estimator and the true value.

Precision

The standard deviation of the estimator.

\[ \begin{aligned} \text{Mean Squared Error, MSE} &= precision^2 + bias^2 \\ \text{Root Mean Squared Error, RMSE} &= \sqrt{MSE} \end{aligned} \]

Parameter and Statistics

A statistic is a value from our observed data.

A parameter is a value that describes the population.

\[ \begin{array} {l|c} \text{Name} & \text{Statistic} & \text{Parameter} \\ \hline \text {Mean} & \bar y & \mu \\ \text {Std. Deviation} & s & \sigma \\ \text {Correlation} & r & \rho \\ \text {Regression Coefficient} & b & \beta \\ \text {Proportion} & \hat p & p \end{array} \]

Sampling Methods

Simple Random Sampling (SRS)

Each case in the population has an equal chance of being included in the sample.

Stratified Sampling

The population is divided into non-overlapping, homogeneous subgroups called strata .
Then, SRS is employed to select a certain number or a certain proportion of the whole within each stratum.

Cluster Sampling

The population is often divided into non-overlapping mutually homogeneous yet internally heterogeneous subgroups called clusters. Cluster sampling is much like SRS, but instead of randomly selecting individuals, SRS is applied to select clusters.
- In other words, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another. That is, we expect strata to be self-similar (homogeneous), while we expect clusters to be diverse (heterogeneous).
The elements in each cluster are then sampled. If all elements in each sampled cluster are sampled, then this is referred to as a "one-stage" cluster sampling plan.
Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. For example, if neighborhoods represented clusters, this sampling method works best when each neighborhood is very diverse. Because each neighborhood itself encompasses diversity, a cluster sample can reduce the time and cost associated with data collection, because the interviewer would need only go to some of the neighborhoods rather than to all parts of a city, in order to collect a useful sample.

One-Stage Cluster Sampling

Multistage Sampling

A "multistage" or "multistage cluster" sampling is an extention of cluster sampling and involves two (or more) steps.

First step is to take a cluster sample.
Then, instead of including all of the individuals in these clusters in the sample, a second sampling method, usually SRS, is employed within each of the selected clusters.

In the neighborhood example, we could first randomly select some number of neighborhoods and then take a SRS from just those selected neighborhoods. As seen in Figure, stratified sampling requires observations to be sampled from every stratum. Multistage sampling selects observations only from those clusters that were randomly selected in the first step.

It is also possible to have more than two steps in multistage sampling. Each cluster may be naturally divided into subclusters. For example, each neighborhood could be divided into streets. To take a three-stage sample, we could first select some number of clusters (neighborhoods), and then, within the selected clusters, select some number of subclusters (streets). Finally, we could select some number of individuals from each of the selected streets.

Multistage Sampling

Non-Random Sampling

Systematic Sampling

Select every \(k^{th}\) individual froma list of the population, where the position of the first person chosen is randomly selected from the \(k\) individuals. This will give a non-representative sample if there is a structure to the list.

Non-Random Sampling

Convenience or Volunteer Sampling

Use the first \(n\) individuals that are available or the individuals who volunteer to participate. This is almost sure to give a non-representative sample which cannot be generalized to the population.

Example: Non-Representative Sample

Survey

Surveyed 10 million people who were subcribers or had telephones.
2.4 million people responded (i.e. 24% response rate)

Prediction
Landslide victory of Landon.

Election Result
Landslide victory of Roosevelt.

What did go wrong with the poll?

Sample was drawn from telephone directories, club membership, magazine subscibers, etc. who were upper middle class people, largely excluding poor unemployed people.
The sample suffered from both selection and nonresponse bias.

Observational Studies

Generally, data in observational studies are collected only by passively monitoring study participants. They are inexpensive and good for discovering relationships related to rare outcomes. They are generally only sufficient to show associations.

Types of observational studies:

1. Retrospective Studies

Collect data on something that has already occurred.

2. Prospective Studies

Identify subjects in advance and collect data as events unfold.

Observational Studies

Case: why did Fido and Fluffy die?

Early 2007, many dogs and cats died of kidney failure. Should you conduct a retrospective or prospective study to find out why?

Retrospective Studies could be appropriate since

The event happened in the past.
It may have been a rare event.
The retrospective study may provide clues on the cause.

Once possible causes are identified, try a Prospective Studies to verify the causes - if it doesn't kill any pets.

Observational Studies

Case: drinking coffee and longivity

Coffee drinkers may live longer - nytimes.com, May 16, 2012.

Coffee may help you live longer, study suggests - thestar.com, May 17, 2012.

No, drinking coffee probably won't make you live longer - washingtonpost.com, May 17, 2012.

Observational Study

Case: drinking coffee and longivity

"Association of coffee drinking with total and cause-specific mortality," New England Journal of Medicine, May 2012.

Sample Size

400,000

Age Range

50-71 years

Period

1995 - 2008

Death

52,000

How would you interpret the result?

Confounding Variable

A confounding variable is a variable that is associated with both the explanatory and response variables. Because of the confounding variable's association with both variables, we do not know if the response is due to the explanatory variable or due to the confounding variable.

Sun exposure is a confounding factor because it is associated with both the use of sunscreen and the development of skin cancer. People who are out in the sun all day are more likely to use sunscreen, and people who are out in the sun all day are more likely to get skin cancer.

Lurking Variable

Lurking variables are variables that are not considered in the analysis, but may affect the nature of the relationship between the explanatory variable and the outcome.

20-year survival status of women by smoking status

\[ \begin{array}{c|lcr} & \text{Smoker} \\ & \text{Yes} & \text{No} \\ \hline \text {Dead} & 0.239 & 0.314 \\ \text {Alive} & 0.761 & 0.686 \end{array} \]

Are smokers less likely to die?

Lurking Variable

Experiments

While observational studies are effective tools for answering certain research questions, experiments are essential to measure the effect of a treatment.

Subject

Entity who is participating in the study.

Treatment Group

The group of subjects that receives treatments.

Control Group

The group of subjects that receives no treatment.

Response Variable

The outcome of interest, measured on each subject.

Factor

The categorical variable that explains the outcome of the experiment. Each category is called level.

Experiments

Blinding

When researchers keep the subjects uninformed about their treatment, the study is said to be blind. Its purpose is to reduce the potential for both researchers' and subjects' emotional bias.

Subjects would not know which experimental group they are assigned to.
The researcher (i.e. the person who is measuring the outcome) would not know which treatment is assigned to which experimental unit.

Single-blind: only one type of blinding is applied.
Double-blind: both types of blinding are applied.

Placebo

A substance or treatment with no active ingredients. The control group receives the placebo treatment.

This phenomenon, in which the recipient perceives an improvement in condition due to personal expectations, rather than the treatment itself, is known as the placebo effect.

Principles of Experimental Design

Well-conducted experiments are built on three main principles.

Direct Control

Researchers assign treatments to cases, and they do their best to control any other differences in the groups. They want the groups to be as identical as possible except for the treatment, so that at the end of the experiment any difference in response between the groups can be attributed to the treatment and not to some other confounding or lurking variable. Direct control refers to variables that the researcher can control, or make the same.

Randomization

Researchers randomize patients into treatment groups to account for variables that cannot be controlled. Randomizing patients into the treatment or control group helps even out the effects of such differences, and it also prevents accidental bias from entering the study.

Replication

In a single study, replication is done by imposing the treatment on a sufficiently large number of subjects or experimental units. Scientists may also replicate the entire experiment on an entirely different population of experimental units to verify earlier findings.

Randomized and Blocked Design

A completely randomized experiment is one in which the subjects or experimental units are randomly assigned to each group in the experiment.

Researchers sometimes know or suspect that another variable, other than the treatment, influences the response. Under these circumstances, they may carry out a blocked experiment. In this design, they first group individuals into blocks based on the identified variable and then randomize subjects within each block to the treatment groups. This strategy is referred to as blocking.

^{Source: OpenIntroOrg}

PATRICIA Study

PApilloma TRIal against Cancer In young Adults

^{The Lancet, Volume 374, Issue 9686, Pages 301 - 314, 25 July 2009}

Efficacy of human papillomavirus (HPV) - 16/18 AS04-adjuvanted vaccine against cervical infection and precancer caused by oncogenic HPV types (PATRICIA); final analysis of a double-blind, randomised study in young women.

^{Paavonen, et. al.}

\[ \begin{array}{c|c} {\text{Response Variable} \\ \text {(Acquired an infection)}} & {\text{Explanatory Variable} \\ \text{(Given the HPV vaccine)}} \\ \hline \text{Yes} & \text{Yes} \\ \text{No} & \text{No} \\ \end{array} \]

PATRICIA Study

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c} {\text{Factor 1} \\ \text{(2 Levels)}} \\ \hline \text{Drug A} \\ \text{Drug B} \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c} {\text{Factor 2} \\ \text{(2 Levels)}} \\ \hline \text{Dose A} \\ \text{Dose B} \end{array} } } \]

\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{Drug A & Dose A}\\ \text{Drug A & Dose B}\\ \text{Drug B & Dose A}\\ \text{Drug B & Dose B} \end{array} } \]

Ischemic Preconditioning

Effect on Mascular Endurance

Can Ischemic Preconditioning improve athletic performance?

1. Experimental units:

40 male teenagers

2. Response Variable:

length of time a wall squat position can be held

3. Control Groups:

2 groups who received 0 lb pressure

4. Control of Extraneous Factors:

Age, sex, athletic ability

5. Randomization

Randomly assigned 10 experimental units to each of 4 treatment groups

Ischemic Preconditioning

Effect on Mascular Endurance

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c|c} \text{Factor1} & {\text{Amount of pressure} \\ \text{applied by the} \\ \text{bloodpressure cuff}} \\ \hline \text{Level 1} & \text{20 lb} \\ \text{Level 2} & \text{0 lb} \\ \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c|c} \text{Factor2} & {\text{Length of time pressure} \\ \text{was applied}} \\ \hline \text{Level 1} & \text{10 min} \\ \text{Level 2} & \text{20 min} \\ \end{array} } } \]

\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{20 lb/10 min}\\ \text{20 lb/20 min}\\ \text{0 lb/ 10 min}\\ \text{0 lb/ 20 min} \end{array} } \]

Data Collection

1. The Goal

2. The Problem

3. The Compromise

4. The Challenge

Sampling from a Population

Statistical Inference

Bias and Precision

Bias

Precision

Parameter and Statistics

Sampling Methods

Simple Random Sampling (SRS)

Stratified Sampling

Cluster Sampling

One-Stage Cluster Sampling

Multistage Sampling

Multistage Sampling

Non-Random Sampling

Systematic Sampling

Non-Random Sampling

Convenience or Volunteer Sampling

Example: Non-Representative Sample

Observational Studies

Observational Studies

Generally, data in observational studies are collected only by passively monitoring study participants. They are inexpensive and good for discovering relationships related to rare outcomes. They are generally only sufficient to show associations.

1. Retrospective Studies

2. Prospective Studies

Observational Studies

Case: why did Fido and Fluffy die?

Early 2007, many dogs and cats died of kidney failure. Should you conduct a retrospective or prospective study to find out why?

Observational Studies

Case: drinking coffee and longivity

Coffee drinkers may live longer - nytimes.com, May 16, 2012.

Coffee may help you live longer, study suggests - thestar.com, May 17, 2012.

No, drinking coffee probably won't make you live longer - washingtonpost.com, May 17, 2012.

Observational Study

Case: drinking coffee and longivity

"Association of coffee drinking with total and cause-specific mortality," New England Journal of Medicine, May 2012.

400,000

50-71 years

1995 - 2008

52,000

Confounding Variable

A confounding variable is a variable that is associated with both the explanatory and response variables. Because of the confounding variable's association with both variables, we do not know if the response is due to the explanatory variable or due to the confounding variable.

Lurking Variable

Lurking variables are variables that are not considered in the analysis, but may affect the nature of the relationship between the explanatory variable and the outcome.

Are smokers less likely to die?

Lurking Variable

Experiments

Experiments

Subject

Treatment Group

Control Group

Response Variable

Factor

Experiments

Blinding

Placebo

Principles of Experimental Design

Direct Control

Randomization

Replication

Randomized and Blocked Design

PATRICIA Study

PApilloma TRIal against Cancer In young Adults

PATRICIA Study

Ischemic Preconditioning

Effect on Mascular Endurance

Can Ischemic Preconditioning improve athletic performance?

40 male teenagers

length of time a wall squat position can be held

2 groups who received 0 lb pressure

Age, sex, athletic ability

Randomly assigned 10 experimental units to each of 4 treatment groups

Ischemic Preconditioning

Effect on Mascular Endurance

Next Week

Chapter 12-13: Probability