Intro to Statistics, Data & Experimental Design

M. Drew LaMar
September 2, 2022

“You can't fix by analysis what you bungled by design.”

- Light, Singer and Willett

Arranged, and designed by John Manoogian III (jm3). Categories and descriptions originally by Buster Benson.

Calling Bullshit: The Art of Skepticism in a Data-Driven World

Perspective

Process of Science

alt text

What is Reality?

  • Can we know reality with certainty?
  • Is anything certain or known?
  • What is evidence?
  • What is data?
  • What is statistics?

Data Process

R for Data Science, by Garrett Grolemund and Hadley Wickham

Discuss: What’s missing?

Answer: Experimental Design and Data Collection!

Data Process

  1. Data Planning (Experimental Design)
    • Pilot Studies (Micro. Ver. of #2-4 below)
  2. Data Collection (Experiment/Field Study)
  3. Data Cleaning/Curation (e.g. remove missing values, units)
  4. Data Exploration & Analysis
    • Data Validation (sanity checks, e.g. values make biological sense?)
    • Data Munging/Wrangling (raw -> processed)
    • Data Analysis (Statistics)
    • Data Visualization
  5. Data Dissemination (Data Communication)

Course Triad: Content + Skills

Course Triad: Content + Skills

Reading Quiz (Class Discussion)

Q1. What feature of an estimate—precision or accuracy—is most strongly affected when individuals differing in the variable of interest do not have an equal chance of being selected?

Answer: Accuracy

Reading Quiz (Class Discussion)

Q2. Is the following study observational or experimental?

Psychologists tested whether the frequency of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people.

Sub-questions:

  • What is the explanatory and response variable?
  • Are they categorical or numerical?

Answer: Observational

Reading Quiz (Class Discussion)

Q2. Is the following study observational or experimental?

Psychologists tested whether the frequency of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people.

Explanatory variable has values “schizophrenic” and “not schizophrenic”, which is a categorical variable.

Observational because treatment groups (or values of the explanatory variable) not assigned randomly by scientist!!

Experimental vs Observational

Definition: A study is experimental if the researcher assigns treatments randomly to individuals, whereas a study is observational if the assignment of treatments is not made by the researcher.

Populations vs Samples

Definition: A parameter is a quantity describing a population, whereas an estimate or statistic is a related quantity calculated from a sample.

Parameter examples: Averages, proportions, measures of variation, and measures of relationship

What is statistics?

Statistics is a technology that describes and measures aspects of nature from samples.

Statistics lets us quantify the uncertainty of these measures.

Statistics makes it possible to determine the likely magnitude of measurements departure from the “truth”.

Statistics is about estimation, the process of inferring an unknown quantity of a target population using sample data.

What is statistics?

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing
Definition: A statistical hypothesis is a specific claim regarding a population parameter.
Definition: Hypothesis testing uses data to evaluate evidence for or against statistical hypotheses.

What is statistics? Parameter estimation

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing

Example: A trapping study measures the rate of fruit fall in forest clear-cuts.

What is statistics? Hypothesis testing

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing

Example: A clinical trial is carried out to determine whether taking large doses of vitamin C benefits health of advanced cancer patients.

What is probability?

alt text alt text

Probability comes first!

…well, most of the time.

  • Many statistical techniques require assumptions about where your data is coming from (i.e. properties of the population)
  • In other words, an assumed probability model describes the population
  • Statistical techniques that are based on probability models are called parametric techniques, while those that are not are called non-parametric techniques.
Quote: “Huh?”
- Student

Data as Information

“Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue.”

- R. A. Fisher (biologist!)

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    Definition: Bias is a systematic discrepancy between the estimates we would obtain, if we could sample a population again and again, and the true population characteristic.

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    Definition: Sampling error is the difference between an estimate and the population parameter being estimated caused by chance.

Sampling

Precision vs Accuracy

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.”

    - John Tukey

Data as Information

For your question, there is desired (signal) and undesired (noise) information in your data.

Goals:

  • Isolate desired information by reducing or controlling for confounding factors (i.e. undesired information)

“The aim … is to provide a clear and rigorous basis for determining when a causal ordering can be said to hold between two variables or groups of variables in a model…”

- H. Simon

The Degradation of Information

alt text

Experimental Design, Data & Statistics

“Designing experiments is as much about learning to think scientifically as it is about the mechanics of the statistics that we use to analyse the data once we have it. It is about having confidence in your data, and knowing that you are measuring what you think you are measuring. It is about knowing what can be concluded from a particular type of experiment and what cannot.

- Ruxton & Colegrave

Experimental Design, Data & Statistics

Design your experiment so that:

  • Measurements lead to useful data.
  • Useful data has information addressing your hypothesis.
  • Statistics are tailored to your data and powerful enough to separate out signal from noise.
  • Results of statistics can be properly interpreted as evidence for or against your original hypothesis.

Two key concepts of experimental design

“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.

- Ruxton & Colegrave

Definition: Random variation is the differences between measured values of the same variable taken from different experimental subjects.

Good experiments minimize or control for "unwanted” random variation, so that any variation due to the factors of interest can be detected more easily.

Two key concepts of experimental design

“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.

- Ruxton & Colegrave

Definition: If we want to study the effect of variable A on variable B, but variable C also affects B, then C is a confounding factor.

Final Remarks

“Designing effective experiments needs thinking about biology more than it does mathematical calculations.”

“Experimental design is about the biology of the system, and that is why the best people to devise biological experiments are biologists themselves.”

- Ruxton & Colegrave

Assignments

  • Reading Quiz for Wednesday
    • Whitlock & Schluter, Chapter 2: Displaying data
  • Lab #0 due Wednesday at 11:59 pm