Intro to Data & Experimental Design

Alban Guillaumet, Troy University

“You can't fix by analysis what you bungled by design.”

- Light, Singer and Willett

Data as Information

There is desired and undesired information in data.

Goals:

Get accurate information by reducing bias due to confounding factors (do we have the right signal?)
- e.g., randomization, control, blinding
Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)
- e.g, blocking, replication, balance

Two key concepts of experimental design

“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.

- Ruxton & Colegrave

Random variation

Definition: Random variation, also called between-individual variation, inter-individual variation, within-treatment variation, or noise, is the differences between measured values of the same variable taken from different experimental subjects.

Quantifies the difference due to reasons other than the ones we are interested in.
Good experiments minimize random variation, so that any variation due to the factors of interest can be detected more easily.

Confounding factors

Definition: If we want to study the effect of variable A on variable B, but variable C also affects B, then C is a confounding factor.

Good experiments allow to eliminate or control the effects of confounding factors.

Example of confounding factor

Correlation does not require causation: The number of violent crimes tends to increase when ice cream sales increase.

alt text

Example of confounding factor

warmer temperatures may increase:
- irritability
- social interactions between people

Example of confounding factor

alt text

Confounding factors

1 - EFFECT ON BIAS

Confounding factors may bias the estimate of the relationship between measured explanatory and response variables, sometimes even reversing (!) the apparent effect of one on another.

2 - EFFECT ON SAMPLING ERROR

Even if they do not induce any bias, unaccounted for confounding factors may increase the noise in the data, leading to less precise estimates and less powerful tests.

Case study

Example: Does supplemental oxygen affect the probability of surviving an ascent of a peak > 8,000 m in the Himalayas?

Case study

Discuss: Can you think of some of the potential confounding factors?

Case study

1) Preparedness, including:

presence of an experienced guide, amount of training, proper acclimatation, use of top-notch material, etc…
Not accounting for them may lead to an overestimate of the effect of oxygen supplementation.

2) Weather and other external factors, such as:

Storms, avalanches,…
Not accounting for them may lead to an underestimate of the effect of oxygen supplementation

Experimental studies

Unlike observational studies, properly designed experiments can identify the causes of the association between treatment and response variables.
Crucial advantage = random assignments of treatments to units (randomization).
Randomization of treatment (oxygen supplementation) to units (individual climbers) allows to break the association between confounding and explanatory variables, allowing the causal relationship between the explanatory and response variables to be assessed.

Experimental studies

An analogy: linked genes on a chromosome. They tend to be transmitted together.
Same here: oxygen supplementation, presence of an experienced guide, amount of training, proper acclimatation, use of top-notch material, etc…tend to be associated (money?)
Randomization of treatment (oxygen supplementation) to units (individual climbers) would allow to break the association between confounding and explanatory variables.

Experimental studies

However, randomization does not eliminate the variation contributed by confounding factors, only their correlation with treatment. It ensures that variation from confounding factors is spread more evenly between the treatment groups, and so it creates no bias.
Blocking is an important strategy used to account for the sampling error due to confounding factors.

Experimental design

Definition: A completely randomized design is an experimental design in which treatments are assigned to all units by randomization.

Definition: Blocks, also called strata, are experimental units that share common features. In a randomized block design, each treatment is applied once to every block.

Completely randomized design

Randomized block design

Observational studies

Two strategies are possible to limit the effect of confounding factors (bias and sampling error):

i) Matching
Frequent in epidemiological studies (case-control studies)
Every individual in the target group (e.g., with a disease) is paired with a corresponding healthy individual who has the same measurements for confounding variables such as age, weight, sex, and ethnic background.

Observational studies

Two strategies are possible to limit the effect of confounding factors (bias and sampling error):

ii) Adjusting
Statistical methods such as GLM are used to correct for differences between treatment and control groups in suspected confounding variables.
Importance of literature research and pilot studies!

Observational study: Simpson's paradox

Fictionary (but based on a real case) example of sex-biased admission rate in a US university

alt text

Simpson's paradox


Call:
glm(formula = success ~ sex, family = binomial, data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0769  -1.0769  -0.9281   1.2814   1.4492  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.61940    0.03190 -19.419   <2e-16 ***
sexM         0.37821    0.03871   9.771   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 17271  on 12761  degrees of freedom
Residual deviance: 17175  on 12760  degrees of freedom
AIC: 17179

Number of Fisher Scoring iterations: 4

Simpson's paradox


Call:
glm(formula = success ~ subject + sex, family = binomial, data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2148  -1.1769  -0.8457   1.1779   1.5504  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.08753    0.04855   1.803   0.0714 .  
subjectB    -0.84322    0.04361 -19.335   <2e-16 ***
sexM        -0.08866    0.04637  -1.912   0.0559 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 17271  on 12761  degrees of freedom
Residual deviance: 16787  on 12759  degrees of freedom
AIC: 16793

Number of Fisher Scoring iterations: 4

        1 
-0.844346

[1] 0.3006194

[1] 0.5218685

Simpson's paradox

“Think very carefully about potential confounding factors before interpreting data. Failure to do so can lead to highly inappropriate conclusions about causal factors.

- Ruxton & Colegrave

Statistical tests are not a panacea! The experimental design, including deciding which variables to collect, is critical to reach a good understanding of the phenomenon under study.

Research workflow

alt text

Data Process

Data Planning (Experimental Design)
- Pilot Studies (Micro. Ver. of #2-4 below)
Data Collection (Experiment/Field Study)
Data Cleaning/Curation (e.g. remove missing values and outliers) (Data skills)
Data Exploration & Analysis
- Data Validation (sanity checks, e.g. values make biological sense?)
- Data Munging/Wrangling (raw -> processed)
- Data Visualization
- Data Analysis (Statistics)
Data Dissemination (Data Communication)

Final remarks # 1

1) While it is always tempting to jump into an experiment as quickly as possible, time spent planning and designing an experiment at the outset will save time and money (not to mention possible embarrassment) in the long run.

2) While wasting time and energy on badly designed experiments is foolish, causing more human or animal suffering or more disturbance to an ecosystem than is absolutely necessary is inexcusable.

- Ruxton & Colegrave

Final remarks # 2

Myth #1 - It does not matter how you collect your data, there will always be a statistical ‘fix’ that will allow you to analyse it.

Myth #2 - If you collects lots of data something interesting will come out, and you’ll be able to detect even very subtle effects.

- Ruxton & Colegrave