Alban Guillaumet, Troy University
“You can't fix by analysis what you bungled by design.”
- Light, Singer and Willett
There is desired and undesired information in data.
Goals:
Get accurate information by reducing bias due to confounding factors (do we have the right signal?)
Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)
“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.
- Ruxton & Colegrave
Definition:
Random variation , also called between-individual variation, inter-individual variation, within-treatment variation, or noise, is the differences between measured values of the same variable taken from different experimental subjects.
Quantifies the difference due to reasons other than the ones we are interested in.
Good experiments minimize random variation, so that any variation due to the factors of interest can be detected more easily.
Definition: If we want to study the effect of variable A on variable B, but variable C also affects B, then C is a
confounding factor .
Good experiments allow to eliminate or control the effects of confounding factors.
Correlation does not require causation: The number of violent crimes tends to increase when ice cream sales increase.
1 - EFFECT ON BIAS
Confounding factors may bias the estimate of the relationship between measured explanatory and response variables, sometimes even reversing (!) the apparent effect of one on another.
2 - EFFECT ON SAMPLING ERROR
Even if they do not induce any bias, unaccounted for confounding factors may increase the noise in the data, leading to less precise estimates and less powerful tests.
Discuss: Can you think of some of the potential confounding factors?
1) Preparedness, including:
presence of an experienced guide, amount of training, proper acclimatation, use of top-notch material, etc…
Not accounting for them may lead to an overestimate of the effect of oxygen supplementation.
2) Weather and other external factors, such as:
Storms, avalanches,…
Not accounting for them may lead to an underestimate of the effect of oxygen supplementation
Unlike observational studies, properly designed experiments can identify the causes of the association between treatment and response variables.
Crucial advantage = random assignments of treatments to units (randomization).
Randomization of treatment (oxygen supplementation) to units (individual climbers) allows to break the association between confounding and explanatory variables, allowing the causal relationship between the explanatory and response variables to be assessed.
An analogy: linked genes on a chromosome. They tend to be transmitted together.
Same here: oxygen supplementation, presence of an experienced guide, amount of training, proper acclimatation, use of top-notch material, etc…tend to be associated (money?)
Randomization of treatment (oxygen supplementation) to units (individual climbers) would allow to break the association between confounding and explanatory variables.
However, randomization does not eliminate the variation contributed by confounding factors, only their correlation with treatment. It ensures that variation from confounding factors is spread more evenly between the treatment groups, and so it creates no bias.
Blocking is an important strategy used to account for the sampling error due to confounding factors.
Definition: A
Definition:
Two strategies are possible to limit the effect of confounding factors (bias and sampling error):
i) Matching
Frequent in epidemiological studies (case-control studies)
Every individual in the target group (e.g., with a disease) is paired with a corresponding healthy individual who has the same measurements for confounding variables such as age, weight, sex, and ethnic background.
Two strategies are possible to limit the effect of confounding factors (bias and sampling error):
ii) Adjusting
Statistical methods such as GLM are used to correct for differences between treatment and control groups in suspected confounding variables.
Importance of literature research and pilot studies!
Fictionary (but based on a real case) example of sex-biased admission rate in a US university
Call:
glm(formula = success ~ sex, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0769 -1.0769 -0.9281 1.2814 1.4492
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.61940 0.03190 -19.419 <2e-16 ***
sexM 0.37821 0.03871 9.771 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 17271 on 12761 degrees of freedom
Residual deviance: 17175 on 12760 degrees of freedom
AIC: 17179
Number of Fisher Scoring iterations: 4
Call:
glm(formula = success ~ subject + sex, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2148 -1.1769 -0.8457 1.1779 1.5504
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08753 0.04855 1.803 0.0714 .
subjectB -0.84322 0.04361 -19.335 <2e-16 ***
sexM -0.08866 0.04637 -1.912 0.0559 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 17271 on 12761 degrees of freedom
Residual deviance: 16787 on 12759 degrees of freedom
AIC: 16793
Number of Fisher Scoring iterations: 4
1
-0.844346
[1] 0.3006194
[1] 0.5218685
“Think very carefully about potential confounding factors before interpreting data. Failure to do so can lead to highly inappropriate conclusions about causal factors.
- Ruxton & Colegrave
Statistical tests are not a panacea! The experimental design, including deciding which variables to collect, is critical to reach a good understanding of the phenomenon under study.
1) While it is always tempting to jump into an experiment as quickly as possible, time spent planning and designing an experiment at the outset will save time and money (not to mention possible embarrassment) in the long run.
2) While wasting time and energy on badly designed experiments is foolish, causing more human or animal suffering or more disturbance to an ecosystem than is absolutely necessary is inexcusable.
- Ruxton & Colegrave
Myth #1 - It does not matter how you collect your data, there will always be a statistical ‘fix’ that will allow you to analyse it.
Myth #2 - If you collects lots of data something interesting will come out, and you’ll be able to detect even very subtle effects.
- Ruxton & Colegrave