Why correct for multiple tests?
Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes.
| $\beta=0$ | $\beta\neq0$ | Hypotheses
——————–|————-|—————-|——— Claim \(\beta=0\) | \(U\) | \(T\) | \(m-R\) Claim \(\beta\neq 0\) | \(V\) | \(S\) | \(R\) Claims | \(m_0\) | \(m-m_0\) | \(m\)
Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does
Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn’t
False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*
Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)
False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)
If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.
Suppose that you call all \(P < 0.05\) significant.
The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.
How do we avoid so many false positives?
The Bonferroni correction is the oldest multiple testing correction.
Basic idea: * Suppose you do \(m\) tests * You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\) * Calculate P-values normally * Set \(\alpha_{fwer} = \alpha/m\) * Call all \(P\)-values less than \(\alpha_{fwer}\) significant
Pros: Easy to calculate, conservative Cons: May be very conservative
This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.
Basic idea: * Suppose you do \(m\) tests * You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\) * Calculate P-values normally * Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\) * Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant
Pros: Still pretty easy to calculate, less conservative (maybe much less)
Cons: Allows for more false positives, may behave strangely under dependence
Example: * Suppose P-values are \(P_1,\ldots,P_m\) * You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value. * Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
#[1] 51
# Controls FWER
sum(p.adjust(pValues,method="bonferroni") < 0.05)
## [1] 0
#[1] 0
# Controls FDR
sum(p.adjust(pValues,method="BH") < 0.05)
## [1] 0
#[1] 0
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
x <- rnorm(20)
# First 500 beta=0, last 500 beta=2
if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 476
## TRUE 500 24
#trueStatus
#not zero zero
#FALSE 0 476
#TRUE 500 24
#all relations discouvered
#24% false positives
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
## trueStatus
## not zero zero
## FALSE 23 500
## TRUE 477 0
# Controls FDR
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
## trueStatus
## not zero zero
## FALSE 0 487
## TRUE 500 13
P-values versus adjusted P-values
par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)
Notes: * Multiple testing is an entire subfield * A basic Bonferroni/BH correction is usually enough * If there is strong dependence between tests there may be problems * Consider method=“BY”
Further resources: * Multiple testing procedures with applications to genomics * Statistical significance for genome-wide studies * Introduction to multiple testing