The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?
The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?
http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf



Suppose you are testing a hypothesis that a parameter \( \beta \) equals zero versus the alternative that it does not equal zero. These are the possible outcomes.
| $\beta=0$ | $\beta\neq0$ | Hypotheses | |
|---|---|---|---|
| Claim \( \beta=0 \) | \( U \) | \( T \) | \( m-R \) |
| Claim \( \beta\neq 0 \) | \( V \) | \( S \) | \( R \) |
| Claims | \( m_0 \) | \( m-m_0 \) | \( m \) |
Type I error or false positive (\( V \)) Say that the parameter does not equal zero when it does
Type II error or false negative (\( T \)) Say that the parameter equals zero when it doesn't
False positive rate - The rate at which false results (\( \beta = 0 \)) are called significant: \( E\left[\frac{V}{m_0}\right] \)*
Family wise error rate (FWER) - The probability of at least one false positive \( {\rm Pr}(V \geq 1) \)
False discovery rate (FDR) - The rate at which claims of significance are false \( E\left[\frac{V}{R}\right] \)
If P-values are correctly calculated calling all \( P < \alpha \) significant will control the false positive rate at level \( \alpha \) on average.
Suppose that you call all \( P < 0.05 \) significant.
The expected number of false positives is: \( 10,000 \times 0.05 = 500 \) false positives.
How do we avoid so many false positives?
The Bonferroni correction is the oldest multiple testing correction.
Basic idea:
Pros: Easy to calculate, conservative Cons: May be very conservative
This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.
Basic idea:
Pros: Still pretty easy to calculate, less conservative (maybe much less)
Cons: Allows for more false positives, may behave strangely under dependence

Controlling all error rates at \( \alpha = 0.20 \)
Example:
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
[1] 51
# Controls FWER
sum(p.adjust(pValues,method="bonferroni") < 0.05)
[1] 0
# Controls FDR
sum(p.adjust(pValues,method="BH") < 0.05)
[1] 0
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
x <- rnorm(20)
# First 500 beta=0, last 500 beta=2
if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
trueStatus
not zero zero
FALSE 0 476
TRUE 500 24
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
trueStatus
not zero zero
FALSE 23 500
TRUE 477 0
# Controls FDR
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
trueStatus
not zero zero
FALSE 0 487
TRUE 500 13
P-values versus adjusted P-values
par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)
Notes:
Further resources: