Multiple testing

Multiple Testing Course Notes:

Why correct for multiple tests?

Hypothesis testing/significance analysis is commonly overused
Correcting for multiple testing avoids false positives or discoveries
Two key components
- Error measure
- Correction

Types of errors

Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes.

                | $\beta=0$   | $\beta\neq0$   |  Hypotheses

——————–|————-|—————-|——— Claim \(\beta=0\) | \(U\) | \(T\) | \(m-R\) Claim \(\beta\neq 0\) | \(V\) | \(S\) | \(R\) Claims | \(m_0\) | \(m-m_0\) | \(m\)

Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does

Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn’t

Error rates

False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*

Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)

False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)

The false positive rate is closely related to the type I error rate http://en.wikipedia.org/wiki/False_positive_rate

Controlling the false positive rate

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

Problem: Suppose that you perform 10,000 tests and \(\beta = 0\) for all of them.

Suppose that you call all \(P < 0.05\) significant.

The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.

How do we avoid so many false positives?

Controlling family-wise error rate (FWER)

The Bonferroni correction is the oldest multiple testing correction.

Basic idea: * Suppose you do \(m\) tests * You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\) * Calculate P-values normally * Set \(\alpha_{fwer} = \alpha/m\) * Call all \(P\)-values less than \(\alpha_{fwer}\) significant

Pros: Easy to calculate, conservative Cons: May be very conservative

Controlling false discovery rate (FDR)

This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.

Basic idea: * Suppose you do \(m\) tests * You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\) * Calculate P-values normally * Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\) * Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant

Pros: Still pretty easy to calculate, less conservative (maybe much less)

Cons: Allows for more false positives, may behave strangely under dependence

Adjusted P-values

One approach is to adjust the threshold \(\alpha\)
A different approach is to calculate “adjusted p-values”
They are not p-values anymore
But they can be used directly without adjusting \(\alpha\)

Example: * Suppose P-values are \(P_1,\ldots,P_m\) * You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value. * Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.

Case study I: no true positives

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  y <- rnorm(20)
  x <- rnorm(20)
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}

# Controls false positive rate
sum(pValues < 0.05)

## [1] 51

#[1] 51

Case study I: no true positives

# Controls FWER 
sum(p.adjust(pValues,method="bonferroni") < 0.05)

## [1] 0

#[1] 0
# Controls FDR 
sum(p.adjust(pValues,method="BH") < 0.05)

## [1] 0

#[1] 0

Case study II: 50% true positives

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)

##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24

       #trueStatus
         #not zero zero
  #FALSE        0  476
  #TRUE       500   24

#all relations discouvered
#24% false positives

Case study II: 50% true positives

# Controls FWER 
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)

##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0

# Controls FDR 
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)

##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13

Case study II: 50% true positives

P-values versus adjusted P-values

par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)

Notes and resources

Notes: * Multiple testing is an entire subfield * A basic Bonferroni/BH correction is usually enough * If there is strong dependence between tests there may be problems * Consider method=“BY”

Further resources: * Multiple testing procedures with applications to genomics * Statistical significance for genome-wide studies * Introduction to multiple testing

Multiple testing

bcaffo_jleek_JHU