Created on 14 Aug 2013
Revised on Thu Aug 15 15:17:31 2013
Key ideas
· Hypothesis testing/significance analysis is commonly overused
· Correcting for multiple testing avoids false positives or discoveries
· Two key components
- Error measure
- Correction
Types of errors
· Type I error or false positive
· Type II error or false negative
Error rates
- False positive rate - The rate at which false results are called significant
- Family wise error rate (FWER) - The probability of at least one false positive
- False discovery rate (FDR) - The rate at which claims of significance are false
1.1 Controlling family-wise error rate (FWER)
The Bonferroni correction is the oldest multiple testing correction.
Basic idea:
- Suppose you do m tests
- You want to control FWER at level a
- Calculate P-values normally
- Set a'=a/m
- Call all P-values less than a' significant
Pros: Easy to calculate, conservative
Cons: May be very conservative
1.2 Controlling false discovery rate (FDR)
This is the most popular correction when performing lots of tests say in genomics, imagining,
astronomy, or other signal-processing disciplines.
Basic idea:
- Suppose you do m tests
- You want to control FDR at level a
- Calculate P-values normally
- Order the P-values from smallest to largest
- Call any P-values[i] less then a*i/m significant
Pros: Still pretty easy to calculate, less conservative (maybe much less)
Cons: Allows for more false positives, may behave strangely under dependence
1.3 Case study I: no true positives
set.seed(1010093)
pValues <- rep(NA, 1000)
for (i in 1:1000) {
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
}
sum(pValues < 0.05) # Controls false positive rate
## [1] 51
sum(p.adjust(pValues, method = "bonferroni") < 0.05) # Controls FWER
## [1] 0
sum(p.adjust(pValues, method = "BH") < 0.05) # Controls FDR
## [1] 0
1.4 Case study II: 50% true positives
set.seed(1010093)
pValues <- rep(NA, 1000)
for (i in 1:1000) {
x <- rnorm(20)
# First 500 beta=0, last 500 beta=2
if (i <= 500) {
y <- rnorm(20)
} else {
y <- rnorm(20, mean = 2 * x)
}
pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
}
trueStatus <- rep(c("zero", "not zero"), each = 500)
table(pValues < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 476
## TRUE 500 24
table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus) # Controls FWER
## trueStatus
## not zero zero
## FALSE 23 500
## TRUE 477 0
table(p.adjust(pValues, method = "BH") < 0.05, trueStatus) # Controls FDR
## trueStatus
## not zero zero
## FALSE 0 487
## TRUE 500 13
P-values versus adjusted P-values
oldpar = par(mfrow = c(1, 2))
plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)
par(oldpar)
Basic ideas
· Way back in the first week we talked about simulating data from distributions in R using the rfoo functions.
· In general simulations are way more flexible/useful
- For bootstrapping as we saw in week 7
- For evaluating models
- For testing different hypotheses
- For sensitivity analysis
· At minimum it is useful to simulate
- A best case scenario
- A few examples where you know your approach won't work
- The importance of simulating the extremes
2.1 Simulating data from a model
Suppose that you have a regression model
Yi = b0 + b1 * Xi + ei
Here is an example of generating data from this model where Xi and ei are normal:
set.seed(44333)
x <- rnorm(50)
e <- rnorm(50)
b0 <- 1
b1 <- 2
y <- b0 + b1 * x + e
Violating assumptions
set.seed(44333)
x <- rnorm(50)
e <- rnorm(50)
e2 <- rcauchy(50)
b0 <- 1
b1 <- 2
y <- b0 + b1 * x + e
y2 <- b0 + b1 * x + e2
oldpar = par(mfrow = c(1, 2))
plot(lm(y ~ x)$fitted, lm(y ~ x)$residuals, pch = 19, xlab = "fitted", ylab = "residuals")
plot(lm(y2 ~ x)$fitted, lm(y2 ~ x)$residuals, pch = 19, xlab = "fitted", ylab = "residuals")
par(oldpar)
Repeated simulations
set.seed(44333)
betaNorm <- betaCauch <- rep(NA, 1000)
for (i in 1:1000) {
x <- rnorm(50)
e <- rnorm(50)
e2 <- rcauchy(50)
b0 <- 1
b1 <- 2
y <- b0 + b1 * x + e
y2 <- b0 + b1 * x + e2
betaNorm[i] <- lm(y ~ x)$coeff[2]
betaCauch[i] <- lm(y2 ~ x)$coeff[2]
}
quantile(betaNorm)
## 0% 25% 50% 75% 100%
## 1.500 1.906 2.013 2.100 2.596
quantile(betaCauch)
## 0% 25% 50% 75% 100%
## -278.352 1.130 1.965 2.804 272.391
Monte Carlo Error
boxplot(betaNorm, betaCauch, col = "blue", ylim = c(-5, 5))
2.2 Simulation based on a data set
library(UsingR)
## Warning: package 'UsingR' was built under R version 3.0.1
## Loading required package: MASS
data(galton)
nobs <- dim(galton)[1]
oldpar = par(mfrow = c(1, 2))
hist(galton$child, col = "blue", breaks = 100)
hist(galton$parent, col = "blue", breaks = 100)
par(oldpar)
Calculating means,variances
lm1 <- lm(galton$child ~ galton$parent)
parent0 <- rnorm(nobs, sd = sd(galton$parent), mean = mean(galton$parent))
child0 <- lm1$coeff[1] + lm1$coeff[2] * parent0 + rnorm(nobs, sd = summary(lm1)$sigma)
oldpar = par(mfrow = c(1, 2))
plot(galton$parent, galton$child, pch = 19)
plot(parent0, child0, pch = 19, col = "blue")
par(oldpar)
2.3 Simulating more complicated scenarios
library(bootstrap)
## Warning: package 'bootstrap' was built under R version 3.0.1
data(stamp)
nobs <- dim(stamp)[1]
hist(stamp$Thickness, col = "grey", breaks = 100, freq = F)
dens <- density(stamp$Thickness)
lines(dens, col = "blue", lwd = 3)
2.4 A simulation that is too simple
plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
newThick <- rnorm(nobs, mean = mean(stamp$Thickness), sd = sd(stamp$Thickness))
lines(density(newThick), col = "grey", lwd = 3)
}
2.5 Simulating from the density estimate
plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
newThick <- rnorm(nobs, mean = stamp$Thickness, sd = dens$bw)
lines(density(newThick), col = "grey", lwd = 3)
}
2.6 Increasing variability
plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
newThick <- rnorm(nobs, mean = stamp$Thickness, sd = dens$bw * 1.5)
lines(density(newThick, bw = dens$bw), col = "grey", lwd = 3)
}