<< Data Analysis >> Note part 10

Created on 14 Aug 2013
Revised on Thu Aug 15 15:17:31 2013

1. Multiple testing

Key ideas

 · Hypothesis testing/significance analysis is commonly overused
 · Correcting for multiple testing avoids false positives or discoveries
 · Two key components
   - Error measure
   - Correction

Types of errors
 · Type I error or false positive
 · Type II error or false negative

Error rates
 - False positive rate - The rate at which false results are called significant
 - Family wise error rate (FWER) - The probability of at least one false positive
 - False discovery rate (FDR) - The rate at which claims of significance are false

1.1 Controlling family-wise error rate (FWER)

The Bonferroni correction is the oldest multiple testing correction.
Basic idea:
 - Suppose you do m tests
 - You want to control FWER at level a
 - Calculate P-values normally
 - Set a'=a/m
 - Call all P-values less than a' significant
Pros: Easy to calculate, conservative 
Cons: May be very conservative

1.2 Controlling false discovery rate (FDR)

This is the most popular correction when performing lots of tests say in genomics, imagining,
astronomy, or other signal-processing disciplines.

Basic idea:

 - Suppose you do m tests
 - You want to control FDR at level a
 - Calculate P-values normally
 - Order the P-values from smallest to largest
 - Call any P-values[i] less then a*i/m significant
Pros: Still pretty easy to calculate, less conservative (maybe much less)
Cons: Allows for more false positives, may behave strangely under dependence

1.3 Case study I: no true positives

set.seed(1010093)
pValues <- rep(NA, 1000)
for (i in 1:1000) {
    y <- rnorm(20)
    x <- rnorm(20)
    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
}
sum(pValues < 0.05)  # Controls false positive rate

## [1] 51

sum(p.adjust(pValues, method = "bonferroni") < 0.05)  # Controls FWER

## [1] 0

sum(p.adjust(pValues, method = "BH") < 0.05)  # Controls FDR

## [1] 0

1.4 Case study II: 50% true positives

set.seed(1010093)
pValues <- rep(NA, 1000)
for (i in 1:1000) {
    x <- rnorm(20)
    # First 500 beta=0, last 500 beta=2
    if (i <= 500) {
        y <- rnorm(20)
    } else {
        y <- rnorm(20, mean = 2 * x)
    }
    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
}
trueStatus <- rep(c("zero", "not zero"), each = 500)
table(pValues < 0.05, trueStatus)

##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24

table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus)  # Controls FWER

##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0

table(p.adjust(pValues, method = "BH") < 0.05, trueStatus)  # Controls FDR

##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13

P-values versus adjusted P-values

oldpar = par(mfrow = c(1, 2))
plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)

plot of chunk unnamed-chunk-3

par(oldpar)

2. Simulation for model checking

Basic ideas
 · Way back in the first week we talked about simulating data from distributions in R using the rfoo functions.
 · In general simulations are way more flexible/useful
   - For bootstrapping as we saw in week 7
   - For evaluating models
   - For testing different hypotheses
   - For sensitivity analysis
 · At minimum it is useful to simulate
   - A best case scenario
   - A few examples where you know your approach won't work
   - The importance of simulating the extremes

2.1 Simulating data from a model

Suppose that you have a regression model

Yi = b0 + b1 * Xi + ei

Here is an example of generating data from this model where Xi and ei are normal:

set.seed(44333)
x <- rnorm(50)
e <- rnorm(50)
b0 <- 1
b1 <- 2
y <- b0 + b1 * x + e

Violating assumptions

set.seed(44333)
x <- rnorm(50)
e <- rnorm(50)
e2 <- rcauchy(50)
b0 <- 1
b1 <- 2
y <- b0 + b1 * x + e
y2 <- b0 + b1 * x + e2
oldpar = par(mfrow = c(1, 2))
plot(lm(y ~ x)$fitted, lm(y ~ x)$residuals, pch = 19, xlab = "fitted", ylab = "residuals")
plot(lm(y2 ~ x)$fitted, lm(y2 ~ x)$residuals, pch = 19, xlab = "fitted", ylab = "residuals")

plot of chunk unnamed-chunk-5

par(oldpar)

Repeated simulations

set.seed(44333)
betaNorm <- betaCauch <- rep(NA, 1000)
for (i in 1:1000) {
    x <- rnorm(50)
    e <- rnorm(50)
    e2 <- rcauchy(50)
    b0 <- 1
    b1 <- 2
    y <- b0 + b1 * x + e
    y2 <- b0 + b1 * x + e2
    betaNorm[i] <- lm(y ~ x)$coeff[2]
    betaCauch[i] <- lm(y2 ~ x)$coeff[2]
}
quantile(betaNorm)

##    0%   25%   50%   75%  100% 
## 1.500 1.906 2.013 2.100 2.596

quantile(betaCauch)

##       0%      25%      50%      75%     100% 
## -278.352    1.130    1.965    2.804  272.391

Monte Carlo Error

boxplot(betaNorm, betaCauch, col = "blue", ylim = c(-5, 5))

plot of chunk unnamed-chunk-7

2.2 Simulation based on a data set

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.0.1

## Loading required package: MASS

data(galton)
nobs <- dim(galton)[1]
oldpar = par(mfrow = c(1, 2))
hist(galton$child, col = "blue", breaks = 100)
hist(galton$parent, col = "blue", breaks = 100)

plot of chunk unnamed-chunk-8

par(oldpar)

Calculating means,variances

lm1 <- lm(galton$child ~ galton$parent)
parent0 <- rnorm(nobs, sd = sd(galton$parent), mean = mean(galton$parent))
child0 <- lm1$coeff[1] + lm1$coeff[2] * parent0 + rnorm(nobs, sd = summary(lm1)$sigma)
oldpar = par(mfrow = c(1, 2))
plot(galton$parent, galton$child, pch = 19)
plot(parent0, child0, pch = 19, col = "blue")

plot of chunk unnamed-chunk-9

par(oldpar)

2.3 Simulating more complicated scenarios

library(bootstrap)

## Warning: package 'bootstrap' was built under R version 3.0.1

data(stamp)
nobs <- dim(stamp)[1]
hist(stamp$Thickness, col = "grey", breaks = 100, freq = F)
dens <- density(stamp$Thickness)
lines(dens, col = "blue", lwd = 3)

plot of chunk unnamed-chunk-10

2.4 A simulation that is too simple

plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
    newThick <- rnorm(nobs, mean = mean(stamp$Thickness), sd = sd(stamp$Thickness))
    lines(density(newThick), col = "grey", lwd = 3)
}

plot of chunk unnamed-chunk-11

2.5 Simulating from the density estimate

plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
    newThick <- rnorm(nobs, mean = stamp$Thickness, sd = dens$bw)
    lines(density(newThick), col = "grey", lwd = 3)
}

plot of chunk unnamed-chunk-12

2.6 Increasing variability

plot(density(stamp$Thickness), col = "black", lwd = 3)
for (i in 1:10) {
    newThick <- rnorm(nobs, mean = stamp$Thickness, sd = dens$bw * 1.5)
    lines(density(newThick, bw = dens$bw), col = "grey", lwd = 3)
}

plot of chunk unnamed-chunk-13