We will use the following dataset to demonstrate the use of permutations:
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt"
filename <- basename(url)
download(url, destfile=filename)
babies <- read.table("babies.txt", header=TRUE)
bwt.nonsmoke <- filter(babies, smoke==0) %>% select(bwt) %>% unlist
bwt.smoke <- filter(babies, smoke==1) %>% select(bwt) %>% unlist
N=10
set.seed(1)
nonsmokers <- sample(bwt.nonsmoke , N)
smokers <- sample(bwt.smoke , N)
obs <- mean(smokers) - mean(nonsmokers)
The question is whether this observed difference is statistically significant. We do not want to rely on the assumptions needed for the normal or t-distribution approximations to hold, so instead we will use permutations. We will reshuffle the data and recompute the mean. We can create one permuted sample with the following code:
dat <- c(smokers,nonsmokers)
shuffle <- sample(dat)
smokersstar <- shuffle[1:N]
nonsmokersstar <- shuffle[(N+1):(2*N)]
mean(smokersstar)-mean(nonsmokersstar)
## [1] -8.5
The last value is one observation from the null distribution we will construct. Set the seed at 1, and then repeat the permutation 1,000 times to create a null distribution. What is the permutation derived p-value for our observation?
set.seed(1)
avgdiff <- replicate(1000, {
dat <- c(smokers,nonsmokers)
shuffle <- sample(dat)
smokersstar <- shuffle[1:N]
nonsmokersstar <- shuffle[(N+1):(2*N)]
return(mean(smokersstar)-mean(nonsmokersstar))
})
hist(avgdiff)
abline(v=obs)
(sum(abs(avgdiff) > abs(obs)) + 1) / (length(avgdiff) + 1)
## [1] 0.05294705
set.seed(1)
avgdiff <- replicate(1000, {
dat <- c(smokers,nonsmokers)
shuffle <- sample(dat)
smokersstar <- shuffle[1:N]
nonsmokersstar <- shuffle[(N+1):(2*N)]
return(median(smokersstar)-median(nonsmokersstar))
})
hist(avgdiff)
# mean observed difference
mean.p <- (sum(abs(avgdiff) > abs(obs)) + 1) / (length(avgdiff) + 1)
abline(v=obs, col = "blue")
# median observed difference
obs <- median(smokersstar)-median(nonsmokersstar)
median.p <- (sum(abs(avgdiff) > abs(obs)) + 1) / (length(avgdiff) + 1)
abline(v=obs, col = "red")
mean.p
## [1] 0.2827173
median.p
## [1] 0.3396603