Say I weigh 20 people that eat burgers and 20 that don’t. My results are not significant. So I weigh all these people a second time to get a better measurement.

I’ve just doubled my dataset! 40 measurements from burger eaters. 40 from sad people. My results show that burger eaters are significantly lighter.

Are they?

get_data <- function(n = 20, resample_n = 5) {
  samples <- rnorm(n)
  resamples <- rep(samples, resample_n)
  resamples
}
is_significant <- function(...) {
  a <- get_data(...)
  b <- get_data(...)
  
  t.test(a, b)$p.value < 0.05
}
positive_rate <- function(...) {
  total <- 0
  for(i in 1:10000) {
    total = total + is_significant(...)
  }
  rate <- total / 10000
  rate
}

to_test <- expand.grid(
  n = 20,
  resample_n = c(1, 2, 5, 10, 20, 100)
)

for(i in 1:nrow(to_test)) {
  row = to_test[i,]
  rate = with(row, positive_rate(n, resample_n))
  to_test[i,'rate'] <- rate
}


library(ggplot2)
library(scales)

ggplot(to_test, aes(resample_n, rate)) + 
  geom_point() +
  geom_hline(yintercept=0.05, color='black') +
  scale_x_log10(breaks=c(1, 2, 5, 10, 20, 100)) +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  geom_path(color="red") +
  ylab("Rate of 'significant' results") +
  xlab("Number of times each sample is duplicated")

No!

Here we see the number of ‘signficant’ results (y-axis) for several simulations (red line). Remember, 0.05 (black line) would be expected by chance. Everything above that is a false positive. False positives become common when we include repeated measurments of the same samples (x-axis). If, for instance, we duplicate each sample 10 times then we get significant results over 50% of the time!

Make sure your samples are independent!