I received an email asking if bootstrapped results became more significant as n increases.
I think you have have taught me enough that I can have an educated conversation about this stuff now. I understand the merits of using a bootstrap method to estimate the sample mean and the variance within it. (In fact is make more sense the just “randomly”plugging unknowns into a statistical model) Then using the overlap of the histograms to determine the probability the two populations are different. However, In thinking about this I was given some pause. There is no accounting for number of observations which might be good because you can’t “cheat” a pValue by increasing your N. This has a downside as well, I think. Two populations can exist that are different but very similar, using a standard T-Test making enough observations will allow you to be confident that there is a subtle difference but a true difference, whereas bootstrapping would never be able to lead you to this conclusions. Thoughts?
Here’s an experiment to prove that they do.
Let’s consider two normal variables, x and y, with different means and from each select a small sample size an a large sample size.
# Seed the random values
# so that the experiment behaves consistently
set.seed(1)
# Create the data
x_small <- rnorm( n=5, mean=1)
x_big <- rnorm(n=10, mean=1)
y_small <- rnorm( n=5, mean=0)
y_big <- rnorm(n=10, mean=0)
T test should show that a and b are different. But this difference should be more significant when the sample size is larger. Indeed it is.
t.test(x=x_big, y=y_big, alternative="greater")$p.value # p < 0.008151
## [1] 0.008151
t.test(x=x_small, y=y_small, alternative="greater")$p.value # p < 0.1061
## [1] 0.1061
Bootstrap should have the same trend, a larger sample size results in more significance.
bootstrap <- function(x, y, n=10000) {
x_is_bigger = c()
for(i in 1:n) {
mean_x <- mean(sample(x, replace=TRUE))
mean_y <- mean(sample(y, replace=TRUE))
x_is_bigger <- c(x_is_bigger, mean_x > mean_y)
}
p <- 1 - sum(x_is_bigger) / n
p
}
bootstrap(x_small, y_small) # p < 0.0541
## [1] 0.0541
bootstrap(x_big, y_big) # p < 0.0032
## [1] 0.0032
Why is bootstrap getting more significant with larger sample sizes? The answer is in the central limit theorem. If you take more samples from a normal distribution their mean will tend to get closer to the true mean. The same is true when you pull bootstrap sample sizes from a larger dataset. The means tend to be closer together.