The general idea in resampling is that if you are unwilling to make distributional assumptions about your data, then the best information about the population is your sample itself. Your goal is typically to find the sampling distribution of the mean based on your sample. That means you want to assume that the each item is sampled from the same distribution — your sample. If you change \(n\), you change the variance of the sampling distribution of the mean. Also, you typically transform teh sampling distribution of the mean so that it is centered at \(\mu_0\), as specified by \(H_0\).
Start by generating a population. For fun, I will make it non-normal.
set.seed(42)
pop <- rf(100000, 2, 10)
pop <- pop-mean(pop) # set mu=0
summary(pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2540 -0.9568 -0.5090 0.0000 0.3428 44.6000
sd(pop)
## [1] 1.612483
Population Distribution.
n <- 5
samp1 <- sample(pop, n, replace=TRUE)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2370 -1.0570 -0.8729 -0.3103 -0.5413 2.1560
sd(samp1)
## [1] 1.402738
Computes the p-value assuming normality.
t.test(samp1)
##
## One Sample t-test
##
## data: samp1
## t = -0.49464, df = 4, p-value = 0.6468
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.052029 1.431430
## sample estimates:
## mean of x
## -0.3102998
Now try resampling. I sample \(n\) items from samp with replacement, and compute the mean. I do this 100,000 times.
samp1.sdm <- replicate(100000, mean(sample(samp1, n, replace=TRUE)))
summary(samp1.sdm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.189000 0.000924 0.452400 0.498800 0.940300 3.848000
Since \(H_0\) is that \(\mu_0=0\), shift samp1.sdm so that the mean is \(\mu_0\).
samp1.sdm <- samp1.sdm - mean(samp1.sdm)
Centered sample-based estimates of the sampling distribution of the mean (blue assumes normality and the histogram is generated via resampling).
(samp1.resample.p <- (sum(samp1.sdm >= abs(mean(samp1))) +
sum(samp1.sdm <= -abs(mean(samp1))))/length(samp1.sdm)) # 2 tailed
## [1] 0.66966
The two-tailed \(p\)-value from the t-test is really similar to the one generated via resampling.
pop.sdm <- replicate(100000, mean(sample(pop, n, replace=TRUE)))
summary(pop.sdm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.207000 -0.500300 -0.154500 -0.000915 0.321700 9.826000
Centered sampling distribution of the mean derived from the population (blue assumes normality but is generated from the population parameters and the histogram is generated from the population).
(pop.sdm.p <- (sum(pop.sdm >= abs(mean(samp1))) +
sum(pop.sdm <= -abs(mean(samp1))))/length(pop.sdm)) # 2 tailed
## [1] 0.64466