Resampling By Hand

In R, of Course

Michael Erickson

2016-09-21

Introduction

The general idea in resampling is that if you are unwilling to make distributional assumptions about your data, then the best information about the population is your sample itself. Your goal is typically to find the sampling distribution of the mean based on your sample. That means you want to assume that the each item is sampled from the same distribution — your sample. If you change \(n\), you change the variance of the sampling distribution of the mean. Also, you typically transform teh sampling distribution of the mean so that it is centered at \(\mu_0\), as specified by \(H_0\).

Simulation

Start by generating a population. For fun, I will make it non-normal.

set.seed(42)
pop <- rf(100000, 2, 10)
pop <- pop-mean(pop) # set mu=0
summary(pop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2540 -0.9568 -0.5090  0.0000  0.3428 44.6000
sd(pop)
## [1] 1.612483

Population Distribution. Population Distribution.

First — Run the Experiement

n <- 5
samp1 <- sample(pop, n, replace=TRUE)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2370 -1.0570 -0.8729 -0.3103 -0.5413  2.1560
sd(samp1)
## [1] 1.402738

Run a t-test

Computes the p-value assuming normality.

t.test(samp1)
## 
##  One Sample t-test
## 
## data:  samp1
## t = -0.49464, df = 4, p-value = 0.6468
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -2.052029  1.431430
## sample estimates:
##  mean of x 
## -0.3102998

Resampling by hand

Now try resampling. I sample \(n\) items from samp with replacement, and compute the mean. I do this 100,000 times.

samp1.sdm <- replicate(100000, mean(sample(samp1, n, replace=TRUE)))
summary(samp1.sdm)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.189000  0.000924  0.452400  0.498800  0.940300  3.848000

Since \(H_0\) is that \(\mu_0=0\), shift samp1.sdm so that the mean is \(\mu_0\).

samp1.sdm <- samp1.sdm - mean(samp1.sdm)

Centered sample-based estimates of the sampling distribution of the mean (blue assumes normality and the histogram is generated via resampling). Centered sample-based estimates of the sampling distribution of the mean (blue assumes normality and the histogram is generated via resampling).

(samp1.resample.p <- (sum(samp1.sdm >= abs(mean(samp1))) + 
                        sum(samp1.sdm <= -abs(mean(samp1))))/length(samp1.sdm)) # 2 tailed
## [1] 0.66966

The two-tailed \(p\)-value from the t-test is really similar to the one generated via resampling.

Approximating the Population Sampling Distribution of the Mean

pop.sdm <- replicate(100000, mean(sample(pop, n, replace=TRUE)))
summary(pop.sdm)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.207000 -0.500300 -0.154500 -0.000915  0.321700  9.826000

Centered sampling distribution of the mean derived from the population (blue assumes normality but is generated from the population parameters and the histogram is generated from the population). Centered sampling distribution of the mean derived from the population (blue assumes normality but is generated from the population parameters and the histogram is generated from the population).

(pop.sdm.p <- (sum(pop.sdm >= abs(mean(samp1))) + 
                        sum(pop.sdm <= -abs(mean(samp1))))/length(pop.sdm)) # 2 tailed
## [1] 0.64466