Resampling is a way to reuse data to generate new, hypothetical samples (called resamples) that are representative of an underlying population. It’s used when (1) you don’t know the underlying sampling distribution of an estimate, (2) traditional formulas are difficult or impossible to apply (such as CLT), and (3) the estimator is a new method that hasn’t been studied yet.
Two popular tools for resampling are the jacknife and bootstrap. Although they have many similarities (e.g. they both can estimate precision for an estimator \(\theta\)), they do have a few notable differences.
The Jacknife works by sequentially deleting one observation in the data set, then recomputing the desired statistic. It is computationally simpler thatn bootstrapping, and may sometimes be computed by hand. The main application of the Jacknife is to reduce bias and evaluate variance for an estimator. The following R Codes may be used to perform standard Jacknife procedure in R:
# Jacknife
data <- c(4.3, 5.7, 3.2, 5.2, 3.1, 6.5, 4.8)
n <- length(data)
# Calculate the approximate SE of the sample mean using Jacknife
mean.jk <- vector()
for(i in 1:n) {
data.jk <- data[-i]
mean.i.jk <- mean(data.jk)
mean.jk <- c(mean.jk, mean.i.jk)
}
mean.jk
## [1] 4.750000 4.516667 4.933333 4.600000 4.950000 4.383333 4.666667
(SE.jk <- sqrt((n-1)^2/n)*sd(mean.jk))
## [1] 0.4748075
The Bootstrap was introduced by Brad Efron in the late 1970s and it is now the most popular resampling method. It uses sampling with replacement to estimate the sampling distribution for a desired estimator. The main purpose of for this particular method is to evaluate the variance of an estimator. The bootstrap can be applied in R using the following codes:
# Bootstrap
mean.boot <- vector()
n <- length(data)
# We choose 100 as the number of bootstrap resamples
for(i in 1:100) {
set.seed(34*i) # This controls the randomization process
boot <- sample(n,n, replace=TRUE)
data.boot <- data[boot] # Selects SRSWR
mean.i.boot <- mean(data.boot)
mean.boot <- c(mean.boot, mean.i.boot)
}
(SE.boot <- sd(mean.boot))
## [1] 0.4505097
Bootstrap can be utilized in many different applications, including
hypothesis tests for certain unknown population quantities. For example,
consider the airquality data in R. Suppose that we wish to
test the claim that There is a significant correlation between Wind
and Temperature. We can test this hypothesis by bootstrapping the
correlation coefficient \(r\) to
regenerate its sampling distribution.
# Using Airquality Data
library(datasets)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
cor.boot <- vector()
# We use 500 bootstrap resamples here
for(i in 1:500) {
set.seed(34*i)
boot <- sample(153, 153, replace=TRUE)
data.boot <- airquality[boot,]
cor.i.boot <- cor(data.boot$Wind, data.boot$Temp)
cor.boot <- c(cor.boot, cor.i.boot)
}
# Is the correlation significantly different from zero?
# Ho: rho = 0
# Ha: rho != 0
hist(cor.boot)
quantile(cor.boot, c(0.025, 0.975))
## 2.5% 97.5%
## -0.5863520 -0.3285571
Conclusion: Since the 95% CI is does not include zero, we can conclude that the correlation is significant between Wind and Temperature.