Resampling Methods

Resampling is a way to reuse data to generate new, hypothetical samples (called resamples) that are representative of an underlying population. It’s used when (1) you don’t know the underlying sampling distribution of an estimate, (2) traditional formulas are difficult or impossible to apply (such as CLT), and (3) the estimator is a new method that hasn’t been studied yet.

Two popular tools for resampling are the jacknife and bootstrap. Although they have many similarities (e.g. they both can estimate precision for an estimator \(\theta\)), they do have a few notable differences.

Jacknife

The Jacknife works by sequentially deleting one observation in the data set, then recomputing the desired statistic. It is computationally simpler thatn bootstrapping, and may sometimes be computed by hand. The main application of the Jacknife is to reduce bias and evaluate variance for an estimator. The following R Codes may be used to perform standard Jacknife procedure in R:

# Jacknife

data <- c(4.3, 5.7, 3.2, 5.2, 3.1, 6.5, 4.8)
n <- length(data)

# Calculate the approximate SE of the sample mean using Jacknife

mean.jk <- vector()

for(i in 1:n) {
  data.jk <- data[-i]
  mean.i.jk <- mean(data.jk)
  mean.jk <- c(mean.jk, mean.i.jk)
}

mean.jk
## [1] 4.750000 4.516667 4.933333 4.600000 4.950000 4.383333 4.666667
(SE.jk <- sqrt((n-1)^2/n)*sd(mean.jk))
## [1] 0.4748075

Bootstrap

The Bootstrap was introduced by Brad Efron in the late 1970s and it is now the most popular resampling method. It uses sampling with replacement to estimate the sampling distribution for a desired estimator. The main purpose of for this particular method is to evaluate the variance of an estimator. The bootstrap can be applied in R using the following codes:

# Bootstrap

mean.boot <- vector()
n <- length(data)

# We choose 100 as the number of bootstrap resamples
for(i in 1:100) {
  
  set.seed(34*i)  # This controls the randomization process
  boot <- sample(n,n, replace=TRUE)
  data.boot <- data[boot]  # Selects SRSWR
  
  mean.i.boot <- mean(data.boot)
  mean.boot <- c(mean.boot, mean.i.boot)
} 

(SE.boot <- sd(mean.boot))
## [1] 0.4505097

Bootstrap can be utilized in many different applications, including hypothesis tests for certain unknown population quantities. For example, consider the airquality data in R. Suppose that we wish to test the claim that There is a significant correlation between Wind and Temperature. We can test this hypothesis by bootstrapping the correlation coefficient \(r\) to regenerate its sampling distribution.

# Using Airquality Data
library(datasets)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
cor.boot <- vector()

# We use 500 bootstrap resamples here
for(i in 1:500) {
  
  set.seed(34*i)
  boot <- sample(153, 153, replace=TRUE)
  data.boot <- airquality[boot,]
  cor.i.boot <- cor(data.boot$Wind, data.boot$Temp)
  cor.boot <- c(cor.boot, cor.i.boot)
} 

# Is the correlation significantly different from zero?
# Ho: rho = 0
# Ha: rho != 0

hist(cor.boot)

quantile(cor.boot, c(0.025, 0.975))
##       2.5%      97.5% 
## -0.5863520 -0.3285571

Conclusion: Since the 95% CI is does not include zero, we can conclude that the correlation is significant between Wind and Temperature.