[Resampling]

Resampling based procedures are ways to perform population based statistical inferences, while living within our data. Data Scientists tend to really like resampling based inferences, since they are very data centric procedures, they scale well to large studies and they often make very few assumptions.

Here we look at creating a confidence interval and estimating standard errors with bootstraping

Nonparametric Bootstrapping

Parameter description

boot( ) calls the statistic function R times. Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). These indices are used within the statistic function to select a sample. The statistics are calculated on the sample and the results are accumulated in the bootobject. The bootobject structure includes

element description

You can access these as bootobject\(t0 and bootobject\)t.

Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to examine the results. If the results look reasonable, you can use boot.ci( ) function to obtain confidence intervals for the statistic(s).

The format is boot.ci(bootobject, conf=, type= ) where ## parameter description - bootobject: The object returned by the boot function - conf: The desired confidence interval (default: conf=0.95) - type: The type of confidence interval returned. Possible values are “norm”, “basic”, “stud”, “perc”, “bca” and “all” (default: type=“all”)

Bootstrapping a Single Statistic (k=1)

The following example generates the bootstrapped 95% confidence interval for R-squared in the linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp). The data source is mtcars. The bootstrapped confidence interval is based on 1000 replications.

# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
   R=1000, formula=mpg~wt+disp)

# view results
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
##     wt + disp)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.7809306 0.01083345  0.04859227
plot(results)

# get 95% confidence interval
boot.ci(results, type="bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.6378,  0.8522 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

Bootstrapping several Statistics (k>1)

In example above, the function rsq returned a number and boot.ci returned a single confidence interval. The statistics function you provide can also return a vector. In the next example we get the 95% CI for the three model regression coefficients (intercept, car weight, displacement). In this case we add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to analyzed.

# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
   R=1000, formula=mpg~wt+disp)

# view results
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ 
##     wt + disp)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1* 34.96055404  1.339763e-01 2.601566258
## t2* -3.35082533 -6.621430e-02 1.158595156
## t3* -0.01772474 -7.087857e-05 0.008421491
plot(results, index=1) # intercept

plot(results, index=2) # wt

plot(results, index=3) # disp

# get 95% confidence intervals
boot.ci(results, type="bca", index=1) # intercept
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 1)
## 
## Intervals : 
## Level       BCa          
## 95%   (29.71, 39.85 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="bca", index=2) # wt
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 2)
## 
## Intervals : 
## Level       BCa          
## 95%   (-5.511, -0.813 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="bca", index=3) # disp
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 3)
## 
## Intervals : 
## Level       BCa          
## 95%   (-0.0349, -0.0007 )  
## Calculations and Intervals on Original Scale
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

COURSE NOTES: ## The jackknife


The jackknife

  • The jackknife deletes each observation and calculates an estimate based on the remaining \(n-1\) of them
  • It uses this collection of estimates to do things like estimate the bias and the standard error
  • Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are

The jackknife

  • We’ll consider the jackknife for univariate data
  • Let \(X_1,\ldots,X_n\) be a collection of data used to estimate a parameter \(\theta\)
  • Let \(\hat \theta\) be the estimate based on the full data set
  • Let \(\hat \theta_{i}\) be the estimate of \(\theta\) obtained by deleting observation \(i\)
  • Let \(\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}\)

Continued

  • Then, the jackknife estimate of the bias is \[ (n - 1) \left(\bar \theta - \hat \theta\right) \] (how far the average delete-one estimate is from the actual estimate)
  • The jackknife estimate of the standard error is \[ \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} \] (the deviance of the delete-one estimates from the average delete-one estimate)

Example

We want to estimate the bias and standard error of the median

library(UsingR)
## Warning: package 'UsingR' was built under R version 4.0.3
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.0.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.0.3
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.0.3
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
data(father.son)
x <- father.son$sheight
n <- length(x)
theta <- median(x)
jk <- sapply(1 : n,
             function(i) median(x[-i])
             )
thetaBar <- mean(jk)
biasEst <- (n - 1) * (thetaBar - theta) 
seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2))

Example

c(biasEst, seEst)
## [1] 0.0000000 0.1014066
library(bootstrap)
## Warning: package 'bootstrap' was built under R version 4.0.3
temp <- jackknife(x, median)
c(temp$jack.bias, temp$jack.se)
## [1] 0.0000000 0.1014066

Example

  • Both methods (of course) yield an estimated bias of 0 and a se of 0.1014066
  • Odd little fact: the jackknife estimate of the bias for the median is always \(0\) when the number of observations is even
  • It has been shown that the jackknife is a linear approximation to the bootstrap
  • Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties

Pseudo observations

  • Another interesting way to think about the jackknife uses pseudo observations
  • Let \[ \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} \]
  • Think of these as ``whatever observation \(i\) contributes to the estimate of \(\theta\)’’
  • Note when \(\hat \theta\) is the sample mean, the pseudo observations are the data themselves
  • Then the sample standard error of these observations is the previous jackknife estimated standard error.
  • The mean of these observations is a bias-corrected estimate of \(\theta\)

The bootstrap

  • The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
  • For example, how would one derive a confidence interval for the median?
  • The bootstrap procedure follows from the so called bootstrap principle

The bootstrap principle

  • Suppose that I have a statistic that estimates some population parameter, but I don’t know its sampling distribution
  • The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution

The bootstrap in practice

  • In practice, the bootstrap principle is always carried out using simulation
  • We will cover only a few aspects of bootstrap resampling
  • The general procedure follows by first simulating complete data sets from the observed data with replacement

    • This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution
  • Calculate the statistic for each simulated data set
  • Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

Example code

B <- 1000
resamples <- matrix(sample(x,
                           n * B,
                           replace = TRUE),
                    B, n)
medians <- apply(resamples, 1, median)
sd(medians)
## [1] 0.08536117
quantile(medians, c(.025, .975))
##     2.5%    97.5% 
## 68.44665 68.81588

Notes on the bootstrap

  • The bootstrap is non-parametric
  • Better percentile bootstrap confidence intervals correct for bias
  • There are lots of variations on bootstrap procedures; the book “An Introduction to the Bootstrap”" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information

Permutation tests

  • Consider the null hypothesis that the distribution of the observations from each group is the same
  • Then, the group labels are irrelevant
  • We then discard the group levels and permute the combined data
  • Split the permuted data into two groups with \(n_A\) and \(n_B\) observations (say by always treating the first \(n_A\) observations as the first group)
  • Evaluate the probability of getting a statistic as large or large than the one observed
  • An example statistic would be the difference in the averages between the two groups; one could also use a t-statistic

Permutation test for pesticide data

subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y <- subdata$count
group <- as.character(subdata$spray)
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
observedStat
## [1] 13.25
mean(permutations > observedStat)
## [1] 0

Histogram of permutations