[Resampling]

Resampling based procedures are ways to perform population based statistical inferences, while living within our data. Data Scientists tend to really like resampling based inferences, since they are very data centric procedures, they scale well to large studies and they often make very few assumptions.

Here we look at creating a confidence interval and estimating standard errors with bootstraping

Nonparametric Bootstrapping

Parameter description

data: A vector, matrix, or data frame
statistic: A function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an indices parameter that the boot() function can use to select cases for each replication (see examples below).
R: Number of bootstrap replicates

boot( ) calls the statistic function R times. Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). These indices are used within the statistic function to select a sample. The statistics are calculated on the sample and the results are accumulated in the bootobject. The bootobject structure includes

element description

t0: The observed values of k statistics applied to the orginal data.
t: An R x k matrix where each row is a bootstrap replicate of the k statistics.

You can access these as bootobject$t0 and bootobject$t.

Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to examine the results. If the results look reasonable, you can use boot.ci( ) function to obtain confidence intervals for the statistic(s).

The format is boot.ci(bootobject, conf=, type= ) where ## parameter description - bootobject: The object returned by the boot function - conf: The desired confidence interval (default: conf=0.95) - type: The type of confidence interval returned. Possible values are “norm”, “basic”, “stud”, “perc”, “bca” and “all” (default: type=“all”)

Bootstrapping a Single Statistic (k=1)

The following example generates the bootstrapped 95% confidence interval for R-squared in the linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp). The data source is mtcars. The bootstrapped confidence interval is based on 1000 replications.

# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
   R=1000, formula=mpg~wt+disp)

# view results
results

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
##     wt + disp)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.7809306 0.01083345  0.04859227

plot(results)

# get 95% confidence interval
boot.ci(results, type="bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.6378,  0.8522 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

Bootstrapping several Statistics (k>1)

In example above, the function rsq returned a number and boot.ci returned a single confidence interval. The statistics function you provide can also return a vector. In the next example we get the 95% CI for the three model regression coefficients (intercept, car weight, displacement). In this case we add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to analyzed.

# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
   R=1000, formula=mpg~wt+disp)

# view results
results

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ 
##     wt + disp)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1* 34.96055404  1.339763e-01 2.601566258
## t2* -3.35082533 -6.621430e-02 1.158595156
## t3* -0.01772474 -7.087857e-05 0.008421491

plot(results, index=1) # intercept

plot(results, index=2) # wt

plot(results, index=3) # disp

# get 95% confidence intervals
boot.ci(results, type="bca", index=1) # intercept

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 1)
## 
## Intervals : 
## Level       BCa          
## 95%   (29.71, 39.85 )  
## Calculations and Intervals on Original Scale

boot.ci(results, type="bca", index=2) # wt

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 2)
## 
## Intervals : 
## Level       BCa          
## 95%   (-5.511, -0.813 )  
## Calculations and Intervals on Original Scale

boot.ci(results, type="bca", index=3) # disp

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 3)
## 
## Intervals : 
## Level       BCa          
## 95%   (-0.0349, -0.0007 )  
## Calculations and Intervals on Original Scale

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

COURSE NOTES: ## The jackknife

The jackknife is a tool for estimating standard errors and the bias of estimators
As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools
Both the jackknife and the bootstrap involve resampling data; that is, repeatedly creating new data sets from the original data

The jackknife

The jackknife deletes each observation and calculates an estimate based on the remaining $n-1$ of them
It uses this collection of estimates to do things like estimate the bias and the standard error
Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are

The jackknife

We’ll consider the jackknife for univariate data
Let $X_1,\ldots,X_n$ be a collection of data used to estimate a parameter $\theta$
Let $\hat \theta$ be the estimate based on the full data set
Let $\hat \theta_{i}$ be the estimate of $\theta$ obtained by deleting observation $i$
Let $\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}$

Continued

Then, the jackknife estimate of the bias is \[ (n - 1) \left(\bar \theta - \hat \theta\right) \] (how far the average delete-one estimate is from the actual estimate)
The jackknife estimate of the standard error is \[ \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} \] (the deviance of the delete-one estimates from the average delete-one estimate)

Example

We want to estimate the bias and standard error of the median

library(UsingR)

## Warning: package 'UsingR' was built under R version 4.0.3

## Loading required package: MASS

## Loading required package: HistData

## Warning: package 'HistData' was built under R version 4.0.3

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 4.0.3

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:boot':
## 
##     aml

## Loading required package: Formula

## Warning: package 'Formula' was built under R version 4.0.3

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

data(father.son)
x <- father.son$sheight
n <- length(x)
theta <- median(x)
jk <- sapply(1 : n,
             function(i) median(x[-i])
             )
thetaBar <- mean(jk)
biasEst <- (n - 1) * (thetaBar - theta) 
seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2))

Example

c(biasEst, seEst)

## [1] 0.0000000 0.1014066

library(bootstrap)

## Warning: package 'bootstrap' was built under R version 4.0.3

temp <- jackknife(x, median)
c(temp$jack.bias, temp$jack.se)

## [1] 0.0000000 0.1014066

Example

Both methods (of course) yield an estimated bias of 0 and a se of 0.1014066
Odd little fact: the jackknife estimate of the bias for the median is always $0$ when the number of observations is even
It has been shown that the jackknife is a linear approximation to the bootstrap
Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties

Pseudo observations

Another interesting way to think about the jackknife uses pseudo observations
Let \[ \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} \]
Think of these as ``whatever observation $i$ contributes to the estimate of $\theta$’’
Note when $\hat \theta$ is the sample mean, the pseudo observations are the data themselves
Then the sample standard error of these observations is the previous jackknife estimated standard error.
The mean of these observations is a bias-corrected estimate of $\theta$

The bootstrap

The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
For example, how would one derive a confidence interval for the median?
The bootstrap procedure follows from the so called bootstrap principle

The bootstrap principle

Suppose that I have a statistic that estimates some population parameter, but I don’t know its sampling distribution
The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution

The bootstrap in practice

In practice, the bootstrap principle is always carried out using simulation
We will cover only a few aspects of bootstrap resampling
The general procedure follows by first simulating complete data sets from the observed data with replacement
- This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution
Calculate the statistic for each simulated data set
Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

Example code

B <- 1000
resamples <- matrix(sample(x,
                           n * B,
                           replace = TRUE),
                    B, n)
medians <- apply(resamples, 1, median)
sd(medians)

## [1] 0.08536117

quantile(medians, c(.025, .975))

##     2.5%    97.5% 
## 68.44665 68.81588

Notes on the bootstrap

The bootstrap is non-parametric
Better percentile bootstrap confidence intervals correct for bias
There are lots of variations on bootstrap procedures; the book “An Introduction to the Bootstrap”" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information

Permutation tests

Consider the null hypothesis that the distribution of the observations from each group is the same
Then, the group labels are irrelevant
We then discard the group levels and permute the combined data
Split the permuted data into two groups with $n_A$ and $n_B$ observations (say by always treating the first $n_A$ observations as the first group)
Evaluate the probability of getting a statistic as large or large than the one observed
An example statistic would be the difference in the averages between the two groups; one could also use a t-statistic

Permutation test for pesticide data

subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y <- subdata$count
group <- as.character(subdata$spray)
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
observedStat

## [1] 13.25

mean(permutations > observedStat)

## [1] 0

Resampled inference

bcaffo_jleek_rpeng UJH

[Resampling]

Parameter description

element description

Bootstrapping a Single Statistic (k=1)

Bootstrapping several Statistics (k>1)

The jackknife

The jackknife

Continued

Example

We want to estimate the bias and standard error of the median

Example

Example

Pseudo observations

The bootstrap

The bootstrap principle

The bootstrap in practice

Example code

Notes on the bootstrap

Permutation tests

Permutation test for pesticide data

Histogram of permutations