The Basics of Bootstrap

Introduction

The vector p below contains values of relative (0-1) abundance of six species in field plots of 400 m\(^2\).

p <- c(0.06, 0.10, 0.25, 0.12, 0.30, 0.17)

Suppose one intends to estimate the diversity using Shannon index:

\[H = -\sum_{i=1}^6 p_i \log(p_i)\]

It is easy to compute:

-sum(p * log(p))

[1] 1.662493

Good! But whats if one is asked about the precision of \(H\), that is, what is the standard deviation?

What is bootstrap?

Bootstrap (Efron, 1979) is a resampling method that can be used to estimate the accuracy of an estimate, no matter how mathematically complex the estimator is. It is somehow similar to Jackknife, but more generalist. In fact, bootstrap has several applications, including in modelling (linear or not) procedures. It can be used, for example, to build nonparametric confidence intervals for the mean or the median. In cluster analysis, bootstrap can be used to check the consistency of the clustering result and the stability of nodes of a clustering tree (dendrogram). Bootstrap is the base to train Random Forest models (machine learning), by building decision trees from bootstrapped versions of the data.

The central idea of bootstrap is to use the observed data to emulate their true distribution by taking \(B\) (usually, \(B > 200\)) samples of the same size of the original sample, with replacement. For example, this could be a first bootstrap sample of p:

sample(p, replace = TRUE)

[1] 0.17 0.06 0.10 0.17 0.17 0.06

This could be the second:

sample(p, replace = TRUE)

[1] 0.30 0.10 0.25 0.06 0.10 0.10

and so on.

For every bootstrap sample, say \(p^*\), we compute an estimate of \(H\). Then, with all \(B\) values of \(H^*\) we can obtain an estimate of standard deviation:

\[\hat{\sigma}_{boot} = \sqrt{\frac{1}{B-1} \sum_{j=1}^B (H^* - \bar{H^*})^2}\] Let’s write a function to estimate the Shannon index:

H <- function(x) -sum(x * log (x))
H(p)

[1] 1.662493

Now we can use our home-brewed function to illustrate the bootstrap with \(B = 200\):

nboot <- 200
Hb <- c()
for(j in 1:nboot) {
  Hb[j] <- H(sample(p, replace = TRUE))
}

Done! Now we have 1000 estimates of Shannon index stored in object Hb. Let’s take a look at the frequency distribution of those values.

hist(Hb)

And we can calculate the mean and the standard deviation of Hb:

mean(Hb)

[1] 1.650782

sd(Hb)

[1] 0.1592893

Did you see how close the mean is to the original value of \(H\)? It gets closer as \(B\) increases. Try with nboot <- 2000.

Confidence interval

A 95% confidence interval can be built using the 2.5% and the 97.5% percentiles of Hb:

quantile(Hb, p = c(0.025, 0.975))

    2.5%    97.5% 
1.351967 1.932569

which can be used for further inferences, say, for instance, to compare statistically the diversity of different sites.

Bootstrapping a matrix

Data in data.frame or matrix can be bootstrapped by resampling with replacement the rows (or columns) of observations.

For example, take the data trees

data(trees)
str(trees)

'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

See the description with help(trees). We can resample the rows of this data frame and then store every new bootstrapped data in an entry of a list.

nr <- nrow(trees)
nboot <- 200
boot_trees <- list()  # null list
for(k in 1:nboot) {
  boot_trees[[k]] <- trees[sample(nr, replace = TRUE), ]
}

The first bootstrapped version of trees is

boot_trees[[1]]

     Girth Height Volume
17    12.9     85   33.8
1      8.3     70   10.3
18    13.3     86   27.4
16    12.9     74   22.2
6     10.8     83   19.7
2      8.6     65   10.3
14    11.7     69   21.3
1.1    8.3     70   10.3
5     10.7     81   18.8
16.1  12.9     74   22.2
10    11.2     75   19.9
10.1  11.2     75   19.9
12    11.4     76   21.0
15    12.0     75   19.1
22    14.2     80   31.7
29    18.0     80   51.5
21    14.0     78   34.5
13    11.4     76   21.4
31    20.6     87   77.0
20    13.8     64   24.9
30    18.0     80   51.0
11    11.3     79   24.2
16.2  12.9     74   22.2
2.1    8.6     65   10.3
6.1   10.8     83   19.7
29.1  18.0     80   51.5
7     11.0     66   15.6
6.2   10.8     83   19.7
31.1  20.6     87   77.0
14.1  11.7     69   21.3
16.3  12.9     74   22.2

Miscellanea

In a sample with \(n\) observations, there are \(2n-1\choose n\) distinct samples. In R, run choose(2*n-1, n)
If bootstrap samples are taken from a known probability distribution (normal, Gamma etc.), then it is called a parametric bootstrap.
The packages boot, bootstrap and many others can perform bootstrap in different ways.

Exercises

With the data soil used in previous lessons, build 95% confidence intervals for the median of RP of each layer of the factor Camada.
Compute the bootstrap distribution of the correlation between the variables Girth and Height of the data trees. Afterwards, obtain the percentiles 2.5% and 97.5% to build a 95% confidence interval for the true correlation.