Bootstrapping Data to calculate the error associated with parameter estimation

Kevin Keenan, Queen’s University, Belfast, (2013)

Distributed under a GPL3 licence

Introduction

The basic steps of bootstrapping are:

If data are organised in treatment groups etc., and the parameter of interest is calculated independently for each group, resampling should be done independently for each group. In the case of the hare data used in the example below, these groupings are Core, Periphery and Allopatry.
In each resample iteration (say out of 1000 bootstraps) the parameter of interest (in this instance, \(D\)) should be calculated and stored.
When the data have been bootstrapped a desired number of times, the parameter vector should be of length = number of bootstrap iterations. If the number of iteration is \(1000\), then the vector containing \(D\) should have \(1000\) elements.
This vector can then be used to derive a number of error values, such as \(95\%\) confidence intervals or the standard deviation.

Resampling

In general, bootstrapping refers to the process of resampling a given number of individuals (rows) from a sample, where each individual can be sampled more than once (i.e. resampling with replacement).

The data

Data should be save as a tab delimited file to be read into R as a dataframe.

The Coding process

The rest of this document will outline the bootstrapping method using R code.

Read the data into `R`

# read the text file
camDat <- read.table("Camera_data.txt", header = TRUE)
# check the first few rows of data
head(camDat)

    Wk Count Dist Angle
1 Core     1 3.20  0.67
2 Core     1 3.20  0.67
3 Core     1 3.20  0.67
4 Core     1 3.68  0.39
5 Core     1 2.75  0.18
6 Core     1 5.02  0.10

The first column will be used for grouping the data. We should find out how large each group is. This can be done using the tapply function to separate the Dist variable into three groups and calculate the length of each.

tapply(camDat$Dist, camDat$Wk, length)

Allopatry      Core Periphery 
       61        49        46

# we can check that the numbers add up
sum(tapply(camDat$Dist, camDat$Wk, length)) == nrow(camDat)

[1] TRUE

Now that we know how many rows are in each group we can use this information later when we tell the sample function how many individuals we would like to resample in each bootstrap iteration. We should assign this information to a variable for later use.

grpSize <- tapply(camDat$Dist, camDat$Wk, length)

To make life easier I tend to separate my data into designated groups before proceeding with bootstrapping, as it makes the actual bootstrap code easier to read. In this instance I will save each group as a data frame in a list. This means that the first element of the list will be the data frame containing only Allopatry data, the second element will be a data frame containing the Core data etc.

grpDat <- split(camDat, camDat$Wk)

# check that the code has done what we wanted
lapply(grpDat, head)

$Allopatry
           Wk Count Dist Angle
96  Allopatry     1 2.68  0.46
97  Allopatry     2 9.40  0.51
98  Allopatry     1 1.70  0.30
99  Allopatry     1 4.60  0.42
100 Allopatry     1 2.50  0.21
101 Allopatry     1 5.30  0.56

$Core
    Wk Count Dist Angle
1 Core     1 3.20  0.67
2 Core     1 3.20  0.67
3 Core     1 3.20  0.67
4 Core     1 3.68  0.39
5 Core     1 2.75  0.18
6 Core     1 5.02  0.10

$Periphery
          Wk Count  Dist Angle
50 Periphery     1  4.46  0.16
51 Periphery     1  2.09  0.29
52 Periphery     1  2.47  0.24
53 Periphery     1  3.12  0.69
54 Periphery     1 11.36  0.50
55 Periphery     1  2.62  0.11

lapply(grpDat, nrow)

$Allopatry
[1] 61

$Core
[1] 49

$Periphery
[1] 46

We can see that the head output show only data for each group, and the number of rows in each of the three data frames equal the group sizes calculated above.

Calculating \(D\)

Before proceeding with the bootstrapping, it is usually a good idea to write a function which calculates the parameter of interest, so that it can easily be incorporated later. The main feature of the function will be that it accepts a data frame with the same headers etc. as the original data, and it will use these data to derive \(D\). The function definition is given below and is based on the following formula.

\[D = \frac{y}{t}\frac{\pi}{vr(2+\theta)}\]

Where \(y\) = the total number of captures, \(t\) = the total camera hours, \(v\) = the distance traveled per unit time, \(r\) = the mean distance to the hare and \(latex \theta\) = the mean angle of detection.

# The function should be passed the dataframe, the constant 't',
# and the constant, 'v'.
calcD <- function(dat, tm, v){
  # calculate y: total captures
  y <- sum(dat$Count)
  # calculate the first half of the equation above
  trm1 <- y/tm
  # calculate the second half
  trm2 <- pi/((v*mean(dat$Dist, 
                      na.rm = TRUE)) * (2 + (2*mean(dat$Angle, 
                                                    na.rm = TRUE))))
  # calculate D
  return(trm1 * trm2)
}

Test the function

calcD(dat = grpDat[[1]], tm = 2880, v = 0.89)

[1] 0.007038704

Now that we have a working function, and our data are grouped appropriately, we can carry out the actual bootstrapping. Typically the sample function is used for this. Below is a brief demo of how the function works.

# Imagine I have a vector from 1:10, and I want to randomly sample
# 10 values from this vector. If I did this without replacement, I 
# would end up with the original ten values, but in random order
sample(x = 1:10, size = 10, replace = FALSE)

 [1]  5  7  3  6 10  4  2  9  8  1

# However, if I allow replacement, I can have any combination of 10
# values from the original vector
sample(x = 1:10, size = 10, replace = TRUE)

 [1]  8  9  7  6  6 10  5  3  8  6

# When sampling a dataframe, generally it is the row indexes that
# are randomly sampled, meaning we could get any cobmination of
# rows in the output. The syntax for this approach is as below:
# create an example data frame
x <- data.frame(a = 1:4, b = 1:4)
# look at the dataframe structure
x

# randomly sample four rows (without replacement)
x[sample(x = 1:nrow(x), size = 4, replace = FALSE), ]

# randomly sample four rows (with replacement)
x[sample(x = 1:nrow(x), size = 4, replace = TRUE), ]

You will notice that the rownames in the with replacement example are strange. This can be ignored since it is just because R does not allow duplicate rownames in dataframes, and is not relevant to our problem here.

Piecing it all together

Now that we have the data in the correct format and the parameter calculation function, as well as the basic principal of bootstrapping, we can start to build up the bootstrapping process.

# define the number of bootstrap iterations
nboots <- 1000

It is important to remember that we want to bootstrap the data 1000 times, but that we have three groups of data which need to be resampled. The basic step-by-step process will look something like this:

Take group 1 (i.e. Allopatry) and resample the data
Calculate \(D\) using calcD
Keep resampling the data until the length of \(D\) = nboots
Take group 2 …
…

and so on until all groups have been bootstrapped and we have three (in this example) vectors, each 1000 elements long.

The code

# write a function which samples a dataframe and calculates D
bsD <- function(dat, tm, v){
  bsDat <- dat[sample(1:nrow(dat), size = nrow(dat), replace = TRUE), ]
  dOut <- calcD(bsDat, tm, v)
  return(dOut)
}

# now use bsD on each group dataframe 1000 times
resD <- lapply(grpDat, function(x){
  d <- replicate(nboots, bsD(x, tm = 2880, v = 0.89))
  return(d)
})

This second chunk of code will output a list with three elements, each one of which is a vector of 1000 elements (i.e. each bootstrapped \(D\) value). It is from these values that we can calculate \(SD\) etc.

Calculating error

From resD, it is now possible to calculate say standard deviation as follows:

resSD <- lapply(resD, sd)
resSD

$Allopatry
[1] 0.0006429752

$Core
[1] 0.0004869402

$Periphery
[1] 0.0005708823

User-friendly function

To make thing easier, all of the above code is written is a user friendly function below. To use the function, simply copy all of the code below, paste it into your R console and choose the appropriate argument options.

densityBoot <- function(data, tm, v, nboots, error_stat){
  ## Written by Kevin Keenan 2013
  ## distributed under GPL3
  camDat <- read.table(data, header = TRUE)
  grpDat <- lapply(unique(camDat$Wk), function(x){
    return(camDat[which(camDat$Wk == x), ])
  })
  calcD <- function(dat, tm, v){
    # calculate y: total captures
    y <- sum(dat$Count)
    # calculate the first half of the equation above
    trm1 <- y/tm
    # calculate the second half
    trm2 <- pi/((v*mean(dat$Dist, 
                        na.rm = TRUE)) * (2 + (2*mean(dat$Angle, 
                                                      na.rm = TRUE))))
    # calculate D
    return(trm1 * trm2)
  }
  grpSize <- tapply(camDat$Dist, camDat$Wk, length,
                    simplify = FALSE)
  bsD <- function(dat, tm, v){
    bsDat <- dat[sample(1:nrow(dat), size = nrow(dat), replace = TRUE), ]
    dOut <- calcD(bsDat, tm, v)
    return(dOut)
  }
  resD <- lapply(grpDat, function(x){
    d <- replicate(nboots, bsD(x, tm = tm, v = v))
    return(d)
  })
  res <- list()
  if(is.element("sd", error_stat)){
    resSD <- sapply(resD, sd)
    names(resSD) <- levels(camDat$Wk)
    res$sd <- c(res, resSD)
  }
  if(is.element("ci", error_stat)){
    resCI <- t(sapply(resD, quantile, probs = c(0.025, 0.975)))
    colnames(resCI) <- c("Lower", "Upper")
    rownames(resCI) <- levels(camDat$Wk)
    res$ci <- resCI
  }
  return(res)
}

This function can be use in the following ways:

Calculate the standard deviation (i.e.standard error of the statistic)

output <- densityBoot(data = "Camera_data.txt", tm = 2880, v = 0.89,
                      nboots = 1000, error_stat = "sd")
# print output results
output

$sd
$sd$Allopatry
[1] 0.0004683233

$sd$Core
[1] 0.0005674262

$sd$Periphery
[1] 0.0006613051

Calculate the 95% ci of \(D\)

output <- densityBoot(data = "Camera_data.txt", tm = 2880, v = 0.89,
                      nboots = 1000, error_stat = "ci")
# print output results
output

$ci
                Lower       Upper
Allopatry 0.005675682 0.007498785
Core      0.005709906 0.007911399
Periphery 0.005880164 0.008316592

Calculate sd and 95% ci of \(D\)

output <- densityBoot(data = "Camera_data.txt", tm = 2880, v = 0.89,
                      nboots = 1000, error_stat = c("sd", "ci"))
# print the results
output

$sd
$sd$Allopatry
[1] 0.0004798202

$sd$Core
[1] 0.0005770059

$sd$Periphery
[1] 0.0006619398


$ci
                Lower       Upper
Allopatry 0.005684712 0.007560023
Core      0.005718971 0.008014058
Periphery 0.005840718 0.008454121

Bugs

Any issues or question regarding the above code can be directed to kkeenan02 [AT] qub.ac.uk