The Block Bootstrapping Method Illustrated

While working on a bootstrapping problem in the Lagunita Stanford MOOC on Statistical Learning, I came across a comment by a fellow student. The problem is challenging, but the video lecture and the accompanying lecture slides don’t provide a clue about how to solve the problem.

The topic being discussed, block bootstrapping, is a variant of basic bootstrapping. It is used when dealing with time series data, or other data sets in which breaking the data set into smaller chunks for sampling purposes might wreck any correlations that exist in the larger data set.

A Word of Caution

It is an important method to learn and it isn’t covered well in the lectures. One problem is the method is sensitive to the length of the blocks. There is no one way to choose the optimal block length because it will vary depending on the size of the sample being used, the way you choose to generate the data, and the type of statistic you are looking for.

Fortunately, the accompanying textbook has an example which isn’t quite what students need, but if it’s modified it turns out to be pretty useful.

One frustrated student asked for an example of the correct code to learn what he had done wrong on his submission. Another student supported this idea, writing “having the full solution (including R code) would enhance learning, particularly for those that tried hard, and failed, to get the correct answer.”

The MOOC’s authors don’t want to give the answer away, which is understandable, and are reluctant to post the code. So I thought I could use the example from the textbook to develop a compromise.

The Demonstration

The following demonstration is not strictly a block bootstrapping problem because the data set I am using is not a time series. I chose this data set because it is one the textbook works with. Students should be familiar with it.

I also don’t want to give away too much of the solution.

The main point is to demonstrate the method and leave it to others to apply it to appropriate time series data sets.

For those who might be reading this who are not taking the Stanford MOOC, I am providing descriptions step-by-step.

I hope this helps.

Getting Started

First, we need to load the libraries we will be using. You can use the fix() command to get a good view of the data set.

require(ISLR)
## Loading required package: ISLR
require(boot)
## Loading required package: boot
#fix(Auto)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

We see we are dealing with a data set consisting of 392 rows and 9 columns.

Some people prefer the data frame editor built into R-Studio. It’s good, but the separate window generated by fix() is useful. Just remember to close it before going back to your code. When I forget, R-Studio freezes up.

Next, we are going to create a custom function called boot.fn() for our purposes. Don’t confuse this with the boot() function we use from the boot library.

From the Auto data set, I picked mpg as the dependent variable. I chose as the independent variables displacement, horsepower, and weight arbitrarily. This is an illustrative example, so I didn’t do a study to determine they are the best variables to use.

I left out cylinders because I suspect it’s not very different from displacement. And I am pretty sure mpg is unrelated to acceleration, year, origin, and name.

But if you look at the plots below you can see they aren’t a bad place to start.

That’s probably too much information, so let’s pare down the number of plots and focus on those variables of interest to us.

Much better.

Time to create boot.fn(). Since it’s a single line we don’t need to mess with brackets.

set.seed(1000)
boot.fn <- function(data, index)
        + return(coef(lm(mpg ~ displacement + horsepower + weight, data = data, subset = index)))
boot.fn(Auto, 1:392)
##  (Intercept) displacement   horsepower       weight 
## 44.855935695 -0.005768819 -0.041674144 -0.005351593

We directed our boot.fn() function to return the model’s coefficients so we have an idea of the function we derive from the data.

Now we are going to sample data from our data set. We will select the switch in the sample() command that allows us to replace the data point once it’s been selected, so we can select it again. We would have to think about using this setting were we working with a time series, but it’s good for us with these data.

boot.fn(Auto, sample(392, 392, replace = TRUE))
##  (Intercept) displacement   horsepower       weight 
## 46.311585493  0.002557759 -0.051790795 -0.006003953

Again, look at the coefficients. They are a little different from the originals, which we would expect since the function is using sampled data.

Let’s see if the boot() function from the boot library gives us similar results.

boot(Auto, boot.fn, 392)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 392)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 44.855935695  7.345924e-02 1.2993169069
## t2* -0.005768819  1.924862e-04 0.0069251532
## t3* -0.041674144  5.779237e-05 0.0139147404
## t4* -0.005351593 -4.402735e-05 0.0006427707
summary(lm(mpg ~ displacement + horsepower + weight, data = Auto))$coef
##                  Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  44.855935695 1.195920013 37.5074714 4.107518e-131
## displacement -0.005768819 0.006581890 -0.8764684  3.813177e-01
## horsepower   -0.041674144 0.012813862 -3.2522703  1.245063e-03
## weight       -0.005351593 0.000712354 -7.5125469  4.035835e-13

The results are identical to the boot.fn() we created. When we run the summary() command we get the same coefficient values, though standard error is off by a little bit. This is not a big concern for us. There is a good explanation for the difference on page 196 of the text, if anyone is interested.

So far, standard bootstrapping appears to be working. As mentioned earlier, we are interested in the block bootstrap method, which requires us to chop the original data set into blocks of roughly equal size, rearrange them, then build a new model.

This is the part that was missing from the textbook and the lecture video and notes. You can see why students might be frustrated.

We are going to create a new data set using the rearranged blocks of data. There are different ways to accomplish this, but our method illustrates the point.

new.rows = c(1:50, 201:250, 51:100, 351:392, 301:350, 101:150, 251:300, 151:200)

We could have just as easily decided on a different order for the blocks, different sizes for the blocks, or added more data by repeating blocks. We create a new object new.Auto using the new rows. Then we will run our custom boot.fn() function on the new data set and see what we get.

new.Auto <- Auto[new.rows, ]
newboot.fn <- function(data, index) + return(coef(lm(mpg ~ displacement + horsepower + weight, data = data, subset = index)))
newboot.fn(new.Auto, 1:392)
##  (Intercept) displacement   horsepower       weight 
## 44.855935695 -0.005768819 -0.041674144 -0.005351593

The coefficients are identical to our original, which makes sense because this is not a time series. The order of the data in the Auto set doesn’t really matter.

Again, we run the newboot.fn() function using a sampling process that allows for replacement of data then compare the results to the boot library’s boot() function.

What do you expect the results to be?

newboot.fn(new.Auto, sample(392, 392, replace = TRUE))
##  (Intercept) displacement   horsepower       weight 
## 45.705488858  0.001409017 -0.042932148 -0.006065081
boot(new.Auto, newboot.fn, 392)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = new.Auto, statistic = newboot.fn, R = 392)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 44.855935695  9.980158e-02 1.2952551768
## t2* -0.005768819  5.499201e-04 0.0070927467
## t3* -0.041674144 -1.117489e-03 0.0149763303
## t4* -0.005351593 -2.948623e-05 0.0006445439
summary(lm(mpg ~ displacement + horsepower + weight, data = new.Auto))$coef
##                  Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  44.855935695 1.195920013 37.5074714 4.107518e-131
## displacement -0.005768819 0.006581890 -0.8764684  3.813177e-01
## horsepower   -0.041674144 0.012813862 -3.2522703  1.245063e-03
## weight       -0.005351593 0.000712354 -7.5125469  4.035835e-13

There you have it. The block bootstrapping method demonstrated.

The example is modified from the textbook, An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, published in 2015 by Springer.

A free (and legal) PDF version of the text is available for download at this address. It’s a very good textbook which I highly recommend, and you can’t argue with the price.