Bootstrapping local regression line chart

Quick example

There is probably a much easier way to do this (I think this saying goes for all my work (in and outside of R!)).

My goal was to create a script that would give me a feel for how uncertain my loess estimate. I was inspired by this chart on Rasmus’ blog. In fact, most of the code below was nicked off him. He created a Bayesian bootstrap of the loess function on the cars data… That’s all too complex for me just yet but I wanted to try this with a simple bootstrap.

I’ll do this using an example - we’ll need to load the cars dataset by doing data(cats). The cars dataset is from Base R and gives the speed of cars and distances taken to stop from cars in the 1920s. So we shouldn’t expect anything too fast!

This is what the dataset looks like:

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

We can plot a single loess curve:

plot(cars$speed, cars$dist, pch = 20, col = "black", xlab = "Car speed in mph",
     ylab = "Stopping distance in ft", main = "Speed and stopping distances of cars")
lines(cars$speed, loess(dist ~ speed, cars)$fitted, lwd = 3,
      col = "tomato")

That’s all well and good but I want to have an idea about how uncertain the loess curve is. For that it’s bootstrap to the rescue!!

The bootstrap will resample with replacement the rows from the cars dataset to create n psuedo datasets and then perform the loess on each of them separately. To do this, we’ll use the boot package.

The next step is to create the function that R will perform on each pseudo dataset:

boot_fn <- function(data, indices) {
        d <- data[indices, ]
        d <- d[order(d$speed),]
        loess_fit <- loess(dist ~ speed, d,
                           control = loess.control(surface = "direct"))
        predict(loess_fit, data.frame(speed = seq(4, 25, 1)), se = T)$fit
}

The function above returns a loess prediction for distance from a set of speed values. I had quite a bit if trouble with the function. Firstly, sampling throws the data order out of whack so local regression doesn’t make sense. Secondly, my loess was not returning a smooth function (I never completely figured out why) so I had to create a newdata = seq(4, 25, 1) and then predict on that to get my lovely smooth loess fits. Ignore the control = ... in the loess function for now, we’ll come back to that later.

The next step is to call the boot function. This works by specifying the data, function and number of samples to be drawn:

library(boot)
loess_boot <- boot(cars, R = 100, statistic = boot_fn)

I use only 100 iterations to reduce code run time - IRL, you’ll proabbly want to increase that. You should then get a new list of 11 called loess_boot. I checked and the results we want are in the t component of this list:

head(loess_boot$t)

##            [,1]      [,2]      [,3]      [,4]     [,5]     [,6]     [,7]
## [1,] -19.008117 -9.708032 -1.366905  6.044529 12.57151 18.28754 23.25992
## [2,]  10.470443 10.544977 11.154259 12.307605 14.00890 16.25040 19.01433
## [3,]   6.743646  6.996026  7.948658  9.583218 11.88129 14.84610 18.58048
## [4,]  10.995560 11.960597 13.468440 15.546618 18.19410 21.34120 24.93301
## [5,]   8.900711 10.158800 11.449739 12.932088 14.75295 16.91064 18.99555
## [6,]  10.879221 11.366326 12.196478 13.412999 15.05046 17.08623 19.41082
##          [,8]     [,9]    [,10]    [,11]    [,12]    [,13]    [,14]
## [1,] 27.52041 30.87978 32.38972 34.73938 38.68427 44.91151 48.18340
## [2,] 22.29823 26.08354 30.34543 35.74833 40.50392 44.72087 48.12052
## [3,] 23.31137 28.15211 33.07154 36.50913 40.93433 44.93412 48.54920
## [4,] 29.23085 33.93875 39.04160 41.15132 40.08683 41.61119 44.68561
## [5,] 20.90144 28.06814 33.60608 40.29228 46.77543 50.31671 53.39990
## [6,] 22.11333 24.65022 28.18367 31.64140 35.04853 43.38809 48.64062
##         [,15]    [,16]    [,17]    [,18]    [,19]    [,20]    [,21]
## [1,] 51.73613 54.99778 59.60416 65.39648 72.06169 79.79215 88.79693
## [2,] 49.86876 51.69650 54.99869 59.86715 65.66431 72.43834 80.26859
## [3,] 51.38684 55.30092 60.21305 65.71515 71.78794 78.48942 85.90270
## [4,] 47.37409 51.69371 55.33129 58.83523 62.36148 66.12606 70.22972
## [5,] 52.20733 55.12818 59.79706 65.16330 71.26054 78.16805 85.94061
## [6,] 52.44463 56.59952 60.99584 66.28136 72.14276 78.84398 86.66088
##         [,22]
## [1,] 99.16674
## [2,] 89.19332
## [3,] 94.10465
## [4,] 74.70616
## [5,] 94.58211
## [6,] 95.73498

There’s a row for each sample and 22 values for the loess because we used seq(4, 25, 1) to create the smooth fit earlier.

Now onto the final bit (hoorah!). I’m in two minds on how I want to show the uncertainty in the estimate. The first method is plot all the loess fits, we can then easily see the variability in the estimate. However this gets crowded quite quickly. The second method is to plot just the 2.5% and 97.5% percentiles as some kind of 95% confidence interval.

1st method - all the lines

We need a for loop to cycle through the rows. In this case, I’m only going to sample 20 out of the 100 as IRL so it plots quicker.

plot(cars$speed, cars$dist, pch = 20, col = "black", xlab = "Car speed in mph",
     ylab = "Stopping distance in ft", main = "Speed and stopping distances of cars")

for(i in sample(nrow(loess_boot$t), 20)) {
        lines(seq(4, 25, 1), loess_boot$t[i,], col = "gray")
}

lines(cars$speed, loess(dist ~ speed, cars)$fitted, lwd = 3,
      col = "tomato")

What do you think? What does this tell us, tells us that it’s most uncertan where there is most variablity in the data (look at 14-16 and 17-20) and less uncertain where there is not much variability in the underlying - around 10. But is the chart too cluttered? Interesting to note that some of the runs go negative, which means the car started to reverse??? That’s a limitation of this naive bootstrap.

2nd method - confidence interval bands

Here we swap the for loop for two new vectors. Also this is great use of the apply function if I say so myself. It takes the matrix loess_boot$t and applies the function quantile to the columns (2).

plot(cars$speed, cars$dist, pch = 20, col = "black", xlab = "Car speed in mph",
     ylab = "Stopping distance in ft", main = "Speed and stopping distances of cars")

conf_97.5 <- apply(loess_boot$t, 2, function(x) quantile(x, .975))
conf_2.5 <- apply(loess_boot$t, 2, function(x) quantile(x, .025))
lines(seq(4, 25, 1), conf_97.5, type ="l",
      col = "gray")
lines(seq(4, 25, 1), conf_2.5, type ="l",
      col = "gray")

lines(cars$speed, loess(dist ~ speed, cars)$fitted, lwd = 3,
      col = "tomato")

How does this compare? It’s definitely cleaner! I do think it could benefit with some polishing, maybe using dashed lines for the confidence intervals. It’s comfoting to see the same story regarding narrow/wide the confidence intervals are in relation to the underlying variability of the data.

Bonus section - only for the more exceptional students

I was thinking about this and started to wonder what else I could do with this. These cars are all from 1920 so they max out at 25mph, however could we use this to estimate how far a car travelling at 30mph will travel once the brakes are hit?

Turns out the loess function can be used to extrapolate. Now, I’m not sure of the maths behind this and I still need to play around with it but following the example documentation, I used the control = loess.control(surface = "direct") argument in the loess call. I also increased the predict data frame sequence up to 30.

## create boot_fn
boot_fn <- function(data, indices) {
        d <- data[indices, ]
        d <- d[order(d$speed),]
        loess_fit <- loess(dist ~ speed, d,
                           control = loess.control(surface = "direct"))
        predict(loess_fit, data.frame(speed = seq(4, 30, 1)), se = T)$fit
}

loess_boot <- boot(cars, R = 100, statistic = boot_fn)

plot(cars$speed, cars$dist, pch = 20, col = "black", xlab = "Car speed in mph",
     ylab = "Stopping distance in ft", main = "Speed and stopping distances of cars", xlim=c(4, 30), ylim = c(0, 200))

conf_97.5 <- apply(loess_boot$t, 2, function(x) quantile(x, .975))
conf_2.5 <- apply(loess_boot$t, 2, function(x) quantile(x, .025))
lines(seq(4, 30, 1), conf_97.5, type ="l",
      col = "gray")
lines(seq(4, 30, 1), conf_2.5, type ="l",
      col = "gray")

lines(seq(4, 30, 1), colMeans(loess_boot$t, na.rm = TRUE), type ="l",
      col = "tomato", lwd = 3)

So looks like the stopping distance increases to over 150ft. Obviously this assumes car brakes didn’t improve etc but even so the extrapolation looks too high so I’m not sure I trust the loess prediction function. The confidence intervals seem to be behaving though, they get wider as speed increases.

Well, that’s all folks! Loess curves are also great for time series and I can see how that could be applied here really easily. It’ll also be interesting to try GAMs on this dataset to see if I get the same answer.

Until next time… @tojyouso

Bootstrapping local regression line chart

@tojyouso

31 March 2016

Quick example

1st method - all the lines

2nd method - confidence interval bands

Bonus section - only for the more exceptional students