Statistics is fun and interesting at every step one makes. This time I was surprised I was calculating GLM confidence intervals on response, and not the link. Here is an example of calculating confidence intervals for Poisson regression in three ways (the first one may not be “correct”). It would appear the method I used to calculate confidence (method 1) intervals differs very little from the “right” way (method 2). I give an example of bootstrapped confidence interval.

require(ggplot2)
require(gridExtra)
require(reshape2)

n <- 200 #sample size
crit.val <- qnorm(0.975)

# model parameters
beta0 <- 2
beta1 <- -0.5

# independent variable is on scale from 0 to 10
x <- runif(n = n, min = 0, max = 10)
mu <- exp(beta0 + beta1 * x)
y <- rpois(n = n, lambda = mu) # calculate response

my.pois <- data.frame(y, x)

par(mfrow = c(1, 2))
hist(my.pois$y) # distribution of response
plot(y ~ x, data = my.pois)

plot of chunk unnamed-chunk-1

par(mfrow = c(1, 1))

mdl <- glm(y ~ x, data = my.pois, family = poisson)

Method 1

This method may not be the correct way of doing this. They differ slightly with the next method (2).

ci1 <- predict(mdl, type = "response", se.fit = TRUE)
my.pois$uci.ci1 <- ci1$fit + crit.val * ci1$se.fit
my.pois$lci.ci1 <- ci1$fit - crit.val * ci1$se.fit
my.pois$fit1 <- ci1$fit

(g1 <- ggplot(my.pois, aes(x = x, y = y)) +
  theme_bw() +
  geom_point() +
  geom_line(aes(y = fit1)) +
  geom_ribbon(aes(ymin = lci.ci1, ymax = uci.ci1), alpha = 0.3))

plot of chunk unnamed-chunk-2

Method 2

This is the “correct” way of calculating confidence interval. Calculate them on the link and transform the values using the inverse of the link. Same holds true for mean predicted values.

ci2 <- predict(mdl, type = "link", se.fit = TRUE)

inverseLink <- mdl$family$linkinv
my.pois$uci.ci2 <- inverseLink(ci2$fit + crit.val * ci2$se.fit)
my.pois$lci.ci2 <- inverseLink(ci2$fit - crit.val * ci2$se.fit)
my.pois$fit2 <- inverseLink(ci2$fit)

(g2 <- ggplot(my.pois, aes(x = x, y = y)) +
  theme_bw() +
  geom_point() +
  geom_line(aes(y = fit2)) +
  geom_ribbon(aes(ymin = lci.ci2, ymax = uci.ci2), alpha = 0.3))

plot of chunk unnamed-chunk-3

Method 3

When bootstrapping, we sample the data with replacement and calculate the model. Repeat this many times.

bootstrapCI <- function(model, newdata) {
  nr <- nrow(model$data)
  data <- model$data
  data <- data[sample(1:nr, size = nr, replace = TRUE), ]
  up <- update(model, data = data)
  predict(up, newdata = newdata, type = "response")
}

newdata <- data.frame(x = my.pois$x)
but <- replicate(10, bootstrapCI(model = mdl, newdata = newdata))
my.pois$uci.ci3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.925)
my.pois$lci.ci3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.025)
my.pois$fit3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.5)

(g3 <- ggplot(my.pois, aes(x = x, y = y)) +
  theme_bw() +
  geom_point() +
  geom_line(aes(y = fit3)) +
  geom_ribbon(aes(ymin = lci.ci3, ymax = uci.ci3), alpha = 0.3))

plot of chunk unnamed-chunk-4

Bootstrapped confidence intervals are usually narrower. Joris Meys has this to say about the matter:

… you derive your confidence interval from your data, which is inherently smaller than the complete population. So the chance you miss some of the extremes in your population is quite large, actually. Hence the fact that bootstrapping (without correction) gives slightly smaller confidence intervals in many cases.

Summary

Can you spot differences between the methods? Notice the breadth and skewness of the intervals.

grid.arrange(g1, g2, g3, ncol = 3)

plot of chunk unnamed-chunk-5

Range of intervals. Bootstrapped intervals (method 3) are somewhat narrower and asymmetrical compared to method 1 and 2, which overlap considerably.

ints <- apply(my.pois, MARGIN = 1, FUN = function(x) {
  data.frame(x = x["x"], 
             method1 = diff(range(x[c("lci.ci1", "uci.ci1")])),
             method2 = diff(range(x[c("lci.ci2", "uci.ci2")])),
             method3 = diff(range(x[c("lci.ci3", "uci.ci3")])))
})
ints <- do.call("rbind", ints)

ints <- melt(ints, id.vars = "x", variable.name = "method")

ggplot(ints, aes(x = x, y = value, color = method)) +
  theme_bw() +
  ylab("Width of interval") +
  geom_line(size = 1)

plot of chunk unnamed-chunk-6