Statistics is fun and interesting at every step one makes. This time I was surprised I was calculating GLM confidence intervals on response, and not the link. Here is an example of calculating confidence intervals for Poisson regression in three ways (the first one may not be “correct”). It would appear the method I used to calculate confidence (method 1) intervals differs very little from the “right” way (method 2). I give an example of bootstrapped confidence interval.
require(ggplot2)
require(gridExtra)
require(reshape2)
n <- 200 #sample size
crit.val <- qnorm(0.975)
# model parameters
beta0 <- 2
beta1 <- -0.5
# independent variable is on scale from 0 to 10
x <- runif(n = n, min = 0, max = 10)
mu <- exp(beta0 + beta1 * x)
y <- rpois(n = n, lambda = mu) # calculate response
my.pois <- data.frame(y, x)
par(mfrow = c(1, 2))
hist(my.pois$y) # distribution of response
plot(y ~ x, data = my.pois)
par(mfrow = c(1, 1))
mdl <- glm(y ~ x, data = my.pois, family = poisson)
This method may not be the correct way of doing this. They differ slightly with the next method (2).
ci1 <- predict(mdl, type = "response", se.fit = TRUE)
my.pois$uci.ci1 <- ci1$fit + crit.val * ci1$se.fit
my.pois$lci.ci1 <- ci1$fit - crit.val * ci1$se.fit
my.pois$fit1 <- ci1$fit
(g1 <- ggplot(my.pois, aes(x = x, y = y)) +
theme_bw() +
geom_point() +
geom_line(aes(y = fit1)) +
geom_ribbon(aes(ymin = lci.ci1, ymax = uci.ci1), alpha = 0.3))
This is the “correct” way of calculating confidence interval. Calculate them on the link and transform the values using the inverse of the link. Same holds true for mean predicted values.
ci2 <- predict(mdl, type = "link", se.fit = TRUE)
inverseLink <- mdl$family$linkinv
my.pois$uci.ci2 <- inverseLink(ci2$fit + crit.val * ci2$se.fit)
my.pois$lci.ci2 <- inverseLink(ci2$fit - crit.val * ci2$se.fit)
my.pois$fit2 <- inverseLink(ci2$fit)
(g2 <- ggplot(my.pois, aes(x = x, y = y)) +
theme_bw() +
geom_point() +
geom_line(aes(y = fit2)) +
geom_ribbon(aes(ymin = lci.ci2, ymax = uci.ci2), alpha = 0.3))
When bootstrapping, we sample the data with replacement and calculate the model. Repeat this many times.
bootstrapCI <- function(model, newdata) {
nr <- nrow(model$data)
data <- model$data
data <- data[sample(1:nr, size = nr, replace = TRUE), ]
up <- update(model, data = data)
predict(up, newdata = newdata, type = "response")
}
newdata <- data.frame(x = my.pois$x)
but <- replicate(10, bootstrapCI(model = mdl, newdata = newdata))
my.pois$uci.ci3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.925)
my.pois$lci.ci3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.025)
my.pois$fit3 <- apply(but, MARGIN = 1, FUN = quantile, probs = 0.5)
(g3 <- ggplot(my.pois, aes(x = x, y = y)) +
theme_bw() +
geom_point() +
geom_line(aes(y = fit3)) +
geom_ribbon(aes(ymin = lci.ci3, ymax = uci.ci3), alpha = 0.3))
Bootstrapped confidence intervals are usually narrower. Joris Meys has this to say about the matter:
… you derive your confidence interval from your data, which is inherently smaller than the complete population. So the chance you miss some of the extremes in your population is quite large, actually. Hence the fact that bootstrapping (without correction) gives slightly smaller confidence intervals in many cases.
Can you spot differences between the methods? Notice the breadth and skewness of the intervals.
grid.arrange(g1, g2, g3, ncol = 3)
Range of intervals. Bootstrapped intervals (method 3) are somewhat narrower and asymmetrical compared to method 1 and 2, which overlap considerably.
ints <- apply(my.pois, MARGIN = 1, FUN = function(x) {
data.frame(x = x["x"],
method1 = diff(range(x[c("lci.ci1", "uci.ci1")])),
method2 = diff(range(x[c("lci.ci2", "uci.ci2")])),
method3 = diff(range(x[c("lci.ci3", "uci.ci3")])))
})
ints <- do.call("rbind", ints)
ints <- melt(ints, id.vars = "x", variable.name = "method")
ggplot(ints, aes(x = x, y = value, color = method)) +
theme_bw() +
ylab("Width of interval") +
geom_line(size = 1)