Prediction standard error in Poisson models

This is a brief report summarizing why prediction variance in poisson models with variable group sizes (offsets) should be compared for rates not absolute counts, as the latter will by definition be proportionate to the expected count.

First, let’s simulate data for a Poisson GLM with an offset (population)

N = 1000
x = rnorm(N,1,0.5) 
beta0 = 0
beta1 = 0.05
pop = round(runif(N,500,5000),0) 
lambda = exp(beta0 + beta1* x + log(pop)) 
y = rpois(N,lambda)

Second, let’s fit a Poisson GLM to the data

m = glm(y ~ x, family="poisson",offset=log(pop))

Third, let’s generate new out-of-sample data for x and population.

x.pred   = rnorm(100, 3,0.5) 
pop.pred = seq(100,10000,100) 
newdata = data.frame( x=x.pred, pop = pop.pred)

Now we will view the prediction standard errors of the counts as a function of the offset (population size)

se.predict.count = predict.glm(m,newdata=newdata, se.fit=TRUE,type="response")$se.fit 
plot(pop.pred,se.predict.count,ylab="prediction standard error for counts",  pch=16,col="#4b6cb7",xlab="offset (population)")

The offset and the prediction standard error (for count values) are positively correlated by definition (given how the data are generated in the simulation or assumed to be generated in real datasets).

\[ y_i \sim pois( \mu_i) \]

\[log( \frac{\mu_i}{pop_i}) = \beta_0 + \beta_1 x_i \]

\[\mu_i = exp(\beta_0 + \beta_1 x_i + log(pop_i)) = exp(\beta_0 + \beta_1 x_i) * pop_i\]

This can be counter-intuitive since one would expect larger sample or population sizes to yield better precision. But when we include an offset, we should look at the standard error of the rate if we are going to compare across groups with different population sizes generating the counts.

So now instead let’s view the prediction standard errors of the rates as a function of the offset (population size).

The model parameters from the fit model were based on rates - recall that including the offset changes the other parameter values - and so to estimate the a predicted rate we exclude any offset in the prediction. To estimate a count for a given region or group, the rate can be multiplied by the group size.

There are two equivalent ways to estimate the standard errors of new predicted rates:

Set each group size to 1 as log(1) = 0 in the prediction dataset, then extract prediction variance
Use the offset in the prediction, but divide the prediction variance for the count by the population

The equivalency is shown below. There is no longer any correlation between the population size generating the counts and the prediction variance, because all are now based on standardized rates.

pop.fixed = rep(1,100) 
newdata.fixed = data.frame(x=x.pred, pop=pop.fixed)   

se.predict.rate1 = predict.glm(m,newdata=newdata.fixed, se.fit=TRUE,type="response")$se.fit 
se.predict.rate2 = se.predict.count / pop.pred 

par(mfrow=c(1,2)) 

plot(pop.pred,se.predict.rate1,ylab="prediction standard error for rates",  pch=16,col="#4b6cb7",xlab="offset (population)") 

plot(pop.pred,se.predict.rate2,ylab="prediction standard error for rates",  pch=16,col="#4b6cb7",xlab="offset (population)")

Prediction standard error in Poisson models

Kathryn Morrison - February 17, 2016