Confidence intervals around the linear predictor only account for uncertainty in estimating the expected value of the outcome (given the covariates). Prediction intervals must also account for the additional uncertainty associated with expected variability in the observed outcome.
We need two functions for later:
\(generate.dataset\) will generate a dataset of size N based on two normal covariates, and return the dataset split into two: the first 1:N-1 observations (what is ‘originally observed’) and the final N\(^{th}\) observation to represent the `new’ value, for which we desire a prediction and prediction interval.
\(get.coverage\) will calculate the coverage % based on the output of the simulations (details to follow).
generate.dataset = function(N) {
x1 = rnorm(N, 5, 1)
x2 = rnorm(N, 2, 0.5)
mu = round( exp(1 + 0.25*x1 + 0.5*x2), 0)
y = rpois(N, lambda=mu)
ds.full = data.frame(x1=x1, x2=x2, y=y)
ds = ds.full[1:(N-1), ]
new.data = ds.full[N, ]
return(list(ds.full=ds.full, ds=ds, new.data=new.data))
}
get.coverage = function(r) {
N = length(r[,1])
coverage = rep(0, N)
for(i in 1:N) {
if( (r$lpi[i] < r$y.true[i]) && (r$upi[i] > r$y.true[i])) {
coverage[i] = 1
}
}
cov = sum(coverage) / length(coverage)
return(cov)
}
Next, let’s run a simulation that will:
classic.CI = function(N.observations, N.simdatasets) {
t.crit = qt(0.975, df=(N.observations-1))
uci = lci = y.pred = y.true = c()
results = list()
for(i in 1:N.simdatasets) {
d = generate.dataset(N.observations)
m = glm(y ~ x1 + x2, data=d$ds, family="poisson")
pred = predict(m, newdata=d$new.data, se.fit = TRUE, type="response")
uci[i] = pred$fit + (t.crit * pred$se.fit)
lci[i] = pred$fit - (t.crit * pred$se.fit)
y.pred[i] = pred$fit
y.true[i] = d$new.data$y
}
return(data.frame(upi=uci,lpi=lci,y.pred=y.pred, y.true=y.true))
}
The coverage is often around 20% - very poor because these aren’t prediction intervals.
empirical.PI = function(N.observations, N.simdatasets, N.iterations, PI.level) {
results = list()
if (PI.level > 1.0) {PI.level = PI.level/100} # So you can, e.g., put in 95 or 0.95 for a 95%CI
for(j in 1:N.simdatasets) {
y.distribution.samples = c()
# For each simulation scenario, generate a new 'observed' dataset
d = generate.dataset(N.observations)
# Fit the Poisson GLM and save the fitted (expected) values
m = glm(y ~ x1 + x2, data=d$ds, family="poisson")
expected.mean.y = predict(m, type = "response")
for(i in 1:N.iterations) {
# Simulation step 1: Generate the y variable with mu = E(y|x1,x2) from the original GLM
generated.ds.i = data.frame(y=rpois( length(expected.mean.y), lambda=expected.mean.y))
# Simulation step 2: Using the generated y and original observed x1, x2 data, fit a GLM
generated.ds.i$x1 = d$ds$x1
generated.ds.i$x2 = d$ds$x2
m.iteration.i = glm( y ~ ., data=generated.ds.i, family="poisson")
# Simulation step 3: Using this iterations GLM, predict the y for the new observation
prediction.iteration.i = predict.glm(m.iteration.i, type="response", newdata = d$new.data)
# Simulation step 4: Using this prediction E(X|.), generate distribution of plausible vals
y.distribution.samples[i] = rpois(length(prediction.iteration.i) , lambda = prediction.iteration.i)
}
results$y.true[j] = d$new.data$y
results$y.pred[j] = predict(m, newdata=d$new.data, type="response")
results$lpi[j] = quantile(y.distribution.samples, probs=((1 - PI.level)/2))
results$upi[j] = quantile(y.distribution.samples, probs=((1 + PI.level)/2))
}
return(data.frame(results))
}
We can summarize the results by running the above simulations (note: this takes roughly an hour to run altogether).
n.datasets = seq(100,500,100)
r1 = lapply(X = n.datasets, FUN=classic.CI, N.observations=200)
r2 = lapply(X = n.datasets, FUN=empirical.PI, N.observations=200, N.iterations=500, PI.level=0.95)
| N Simulated Datasets | Classic CI Coverage\(^1\) | Empirical PI Coverage\(^{1,2}\) |
|---|---|---|
| 100 | 0.19 | 0.98 |
| 200 | 0.20 | 0.93 |
| 300 | 0.17 | 0.91 |
| 400 | 0.23 | 0.95 |
| 500 | 0.18 | 0.92 |
\(^1\) Data generation based on N simulated datasets of 200 observations each
\(^2\) Simulation for PI based on 500 iterations
To be continued…