Preamble

Confidence intervals around the linear predictor only account for uncertainty in estimating the expected value of the outcome (given the covariates). Prediction intervals must also account for the additional uncertainty associated with expected variability in the observed outcome.

We need two functions for later:

  1. \(generate.dataset\) will generate a dataset of size N based on two normal covariates, and return the dataset split into two: the first 1:N-1 observations (what is ‘originally observed’) and the final N\(^{th}\) observation to represent the `new’ value, for which we desire a prediction and prediction interval.

  2. \(get.coverage\) will calculate the coverage % based on the output of the simulations (details to follow).

generate.dataset = function(N) {
    x1 = rnorm(N, 5, 1)  
    x2 = rnorm(N, 2, 0.5) 
    mu = round( exp(1 + 0.25*x1 + 0.5*x2), 0)
    y = rpois(N, lambda=mu) 
    ds.full = data.frame(x1=x1, x2=x2, y=y)   
    ds  = ds.full[1:(N-1), ] 
    new.data =  ds.full[N, ] 
  return(list(ds.full=ds.full, ds=ds, new.data=new.data)) 
}
get.coverage = function(r) {
  N = length(r[,1])
  coverage = rep(0, N)
  for(i in 1:N) {
    if( (r$lpi[i] < r$y.true[i]) && (r$upi[i] > r$y.true[i])) { 
      coverage[i] = 1 
    } 
  }
  cov = sum(coverage) / length(coverage)
  return(cov) 
} 


Frequentist prediction intervals

1. Simulation demonstrating the poor coverage of traditional confidence intervals around a prediction

Next, let’s run a simulation that will:

  1. Fit a Poisson GLM to the data
  2. Use that GLM to predict the expected value of \(y\) given new observed \(x_1\), \(x_2\)
  3. Calculate the confidence interval around the expected value

classic.CI = function(N.observations, N.simdatasets) { 
  t.crit = qt(0.975, df=(N.observations-1))
  uci = lci = y.pred = y.true = c()
  results = list() 
  
    for(i in 1:N.simdatasets) {
      
      d = generate.dataset(N.observations) 
      m = glm(y ~ x1 + x2, data=d$ds, family="poisson") 
      pred = predict(m, newdata=d$new.data, se.fit = TRUE, type="response")   
  
      uci[i] = pred$fit + (t.crit * pred$se.fit) 
      lci[i] = pred$fit - (t.crit * pred$se.fit)
      y.pred[i] = pred$fit 
      y.true[i] = d$new.data$y
      } 
    return(data.frame(upi=uci,lpi=lci,y.pred=y.pred, y.true=y.true)) 
  }

The coverage is often around 20% - very poor because these aren’t prediction intervals.

2. Simulation based approach to a prediction interval with a Poisson GLM

empirical.PI = function(N.observations, N.simdatasets, N.iterations, PI.level) {      

 results = list()  
  if (PI.level > 1.0) {PI.level = PI.level/100} # So you can, e.g., put in 95 or 0.95 for a 95%CI  
  
  for(j in 1:N.simdatasets) {  
    y.distribution.samples = c()   
    
      # For each simulation scenario, generate a new 'observed' dataset 
      d = generate.dataset(N.observations) 
      
      # Fit the Poisson GLM and save the fitted (expected) values 
      m = glm(y ~ x1 + x2, data=d$ds, family="poisson") 
      expected.mean.y = predict(m, type = "response") 
      
      for(i in 1:N.iterations) {
        
        # Simulation step 1: Generate the y variable with mu = E(y|x1,x2) from the original GLM 
        generated.ds.i = data.frame(y=rpois( length(expected.mean.y), lambda=expected.mean.y)) 
        
        # Simulation step 2: Using the generated y and original observed x1, x2 data, fit a GLM 
        generated.ds.i$x1 = d$ds$x1 
        generated.ds.i$x2 = d$ds$x2 
          
        m.iteration.i = glm( y ~ ., data=generated.ds.i, family="poisson")  
        
        # Simulation step 3: Using this iterations GLM, predict the y for the new observation
        prediction.iteration.i = predict.glm(m.iteration.i, type="response", newdata = d$new.data) 
        
        # Simulation step 4: Using this prediction E(X|.), generate distribution of plausible vals
        y.distribution.samples[i] = rpois(length(prediction.iteration.i) , lambda = prediction.iteration.i) 
      } 
      
      results$y.true[j] = d$new.data$y
      results$y.pred[j] = predict(m, newdata=d$new.data, type="response") 
      results$lpi[j] =  quantile(y.distribution.samples, probs=((1 - PI.level)/2)) 
      results$upi[j] =  quantile(y.distribution.samples, probs=((1 + PI.level)/2)) 
  }
  return(data.frame(results)) 
} 


3. Simulation summary

We can summarize the results by running the above simulations (note: this takes roughly an hour to run altogether).

n.datasets = seq(100,500,100) 
r1 = lapply(X = n.datasets, FUN=classic.CI, N.observations=200) 
r2 = lapply(X = n.datasets, FUN=empirical.PI, N.observations=200, N.iterations=500, PI.level=0.95) 


Table: Coverage for classic CI and empirical PIs, by number of simulated datasets

N Simulated Datasets Classic CI Coverage\(^1\) Empirical PI Coverage\(^{1,2}\)
100 0.19 0.98
200 0.20 0.93
300 0.17 0.91
400 0.23 0.95
500 0.18 0.92

\(^1\) Data generation based on N simulated datasets of 200 observations each
\(^2\) Simulation for PI based on 500 iterations

Bayesian prediction intervals

To be continued…