Notes 15

Poisson Counts | Part 5: Confidence Intervals for Beta Coefficients and the Mean Response

Setup

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(GLMsData)
library(broom) # for tidy, augment, glance
## Warning: package 'broom' was built under R version 4.3.3
library(statmod) # for qresid()
## Warning: package 'statmod' was built under R version 4.3.3

Noisy miner data context again

data(nminer)
head(nminer)
##   Miners Eucs Area Grazed Shrubs Bulokes Timber Minerab
## 1      0    2   22      0      1     120     16       0
## 2      0   10   11      0      1      67     25       0
## 3      1   16   51      0      1      85     13       3
## 4      1   20   22      0      1      45     12       2
## 5      1   19    4      0      1     160     14       8
## 6      1   18   61      0      1      75      6       1

Model:

  • \(Minerab \sim Poisson(\mu)\), where

  • \(\log(\mu) = \beta_0 + \beta_1 Eucs\)

Fit the model

fit <- glm(Minerab ~ Eucs, family = poisson(link="log"), data = nminer)

Confidence intervals for betas

tidy(fit, conf.int = TRUE, conf.level = .95) |> 
  select(-c(statistic, p.value))
## # A tibble: 2 × 5
##   term        estimate std.error conf.low conf.high
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -0.876    0.283   -1.46      -0.347
## 2 Eucs           0.114    0.0124   0.0899     0.139

Interpretation: We are 95% confident that \(\beta_1\) is between 0.090 and 0.139.

How is this calculated?

(See Section 7.2.2)

Formula:

\[\hat{\beta}_1 \pm z^\ast SE_{\hat{\beta}_1}\]

where \(z^\ast\) is the quantile of the standard normal distribution so 95% of the probability is between \(-z^\ast\) and \(z^\ast\). (Note: this is based on \(\hat{\beta}_1\) following a normal distribution again.)

Hmmm … We know the probability but are looking for the quantile. Sounds like a job for the qnorm() function!

How would we find \(z^\ast\) for a 95% confidence level?

zstar <- qnorm(.975)
zstar
## [1] 1.959964

Calculating the confidence interval formula in R:

tidy(fit)$estimate[2] + c(-1,1) * zstar * tidy(fit)$std.error[2]
## [1] 0.08961696 0.13834574

What is meant by “95% confidence?”

It does NOT mean: there’s a 95% probability that \(\beta_1\) is between 0.090 and 0.139.

What it means: if we take repeated samples from the population (that idea again), and calculate a separate 95% confidence interval based on each sample, 95% of the intervals will contain the true \(\beta_1\) value.

For a demonstration of this, let’s rev up our bootstrap sampler again and collect 100 samples.

set.seed(24601)
N <- 100
boot_samples <- tibble()
for(i in 1:N){
  boot_samples <- bind_rows(
    boot_samples, 
    tibble(
      n = i, 
      sample_n(nminer, size = nrow(nminer), replace = TRUE)
    )
  ) |> 
    select(n, Minerab, Eucs)
}

For each sample, we fit the model, calculate the confidence interval, and extract the endpoints.

ci <- array(dim = c(N,2))
for(i in 1:N){
  ci[i,] <- glm(Minerab ~ Eucs, family = poisson(link="log"), 
                       data = boot_samples |> filter(n == i)
                     ) |> 
    tidy(conf.int = TRUE, conf.level = .95) |> 
    filter(term == "Eucs") |> 
    select(conf.low, conf.high) |> 
    as.double()
}

This plots the 100 confidence intervals for \(\beta_1\). The true \(\beta_1\) value is represented by a horizontal line (even though \(\beta_1\) is not known).

tibble(
  ci_low = ci[,1], 
  ci_high = ci[,2], 
  n = 1:N
) |> 
  ggplot(
    mapping = aes(x = n, xend = n, y = ci_low, yend = ci_high)
  ) + 
  geom_segment() + 
  geom_hline(yintercept = .12) + 
  labs(
    x = "Sample number", 
    y = "Confidence interval for beta1"
  )

95% confidence” represents that in the long run, 95% of the intervals will contain \(\beta_1\). So 95% confidence refers to the confidence interval making process, not the particular interval that we are looking at.

Confidence intervals for \(\mu\)

(See Section 7.2.3)

ggplot(
  data = augment(fit, type.predict = "response"), 
  mapping = aes(x = Eucs, y = Minerab)
) + 
  geom_jitter(width = .3, height = .3, alpha = .5) + 
  geom_line(aes(y = .fitted)) + 
  labs(
    x = "Number of Eucalyptus trees", 
    y = "Number of Noisy Miner birds"
  )

In our data context, \(\mu\) = mean count of noisy miners on all transects with a particular number of Eucs (eucalyptus trees). (The line in the graph above is \(\hat{\mu}\).) According to our model,

\[\log(\mu) = \beta_0 + \beta_1 Eucs\]

Unfortunately, the augment() function doesn’t return these confidence intervals directly.

Method:

  1. Make a confidence interval for log(mu)

  2. Apply exp() to both endpoints to get a confidence interval for \(\mu\).

To do #1, we’ll use the notation \(\eta\) (eta) for \(\log(\mu)\) since it is shorter. The confidence interval for \(\eta\) is

\[\hat{\eta} \pm z^\ast SE_{\hat{\eta}}\]

(Note: this is the same form as our earlier CI for \(\hat{\beta}_1\).)

We can get \(\hat{\eta}\) and its standard error from augment(). We need to specify se_fit = TRUE to have it output the standard errors.

augment(fit, se_fit = TRUE)
## # A tibble: 31 × 9
##    Minerab  Eucs .fitted .se.fit .resid   .hat .sigma .cooksd .std.resid
##      <int> <int>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1       0     2  -0.648   0.260 -1.02  0.0354   1.49 0.00994     -1.04 
##  2       0    10   0.264   0.175 -1.61  0.0398   1.47 0.0281      -1.65 
##  3       3    16   0.947   0.125  0.255 0.0406   1.50 0.00151      0.261
##  4       2    20   1.40    0.110 -1.14  0.0491   1.49 0.0285      -1.17 
##  5       8    19   1.29    0.112  1.98  0.0454   1.45 0.131        2.02 
##  6       1    18   1.18    0.115 -1.46  0.0430   1.48 0.0364      -1.49 
##  7       8    12   0.492   0.156  3.56  0.0399   1.34 0.536        3.63 
##  8       5    16   0.947   0.125  1.33  0.0406   1.48 0.0501       1.36 
##  9       0     3  -0.534   0.249 -1.08  0.0363   1.49 0.0114      -1.10 
## 10       4    12   0.492   0.156  1.56  0.0399   1.47 0.0740       1.59 
## # ℹ 21 more rows

Note that we are NOT using the type.predict = "response" argument as we usually have.

  • Without type.predict = "response", we get fitted values, etc. for log(mu) (which is what we want here)

  • With type.predict = "response", we get fitted values, etc. for mu

The variable .fitted gives \(\hat{\eta}\), the estimate of \(\log(\mu)\), and .se.fit gives its standard error.

To focus on a single observation of x, we can use the newdata argument. Let’s focus on when Eucs = 15.

augment(fit, newdata = tibble(Eucs = 15), se_fit = TRUE)
## # A tibble: 1 × 3
##    Eucs .fitted .se.fit
##   <dbl>   <dbl>   <dbl>
## 1    15   0.834   0.132

The \(z^\ast\) value is calculated in the same way as before. Let’s use R as a calculator to find \(z^\ast\) for a 99% confidence interval.

zstar99 <- qnorm(.995)

The 99% interval for \(\log(\mu)\) for Eucs = 15 is

#lower
0.8335088    - zstar99*0.132077
## [1] 0.493301
#upper
0.8335088    + zstar99*0.132077
## [1] 1.173717

The 99% interval for \(\mu\) for Eucs = 15 is

#lower
exp(0.8335088    - zstar99*0.132077)
## [1] 1.637713
#upper
exp(0.8335088    + zstar99*0.132077)
## [1] 3.23399

Interpretation: We are 99% confident that the population mean number of noisy miner birds on land plots with 15 eucalyptus trees is between 1.64 and 3.24.

Plot with 95% confidence intervals for \(\mu\) around the line for \(\hat{\mu}\):

augment(fit, se_fit = TRUE) |> 
  ggplot(
    mapping = aes(x = Eucs)
  ) + 
  geom_jitter(aes(y = Minerab)) + # points with jitter
  geom_line(aes(y = exp(.fitted))) + # line for mu-hat
  geom_ribbon(aes( #shaded region for CI
    ymin = exp(.fitted - qnorm(.975)* .se.fit), 
    ymax = exp(.fitted + qnorm(.975)* .se.fit)
  ), alpha = .2)