Notes 15
Poisson Counts | Part 5: Confidence Intervals for Beta Coefficients and the Mean Response
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
Noisy miner data context again
## Miners Eucs Area Grazed Shrubs Bulokes Timber Minerab
## 1 0 2 22 0 1 120 16 0
## 2 0 10 11 0 1 67 25 0
## 3 1 16 51 0 1 85 13 3
## 4 1 20 22 0 1 45 12 2
## 5 1 19 4 0 1 160 14 8
## 6 1 18 61 0 1 75 6 1
Model:
\(Minerab \sim Poisson(\mu)\), where
\(\log(\mu) = \beta_0 + \beta_1 Eucs\)
Fit the model
Confidence intervals for betas
## # A tibble: 2 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.876 0.283 -1.46 -0.347
## 2 Eucs 0.114 0.0124 0.0899 0.139
Interpretation: We are 95% confident that \(\beta_1\) is between 0.090 and 0.139.
How is this calculated?
(See Section 7.2.2)
Formula:
\[\hat{\beta}_1 \pm z^\ast SE_{\hat{\beta}_1}\]
where \(z^\ast\) is the quantile of the standard normal distribution so 95% of the probability is between \(-z^\ast\) and \(z^\ast\). (Note: this is based on \(\hat{\beta}_1\) following a normal distribution again.)
Hmmm … We know the probability but are looking for the quantile.
Sounds like a job for the qnorm()
function!
How would we find \(z^\ast\) for a 95% confidence level?
## [1] 1.959964
Calculating the confidence interval formula in R:
## [1] 0.08961696 0.13834574
What is meant by “95% confidence?”
It does NOT mean: there’s a 95% probability that \(\beta_1\) is between 0.090 and 0.139.
What it means: if we take repeated samples from the population (that idea again), and calculate a separate 95% confidence interval based on each sample, 95% of the intervals will contain the true \(\beta_1\) value.
For a demonstration of this, let’s rev up our bootstrap sampler again and collect 100 samples.
set.seed(24601)
N <- 100
boot_samples <- tibble()
for(i in 1:N){
boot_samples <- bind_rows(
boot_samples,
tibble(
n = i,
sample_n(nminer, size = nrow(nminer), replace = TRUE)
)
) |>
select(n, Minerab, Eucs)
}
For each sample, we fit the model, calculate the confidence interval, and extract the endpoints.
ci <- array(dim = c(N,2))
for(i in 1:N){
ci[i,] <- glm(Minerab ~ Eucs, family = poisson(link="log"),
data = boot_samples |> filter(n == i)
) |>
tidy(conf.int = TRUE, conf.level = .95) |>
filter(term == "Eucs") |>
select(conf.low, conf.high) |>
as.double()
}
This plots the 100 confidence intervals for \(\beta_1\). The true \(\beta_1\) value is represented by a horizontal line (even though \(\beta_1\) is not known).
tibble(
ci_low = ci[,1],
ci_high = ci[,2],
n = 1:N
) |>
ggplot(
mapping = aes(x = n, xend = n, y = ci_low, yend = ci_high)
) +
geom_segment() +
geom_hline(yintercept = .12) +
labs(
x = "Sample number",
y = "Confidence interval for beta1"
)
“95% confidence” represents that in the long run, 95% of the intervals will contain \(\beta_1\). So 95% confidence refers to the confidence interval making process, not the particular interval that we are looking at.
Confidence intervals for \(\mu\)
(See Section 7.2.3)
ggplot(
data = augment(fit, type.predict = "response"),
mapping = aes(x = Eucs, y = Minerab)
) +
geom_jitter(width = .3, height = .3, alpha = .5) +
geom_line(aes(y = .fitted)) +
labs(
x = "Number of Eucalyptus trees",
y = "Number of Noisy Miner birds"
)
In our data context, \(\mu\) = mean
count of noisy miners on all transects with a particular number of
Eucs
(eucalyptus trees). (The line in the graph above is
\(\hat{\mu}\).) According to our
model,
\[\log(\mu) = \beta_0 + \beta_1 Eucs\]
Unfortunately, the augment()
function doesn’t return
these confidence intervals directly.
Method:
Make a confidence interval for log(mu)
Apply
exp()
to both endpoints to get a confidence interval for \(\mu\).
To do #1, we’ll use the notation \(\eta\) (eta) for \(\log(\mu)\) since it is shorter. The confidence interval for \(\eta\) is
\[\hat{\eta} \pm z^\ast SE_{\hat{\eta}}\]
(Note: this is the same form as our earlier CI for \(\hat{\beta}_1\).)
We can get \(\hat{\eta}\) and its
standard error from augment()
. We need to specify
se_fit = TRUE
to have it output the standard errors.
## # A tibble: 31 × 9
## Minerab Eucs .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2 -0.648 0.260 -1.02 0.0354 1.49 0.00994 -1.04
## 2 0 10 0.264 0.175 -1.61 0.0398 1.47 0.0281 -1.65
## 3 3 16 0.947 0.125 0.255 0.0406 1.50 0.00151 0.261
## 4 2 20 1.40 0.110 -1.14 0.0491 1.49 0.0285 -1.17
## 5 8 19 1.29 0.112 1.98 0.0454 1.45 0.131 2.02
## 6 1 18 1.18 0.115 -1.46 0.0430 1.48 0.0364 -1.49
## 7 8 12 0.492 0.156 3.56 0.0399 1.34 0.536 3.63
## 8 5 16 0.947 0.125 1.33 0.0406 1.48 0.0501 1.36
## 9 0 3 -0.534 0.249 -1.08 0.0363 1.49 0.0114 -1.10
## 10 4 12 0.492 0.156 1.56 0.0399 1.47 0.0740 1.59
## # ℹ 21 more rows
Note that we are NOT using the type.predict = "response"
argument as we usually have.
Without
type.predict = "response"
, we get fitted values, etc. for log(mu) (which is what we want here)With
type.predict = "response"
, we get fitted values, etc. for mu
The variable .fitted
gives \(\hat{\eta}\), the estimate of \(\log(\mu)\), and .se.fit
gives
its standard error.
To focus on a single observation of x, we can use the
newdata
argument. Let’s focus on when
Eucs = 15
.
## # A tibble: 1 × 3
## Eucs .fitted .se.fit
## <dbl> <dbl> <dbl>
## 1 15 0.834 0.132
The \(z^\ast\) value is calculated in the same way as before. Let’s use R as a calculator to find \(z^\ast\) for a 99% confidence interval.
The 99% interval for \(\log(\mu)\) for Eucs = 15 is
## [1] 0.493301
## [1] 1.173717
The 99% interval for \(\mu\) for Eucs = 15 is
## [1] 1.637713
## [1] 3.23399
Interpretation: We are 99% confident that the population mean number of noisy miner birds on land plots with 15 eucalyptus trees is between 1.64 and 3.24.
Plot with 95% confidence intervals for \(\mu\) around the line for \(\hat{\mu}\):
augment(fit, se_fit = TRUE) |>
ggplot(
mapping = aes(x = Eucs)
) +
geom_jitter(aes(y = Minerab)) + # points with jitter
geom_line(aes(y = exp(.fitted))) + # line for mu-hat
geom_ribbon(aes( #shaded region for CI
ymin = exp(.fitted - qnorm(.975)* .se.fit),
ymax = exp(.fitted + qnorm(.975)* .se.fit)
), alpha = .2)