Notes 13

Poisson Counts | Part 3: Modeling Rates

Setup

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(GLMsData)
library(broom) # for tidy, augment, glance
## Warning: package 'broom' was built under R version 4.3.3
library(statmod) # for qresid()
## Warning: package 'statmod' was built under R version 4.3.3

Section 10.3 - Modeling Rates

In theory, the possible values for a Poisson distribution include unlimited counts 0, 1, 2, … However, the Poisson model can also be used when there is an upper bound for each count response, but the upper bound is very large.

“For such applications, the maximum number of events is usually representative of some population, and the response can be usefully viewed as a rate rather than just as a count.

“The size of each population needs to be specified to make comparisons meaningful. For example, consider comparing the number of people with a certain disease in various cities. The number of cases in each city may be useful information for planning purposes. However, quoting just the number of people with the disease in each city is an unfair comparison, as some cities have a far larger population than others. Comparing the number of people with the disease per unit of population (for example, per thousand people) is a fairer comparison. That is, the disease rate is often more suitable for modelling than the actual number of people with the disease.

“In principle, rates can treated as proportions, and analysed using binomial GLMs, but Poisson GLMs are more convenient when the populations are large and the rates are relatively small, less than 1% say.”

Example 10.1 - danishlc data

Number of incidents of lung cancer from 1968 to 1971 in four Danish cities for different age groups.

data(danishlc)
tibble(danishlc)
## # A tibble: 24 × 4
##    Cases   Pop Age   City      
##    <int> <int> <fct> <fct>     
##  1    11  3059 40-54 Fredericia
##  2    11   800 55-59 Fredericia
##  3    11   710 60-64 Fredericia
##  4    10   581 65-69 Fredericia
##  5    11   509 70-74 Fredericia
##  6    10   605 >74   Fredericia
##  7    13  2879 40-54 Horsens   
##  8     6  1083 55-59 Horsens   
##  9    15   923 60-64 Horsens   
## 10    10   834 65-69 Horsens   
## # ℹ 14 more rows

Looking at Fredericia:

  • The number of cases of lung cancer in each age group is remarkably similar

  • However, using the number of cases does not accurately reflect the information in the data because five times as many people are in the 40–54 age group than in the over-75 age group.

It’s better to consider the rate of lung cancer, say the number of cases per 1000 population. Let’s calculate this rate:

danishlc <- danishlc |> 
  mutate(rate1k = Cases / Pop * 1000) 
danishlc <- danishlc |> 
  mutate(Age = factor(Age, levels = c("40-54", "55-59", "60-64", "65-69", "70-74", ">74")))

Plot of rates at different ages for different cities.

ggplot(
  data = danishlc,
  mapping = aes(x = Age, y = rate1k, color = City, group = City)
) + 
  geom_point() + 
  geom_line() + 
  labs(y = "Cases per 1000")

Note that the default alphabetical ordering of the Age variable puts the oldest variable first. We can force an order of the levels with the factor() function.

danishlc <- danishlc |> 
  mutate(Age = factor(Age, levels = c("40-54", "55-59", "60-64", "65-69", "70-74", ">74")))

Interpreting the plot: What are the effects of Age and City?

*Age: the rate of lung cancer increases with age

  • City: not really a consistent effect across ages

Writing the model in notation

  • \(y_i\) = observed number of lung cancers in city/age group \(i\)

  • \(T_i\) = population of group \(i\)

  • observed rate = \(y_i / T_i\), expected rate = \(\mu_i / T_i\)

Using log link function, model is

  1. Random component: \(y \sim Poisson(\mu)\)

  2. Systematic component: \(\log(\mu/T)\) = linear predictor, where “linear predictor” is \(\beta_0 + \beta_1 x_1 + ...\)

To fit the model, we want \(\log(\mu)\) on the left side of #2 by itself. Using the property

\[\log(\mu/T) = \log(\mu) - \log(T)\]

We can write the systematic component as

  1. \(\log(\mu) = \log(T)\) + linear predictor

Note that the beta parameters need to be estimated, but no parameters need to estimated for \(\log(T)\). (They are given by the populations given in the data.)

A term like \(\log(T)\) that doesn’t require parameters to be estimated is called an offset.

Age only model

A model using Age only (not City) as an explanatory variable is fit as:

fit_age_only <- glm(
  Cases ~ offset(log(Pop)) + Age, 
  family = poisson(link = "log"), 
  data = danishlc
)

Estimated beta coefficients:

tidy(fit_age_only)
## # A tibble: 6 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    -5.86     0.174    -33.7  1.30e-248
## 2 Age55-59        1.08     0.248      4.36 1.29e-  5
## 3 Age60-64        1.50     0.231      6.49 8.66e- 11
## 4 Age65-69        1.75     0.229      7.64 2.22e- 14
## 5 Age70-74        1.85     0.235      7.85 4.00e- 15
## 6 Age>74          1.41     0.250      5.63 1.80e-  8

About the terms of the model:

  • Note that the Age variable created 5 terms. Each of these is a dummy variables

  • How many categories of Age are in the dataset? 6 So the number of dummy variables created is one less than the number of categories.

  • The category that doesn’t get a dummy variable is called the reference category. In this case, the reference category is 40-54. In general, R makes the first category (in order) the reference category.

Calculating predicted rates

Recall: The systematic component is written as \(\log(\mu) = \log(T)\) + linear predictor, or in terms of rates:

\(\log(\mu/T)\) = linear predictor

Plugging the beta estimates into the linear predictor part gives:

\[\log(\mu/T) = -5.86 + 1.08 I(Age55-59) + ... + 1.41 I(Age>74)\]

Let’s calculate the predicted rate for the different age groups. (Note: we’ll use the pull() function here, which does the same thing as the $ operator for selecting a variable from a dataset but works better with “piping style” code.)

  • For Age from 40-54:
exp(-5.86) #plugging in round numbers
## [1] 0.002851244
exp( #retrieving exact numbers
  tidy(fit_age_only) |>
    filter(term == "(Intercept)") |>
    pull(estimate)
)
## [1] 0.002844828
  • For Age from 55-59:
exp(-5.86 + 1.08) #plugging in round numbers
## [1] 0.008395999
exp( #retrieving exact numbers
  tidy(fit_age_only) |>
    filter(term == "(Intercept)") |>
    pull(estimate) + 
  tidy(fit_age_only) |>
    filter(term == "Age55-59") |>
    pull(estimate)
)
## [1] 0.008396746
  • For Age from 60-64:
exp(
  tidy(fit_age_only) |>
    filter(term == "(Intercept)") |>
    pull(estimate) + 
  tidy(fit_age_only) |>
    filter(term == "Age60-64") |>
    pull(estimate)
)
## [1] 0.01277101

Interpreting beta coefficients

Interpret beta coefficient estimate for Age55-59 = 1.08

When interpreting the beta coefficient estimate for Age55-59 = 1.08, we look for what impact it has on the results.

Note: when calculating predicted rates, the Age55-59 = 1.08 coefficient is the only difference in the calculation between what two age groups: 40-54 and 55-59. It is part of the calculation for age group 55-59 but not part of the calculation for age group 40-54.

What is the change between the predicted rate for Ages in 40-54 and 55-59? (This is the interpretation of the beta coefficient estimate for Age55-59 = 1.08.)

Rate (cases/1000) of lung cancer for 55-59 is rexp(1.08) times the rate of lung cancer for 40-54.

Interpret beta coefficient estimate for Age60-64 = 1.50

Rate (cases/1000) of lung cancer for 60-64 is rexp(1.50) |> round(2) the rate of lung cancer for 40-54.

Reference category

Why do you think the category for which there is no dummy variable (Age = "40-59" in this case) is called the reference category?

Because all the coefficient interpretation are in reference to it.

Calculating predicted rates with augment() function

In our glm() function, we specified the model as

\[\log(\mu) = \log(T) + linear\ predictor\]

When we use the augment() function,

  • The default .fitted values are the “link” values \(\log(\hat{\mu})\)

  • Previously we used the type.predict = "response" option so .fitted was \(\hat{\mu}\)

But neither of these is what we want. We want the predicted rate \(\hat{\mu}/T\). So we need to separately divide .fitted by Pop to get the predicted rates.

augment(fit_age_only, type.predict = "response") |> 
  mutate(pop = `offset(log(Pop))` |> exp()) |>
  mutate(pred_rate = .fitted / pop) |>
  select(pred_rate)
## # A tibble: 24 × 1
##    pred_rate
##        <dbl>
##  1   0.00284
##  2   0.00840
##  3   0.0128 
##  4   0.0164 
##  5   0.0180 
##  6   0.0116 
##  7   0.00284
##  8   0.00840
##  9   0.0128 
## 10   0.0164 
## # ℹ 14 more rows

Let’s check that these values match those we calculated from the beta estimates. We will use the distinct() function so we get only row per age group.

augment(fit_age_only, type.predict = "response") |> 
  mutate(pop = exp(`offset(log(Pop))`)) |>
  mutate(pred_rate = .fitted / pop) |> 
  distinct(Age, .keep_all = TRUE) |> 
  select(Age, pred_rate)
## # A tibble: 6 × 2
##   Age   pred_rate
##   <fct>     <dbl>
## 1 40-54   0.00284
## 2 55-59   0.00840
## 3 60-64   0.0128 
## 4 65-69   0.0164 
## 5 70-74   0.0180 
## 6 >74     0.0116