Poisson regression for rates

References

Regression Models for Count Data in R: http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf
Applied Regression Analysis and Multivariable Methods, 4th edition
Introductory Statistics with R (ISwR), 2nd edition
UCLA Data Analysis Examples: http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm

Poisson regression

Poisson regression is used for outcome variables that only take positive values such as counts and rates. The model assumes Var(outcome) = E(outcome) conditional on predictors.

Create rate ratio calculation function

glm.RR <- function(GLM.RESULT, digits = 2) {

    if (GLM.RESULT$family$family == "binomial") {
        LABEL <- "OR"
    } else if (GLM.RESULT$family$family == "poisson") {
        LABEL <- "RR"
    } else {
        stop("Not logistic or Poisson model")
    }

    COEF      <- stats::coef(GLM.RESULT)
    CONFINT   <- stats::confint(GLM.RESULT)
    TABLE     <- cbind(coef=COEF, CONFINT)
    TABLE.EXP <- round(exp(TABLE), digits)

    colnames(TABLE.EXP)[1] <- LABEL

    TABLE.EXP
}

Skin cancer example

The example was taken from Applied Regression Analysis and Multivariable Methods, 4th Edition.

Skin cancer dataset from “Non-melanoma skin cancer among Caucasians in four areas of the United States.” (www.pubmed.org/4422100).

Dataset is availabe at http://academic.cengage.com/resource_uploads/downloads/0495384968_88614.zip (melanom.txt). As it is small it was included below.

Prepare dataset

## Create a dataset manually
nonmel <- read.table(header = TRUE,
                     text = "
   cases city u1 u2 u3 u4 u5 u6 u7      n
1      1    0  1  0  0  0  0  0  0 172675
2     16    0  0  1  0  0  0  0  0 123065
3     30    0  0  0  1  0  0  0  0  96216
4     71    0  0  0  0  1  0  0  0  92051
5    102    0  0  0  0  0  1  0  0  72159
6    130    0  0  0  0  0  0  1  0  54722
7    133    0  0  0  0  0  0  0  1  32185
8     40    0  0  0  0  0  0  0  0   8328
9      4    1  1  0  0  0  0  0  0 181343
10    38    1  0  1  0  0  0  0  0 146207
11   119    1  0  0  1  0  0  0  0 121374
12   221    1  0  0  0  1  0  0  0 111353
13   259    1  0  0  0  0  1  0  0  83004
14   310    1  0  0  0  0  0  1  0  55932
15   226    1  0  0  0  0  0  0  1  29007
16    65    1  0  0  0  0  0  0  0   7583
")

## Create age.range variable and city variable
nonmel <- within(nonmel, {
    age.range <- rep(c("15_24","25_34","35_44","45_54","55_64","65_74","75_84","85+"), 2)
    age.range <- factor(age.range)
    age.range <- relevel(age.range, ref = "85+")

    city <- factor(city, 0:1, c("Minneapolis", "Dallas"))
})

## rop unnecessary columns
nonmel <- nonmel[c("cases","n","city","age.range")]

## Check data
nonmel

   cases      n        city age.range
1      1 172675 Minneapolis     15_24
2     16 123065 Minneapolis     25_34
3     30  96216 Minneapolis     35_44
4     71  92051 Minneapolis     45_54
5    102  72159 Minneapolis     55_64
6    130  54722 Minneapolis     65_74
7    133  32185 Minneapolis     75_84
8     40   8328 Minneapolis       85+
9      4 181343      Dallas     15_24
10    38 146207      Dallas     25_34
11   119 121374      Dallas     35_44
12   221 111353      Dallas     45_54
13   259  83004      Dallas     55_64
14   310  55932      Dallas     65_74
15   226  29007      Dallas     75_84
16    65   7583      Dallas       85+

It describes the number of cases of non-melanoma skin cancer among residents of two cities in each age category.

Poisson regression modeling

The outcome of interest here is rates (cases/n), however, glm() can only take counts as outcome for Poisson regression.

By transforming the equation.

\( log(cases / n) = \beta_0 + \sum_i {\beta_i X_i} \)

\( log(cases) - log(n) = \beta_0 + \sum_i {\beta_i X_i} \)

\( log(cases) = log(n) + \beta_0 + \sum_i {\beta_i X_i} \)

Thus, it can be modeled by including a log(n) term with coeffcient of 1 (no \( \beta \) for this term). This is called an offset and, modeled with offset() in R.

These two are equivalent

## Including offset(log(n)) in the right hand side
model.1 <- glm(cases ~ city + age.range + offset(log(n)), family = poisson(link = "log"), data = nonmel)
## Using the offset option
model.1 <- glm(cases ~ city + age.range, offset = log(n), family = poisson(link = "log"), data = nonmel)

## Results from regular Poisson
summary(model.1)


Call:
glm(formula = cases ~ city + age.range, family = poisson(link = "log"), 
    data = nonmel, offset = log(n))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5060  -0.4857   0.0164   0.3693   1.2476  

Coefficients:
               Estimate Std. Error z value  Pr(>|z|)    
(Intercept)     -5.4834     0.1037  -52.89   < 2e-16 ***
cityDallas       0.8039     0.0522   15.40   < 2e-16 ***
age.range15_24  -6.1742     0.4577  -13.49   < 2e-16 ***
age.range25_34  -3.5440     0.1675  -21.16   < 2e-16 ***
age.range35_44  -2.3268     0.1275  -18.25   < 2e-16 ***
age.range45_54  -1.5790     0.1138  -13.87   < 2e-16 ***
age.range55_64  -1.0869     0.1109   -9.80   < 2e-16 ***
age.range65_74  -0.5288     0.1086   -4.87 0.0000011 ***
age.range75_84  -0.1157     0.1109   -1.04       0.3    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2789.6810  on 15  degrees of freedom
Residual deviance:    8.2585  on  7  degrees of freedom
AIC: 120.5

Number of Fisher Scoring iterations: 4

glm.RR(model.1, 3)

                  RR 2.5 % 97.5 %
(Intercept)    0.004 0.003  0.005 # The risk of cancer for oldest in Minn. is 0.004 per person (4/1000 people)
cityDallas     2.234 2.018  2.477 # Compared to Minn. , risk ratio of cancer is 2.234 within levels of covariates
age.range15_24 0.002 0.001  0.005 # Compared to oldest, risk ratio of cancer is 0.002 within levels of covariates
age.range25_34 0.029 0.021  0.040 # Compared to oldest, risk ratio of cancer is 0.029 within levels of covariates
age.range35_44 0.098 0.076  0.126 # Compared to oldest, risk ratio of cancer is 0.098 within levels of covariates
age.range45_54 0.206 0.166  0.259 # Compared to oldest, risk ratio of cancer is 0.206 within levels of covariates
age.range55_64 0.337 0.272  0.421 # Compared to oldest, risk ratio of cancer is 0.337 within levels of covariates
age.range65_74 0.589 0.478  0.733 # Compared to oldest, risk ratio of cancer is 0.589 within levels of covariates
age.range75_84 0.891 0.720  1.112 # Compared to oldest, risk ratio of cancer is 0.891 within levels of covariates

Prediction

## Predict case per person (n = 1) for oldest people in the Minneapolis
exp(predict(model.1, newdata = data.frame(city = "Minneapolis", age.range = "85+", n = 1)))

       1 
0.004155


## Create dataset to predict for
newdat1 <- nonmel[c("city","age.range")]
newdat2 <- newdat1

## Predicted number of cases per person
newdat1$n <- 1
nonmel$pred.cases.per.one <- exp(predict(model.1, newdat1))

## Predicted number of cases per one thousand persons
newdat2$n <- 1000
nonmel$pred.cases.per.thousand <- exp(predict(model.1, newdat2))

## Predicted number of cases per actual population
nonmel$pred.cases <- exp(predict(model.1))

## Show prediction results
nonmel

   cases      n        city age.range pred.cases.per.one pred.cases.per.thousand pred.cases
1      1 172675 Minneapolis     15_24        0.000008653                0.008653      1.494
2     16 123065 Minneapolis     25_34        0.000120073                0.120073     14.777
3     30  96216 Minneapolis     35_44        0.000405559                0.405559     39.021
4     71  92051 Minneapolis     45_54        0.000856708                0.856708     78.861
5    102  72159 Minneapolis     55_64        0.001401347                1.401347    101.120
6    130  54722 Minneapolis     65_74        0.002448700                2.448700    133.998
7    133  32185 Minneapolis     75_84        0.003701282                3.701282    119.126
8     40   8328 Minneapolis       85+        0.004155093                4.155093     34.604
9      4 181343      Dallas     15_24        0.000019333                0.019333      3.506
10    38 146207      Dallas     25_34        0.000268272                0.268272     39.223
11   119 121374      Dallas     35_44        0.000906114                0.906114    109.979
12   221 111353      Dallas     45_54        0.001914086                1.914086    213.139
13   259  83004      Dallas     55_64        0.003130936                3.130936    259.880
14   310  55932      Dallas     65_74        0.005470969                5.470969    306.002
15   226  29007      Dallas     75_84        0.008269529                8.269529    239.874
16    65   7583      Dallas       85+        0.009283448                9.283448     70.396

Quasi-Poisson regression with an estimated dispersion parameter

## quasi-Poisson to allow the scale parameter to change from 1. Show the dispersion parameter.
model.1q <- glm(cases ~ city + age.range, offset = log(n), family = quasipoisson(link = "log"), data = nonmel)
summary(model.1q)


Call:
glm(formula = cases ~ city + age.range, family = quasipoisson(link = "log"), 
    data = nonmel, offset = log(n))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5060  -0.4857   0.0164   0.3693   1.2476  

Coefficients:
               Estimate Std. Error t value      Pr(>|t|)    
(Intercept)     -5.4834     0.1117  -49.08 0.00000000038 ***
cityDallas       0.8039     0.0563   14.29 0.00000195327 ***
age.range15_24  -6.1742     0.4932  -12.52 0.00000478575 ***
age.range25_34  -3.5440     0.1805  -19.64 0.00000022172 ***
age.range35_44  -2.3268     0.1373  -16.94 0.00000061180 ***
age.range45_54  -1.5790     0.1227  -12.87 0.00000396332 ***
age.range55_64  -1.0869     0.1195   -9.10 0.00003983884 ***
age.range65_74  -0.5288     0.1170   -4.52        0.0027 ** 
age.range75_84  -0.1157     0.1195   -0.97        0.3656    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.161)

    Null deviance: 2789.6810  on 15  degrees of freedom
Residual deviance:    8.2585  on  7  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

The quasi-Poisson method can be used to estimate the dispersion parameter, i.e., degree of overdispersion. It is 1.16, and almost equal to 1, which is the value used in regular Poisson regression. Thus, the Poisson model is appears ok. If it is large, the standard error estimation from the quasi-Poisson model should be used. Here because of the near one dispersion parameter, the SE estimates are almost identical in both methods.

link function: \( g(\mu) = ln(\mu) \)
variance function: \( V(\mu) = \theta\mu \) where \( \theta \) is estimated unlike Poisson regression where it is fixed at 1.

Robust sandwich covariance matrix estimator

## Load sandwich package for robust estimator
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Poisson model with SE estimated via robust variance estimator
coeftest(model.1, vcov = sandwich)


z test of coefficients:

               Estimate Std. Error z value         Pr(>|z|)    
(Intercept)     -5.4834     0.0798  -68.72          < 2e-16 ***
cityDallas       0.8039     0.0406   19.82          < 2e-16 ***
age.range15_24  -6.1742     0.1575  -39.20          < 2e-16 ***
age.range25_34  -3.5440     0.0793  -44.69          < 2e-16 ***
age.range35_44  -2.3268     0.1126  -20.66          < 2e-16 ***
age.range45_54  -1.5790     0.0821  -19.23          < 2e-16 ***
age.range55_64  -1.0869     0.0726  -14.97          < 2e-16 ***
age.range65_74  -0.5288     0.0737   -7.17 0.00000000000075 ***
age.range75_84  -0.1157     0.0909   -1.27              0.2    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.

Negative binomial regression

## Load MASS
library(MASS)
## Negative binomial regression
model.1nb <- glm.nb(cases ~ city + age.range + offset(log(n)), data = nonmel)
summary(model.1nb)


Call:
glm.nb(formula = cases ~ city + age.range + offset(log(n)), data = nonmel, 
    init.theta = 1600830.821, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5059  -0.4857   0.0164   0.3693   1.2476  

Coefficients:
               Estimate Std. Error z value  Pr(>|z|)    
(Intercept)     -5.4834     0.1037  -52.89   < 2e-16 ***
cityDallas       0.8039     0.0522   15.40   < 2e-16 ***
age.range15_24  -6.1742     0.4577  -13.49   < 2e-16 ***
age.range25_34  -3.5440     0.1675  -21.16   < 2e-16 ***
age.range35_44  -2.3268     0.1275  -18.25   < 2e-16 ***
age.range45_54  -1.5790     0.1138  -13.87   < 2e-16 ***
age.range55_64  -1.0869     0.1109   -9.80   < 2e-16 ***
age.range65_74  -0.5288     0.1086   -4.87 0.0000011 ***
age.range75_84  -0.1157     0.1110   -1.04       0.3    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1600831) family taken to be 1)

    Null deviance: 2789.499  on 15  degrees of freedom
Residual deviance:    8.258  on  7  degrees of freedom
AIC: 122.5

Number of Fisher Scoring iterations: 1

              Theta:  1600831 
          Std. Err.:  34893356 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -102.5

Yet another way to account for overdispersion is using negative binomial regression, which models the variance function as follows. In this specific case the \( \theta \) is very large, and it does not converge within iteration limits.

link function: \( g(\mu) = ln(\mu) \)
variance function: \( V(\mu) = \mu + \mu^2 / \theta \)

Lung cancer example from Introductory Statistics with R (ISwR) Book

E.B. Andersen (1977), Multiplicative Poisson models with unequal cell rates, Scandinavian Journal of Statistics, 4:153-158.

Data preparation

## Load ISwR package
library(ISwR)
## Load data
data(eba1977)
eba1977

         city   age  pop cases
1  Fredericia 40-54 3059    11
2     Horsens 40-54 2879    13
3     Kolding 40-54 3142     4
4       Vejle 40-54 2520     5
5  Fredericia 55-59  800    11
6     Horsens 55-59 1083     6
7     Kolding 55-59 1050     8
8       Vejle 55-59  878     7
9  Fredericia 60-64  710    11
10    Horsens 60-64  923    15
11    Kolding 60-64  895     7
12      Vejle 60-64  839    10
13 Fredericia 65-69  581    10
14    Horsens 65-69  834    10
15    Kolding 65-69  702    11
16      Vejle 65-69  631    14
17 Fredericia 70-74  509    11
18    Horsens 70-74  634    12
19    Kolding 70-74  535     9
20      Vejle 70-74  539     8
21 Fredericia   75+  605    10
22    Horsens   75+  782     2
23    Kolding   75+  659    12
24      Vejle   75+  619     7

Lung cancer incidence in four Danish cities 1968-1971

Description:
     This data set contains counts of incident lung cancer cases and
     population size in four neighbouring Danish cities by age group.

Format:
     A data frame with 24 observations on the following 4 variables:
     city a factor with levels Fredericia, Horsens, Kolding,
          and Vejle.
     age a factor with levels 40-54, 55-59, 60-64, 65-69,
          70-74, and 75+.
     pop a numeric vector, number of inhabitants.
     cases a numeric vector, number of lung cancer cases.

Poisson model

## Fit Poisson model
model.2 <- glm(cases ~ city + age, offset = log(pop), family = poisson(link = "log"), data = eba1977)
summary(model.2)


Call:
glm(formula = cases ~ city + age, family = poisson(link = "log"), 
    data = eba1977, offset = log(pop))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6357  -0.6730  -0.0344   0.3726   1.8527  

Coefficients:
            Estimate Std. Error z value          Pr(>|z|)    
(Intercept)   -5.632      0.200  -28.12           < 2e-16 ***
cityHorsens   -0.330      0.182   -1.82             0.069 .  
cityKolding   -0.372      0.188   -1.98             0.048 *  
cityVejle     -0.272      0.188   -1.45             0.147    
age55-59       1.101      0.248    4.43 0.000009230223483 ***
age60-64       1.519      0.232    6.56 0.000000000055276 ***
age65-69       1.768      0.229    7.70 0.000000000000013 ***
age70-74       1.857      0.235    7.89 0.000000000000003 ***
age75+         1.420      0.250    5.67 0.000000014075139 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 129.908  on 23  degrees of freedom
Residual deviance:  23.447  on 15  degrees of freedom
AIC: 137.8

Number of Fisher Scoring iterations: 5


## Check dispersion parameter with quasi-Poisson regression
model.2q <- glm(cases ~ city + age, offset = log(pop), family = quasipoisson(link = "log"), data = eba1977)
summary(model.2q)


Call:
glm(formula = cases ~ city + age, family = quasipoisson(link = "log"), 
    data = eba1977, offset = log(pop))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6357  -0.6730  -0.0344   0.3726   1.8527  

Coefficients:
            Estimate Std. Error t value         Pr(>|t|)    
(Intercept)   -5.632      0.246  -22.93 0.00000000000043 ***
cityHorsens   -0.330      0.223   -1.48          0.15884    
cityKolding   -0.372      0.230   -1.61          0.12756    
cityVejle     -0.272      0.230   -1.18          0.25561    
age55-59       1.101      0.305    3.62          0.00254 ** 
age60-64       1.519      0.284    5.35 0.00008166545653 ***
age65-69       1.768      0.281    6.28 0.00001469556390 ***
age70-74       1.857      0.289    6.43 0.00001125399928 ***
age75+         1.420      0.307    4.63          0.00033 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.504)

    Null deviance: 129.908  on 23  degrees of freedom
Residual deviance:  23.447  on 15  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The dispersion parameter is 1.5, and is not so large. The Poisson model is probably acceptable. However, the significance of the main effect of living in Kolding city is qualitatively different (significant in Poisson, and non-significant quasi-Poisson).

glm.RR(model.2, 3)

               RR 2.5 % 97.5 %
(Intercept) 0.004 0.002  0.005 # risk of lung cancer is 0.004 per person among 40-54 yrs in Fredericia.
cityHorsens 0.719 0.503  1.026 # risk ratio of lung cancer is 0.719 compared to Fredericia within levels of age
cityKolding 0.690 0.476  0.995 # risk ratio of lung cancer is 0.690 compared to Fredericia within levels of age
cityVejle   0.762 0.525  1.099 # risk ratio of lung cancer is 0.762 compared to Fredericia within levels of age
age55-59    3.007 1.843  4.901 # risk ratio of lung cancer is 3.007 compared to 40-54 yrs within levels of city
age60-64    4.566 2.907  7.236 # risk ratio of lung cancer is 4.566 compared to 40-54 yrs within levels of city
age65-69    5.857 3.748  9.249 # risk ratio of lung cancer is 5.857 compared to 40-54 yrs within levels of city
age70-74    6.404 4.043 10.212 # risk ratio of lung cancer is 6.404 compared to 40-54 yrs within levels of city
age75+      4.136 2.523  6.762 # risk ratio of lung cancer is 4.136 compared to 40-54 yrs within levels of city

Goodness of fit test If the residual deviance is close enough to the residual degrees of freedom, it is a good fit. It can be tested by Chi-squared test.

list(residual.deviance           = deviance(model.2),
     residual.degrees.of.freedom = df.residual(model.2),
     chisq.p.value               = pchisq(deviance(model.2), df.residual(model.2), lower = F)
     )

$residual.deviance
[1] 23.45

$residual.degrees.of.freedom
[1] 15

$chisq.p.value
[1] 0.07509

Robust sandwich covariance matrix estimator

## Poisson model with SE estimated via robust variance estimator
coeftest(model.2, vcov = sandwich)


z test of coefficients:

            Estimate Std. Error z value        Pr(>|z|)    
(Intercept)   -5.632      0.188  -29.93         < 2e-16 ***
cityHorsens   -0.330      0.187   -1.76           0.078 .  
cityKolding   -0.372      0.154   -2.41           0.016 *  
cityVejle     -0.272      0.107   -2.56           0.011 *  
age55-59       1.101      0.241    4.58 0.0000047610015 ***
age60-64       1.519      0.246    6.17 0.0000000006998 ***
age65-69       1.768      0.246    7.20 0.0000000000006 ***
age70-74       1.857      0.222    8.37         < 2e-16 ***
age75+         1.420      0.342    4.15 0.0000325995756 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.

Negative binomial regression

## Negative binomial regression
model.2nb <- glm.nb(cases ~ city + age + offset(log(pop)), data = eba1977)
summary(model.2nb)


Call:
glm.nb(formula = cases ~ city + age + offset(log(pop)), data = eba1977, 
    init.theta = 119797.75, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6357  -0.6729  -0.0344   0.3726   1.8526  

Coefficients:
            Estimate Std. Error z value          Pr(>|z|)    
(Intercept)   -5.632      0.200  -28.12           < 2e-16 ***
cityHorsens   -0.330      0.182   -1.82             0.069 .  
cityKolding   -0.372      0.188   -1.98             0.048 *  
cityVejle     -0.272      0.188   -1.45             0.147    
age55-59       1.101      0.248    4.43 0.000009236801490 ***
age60-64       1.519      0.232    6.56 0.000000000055373 ***
age65-69       1.768      0.229    7.70 0.000000000000013 ***
age70-74       1.857      0.235    7.89 0.000000000000003 ***
age75+         1.420      0.250    5.67 0.000000014090354 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(119798) family taken to be 1)

    Null deviance: 129.895  on 23  degrees of freedom
Residual deviance:  23.446  on 15  degrees of freedom
AIC: 139.8

Number of Fisher Scoring iterations: 1

              Theta:  119798 
          Std. Err.:  3824543 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -119.8

link function: \( g(\mu) = ln(\mu) \)
variance function: \( V(\mu) = \mu + \mu^2 / \theta \)

Lung cancer deaths in British male physicians (Frome, 1983).

Applied Regression Analysis and Multivariable Methods, 4th Edition
www.pubmed.org/6652201
Available as lung.cancer dataset in SMPracticals package

library(SMPracticals)
data(lung.cancer)

A data frame with 63 observations on the following 4 variables. (?lung.cancer)

years.smok a factor giving the number of years smoking
cigarettes a factor giving cigarette consumption
Time man-years at risk
y number of deaths

head(lung.cancer, 15)

   years.smok cigarettes  Time y
1       15-19          0 10366 1
2       15-19        1-9  3121 0
3       15-19      10-14  3577 0
4       15-19      15-19  4317 0
5       15-19      20-24  5683 0
6       15-19      25-34  3042 0
7       15-19        35+   670 0
8       20-24          0  8162 0
9       20-24        1-9  2937 0
10      20-24      10-14  3286 1
11      20-24      15-19  4214 0
12      20-24      20-24  6385 1
13      20-24      25-34  4050 1
14      20-24        35+  1166 0
15      25-29          0  5969 0

Poisson regression

## Poisson
model.3 <- glm(y ~ years.smok + cigarettes,
                         offset = log(Time),
                         family = poisson(link = "log"), data = lung.cancer)    
summary(model.3)


Call:
glm(formula = y ~ years.smok + cigarettes, family = poisson(link = "log"), 
    data = lung.cancer, offset = log(Time))

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.833  -0.856  -0.381   0.424   2.176  

Coefficients:
                Estimate Std. Error z value     Pr(>|z|)    
(Intercept)      -12.578      1.148  -10.96      < 2e-16 ***
years.smok20-24    0.947      1.155    0.82      0.41220    
years.smok25-29    1.702      1.080    1.57      0.11528    
years.smok30-34    3.203      1.020    3.14      0.00170 ** 
years.smok35-39    3.242      1.024    3.17      0.00155 ** 
years.smok40-44    4.209      1.014    4.15 0.0000329906 ***
years.smok45-49    4.448      1.017    4.37 0.0000122579 ***
years.smok50-54    4.905      1.020    4.81 0.0000015245 ***
years.smok55-59    5.413      1.024    5.29 0.0000001242 ***
cigarettes1-9      1.220      0.707    1.72      0.08455 .  
cigarettes10-14    2.099      0.636    3.30      0.00097 ***
cigarettes15-19    2.309      0.633    3.65      0.00026 ***
cigarettes20-24    2.901      0.596    4.87 0.0000011134 ***
cigarettes25-34    3.116      0.595    5.24 0.0000001606 ***
cigarettes35+      3.606      0.605    5.96 0.0000000025 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  51.471  on 48  degrees of freedom
AIC: 201.3

Number of Fisher Scoring iterations: 6


## Quasi-Poisson
model.3q <- glm(y ~ years.smok + cigarettes,
                         offset = log(Time),
                         family = quasipoisson(link = "log"), data = lung.cancer)    
summary(model.3q)


Call:
glm(formula = y ~ years.smok + cigarettes, family = quasipoisson(link = "log"), 
    data = lung.cancer, offset = log(Time))

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.833  -0.856  -0.381   0.424   2.176  

Coefficients:
                Estimate Std. Error t value       Pr(>|t|)    
(Intercept)      -12.578      1.342   -9.37 0.000000000002 ***
years.smok20-24    0.947      1.351    0.70        0.48662    
years.smok25-29    1.702      1.264    1.35        0.18446    
years.smok30-34    3.203      1.193    2.68        0.00996 ** 
years.smok35-39    3.242      1.198    2.71        0.00939 ** 
years.smok40-44    4.209      1.186    3.55        0.00087 ***
years.smok45-49    4.448      1.190    3.74        0.00049 ***
years.smok50-54    4.905      1.193    4.11        0.00015 ***
years.smok55-59    5.413      1.198    4.52 0.000040376655 ***
cigarettes1-9      1.220      0.827    1.47        0.14680    
cigarettes10-14    2.099      0.744    2.82        0.00695 ** 
cigarettes15-19    2.309      0.740    3.12        0.00306 ** 
cigarettes20-24    2.901      0.697    4.16        0.00013 ***
cigarettes25-34    3.116      0.696    4.48 0.000046139733 ***
cigarettes35+      3.606      0.707    5.10 0.000005779992 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.368)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  51.471  on 48  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

Rate ratios

glm.RR(model.3, 3)

                     RR  2.5 %  97.5 %
(Intercept)       0.000  0.000    0.00
years.smok20-24   2.578  0.330   52.12
years.smok25-29   5.483  0.936  103.58
years.smok30-34  24.604  5.210  439.82
years.smok35-39  25.591  5.348  459.10
years.smok40-44  67.278 14.583 1195.17
years.smok45-49  85.425 18.301 1522.31
years.smok50-54 134.941 28.597 2411.61
years.smok55-59 224.385 46.942 4024.10
cigarettes1-9     3.387  0.893   16.06
cigarettes10-14   8.159  2.663   35.40
cigarettes15-19  10.064  3.319   43.47
cigarettes20-24  18.190  6.669   74.90
cigarettes25-34  22.561  8.293   92.80
cigarettes35+    36.816 13.153  153.43

Robust sandwich covariance matrix estimator

## Poisson model with SE estimated via robust variance estimator
coeftest(model.3, vcov = sandwich)


z test of coefficients:

                Estimate Std. Error z value    Pr(>|z|)    
(Intercept)      -12.578      1.457   -8.63     < 2e-16 ***
years.smok20-24    0.947      1.118    0.85     0.39699    
years.smok25-29    1.702      1.160    1.47     0.14255    
years.smok30-34    3.203      1.075    2.98     0.00289 ** 
years.smok35-39    3.242      1.083    2.99     0.00275 ** 
years.smok40-44    4.209      1.074    3.92 0.000089147 ***
years.smok45-49    4.448      1.071    4.15 0.000032653 ***
years.smok50-54    4.905      1.078    4.55 0.000005334 ***
years.smok55-59    5.413      1.096    4.94 0.000000784 ***
cigarettes1-9      1.220      0.806    1.51     0.13025    
cigarettes10-14    2.099      0.673    3.12     0.00181 ** 
cigarettes15-19    2.309      0.698    3.31     0.00094 ***
cigarettes20-24    2.901      0.644    4.51 0.000006555 ***
cigarettes25-34    3.116      0.645    4.83 0.000001381 ***
cigarettes35+      3.606      0.650    5.55 0.000000028 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.

Negative binomial regression

## Negative binomial regression
model.3nb <- glm.nb(y ~ years.smok + cigarettes + offset(log(Time)), data = lung.cancer)
summary(model.3nb)


Call:
glm.nb(formula = y ~ years.smok + cigarettes + offset(log(Time)), 
    data = lung.cancer, init.theta = 5028.906523, link = log)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.833  -0.856  -0.381   0.424   2.176  

Coefficients:
                Estimate Std. Error z value     Pr(>|z|)    
(Intercept)      -12.578      1.148  -10.96      < 2e-16 ***
years.smok20-24    0.947      1.155    0.82      0.41224    
years.smok25-29    1.702      1.080    1.57      0.11531    
years.smok30-34    3.203      1.020    3.14      0.00170 ** 
years.smok35-39    3.242      1.024    3.17      0.00155 ** 
years.smok40-44    4.209      1.014    4.15 0.0000330267 ***
years.smok45-49    4.448      1.017    4.37 0.0000122755 ***
years.smok50-54    4.905      1.020    4.81 0.0000015254 ***
years.smok55-59    5.413      1.024    5.29 0.0000001243 ***
cigarettes1-9      1.220      0.707    1.72      0.08457 .  
cigarettes10-14    2.099      0.636    3.30      0.00097 ***
cigarettes15-19    2.309      0.633    3.65      0.00026 ***
cigarettes20-24    2.901      0.596    4.87 0.0000011162 ***
cigarettes25-34    3.116      0.595    5.24 0.0000001609 ***
cigarettes35+      3.606      0.605    5.96 0.0000000025 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(5029) family taken to be 1)

    Null deviance: 444.862  on 62  degrees of freedom
Residual deviance:  51.455  on 48  degrees of freedom
AIC: 203.3

Number of Fisher Scoring iterations: 1

              Theta:  5029 
          Std. Err.:  20348 
Warning while fitting theta: alternation limit reached 

 2 x log-likelihood:  -171.3

link function: \( g(\mu) = ln(\mu) \)
variance function: \( V(\mu) = \mu + \mu^2 / \theta \)

Prediction

## Create dataset to predict for
newdat1 <- lung.cancer[c("years.smok","cigarettes")]
newdat2 <- newdat1

## Predicted number of cases per person
newdat1$Time <- 1
lung.cancer$pred.cases.per.one <- exp(predict(model.3, newdat1))

## Predicted number of cases per one thousand persons
newdat2$Time <- 1000
lung.cancer$pred.cases.per.thousand <- exp(predict(model.3, newdat2))

## Predicted number of cases per actual population
lung.cancer$pred.cases <- exp(predict(model.3))

## Show prediction results
lung.cancer

   years.smok cigarettes  Time  y pred.cases.per.one pred.cases.per.thousand pred.cases
1       15-19          0 10366  1        0.000003445                0.003445    0.03572
2       15-19        1-9  3121  0        0.000011671                0.011671    0.03643
3       15-19      10-14  3577  0        0.000028111                0.028111    0.10055
4       15-19      15-19  4317  0        0.000034674                0.034674    0.14969
5       15-19      20-24  5683  0        0.000062673                0.062673    0.35617
6       15-19      25-34  3042  0        0.000077732                0.077732    0.23646
7       15-19        35+   670  0        0.000126847                0.126847    0.08499
8       20-24          0  8162  0        0.000008882                0.008882    0.07249
9       20-24        1-9  2937  0        0.000030086                0.030086    0.08836
10      20-24      10-14  3286  1        0.000072465                0.072465    0.23812
11      20-24      15-19  4214  0        0.000089384                0.089384    0.37666
12      20-24      20-24  6385  1        0.000161559                0.161559    1.03155
13      20-24      25-34  4050  1        0.000200380                0.200380    0.81154
14      20-24        35+  1166  0        0.000326989                0.326989    0.38127
15      25-29          0  5969  0        0.000018890                0.018890    0.11276
16      25-29        1-9  2288  0        0.000063987                0.063987    0.14640
17      25-29      10-14  2546  1        0.000154121                0.154121    0.39239
18      25-29      15-19  3185  0        0.000190105                0.190105    0.60549
19      25-29      20-24  5483  1        0.000343609                0.343609    1.88401
20      25-29      25-34  4290  4        0.000426176                0.426176    1.82830
21      25-29        35+  1482  0        0.000695452                0.695452    1.03066
22      30-34          0  4496  0        0.000084774                0.084774    0.38114
23      30-34        1-9  2015  0        0.000287157                0.287157    0.57862
24      30-34      10-14  2219  2        0.000691651                0.691651    1.53477
25      30-34      15-19  2560  4        0.000853139                0.853139    2.18404
26      30-34      20-24  4687  6        0.001542020                1.542020    7.22745
27      30-34      25-34  4268  9        0.001912560                1.912560    8.16281
28      30-34        35+  1580  4        0.003120993                3.120993    4.93117
29      35-39          0  3512  0        0.000088174                0.088174    0.30967
30      35-39        1-9  1648  1        0.000298675                0.298675    0.49222
31      35-39      10-14  1826  0        0.000719392                0.719392    1.31361
32      35-39      15-19  1893  0        0.000887358                0.887358    1.67977
33      35-39      20-24  3646  5        0.001603870                1.603870    5.84771
34      35-39      25-34  3529  9        0.001989272                1.989272    7.02014
35      35-39        35+  1336  6        0.003246174                3.246174    4.33689
36      40-44          0  2201  0        0.000231804                0.231804    0.51020
37      40-44        1-9  1310  2        0.000785196                0.785196    1.02861
38      40-44      10-14  1386  1        0.001891234                1.891234    2.62125
39      40-44      15-19  1334  2        0.002332804                2.332804    3.11196
40      40-44      20-24  2411 12        0.004216465                4.216465   10.16590
41      40-44      25-34  2424 11        0.005229660                5.229660   12.67670
42      40-44        35+   924 10        0.008533972                8.533972    7.88539
43      45-49          0  1421  0        0.000294328                0.294328    0.41824
44      45-49        1-9   927  0        0.000996988                0.996988    0.92421
45      45-49      10-14   988  2        0.002401360                2.401360    2.37254
46      45-49      15-19   849  2        0.002962035                2.962035    2.51477
47      45-49      20-24  1567  9        0.005353779                5.353779    8.38937
48      45-49      25-34  1409 10        0.006640265                6.640265    9.35613
49      45-49        35+   556  7        0.010835854               10.835854    6.02473
50      50-54          0  1121  0        0.000464935                0.464935    0.52119
51      50-54        1-9   710  3        0.001574889                1.574889    1.11817
52      50-54      10-14   684  4        0.003793301                3.793301    2.59462
53      50-54      15-19   470  2        0.004678971                4.678971    2.19912
54      50-54      20-24   857  7        0.008457083                8.457083    7.24772
55      50-54      25-34   663  5        0.010489277               10.489277    6.95439
56      50-54        35+   255  4        0.017116828               17.116828    4.36479
57      55-59          0   826  2        0.000773114                0.773114    0.63859
58      55-59        1-9   606  0        0.002618793                2.618793    1.58699
59      55-59      10-14   449  3        0.006307664                6.307664    2.83214
60      55-59      15-19   280  5        0.007780394                7.780394    2.17851
61      55-59      20-24   416  7        0.014062800               14.062800    5.85012
62      55-59      25-34   284  3        0.017442019               17.442019    4.95353
63      55-59        35+   104  1        0.028462595               28.462595    2.96011