Poisson regression is used for outcome variables that only take positive values such as counts and rates. The model assumes Var(outcome) = E(outcome) conditional on predictors.
Create rate ratio calculation function
glm.RR <- function(GLM.RESULT, digits = 2) {
if (GLM.RESULT$family$family == "binomial") {
LABEL <- "OR"
} else if (GLM.RESULT$family$family == "poisson") {
LABEL <- "RR"
} else {
stop("Not logistic or Poisson model")
}
COEF <- stats::coef(GLM.RESULT)
CONFINT <- stats::confint(GLM.RESULT)
TABLE <- cbind(coef=COEF, CONFINT)
TABLE.EXP <- round(exp(TABLE), digits)
colnames(TABLE.EXP)[1] <- LABEL
TABLE.EXP
}
The example was taken from Applied Regression Analysis and Multivariable Methods, 4th Edition.
Skin cancer dataset from “Non-melanoma skin cancer among Caucasians in four areas of the United States.” (www.pubmed.org/4422100).
Dataset is availabe at http://academic.cengage.com/resource_uploads/downloads/0495384968_88614.zip (melanom.txt). As it is small it was included below.
Prepare dataset
## Create a dataset manually
nonmel <- read.table(header = TRUE,
text = "
cases city u1 u2 u3 u4 u5 u6 u7 n
1 1 0 1 0 0 0 0 0 0 172675
2 16 0 0 1 0 0 0 0 0 123065
3 30 0 0 0 1 0 0 0 0 96216
4 71 0 0 0 0 1 0 0 0 92051
5 102 0 0 0 0 0 1 0 0 72159
6 130 0 0 0 0 0 0 1 0 54722
7 133 0 0 0 0 0 0 0 1 32185
8 40 0 0 0 0 0 0 0 0 8328
9 4 1 1 0 0 0 0 0 0 181343
10 38 1 0 1 0 0 0 0 0 146207
11 119 1 0 0 1 0 0 0 0 121374
12 221 1 0 0 0 1 0 0 0 111353
13 259 1 0 0 0 0 1 0 0 83004
14 310 1 0 0 0 0 0 1 0 55932
15 226 1 0 0 0 0 0 0 1 29007
16 65 1 0 0 0 0 0 0 0 7583
")
## Create age.range variable and city variable
nonmel <- within(nonmel, {
age.range <- rep(c("15_24","25_34","35_44","45_54","55_64","65_74","75_84","85+"), 2)
age.range <- factor(age.range)
age.range <- relevel(age.range, ref = "85+")
city <- factor(city, 0:1, c("Minneapolis", "Dallas"))
})
## rop unnecessary columns
nonmel <- nonmel[c("cases","n","city","age.range")]
## Check data
nonmel
cases n city age.range
1 1 172675 Minneapolis 15_24
2 16 123065 Minneapolis 25_34
3 30 96216 Minneapolis 35_44
4 71 92051 Minneapolis 45_54
5 102 72159 Minneapolis 55_64
6 130 54722 Minneapolis 65_74
7 133 32185 Minneapolis 75_84
8 40 8328 Minneapolis 85+
9 4 181343 Dallas 15_24
10 38 146207 Dallas 25_34
11 119 121374 Dallas 35_44
12 221 111353 Dallas 45_54
13 259 83004 Dallas 55_64
14 310 55932 Dallas 65_74
15 226 29007 Dallas 75_84
16 65 7583 Dallas 85+
It describes the number of cases of non-melanoma skin cancer among residents of two cities in each age category.
Poisson regression modeling
The outcome of interest here is rates (cases/n), however, glm() can only take counts as outcome for Poisson regression.
By transforming the equation.
\( log(cases / n) = \beta_0 + \sum_i {\beta_i X_i} \)
\( log(cases) - log(n) = \beta_0 + \sum_i {\beta_i X_i} \)
\( log(cases) = log(n) + \beta_0 + \sum_i {\beta_i X_i} \)
Thus, it can be modeled by including a log(n) term with coeffcient of 1 (no \( \beta \) for this term). This is called an offset and, modeled with offset() in R.
These two are equivalent
## Including offset(log(n)) in the right hand side
model.1 <- glm(cases ~ city + age.range + offset(log(n)), family = poisson(link = "log"), data = nonmel)
## Using the offset option
model.1 <- glm(cases ~ city + age.range, offset = log(n), family = poisson(link = "log"), data = nonmel)
## Results from regular Poisson
summary(model.1)
Call:
glm(formula = cases ~ city + age.range, family = poisson(link = "log"),
data = nonmel, offset = log(n))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5060 -0.4857 0.0164 0.3693 1.2476
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.4834 0.1037 -52.89 < 2e-16 ***
cityDallas 0.8039 0.0522 15.40 < 2e-16 ***
age.range15_24 -6.1742 0.4577 -13.49 < 2e-16 ***
age.range25_34 -3.5440 0.1675 -21.16 < 2e-16 ***
age.range35_44 -2.3268 0.1275 -18.25 < 2e-16 ***
age.range45_54 -1.5790 0.1138 -13.87 < 2e-16 ***
age.range55_64 -1.0869 0.1109 -9.80 < 2e-16 ***
age.range65_74 -0.5288 0.1086 -4.87 0.0000011 ***
age.range75_84 -0.1157 0.1109 -1.04 0.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2789.6810 on 15 degrees of freedom
Residual deviance: 8.2585 on 7 degrees of freedom
AIC: 120.5
Number of Fisher Scoring iterations: 4
glm.RR(model.1, 3)
RR 2.5 % 97.5 %
(Intercept) 0.004 0.003 0.005 # The risk of cancer for oldest in Minn. is 0.004 per person (4/1000 people)
cityDallas 2.234 2.018 2.477 # Compared to Minn. , risk ratio of cancer is 2.234 within levels of covariates
age.range15_24 0.002 0.001 0.005 # Compared to oldest, risk ratio of cancer is 0.002 within levels of covariates
age.range25_34 0.029 0.021 0.040 # Compared to oldest, risk ratio of cancer is 0.029 within levels of covariates
age.range35_44 0.098 0.076 0.126 # Compared to oldest, risk ratio of cancer is 0.098 within levels of covariates
age.range45_54 0.206 0.166 0.259 # Compared to oldest, risk ratio of cancer is 0.206 within levels of covariates
age.range55_64 0.337 0.272 0.421 # Compared to oldest, risk ratio of cancer is 0.337 within levels of covariates
age.range65_74 0.589 0.478 0.733 # Compared to oldest, risk ratio of cancer is 0.589 within levels of covariates
age.range75_84 0.891 0.720 1.112 # Compared to oldest, risk ratio of cancer is 0.891 within levels of covariates
Prediction
## Predict case per person (n = 1) for oldest people in the Minneapolis
exp(predict(model.1, newdata = data.frame(city = "Minneapolis", age.range = "85+", n = 1)))
1
0.004155
## Create dataset to predict for
newdat1 <- nonmel[c("city","age.range")]
newdat2 <- newdat1
## Predicted number of cases per person
newdat1$n <- 1
nonmel$pred.cases.per.one <- exp(predict(model.1, newdat1))
## Predicted number of cases per one thousand persons
newdat2$n <- 1000
nonmel$pred.cases.per.thousand <- exp(predict(model.1, newdat2))
## Predicted number of cases per actual population
nonmel$pred.cases <- exp(predict(model.1))
## Show prediction results
nonmel
cases n city age.range pred.cases.per.one pred.cases.per.thousand pred.cases
1 1 172675 Minneapolis 15_24 0.000008653 0.008653 1.494
2 16 123065 Minneapolis 25_34 0.000120073 0.120073 14.777
3 30 96216 Minneapolis 35_44 0.000405559 0.405559 39.021
4 71 92051 Minneapolis 45_54 0.000856708 0.856708 78.861
5 102 72159 Minneapolis 55_64 0.001401347 1.401347 101.120
6 130 54722 Minneapolis 65_74 0.002448700 2.448700 133.998
7 133 32185 Minneapolis 75_84 0.003701282 3.701282 119.126
8 40 8328 Minneapolis 85+ 0.004155093 4.155093 34.604
9 4 181343 Dallas 15_24 0.000019333 0.019333 3.506
10 38 146207 Dallas 25_34 0.000268272 0.268272 39.223
11 119 121374 Dallas 35_44 0.000906114 0.906114 109.979
12 221 111353 Dallas 45_54 0.001914086 1.914086 213.139
13 259 83004 Dallas 55_64 0.003130936 3.130936 259.880
14 310 55932 Dallas 65_74 0.005470969 5.470969 306.002
15 226 29007 Dallas 75_84 0.008269529 8.269529 239.874
16 65 7583 Dallas 85+ 0.009283448 9.283448 70.396
Quasi-Poisson regression with an estimated dispersion parameter
## quasi-Poisson to allow the scale parameter to change from 1. Show the dispersion parameter.
model.1q <- glm(cases ~ city + age.range, offset = log(n), family = quasipoisson(link = "log"), data = nonmel)
summary(model.1q)
Call:
glm(formula = cases ~ city + age.range, family = quasipoisson(link = "log"),
data = nonmel, offset = log(n))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5060 -0.4857 0.0164 0.3693 1.2476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.4834 0.1117 -49.08 0.00000000038 ***
cityDallas 0.8039 0.0563 14.29 0.00000195327 ***
age.range15_24 -6.1742 0.4932 -12.52 0.00000478575 ***
age.range25_34 -3.5440 0.1805 -19.64 0.00000022172 ***
age.range35_44 -2.3268 0.1373 -16.94 0.00000061180 ***
age.range45_54 -1.5790 0.1227 -12.87 0.00000396332 ***
age.range55_64 -1.0869 0.1195 -9.10 0.00003983884 ***
age.range65_74 -0.5288 0.1170 -4.52 0.0027 **
age.range75_84 -0.1157 0.1195 -0.97 0.3656
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.161)
Null deviance: 2789.6810 on 15 degrees of freedom
Residual deviance: 8.2585 on 7 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 4
The quasi-Poisson method can be used to estimate the dispersion parameter, i.e., degree of overdispersion. It is 1.16, and almost equal to 1, which is the value used in regular Poisson regression. Thus, the Poisson model is appears ok. If it is large, the standard error estimation from the quasi-Poisson model should be used. Here because of the near one dispersion parameter, the SE estimates are almost identical in both methods.
Robust sandwich covariance matrix estimator
## Load sandwich package for robust estimator
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Poisson model with SE estimated via robust variance estimator
coeftest(model.1, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.4834 0.0798 -68.72 < 2e-16 ***
cityDallas 0.8039 0.0406 19.82 < 2e-16 ***
age.range15_24 -6.1742 0.1575 -39.20 < 2e-16 ***
age.range25_34 -3.5440 0.0793 -44.69 < 2e-16 ***
age.range35_44 -2.3268 0.1126 -20.66 < 2e-16 ***
age.range45_54 -1.5790 0.0821 -19.23 < 2e-16 ***
age.range55_64 -1.0869 0.0726 -14.97 < 2e-16 ***
age.range65_74 -0.5288 0.0737 -7.17 0.00000000000075 ***
age.range75_84 -0.1157 0.0909 -1.27 0.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.
Negative binomial regression
## Load MASS
library(MASS)
## Negative binomial regression
model.1nb <- glm.nb(cases ~ city + age.range + offset(log(n)), data = nonmel)
summary(model.1nb)
Call:
glm.nb(formula = cases ~ city + age.range + offset(log(n)), data = nonmel,
init.theta = 1600830.821, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5059 -0.4857 0.0164 0.3693 1.2476
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.4834 0.1037 -52.89 < 2e-16 ***
cityDallas 0.8039 0.0522 15.40 < 2e-16 ***
age.range15_24 -6.1742 0.4577 -13.49 < 2e-16 ***
age.range25_34 -3.5440 0.1675 -21.16 < 2e-16 ***
age.range35_44 -2.3268 0.1275 -18.25 < 2e-16 ***
age.range45_54 -1.5790 0.1138 -13.87 < 2e-16 ***
age.range55_64 -1.0869 0.1109 -9.80 < 2e-16 ***
age.range65_74 -0.5288 0.1086 -4.87 0.0000011 ***
age.range75_84 -0.1157 0.1110 -1.04 0.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1600831) family taken to be 1)
Null deviance: 2789.499 on 15 degrees of freedom
Residual deviance: 8.258 on 7 degrees of freedom
AIC: 122.5
Number of Fisher Scoring iterations: 1
Theta: 1600831
Std. Err.: 34893356
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -102.5
Yet another way to account for overdispersion is using negative binomial regression, which models the variance function as follows. In this specific case the \( \theta \) is very large, and it does not converge within iteration limits.
E.B. Andersen (1977), Multiplicative Poisson models with unequal cell rates, Scandinavian Journal of Statistics, 4:153-158.
Data preparation
## Load ISwR package
library(ISwR)
## Load data
data(eba1977)
eba1977
city age pop cases
1 Fredericia 40-54 3059 11
2 Horsens 40-54 2879 13
3 Kolding 40-54 3142 4
4 Vejle 40-54 2520 5
5 Fredericia 55-59 800 11
6 Horsens 55-59 1083 6
7 Kolding 55-59 1050 8
8 Vejle 55-59 878 7
9 Fredericia 60-64 710 11
10 Horsens 60-64 923 15
11 Kolding 60-64 895 7
12 Vejle 60-64 839 10
13 Fredericia 65-69 581 10
14 Horsens 65-69 834 10
15 Kolding 65-69 702 11
16 Vejle 65-69 631 14
17 Fredericia 70-74 509 11
18 Horsens 70-74 634 12
19 Kolding 70-74 535 9
20 Vejle 70-74 539 8
21 Fredericia 75+ 605 10
22 Horsens 75+ 782 2
23 Kolding 75+ 659 12
24 Vejle 75+ 619 7
Lung cancer incidence in four Danish cities 1968-1971
Description:
This data set contains counts of incident lung cancer cases and
population size in four neighbouring Danish cities by age group.
Format:
A data frame with 24 observations on the following 4 variables:
city a factor with levels Fredericia, Horsens, Kolding,
and Vejle.
age a factor with levels 40-54, 55-59, 60-64, 65-69,
70-74, and 75+.
pop a numeric vector, number of inhabitants.
cases a numeric vector, number of lung cancer cases.
Poisson model
## Fit Poisson model
model.2 <- glm(cases ~ city + age, offset = log(pop), family = poisson(link = "log"), data = eba1977)
summary(model.2)
Call:
glm(formula = cases ~ city + age, family = poisson(link = "log"),
data = eba1977, offset = log(pop))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6357 -0.6730 -0.0344 0.3726 1.8527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.632 0.200 -28.12 < 2e-16 ***
cityHorsens -0.330 0.182 -1.82 0.069 .
cityKolding -0.372 0.188 -1.98 0.048 *
cityVejle -0.272 0.188 -1.45 0.147
age55-59 1.101 0.248 4.43 0.000009230223483 ***
age60-64 1.519 0.232 6.56 0.000000000055276 ***
age65-69 1.768 0.229 7.70 0.000000000000013 ***
age70-74 1.857 0.235 7.89 0.000000000000003 ***
age75+ 1.420 0.250 5.67 0.000000014075139 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 23.447 on 15 degrees of freedom
AIC: 137.8
Number of Fisher Scoring iterations: 5
## Check dispersion parameter with quasi-Poisson regression
model.2q <- glm(cases ~ city + age, offset = log(pop), family = quasipoisson(link = "log"), data = eba1977)
summary(model.2q)
Call:
glm(formula = cases ~ city + age, family = quasipoisson(link = "log"),
data = eba1977, offset = log(pop))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6357 -0.6730 -0.0344 0.3726 1.8527
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.632 0.246 -22.93 0.00000000000043 ***
cityHorsens -0.330 0.223 -1.48 0.15884
cityKolding -0.372 0.230 -1.61 0.12756
cityVejle -0.272 0.230 -1.18 0.25561
age55-59 1.101 0.305 3.62 0.00254 **
age60-64 1.519 0.284 5.35 0.00008166545653 ***
age65-69 1.768 0.281 6.28 0.00001469556390 ***
age70-74 1.857 0.289 6.43 0.00001125399928 ***
age75+ 1.420 0.307 4.63 0.00033 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.504)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 23.447 on 15 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
The dispersion parameter is 1.5, and is not so large. The Poisson model is probably acceptable. However, the significance of the main effect of living in Kolding city is qualitatively different (significant in Poisson, and non-significant quasi-Poisson).
glm.RR(model.2, 3)
RR 2.5 % 97.5 %
(Intercept) 0.004 0.002 0.005 # risk of lung cancer is 0.004 per person among 40-54 yrs in Fredericia.
cityHorsens 0.719 0.503 1.026 # risk ratio of lung cancer is 0.719 compared to Fredericia within levels of age
cityKolding 0.690 0.476 0.995 # risk ratio of lung cancer is 0.690 compared to Fredericia within levels of age
cityVejle 0.762 0.525 1.099 # risk ratio of lung cancer is 0.762 compared to Fredericia within levels of age
age55-59 3.007 1.843 4.901 # risk ratio of lung cancer is 3.007 compared to 40-54 yrs within levels of city
age60-64 4.566 2.907 7.236 # risk ratio of lung cancer is 4.566 compared to 40-54 yrs within levels of city
age65-69 5.857 3.748 9.249 # risk ratio of lung cancer is 5.857 compared to 40-54 yrs within levels of city
age70-74 6.404 4.043 10.212 # risk ratio of lung cancer is 6.404 compared to 40-54 yrs within levels of city
age75+ 4.136 2.523 6.762 # risk ratio of lung cancer is 4.136 compared to 40-54 yrs within levels of city
Goodness of fit test If the residual deviance is close enough to the residual degrees of freedom, it is a good fit. It can be tested by Chi-squared test.
list(residual.deviance = deviance(model.2),
residual.degrees.of.freedom = df.residual(model.2),
chisq.p.value = pchisq(deviance(model.2), df.residual(model.2), lower = F)
)
$residual.deviance
[1] 23.45
$residual.degrees.of.freedom
[1] 15
$chisq.p.value
[1] 0.07509
Robust sandwich covariance matrix estimator
## Poisson model with SE estimated via robust variance estimator
coeftest(model.2, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.632 0.188 -29.93 < 2e-16 ***
cityHorsens -0.330 0.187 -1.76 0.078 .
cityKolding -0.372 0.154 -2.41 0.016 *
cityVejle -0.272 0.107 -2.56 0.011 *
age55-59 1.101 0.241 4.58 0.0000047610015 ***
age60-64 1.519 0.246 6.17 0.0000000006998 ***
age65-69 1.768 0.246 7.20 0.0000000000006 ***
age70-74 1.857 0.222 8.37 < 2e-16 ***
age75+ 1.420 0.342 4.15 0.0000325995756 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.
Negative binomial regression
## Negative binomial regression
model.2nb <- glm.nb(cases ~ city + age + offset(log(pop)), data = eba1977)
summary(model.2nb)
Call:
glm.nb(formula = cases ~ city + age + offset(log(pop)), data = eba1977,
init.theta = 119797.75, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6357 -0.6729 -0.0344 0.3726 1.8526
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.632 0.200 -28.12 < 2e-16 ***
cityHorsens -0.330 0.182 -1.82 0.069 .
cityKolding -0.372 0.188 -1.98 0.048 *
cityVejle -0.272 0.188 -1.45 0.147
age55-59 1.101 0.248 4.43 0.000009236801490 ***
age60-64 1.519 0.232 6.56 0.000000000055373 ***
age65-69 1.768 0.229 7.70 0.000000000000013 ***
age70-74 1.857 0.235 7.89 0.000000000000003 ***
age75+ 1.420 0.250 5.67 0.000000014090354 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(119798) family taken to be 1)
Null deviance: 129.895 on 23 degrees of freedom
Residual deviance: 23.446 on 15 degrees of freedom
AIC: 139.8
Number of Fisher Scoring iterations: 1
Theta: 119798
Std. Err.: 3824543
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -119.8
Yet another way to account for overdispersion is using negative binomial regression, which models the variance function as follows. In this specific case the \( \theta \) is very large, and it does not converge within iteration limits.
library(SMPracticals)
data(lung.cancer)
A data frame with 63 observations on the following 4 variables. (?lung.cancer)
head(lung.cancer, 15)
years.smok cigarettes Time y
1 15-19 0 10366 1
2 15-19 1-9 3121 0
3 15-19 10-14 3577 0
4 15-19 15-19 4317 0
5 15-19 20-24 5683 0
6 15-19 25-34 3042 0
7 15-19 35+ 670 0
8 20-24 0 8162 0
9 20-24 1-9 2937 0
10 20-24 10-14 3286 1
11 20-24 15-19 4214 0
12 20-24 20-24 6385 1
13 20-24 25-34 4050 1
14 20-24 35+ 1166 0
15 25-29 0 5969 0
Poisson regression
## Poisson
model.3 <- glm(y ~ years.smok + cigarettes,
offset = log(Time),
family = poisson(link = "log"), data = lung.cancer)
summary(model.3)
Call:
glm(formula = y ~ years.smok + cigarettes, family = poisson(link = "log"),
data = lung.cancer, offset = log(Time))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.833 -0.856 -0.381 0.424 2.176
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.578 1.148 -10.96 < 2e-16 ***
years.smok20-24 0.947 1.155 0.82 0.41220
years.smok25-29 1.702 1.080 1.57 0.11528
years.smok30-34 3.203 1.020 3.14 0.00170 **
years.smok35-39 3.242 1.024 3.17 0.00155 **
years.smok40-44 4.209 1.014 4.15 0.0000329906 ***
years.smok45-49 4.448 1.017 4.37 0.0000122579 ***
years.smok50-54 4.905 1.020 4.81 0.0000015245 ***
years.smok55-59 5.413 1.024 5.29 0.0000001242 ***
cigarettes1-9 1.220 0.707 1.72 0.08455 .
cigarettes10-14 2.099 0.636 3.30 0.00097 ***
cigarettes15-19 2.309 0.633 3.65 0.00026 ***
cigarettes20-24 2.901 0.596 4.87 0.0000011134 ***
cigarettes25-34 3.116 0.595 5.24 0.0000001606 ***
cigarettes35+ 3.606 0.605 5.96 0.0000000025 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 445.099 on 62 degrees of freedom
Residual deviance: 51.471 on 48 degrees of freedom
AIC: 201.3
Number of Fisher Scoring iterations: 6
## Quasi-Poisson
model.3q <- glm(y ~ years.smok + cigarettes,
offset = log(Time),
family = quasipoisson(link = "log"), data = lung.cancer)
summary(model.3q)
Call:
glm(formula = y ~ years.smok + cigarettes, family = quasipoisson(link = "log"),
data = lung.cancer, offset = log(Time))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.833 -0.856 -0.381 0.424 2.176
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.578 1.342 -9.37 0.000000000002 ***
years.smok20-24 0.947 1.351 0.70 0.48662
years.smok25-29 1.702 1.264 1.35 0.18446
years.smok30-34 3.203 1.193 2.68 0.00996 **
years.smok35-39 3.242 1.198 2.71 0.00939 **
years.smok40-44 4.209 1.186 3.55 0.00087 ***
years.smok45-49 4.448 1.190 3.74 0.00049 ***
years.smok50-54 4.905 1.193 4.11 0.00015 ***
years.smok55-59 5.413 1.198 4.52 0.000040376655 ***
cigarettes1-9 1.220 0.827 1.47 0.14680
cigarettes10-14 2.099 0.744 2.82 0.00695 **
cigarettes15-19 2.309 0.740 3.12 0.00306 **
cigarettes20-24 2.901 0.697 4.16 0.00013 ***
cigarettes25-34 3.116 0.696 4.48 0.000046139733 ***
cigarettes35+ 3.606 0.707 5.10 0.000005779992 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.368)
Null deviance: 445.099 on 62 degrees of freedom
Residual deviance: 51.471 on 48 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
Rate ratios
glm.RR(model.3, 3)
RR 2.5 % 97.5 %
(Intercept) 0.000 0.000 0.00
years.smok20-24 2.578 0.330 52.12
years.smok25-29 5.483 0.936 103.58
years.smok30-34 24.604 5.210 439.82
years.smok35-39 25.591 5.348 459.10
years.smok40-44 67.278 14.583 1195.17
years.smok45-49 85.425 18.301 1522.31
years.smok50-54 134.941 28.597 2411.61
years.smok55-59 224.385 46.942 4024.10
cigarettes1-9 3.387 0.893 16.06
cigarettes10-14 8.159 2.663 35.40
cigarettes15-19 10.064 3.319 43.47
cigarettes20-24 18.190 6.669 74.90
cigarettes25-34 22.561 8.293 92.80
cigarettes35+ 36.816 13.153 153.43
Robust sandwich covariance matrix estimator
## Poisson model with SE estimated via robust variance estimator
coeftest(model.3, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.578 1.457 -8.63 < 2e-16 ***
years.smok20-24 0.947 1.118 0.85 0.39699
years.smok25-29 1.702 1.160 1.47 0.14255
years.smok30-34 3.203 1.075 2.98 0.00289 **
years.smok35-39 3.242 1.083 2.99 0.00275 **
years.smok40-44 4.209 1.074 3.92 0.000089147 ***
years.smok45-49 4.448 1.071 4.15 0.000032653 ***
years.smok50-54 4.905 1.078 4.55 0.000005334 ***
years.smok55-59 5.413 1.096 4.94 0.000000784 ***
cigarettes1-9 1.220 0.806 1.51 0.13025
cigarettes10-14 2.099 0.673 3.12 0.00181 **
cigarettes15-19 2.309 0.698 3.31 0.00094 ***
cigarettes20-24 2.901 0.644 4.51 0.000006555 ***
cigarettes25-34 3.116 0.645 4.83 0.000001381 ***
cigarettes35+ 3.606 0.650 5.55 0.000000028 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to overcome the problem with overdispersion (underestimation of the variance) in Poisson regression is to use the robust sandwich covariance matrix estimator.
Negative binomial regression
## Negative binomial regression
model.3nb <- glm.nb(y ~ years.smok + cigarettes + offset(log(Time)), data = lung.cancer)
summary(model.3nb)
Call:
glm.nb(formula = y ~ years.smok + cigarettes + offset(log(Time)),
data = lung.cancer, init.theta = 5028.906523, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.833 -0.856 -0.381 0.424 2.176
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.578 1.148 -10.96 < 2e-16 ***
years.smok20-24 0.947 1.155 0.82 0.41224
years.smok25-29 1.702 1.080 1.57 0.11531
years.smok30-34 3.203 1.020 3.14 0.00170 **
years.smok35-39 3.242 1.024 3.17 0.00155 **
years.smok40-44 4.209 1.014 4.15 0.0000330267 ***
years.smok45-49 4.448 1.017 4.37 0.0000122755 ***
years.smok50-54 4.905 1.020 4.81 0.0000015254 ***
years.smok55-59 5.413 1.024 5.29 0.0000001243 ***
cigarettes1-9 1.220 0.707 1.72 0.08457 .
cigarettes10-14 2.099 0.636 3.30 0.00097 ***
cigarettes15-19 2.309 0.633 3.65 0.00026 ***
cigarettes20-24 2.901 0.596 4.87 0.0000011162 ***
cigarettes25-34 3.116 0.595 5.24 0.0000001609 ***
cigarettes35+ 3.606 0.605 5.96 0.0000000025 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(5029) family taken to be 1)
Null deviance: 444.862 on 62 degrees of freedom
Residual deviance: 51.455 on 48 degrees of freedom
AIC: 203.3
Number of Fisher Scoring iterations: 1
Theta: 5029
Std. Err.: 20348
Warning while fitting theta: alternation limit reached
2 x log-likelihood: -171.3
Yet another way to account for overdispersion is using negative binomial regression, which models the variance function as follows. In this specific case the \( \theta \) is very large, and it does not converge within iteration limits.
Prediction
## Create dataset to predict for
newdat1 <- lung.cancer[c("years.smok","cigarettes")]
newdat2 <- newdat1
## Predicted number of cases per person
newdat1$Time <- 1
lung.cancer$pred.cases.per.one <- exp(predict(model.3, newdat1))
## Predicted number of cases per one thousand persons
newdat2$Time <- 1000
lung.cancer$pred.cases.per.thousand <- exp(predict(model.3, newdat2))
## Predicted number of cases per actual population
lung.cancer$pred.cases <- exp(predict(model.3))
## Show prediction results
lung.cancer
years.smok cigarettes Time y pred.cases.per.one pred.cases.per.thousand pred.cases
1 15-19 0 10366 1 0.000003445 0.003445 0.03572
2 15-19 1-9 3121 0 0.000011671 0.011671 0.03643
3 15-19 10-14 3577 0 0.000028111 0.028111 0.10055
4 15-19 15-19 4317 0 0.000034674 0.034674 0.14969
5 15-19 20-24 5683 0 0.000062673 0.062673 0.35617
6 15-19 25-34 3042 0 0.000077732 0.077732 0.23646
7 15-19 35+ 670 0 0.000126847 0.126847 0.08499
8 20-24 0 8162 0 0.000008882 0.008882 0.07249
9 20-24 1-9 2937 0 0.000030086 0.030086 0.08836
10 20-24 10-14 3286 1 0.000072465 0.072465 0.23812
11 20-24 15-19 4214 0 0.000089384 0.089384 0.37666
12 20-24 20-24 6385 1 0.000161559 0.161559 1.03155
13 20-24 25-34 4050 1 0.000200380 0.200380 0.81154
14 20-24 35+ 1166 0 0.000326989 0.326989 0.38127
15 25-29 0 5969 0 0.000018890 0.018890 0.11276
16 25-29 1-9 2288 0 0.000063987 0.063987 0.14640
17 25-29 10-14 2546 1 0.000154121 0.154121 0.39239
18 25-29 15-19 3185 0 0.000190105 0.190105 0.60549
19 25-29 20-24 5483 1 0.000343609 0.343609 1.88401
20 25-29 25-34 4290 4 0.000426176 0.426176 1.82830
21 25-29 35+ 1482 0 0.000695452 0.695452 1.03066
22 30-34 0 4496 0 0.000084774 0.084774 0.38114
23 30-34 1-9 2015 0 0.000287157 0.287157 0.57862
24 30-34 10-14 2219 2 0.000691651 0.691651 1.53477
25 30-34 15-19 2560 4 0.000853139 0.853139 2.18404
26 30-34 20-24 4687 6 0.001542020 1.542020 7.22745
27 30-34 25-34 4268 9 0.001912560 1.912560 8.16281
28 30-34 35+ 1580 4 0.003120993 3.120993 4.93117
29 35-39 0 3512 0 0.000088174 0.088174 0.30967
30 35-39 1-9 1648 1 0.000298675 0.298675 0.49222
31 35-39 10-14 1826 0 0.000719392 0.719392 1.31361
32 35-39 15-19 1893 0 0.000887358 0.887358 1.67977
33 35-39 20-24 3646 5 0.001603870 1.603870 5.84771
34 35-39 25-34 3529 9 0.001989272 1.989272 7.02014
35 35-39 35+ 1336 6 0.003246174 3.246174 4.33689
36 40-44 0 2201 0 0.000231804 0.231804 0.51020
37 40-44 1-9 1310 2 0.000785196 0.785196 1.02861
38 40-44 10-14 1386 1 0.001891234 1.891234 2.62125
39 40-44 15-19 1334 2 0.002332804 2.332804 3.11196
40 40-44 20-24 2411 12 0.004216465 4.216465 10.16590
41 40-44 25-34 2424 11 0.005229660 5.229660 12.67670
42 40-44 35+ 924 10 0.008533972 8.533972 7.88539
43 45-49 0 1421 0 0.000294328 0.294328 0.41824
44 45-49 1-9 927 0 0.000996988 0.996988 0.92421
45 45-49 10-14 988 2 0.002401360 2.401360 2.37254
46 45-49 15-19 849 2 0.002962035 2.962035 2.51477
47 45-49 20-24 1567 9 0.005353779 5.353779 8.38937
48 45-49 25-34 1409 10 0.006640265 6.640265 9.35613
49 45-49 35+ 556 7 0.010835854 10.835854 6.02473
50 50-54 0 1121 0 0.000464935 0.464935 0.52119
51 50-54 1-9 710 3 0.001574889 1.574889 1.11817
52 50-54 10-14 684 4 0.003793301 3.793301 2.59462
53 50-54 15-19 470 2 0.004678971 4.678971 2.19912
54 50-54 20-24 857 7 0.008457083 8.457083 7.24772
55 50-54 25-34 663 5 0.010489277 10.489277 6.95439
56 50-54 35+ 255 4 0.017116828 17.116828 4.36479
57 55-59 0 826 2 0.000773114 0.773114 0.63859
58 55-59 1-9 606 0 0.002618793 2.618793 1.58699
59 55-59 10-14 449 3 0.006307664 6.307664 2.83214
60 55-59 15-19 280 5 0.007780394 7.780394 2.17851
61 55-59 20-24 416 7 0.014062800 14.062800 5.85012
62 55-59 25-34 284 3 0.017442019 17.442019 4.95353
63 55-59 35+ 104 1 0.028462595 28.462595 2.96011