\[ \mathrm{P}(Y=y \mid \lambda)=\frac{e^{-\lambda} \lambda^y}{y !} \]
Poisson distribution not equal Poisson regression
# numbers from
x<-seq(0,30,1)
# calculate the distribution function
# based on the parameters
pdf1<- dpois(x, 1)
pdf2<- dpois(x, 7)
pdf3<- dpois(x, 12)
# Plotting the PDF
plot(x,pdf1,type = "l")
lines(x,pdf2 ,col=2)
lines(x,pdf3 ,col=3)
Negative binomial regression is a popular generalization of Poisson regression because it loosens the highly restrictive assumption that the variance is equal to the mean made by the Poisson model. The traditional negative binomial regression model is based on the Poisson-gamma mixture distribution. This model is popular because it models the Poisson heterogeneity with a gamma distribution.
Poisson regression is typically used to model count data. But, sometimes, it is more relevant to model rates instead of counts. This is relevant when individuals are not followed the same amount of time. For example, six cases over 1 year should not amount to the same as six cases over 10 years.
In Poisson regression the dependent variable \((Y)\) is an observed count that follows the Poisson distribution. The rate \(\lambda\) is determined by a set of \(k\) predictors \(\mathbf{X}=\left(X_1, \ldots, X_k\right)\).
since
\[ \mu_x=\lambda=\exp \{\mathbf{X} \beta\} . \]
\[ \log \mu_x=\beta_0+\beta_1 x \]
(where \(\mu_x\) is the expected count for those with covariate \(x\) )
for outcome is rate not count
\[ \log \frac{\mu_x}{t_x}=\beta_0^{\prime}+\beta_1^{\prime} x \]
(where \(t_x\) is the exposure time for those with covariate \(x\) ). Now, the last equation could be rewritten
\[ \log \mu_x=\log t_x+\beta_0^{\prime}+\beta_1^{\prime} x \]
and \(\log t_x\) plays the role of an offset.
# example
p <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic",
"Vocational"))
id <- factor(id)
})
head(p)
## id num_awards prog math
## 1 45 0 Vocational 41
## 2 108 0 General 41
## 3 15 0 Vocational 44
## 4 67 0 Vocational 42
## 5 153 0 Vocational 40
## 6 51 0 General 42
hist(p$num_awards,probability = T)
lines(density(p$num_awards))
summary(m1 <- glm(num_awards ~ prog + math, family="poisson", data=p))
##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.8436 -0.5106 0.2558 2.6796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.24712 0.65845 -7.969 1.60e-15 ***
## progAcademic 1.08386 0.35825 3.025 0.00248 **
## progVocational 0.36981 0.44107 0.838 0.40179
## math 0.07015 0.01060 6.619 3.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
rate data can be modeled by including the log(n) term with coefficient of 1. This is called an offset.
library(ISwR)
data(eba1977)
cancer.data = eba1977
head(cancer.data)
## city age pop cases
## 1 Fredericia 40-54 3059 11
## 2 Horsens 40-54 2879 13
## 3 Kolding 40-54 3142 4
## 4 Vejle 40-54 2520 5
## 5 Fredericia 55-59 800 11
## 6 Horsens 55-59 1083 6
# histogram count
hist(cancer.data$case,probability = T)
lines(density(cancer.data$case))
# histogram rate
hist(cancer.data$case/cancer.data$pop,probability = T)
lines(density(cancer.data$case/cancer.data$pop))
# using offset
poisson.model.rate <- glm(cases ~ city + age+ offset(log(pop)), family = poisson(link = "log"), data = cancer.data)
#display summary
summary(poisson.model.rate)
##
## Call:
## glm(formula = cases ~ city + age + offset(log(pop)), family = poisson(link = "log"),
## data = cancer.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.63573 -0.67296 -0.03436 0.37258 1.85267
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.6321 0.2003 -28.125 < 2e-16 ***
## cityHorsens -0.3301 0.1815 -1.818 0.0690 .
## cityKolding -0.3715 0.1878 -1.978 0.0479 *
## cityVejle -0.2723 0.1879 -1.450 0.1472
## age55-59 1.1010 0.2483 4.434 9.23e-06 ***
## age60-64 1.5186 0.2316 6.556 5.53e-11 ***
## age65-69 1.7677 0.2294 7.704 1.31e-14 ***
## age70-74 1.8569 0.2353 7.891 3.00e-15 ***
## age75+ 1.4197 0.2503 5.672 1.41e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 129.908 on 23 degrees of freedom
## Residual deviance: 23.447 on 15 degrees of freedom
## AIC: 137.84
##
## Number of Fisher Scoring iterations: 5
Compare to 40-54, age55-59 group increases the number of cases by exp (1.1)=3, while controlling for the effect of city.