Poisson regression

\[ \mathrm{P}(Y=y \mid \lambda)=\frac{e^{-\lambda} \lambda^y}{y !} \]

Poisson distribution not equal Poisson regression

# numbers from  
x<-seq(0,30,1)

# calculate the distribution function
# based on the parameters
pdf1<- dpois(x, 1)
pdf2<- dpois(x, 7)
pdf3<- dpois(x, 12)

# Plotting the PDF 
plot(x,pdf1,type = "l")
lines(x,pdf2 ,col=2)
lines(x,pdf3 ,col=3)

Negative binomial regression is a popular generalization of Poisson regression because it loosens the highly restrictive assumption that the variance is equal to the mean made by the Poisson model. The traditional negative binomial regression model is based on the Poisson-gamma mixture distribution. This model is popular because it models the Poisson heterogeneity with a gamma distribution.

Poisson regression is typically used to model count data. But, sometimes, it is more relevant to model rates instead of counts. This is relevant when individuals are not followed the same amount of time. For example, six cases over 1 year should not amount to the same as six cases over 10 years.

In Poisson regression the dependent variable \((Y)\) is an observed count that follows the Poisson distribution. The rate \(\lambda\) is determined by a set of \(k\) predictors \(\mathbf{X}=\left(X_1, \ldots, X_k\right)\).

since

\[ \mu_x=\lambda=\exp \{\mathbf{X} \beta\} . \]

\[ \log \mu_x=\beta_0+\beta_1 x \]

(where \(\mu_x\) is the expected count for those with covariate \(x\) )

for outcome is rate not count

\[ \log \frac{\mu_x}{t_x}=\beta_0^{\prime}+\beta_1^{\prime} x \]

(where \(t_x\) is the exposure time for those with covariate \(x\) ). Now, the last equation could be rewritten

\[ \log \mu_x=\log t_x+\beta_0^{\prime}+\beta_1^{\prime} x \]

and \(\log t_x\) plays the role of an offset.

Dependent variable is count

# example
p <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic", 
                                            "Vocational"))
  id <- factor(id)
})

head(p)
##    id num_awards       prog math
## 1  45          0 Vocational   41
## 2 108          0    General   41
## 3  15          0 Vocational   44
## 4  67          0 Vocational   42
## 5 153          0 Vocational   40
## 6  51          0    General   42
hist(p$num_awards,probability = T)
lines(density(p$num_awards))

summary(m1 <- glm(num_awards ~ prog + math, family="poisson", data=p))
## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
## progAcademic    1.08386    0.35825   3.025  0.00248 ** 
## progVocational  0.36981    0.44107   0.838  0.40179    
## math            0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

Dependent variable is rate (count/population)

rate data can be modeled by including the log(n) term with coefficient of 1. This is called an offset.

library(ISwR)
data(eba1977)
cancer.data = eba1977
head(cancer.data)
##         city   age  pop cases
## 1 Fredericia 40-54 3059    11
## 2    Horsens 40-54 2879    13
## 3    Kolding 40-54 3142     4
## 4      Vejle 40-54 2520     5
## 5 Fredericia 55-59  800    11
## 6    Horsens 55-59 1083     6
# histogram count
hist(cancer.data$case,probability = T)
lines(density(cancer.data$case))

# histogram rate
hist(cancer.data$case/cancer.data$pop,probability = T)
lines(density(cancer.data$case/cancer.data$pop))

# using offset
poisson.model.rate <- glm(cases ~ city + age+ offset(log(pop)), family = poisson(link = "log"), data = cancer.data)

#display summary
summary(poisson.model.rate)
## 
## Call:
## glm(formula = cases ~ city + age + offset(log(pop)), family = poisson(link = "log"), 
##     data = cancer.data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63573  -0.67296  -0.03436   0.37258   1.85267  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.6321     0.2003 -28.125  < 2e-16 ***
## cityHorsens  -0.3301     0.1815  -1.818   0.0690 .  
## cityKolding  -0.3715     0.1878  -1.978   0.0479 *  
## cityVejle    -0.2723     0.1879  -1.450   0.1472    
## age55-59      1.1010     0.2483   4.434 9.23e-06 ***
## age60-64      1.5186     0.2316   6.556 5.53e-11 ***
## age65-69      1.7677     0.2294   7.704 1.31e-14 ***
## age70-74      1.8569     0.2353   7.891 3.00e-15 ***
## age75+        1.4197     0.2503   5.672 1.41e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 129.908  on 23  degrees of freedom
## Residual deviance:  23.447  on 15  degrees of freedom
## AIC: 137.84
## 
## Number of Fisher Scoring iterations: 5
  • Interpretation

Compare to 40-54, age55-59 group increases the number of cases by exp (1.1)=3, while controlling for the effect of city.