23/2/2021

Poisson distribution

  • Using this as an example today
  • When used with a generalized linear model often called a loglinear regression
  • The Poisson distribution is a non-normal distribution that is effective in modelling strictly positive integer datat (such as counts). It has a single parameter \(\lambda\) which is both the mean and the variance of the response variable.

Do Prussians get kicked in the head a lot by horses?

The Poisson distribution was made famous by the Russian statistician Ladislaus Bortkiewicz (1868-1931), who counted the number of soldiers killed by being kicked by a horse each year in each of 14 cavalry corps in the Prussian army over a 20-year period

Exploratory data analysis

Me <- tapply(prussian$y, prussian$corp, mean)
Var <- tapply(prussian$y, prussian$corp, var)
rbind(Me, Var) %>% kbl() %>%
  kable_classic(full_width = F, html_font = "Cambria", font_size = 12)
G I II III IV IX V VI VII VIII X XI XIV XV
Me 0.8 0.800000 0.6000000 0.6000000 0.4000000 0.6500000 0.5500000 0.8500000 0.6000000 0.3500000 0.7500000 1.250000 1.200000 0.4000000
Var 0.8 1.115789 0.6736842 0.5684211 0.2526316 0.5552632 0.3657895 0.9763158 0.5684211 0.2394737 0.8289474 1.460526 1.326316 0.4631579

Exploratory data analysis

ggplot(prussian, aes(y, fill = corp)) +
  geom_histogram(binwidth=.5, position="dodge") +
  xlab("Number of deaths by Horse") + ylab("Count")

The predictor function

This specifies the explanatory variables in the model

\[ \eta_i = \beta_0 + \beta_1corp + \beta_2year + \beta_3corp:year \]

The link function and the variance function

Translating that to R

#library(pscl)
berlin<- filter(prussian, corp == c("G","I", "IX"))
berlin_model <- glm(y ~ corp+year, family=poisson,data=berlin)
summary(berlin_model)
## 
## Call:
## glm(formula = y ~ corp + year, family = poisson, data = berlin)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.60256  -1.28196  -0.09747   0.10987   1.62149  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.927216   3.519994  -0.548    0.584
## corpI        0.145268   0.534633   0.272    0.786
## corpIX      -0.008883   0.556465  -0.016    0.987
## year         0.022832   0.041260   0.553    0.580
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20.496  on 18  degrees of freedom
## Residual deviance: 20.082  on 15  degrees of freedom
## AIC: 57.671
## 
## Number of Fisher Scoring iterations: 5

Pseudo R squared

  • For Poisson models there is no true R squared. Instead we can calculate the explained deviance (pseudo R squared) \[ pseudo R^2 = 100\frac{null deviance - residual deviance}{null deviance} \]
100*((berlin_model$null.deviance-berlin_model$deviance)/berlin_model$null.deviance)
## [1] 2.020372

Model validation: Overdispersion

  • Overdispersion means that a Poisson distribution does not adequately model the variance and is not appropriate for the analysis
  • Is the residual deviance of the fitted model bigger than the residual degrees of freedom (> 1.2 problematic).
  • We’ll discuss in the practical what to do about it.
berlin_model$deviance/berlin_model$df.residual
## [1] 1.338786

Model validation: autoplot

For Prussia

prussian_model <- glm(y ~ corp+year+corp:year, family=poisson,data=prussian)
Anova (prussian_model)
## Analysis of Deviance Table (Type II tests)
## 
## Response: y
##           LR Chisq Df Pr(>Chisq)  
## corp       26.1369 13     0.0163 *
## year        2.2866  1     0.1305  
## corp:year   8.8681 13     0.7828  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For Prussia

100*((prussian_model$null.deviance-prussian_model$deviance)/prussian_model$null.deviance)
## [1] 11.53719
prussian_model$deviance/prussian_model$df.residual
## [1] 1.134671

For Prussia

#library(ggfortify)
autoplot(prussian_model)

For Prussia

plot(residuals(prussian_model)~predict(prussian_model,type = "link"))

Post hoc tests

#library(emmeans)
marginal = emmeans(prussian_model, ~ corp)
## NOTE: Results may be misleading due to involvement in interactions
pairs(marginal)
##  contrast   estimate    SE  df z.ratio p.value
##  G - I       0.00000 0.360 Inf  0.000  1.0000 
##  G - II      0.27010 0.385 Inf  0.701  1.0000 
##  G - III     0.26885 0.385 Inf  0.698  1.0000 
##  G - IV      0.71458 0.448 Inf  1.596  0.9475 
##  G - IX      0.20024 0.379 Inf  0.528  1.0000 
##  G - V       0.40938 0.407 Inf  1.005  0.9993 
##  G - VI     -0.05547 0.356 Inf -0.156  1.0000 
##  G - VII     0.29917 0.392 Inf  0.764  1.0000 
##  G - VIII    0.82990 0.463 Inf  1.793  0.8799 
##  G - X       0.09743 0.373 Inf  0.262  1.0000 
##  G - XI     -0.40693 0.331 Inf -1.228  0.9946 
##  G - XIV    -0.39924 0.330 Inf -1.211  0.9953 
##  G - XV      0.69315 0.441 Inf  1.570  0.9538 
##  I - II      0.27010 0.385 Inf  0.701  1.0000 
##  I - III     0.26885 0.385 Inf  0.698  1.0000 
##  I - IV      0.71458 0.448 Inf  1.596  0.9475 
##  I - IX      0.20024 0.379 Inf  0.528  1.0000 
##  I - V       0.40938 0.407 Inf  1.005  0.9993 
##  I - VI     -0.05547 0.356 Inf -0.156  1.0000 
##  I - VII     0.29917 0.392 Inf  0.764  1.0000 
##  I - VIII    0.82990 0.463 Inf  1.793  0.8799 
##  I - X       0.09743 0.373 Inf  0.262  1.0000 
##  I - XI     -0.40693 0.331 Inf -1.228  0.9946 
##  I - XIV    -0.39924 0.330 Inf -1.211  0.9953 
##  I - XV      0.69315 0.441 Inf  1.570  0.9538 
##  II - III   -0.00125 0.409 Inf -0.003  1.0000 
##  II - IV     0.44448 0.468 Inf  0.950  0.9996 
##  II - IX    -0.06986 0.403 Inf -0.173  1.0000 
##  II - V      0.13927 0.430 Inf  0.324  1.0000 
##  II - VI    -0.32557 0.381 Inf -0.854  0.9999 
##  II - VII    0.02907 0.415 Inf  0.070  1.0000 
##  II - VIII   0.55980 0.483 Inf  1.160  0.9969 
##  II - X     -0.17268 0.397 Inf -0.435  1.0000 
##  II - XI    -0.67703 0.358 Inf -1.889  0.8330 
##  II - XIV   -0.66934 0.357 Inf -1.875  0.8403 
##  II - XV     0.42305 0.462 Inf  0.916  0.9997 
##  III - IV    0.44573 0.468 Inf  0.953  0.9996 
##  III - IX   -0.06861 0.403 Inf -0.170  1.0000 
##  III - V     0.14053 0.429 Inf  0.327  1.0000 
##  III - VI   -0.32432 0.381 Inf -0.851  0.9999 
##  III - VII   0.03032 0.415 Inf  0.073  1.0000 
##  III - VIII  0.56105 0.482 Inf  1.163  0.9969 
##  III - X    -0.17142 0.397 Inf -0.432  1.0000 
##  III - XI   -0.67578 0.358 Inf -1.887  0.8340 
##  III - XIV  -0.66809 0.357 Inf -1.873  0.8414 
##  III - XV    0.42430 0.462 Inf  0.919  0.9997 
##  IV - IX    -0.51434 0.463 Inf -1.111  0.9980 
##  IV - V     -0.30521 0.486 Inf -0.628  1.0000 
##  IV - VI    -0.77005 0.444 Inf -1.734  0.9040 
##  IV - VII   -0.41541 0.473 Inf -0.878  0.9998 
##  IV - VIII   0.11532 0.534 Inf  0.216  1.0000 
##  IV - X     -0.61716 0.457 Inf -1.349  0.9871 
##  IV - XI    -1.12151 0.425 Inf -2.641  0.3138 
##  IV - XIV   -1.11382 0.423 Inf -2.631  0.3204 
##  IV - XV    -0.02143 0.515 Inf -0.042  1.0000 
##  IX - V      0.20914 0.424 Inf  0.493  1.0000 
##  IX - VI    -0.25571 0.375 Inf -0.682  1.0000 
##  IX - VII    0.09893 0.409 Inf  0.242  1.0000 
##  IX - VIII   0.62966 0.478 Inf  1.318  0.9896 
##  IX - X     -0.10281 0.391 Inf -0.263  1.0000 
##  IX - XI    -0.60717 0.352 Inf -1.727  0.9068 
##  IX - XIV   -0.59948 0.350 Inf -1.712  0.9122 
##  IX - XV     0.49291 0.457 Inf  1.079  0.9985 
##  V - VI     -0.46485 0.403 Inf -1.152  0.9971 
##  V - VII    -0.11021 0.435 Inf -0.253  1.0000 
##  V - VIII    0.42052 0.500 Inf  0.840  0.9999 
##  V - X      -0.31195 0.418 Inf -0.746  1.0000 
##  V - XI     -0.81631 0.382 Inf -2.137  0.6754 
##  V - XIV    -0.80862 0.381 Inf -2.125  0.6843 
##  V - XV      0.28377 0.481 Inf  0.591  1.0000 
##  VI - VII    0.35464 0.388 Inf  0.915  0.9997 
##  VI - VIII   0.88537 0.459 Inf  1.927  0.8120 
##  VI - X      0.15290 0.368 Inf  0.415  1.0000 
##  VI - XI    -0.35146 0.326 Inf -1.076  0.9986 
##  VI - XIV   -0.34377 0.325 Inf -1.058  0.9988 
##  VI - XV     0.74862 0.438 Inf  1.710  0.9129 
##  VII - VIII  0.53073 0.488 Inf  1.088  0.9984 
##  VII - X    -0.20174 0.403 Inf -0.501  1.0000 
##  VII - XI   -0.70610 0.365 Inf -1.933  0.8084 
##  VII - XIV  -0.69841 0.364 Inf -1.920  0.8162 
##  VII - XV    0.39398 0.467 Inf  0.843  0.9999 
##  VIII - X   -0.73247 0.472 Inf -1.550  0.9582 
##  VIII - XI  -1.23683 0.441 Inf -2.806  0.2209 
##  VIII - XIV -1.22914 0.440 Inf -2.796  0.2260 
##  VIII - XV  -0.13675 0.528 Inf -0.259  1.0000 
##  X - XI     -0.50436 0.345 Inf -1.464  0.9738 
##  X - XIV    -0.49667 0.343 Inf -1.448  0.9762 
##  X - XV      0.59572 0.451 Inf  1.320  0.9895 
##  XI - XIV    0.00769 0.298 Inf  0.026  1.0000 
##  XI - XV     1.10008 0.418 Inf  2.632  0.3198 
##  XIV - XV    1.09239 0.417 Inf  2.621  0.3266 
## 
## Results are given on the log (not the response) scale. 
## P value adjustment: tukey method for comparing a family of 14 estimates