Poisson logistic regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.
In R, glm function are used for the Logistic regression with following parameters.
- formula: predictors used in formula (Systematic Component)
- family: "poisson" needs to be chosen(Random Component)
- link: "log".(Link Component)
Data:
This data has been downloaded from http://people.stern.nyu.edu/jsimonof/AnalCatData/ the book Analyzing Categorical Data by J.S. Simonoff.
Data has 54 observation of 4 variables: Year: it is the year by th time shark attack was happened. Population: of Florida Attacks: Number of attacks(Counts) Fatalities: Number of death.
Objective: In this study, I applied the Poisson logistic regression from simplest(null) complex(saturated) the model on Florida Shark data set..
Methods to use: - Quasipoisson logistic regression (Null to Saturated) - Choosing best model based on the extra poisson variation - Negative Binomial logistic Regression - Drop in deviance test to compare the models …
Load Libraries
#Loading necessary libraries
library(readxl)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
shark <- read_excel("floridashark.xls")
head(shark)
## # A tibble: 6 x 4
## Year Population Attacks Fatalities
## <dbl> <dbl> <dbl> <dbl>
## 1 1946 2473000 0 0
## 2 1947 2539000 1 1
## 3 1948 2578000 0 0
## 4 1949 2668000 0 0
## 5 1950 2771305 1 0
## 6 1951 2980000 0 0
nrow(shark)
## [1] 54
Grouping the data (based on the data collection type)
# US Navy funding for International Shark Attack File ended in 1968
shark$Navy <- 1 * (shark$Year < 1969)
# ISAF moved to Florida Museum of Natural History in 1988
shark$Museum <- 1 * (shark$Year > 1987)
# volusia County started improved reporting after 1993
shark$Volusia <- 1 * (shark$Year > 1993)
head(shark)
## # A tibble: 6 x 7
## Year Population Attacks Fatalities Navy Museum Volusia
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1946 2473000 0 0 1 0 0
## 2 1947 2539000 1 1 1 0 0
## 3 1948 2578000 0 0 1 0 0
## 4 1949 2668000 0 0 1 0 0
## 5 1950 2771305 1 0 1 0 0
## 6 1951 2980000 0 0 1 0 0
Plots
plot(shark$Year, shark$Attacks, pch = 16, col = "blue",
xlab= "YEAR", ylab="Number of Attacks")
title("Attacks versus Year")
plot(shark$Population, shark$Attacks, pch = 16, col = "blue",
xlab= "Florida Population", ylab="Number of Attacks")
title("Attacks versus Population")
Is there an evidence for Extra-Poisson variation?
In order to address this question is, we can divide the time to 5 year windows and cjeck their means and variances.
So, if there is Poisson like variation? we should expect to see the mean and variance same. If variances higher than the mean, then I have an extra poisson variation, and if it less than the mean, than I have under poisson variation which is really rare.
#construct (approximate) five-year windows:
window <- factor(shark$Year %/%5) #division modulo 5
table(window)
## window
## 389 390 391 392 393 394 395 396 397 398 399
## 4 5 5 5 5 5 5 5 5 5 5
Now compute mean and variance for each window,
#Compute means and variances for each windows
m <- tapply(shark$Attacks, INDEX = window, FUN = "mean")
v <- tapply(shark$Attacks, INDEX = window, FUN = "var")
#Plot means and variances
plot(m, v, pch = 16, col = "blue", xlab = "window mean", ylab = "window variance")
abline(c(0, 1), lwd = 3, col = "red")
Yes, definitely, thee is an extra poisson variation. The observations are spread out around the 1-1 reference line.
If mean equals variance, we would expect these observations around the reference line.
Model exploration
fit_null <- glm(Attacks ~ 1,
offset = log(Population),
family = poisson(link = log), data = shark)
deviance(fit_null)
## [1] 176.9336
fit_null <- glm(Attacks ~ 1,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
deviance(fit_null)
## [1] 176.9336
As you notice, quasipoisson and poisson deviance are the same, it is becauase it doesnt care about the dispersion parameter when it calculates the deviance.
Fit many models, then extract their residual deviances:
fit_NYMV <- glm(Attacks ~ Navy + Museum + Year + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NYMV
##
## Call: glm(formula = Attacks ~ Navy + Museum + Year + Volusia, family = quasipoisson(link = log),
## data = shark, offset = log(Population))
##
## Coefficients:
## (Intercept) Navy Museum Year Volusia
## -96.43402 0.82384 -0.04582 0.04147 0.34575
##
## Degrees of Freedom: 53 Total (i.e. Null); 49 Residual
## Null Deviance: 176.9
## Residual Deviance: 95.2 AIC: NA
fit_NYV <- glm(Attacks ~ Navy + Year + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NMV <- glm(Attacks ~ Navy + Museum + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_YMV <- glm(Attacks ~ Museum + Year + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NV <- glm(Attacks ~ Navy + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_YV <- glm(Attacks ~ Year + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_MV <- glm(Attacks ~ Museum + Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NYM <- glm(Attacks ~ Navy + Museum + Year,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NY <- glm(Attacks ~ Navy + Year,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_NM <- glm(Attacks ~ Navy + Museum,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_YM <- glm(Attacks ~ Museum + Year,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_N <- glm(Attacks ~ Navy,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_Y <- glm(Attacks ~ Year,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_M <- glm(Attacks ~ Museum,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
fit_V <- glm(Attacks ~ Volusia,
offset = log(Population),
family = quasipoisson(link = log), data = shark)
# Each fit is a list: create a list of lists
fit <- list(fit_NYMV,
fit_NYV, fit_NMV, fit_YMV, fit_NYM,
fit_NV, fit_YV, fit_MV, fit_NY, fit_NM, fit_YM,
fit_N, fit_Y, fit_M, fit_V,
fit_null)
# Apply the "deviance" function to each element of list:
qdev <- lapply(fit, FUN = "deviance")
qdev
## [[1]]
## [1] 95.19651
##
## [[2]]
## [1] 95.24397
##
## [[3]]
## [1] 104.9774
##
## [[4]]
## [1] 102.2859
##
## [[5]]
## [1] 99.55595
##
## [[6]]
## [1] 111.8997
##
## [[7]]
## [1] 103.407
##
## [[8]]
## [1] 104.9789
##
## [[9]]
## [1] 99.57217
##
## [[10]]
## [1] 122.3856
##
## [[11]]
## [1] 114.6168
##
## [[12]]
## [1] 166.3714
##
## [[13]]
## [1] 119.1093
##
## [[14]]
## [1] 122.3871
##
## [[15]]
## [1] 112.7842
##
## [[16]]
## [1] 176.9336
Focus on the most competitive models: residual deviance close to that of the largest model, but not the largest model:
# Each fit is a list: create a list of lists
fit <- list(fit_NYMV,
fit_NYV, fit_NMV, fit_YMV, fit_NYM,
fit_NV, fit_YV, fit_MV, fit_NY, fit_NM, fit_YM,
fit_N, fit_Y, fit_M, fit_V,
fit_null)
summary(fit[[3]])
##
## Call:
## glm(formula = Attacks ~ Navy + Museum + Volusia, family = quasipoisson(link = log),
## data = shark, offset = log(Population))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0581 -1.0013 -0.3830 0.4763 4.4833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.343325 0.140194 -102.310 < 2e-16 ***
## Navy 0.006403 0.233022 0.027 0.97819
## Museum 0.413589 0.221032 1.871 0.06718 .
## Volusia 0.593899 0.209040 2.841 0.00649 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.044056)
##
## Null deviance: 176.93 on 53 degrees of freedom
## Residual deviance: 104.98 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
summary(fit[[4]])
##
## Call:
## glm(formula = Attacks ~ Museum + Year + Volusia, family = quasipoisson(link = log),
## data = shark, offset = log(Population))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0710 -0.9188 -0.2888 0.5196 4.1630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.25082 20.02398 -1.860 0.0687 .
## Museum 0.20103 0.26869 0.748 0.4578
## Year 0.01161 0.01015 1.145 0.2579
## Volusia 0.52440 0.21568 2.431 0.0187 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.003421)
##
## Null deviance: 176.93 on 53 degrees of freedom
## Residual deviance: 102.29 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
summary(fit[[8]])
##
## Call:
## glm(formula = Attacks ~ Museum + Volusia, family = quasipoisson(link = log),
## data = shark, offset = log(Population))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0617 -0.9983 -0.3792 0.4819 4.4761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.3410 0.1109 -129.340 < 2e-16 ***
## Museum 0.4113 0.2023 2.033 0.04726 *
## Volusia 0.5939 0.2070 2.869 0.00597 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.00394)
##
## Null deviance: 176.93 on 53 degrees of freedom
## Residual deviance: 104.98 on 51 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
Is there evidence of extra-Poisson variation? Note that We use quasipoisson to fit the models. Model3 has Dispersion parameter for quasipoisson family is 2.044056, so it is a sign for extre poisson variation, otherwise it should have been around 1. Another sign is that min and max deviance residuals, we dont expect that large deviances bigger than plus/minus 3.
Model4 has Dispersion parameter for quasipoisson family is 2.003421, so it is a sign for extre poisson variation, otherwise it should have been around 1. Again, another sign is that min and max deviance residuals, we dont expect that large deviances bigger than plus/minus 3.
Model8 has Dispersion parameter for quasipoisson family is 2.00394, so it is a sign for extre poisson variation, otherwise it should have been around 1. Again, another sign is that min and max deviance residuals, we dont expect that large deviances bigger than plus/minus 3.
Plot the best models:
plot(shark$Year, shark$Attacks, pch = 16, col = "blue")
lines(shark$Year, predict(fit[[3]], type = "response"), lwd = 2, col = "black")
lines(shark$Year, predict(fit[[4]], type = "response"), lwd = 2, col = "green")
lines(shark$Year, predict(fit[[8]], type = "response"), lwd = 2, col = "red")
As you can see that, there are jumps on the graph, it is because the data collection type was changed.
Model without offset
Note that when we use offset(log population)if fixes regression coefficent and it does not estimate it.
What if we had not used an offset (log(Population) with known coefficient equal to one), but instead used log(Population) as a covariate with unknown coefficient?
fit_noff <- glm(Attacks ~ log(Population) + Volusia + Museum,
family = quasipoisson(link = log), data = shark)
summary(fit_noff)
##
## Call:
## glm(formula = Attacks ~ log(Population) + Volusia + Museum, family = quasipoisson(link = log),
## data = shark)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0972 -0.9464 -0.2752 0.4703 4.0800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.7418 4.4999 -4.609 2.82e-05 ***
## log(Population) 1.4044 0.2836 4.951 8.79e-06 ***
## Volusia 0.5511 0.2078 2.651 0.0107 *
## Museum 0.1853 0.2505 0.740 0.4629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.978416)
##
## Null deviance: 387.38 on 53 degrees of freedom
## Residual deviance: 100.73 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
Since the log population coefficient is very close to 1, we dont see much difference.
Plot the best models without having offset
plot(shark$Year, shark$Attacks, pch = 16, col = "blue")
lines(shark$Year, predict(fit[[8]], type = "response"), lwd = 2, col = "red")
lines(shark$Year, predict(fit_noff, type = "response"), lwd = 2, col = "black")
Creating a calibrated data set
In order to avoid jumps, we can calibrate the data across all years.
This means we want to construct predictions with the variable Museum and Volusia always equal to one.
comparable <- shark # we will keep the Year and Population values
comparable$Museum <- 1
comparable$Volusia <- 1
plot(shark$Year, shark$Attacks, pch = 16, col = "blue")
lines(shark$Year, predict(fit[[8]], newdata = comparable, type = "response"),
lwd = 2, col = "red")
lines(shark$Year, predict(fit_noff, newdata = comparable, type = "response"),
lwd = 2, col = "black")
Negative Binomial Regression
As an alternative to the quasi-likelihood approach, we could do Negative Binomial regression.
I will be using MASS library to glm.nb function.
library(MASS)
fit_NB <- glm.nb(formula = Attacks ~ Museum + Volusia
+ offset(log(Population)),
data = shark)
summary(fit_NB)
##
## Call:
## glm.nb(formula = Attacks ~ Museum + Volusia + offset(log(Population)),
## data = shark, init.theta = 8.024950684, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6945 -0.8388 -0.2389 0.4015 3.1019
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.36178 0.09814 -146.332 <2e-16 ***
## Museum 0.43445 0.21134 2.056 0.0398 *
## Volusia 0.59249 0.25078 2.363 0.0181 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(8.025) family taken to be 1)
##
## Null deviance: 98.973 on 53 degrees of freedom
## Residual deviance: 67.664 on 51 degrees of freedom
## AIC: 269.95
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 8.02
## Std. Err.: 4.41
##
## 2 x log-likelihood: -261.953
This estimated coefficients and standard errors are almost identical to quasi poisson model.
Plot Negative binomial regression and Poisson regression fit
plot(shark$Year, shark$Attacks, pch = 16, col = "blue")
lines(shark$Year, predict(fit[[8]], type = "response"), lwd = 2, col = "red")
lines(shark$Year, exp(predict(fit_NB)), lwd = 2, col = "green")
Finally, comparing the Poisson, quasi-Poisson, and negative binomial mean-variance models: \[\begin{align*} \mbox{Poisson:} & \quad\mbox{variance}=\mbox{mean}\\ \mbox{quasi-Poisson:} & \quad \mbox{variance}=\psi\times \mbox{mean}\\ \mbox{Negative Binomial:} & \quad\mbox{variance}=\mbox{mean}+\frac{\mbox{mean}^2}{\theta}. \end{align*}\]
where \(\psi\) and \(\theta\) is constant. Note that Poisson and Quasipossion are lines and Negative binomial is quadratic function.
plot(m, v, pch = 16, col = "blue",
xlab = "window mean", ylab = "window variance")
abline(c(0,1), col = "red", lwd = 2)
mu_NB <- sort(exp(predict(fit_NB)))
lines(mu_NB, mu_NB + mu_NB^2 / fit_NB$theta, col = "green", lwd = 2)
mu <- sort(exp(predict(fit[[8]])))
psi <- summary(fit[[8]])$dispersion
lines(mu, psi * mu, col = "black", lwd = 2)
So there are two different ways to handle extra poisson variation; quasipoisson or Negative binomial.