We will go through some theory about Poisson regression models and eventually cover a complete example on a subset of a real dataset in which we will fit a model, perform model selection using stepwise method and validation as well as to interpret the output of the model. Bear with me onto this detailed example and you will have a good overview of what it entails to fit a Poisson model.
We often use the Poisson distribution to model count data, that is that our response variable takes nonnegative integers. If \(Y \sim Poi(\lambda)\) with \(\lambda > 0\), then the Probability Mass Function (PMF) is given by
\[ P(Y = y) = \frac{\lambda^{y} e^{- \lambda}}{y!}, \ \ \ y = 0,1,2,...\]
In addition, for Poisson distributed random variables, we have that \(E[Y] = var(Y) = \lambda\). This is the assumption that the mean of the distribution is equal to its variance. Eventually, we have that \(\sum_{i=1}^{n} y_{i} \sim Poi(\sum_{i=1}^{n} \lambda_{i})\). The code below shows how to draw the first plot of several Poisson distributed data.
set.seed(2023)
s1 <- data.frame('data' = rpois(n = 1000, lambda = 0.5))
s2 <- data.frame('data' = rpois(n = 1000, lambda = 2))
s3 <- data.frame('data' = rpois(n = 1000, lambda = 10))
p1 <- s1 %>% ggplot() +
geom_bar(aes(x = data, y = stat(count / sum(count))), width = 0.5,
fill = 'firebrick3') +
labs(x = 'y', y = 'proportion', title = lambda~ '= 0.5') +
theme_minimal()
scale_color_gradient(low="firebrick1", high="firebrick4")
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
p2 <- s2 %>% ggplot() +
geom_bar(aes(x = data, y = stat(count / sum(count))), width = 0.75,
fill = 'firebrick3') +
labs(x = 'y', y = 'proportion', title = lambda~ '= 2') +
theme_minimal()
p3 <- s3 %>% ggplot() +
geom_bar(aes(x = data, y = stat(count / sum(count))), width = 0.75,
fill = 'firebrick3') +
labs(x = 'y', y = 'proportion', title = lambda~ '= 10') +
theme_minimal()
Finally we can plot our different artificially generated Poisson distributed data all at once.
grid.arrange(p1, p2, p3, nrow = 1)
## Warning: `stat(count / sum(count))` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count / sum(count))` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now let us consider \(n\) independent observations \(y_{1},...,y_{n}\) for which we assume a Poisson distribution conditionally on a set of \(p\) categorical or numerical covariates \(x_{j}\), for \(j = 1,..., p\). The model is given by the following equation
\[ ln \bigg( E[y_{i} \mid x_{i}] \bigg) = ln \big( \lambda_{i} \big) = \beta_{0} + \beta_{1} x_{i1} + ... + \beta_{p} x_{ip} = \boldsymbol{x}_{i}^{T} \boldsymbol{\beta} \]
with \(i = 1,..., n\), with \(\boldsymbol{x}_{i}^{T} = (1, x_{i1}, ..., x_{ip})^{T}\) and \(\boldsymbol{\beta} = ( \beta_{0} ,..., \beta_{p})\).
The natural link function is the log link. It ensures that \(\lambda_{i} \geq 0\). It follows that
\[ E \big[ y_{i} \mid x_{i} \big] =\lambda_{i} = e^{\beta_{0} + \beta_{1} x_{i1} + ... + \beta_{p} x_{ip}} = e^{ \boldsymbol{x}_{i}^{T} \boldsymbol{\beta}} \]
The Poisson Generalized Linear Model (GLM) is suitable for modeling count data as response variable \(Y\) when a set of assumptions are met. We will see them next.
Count response: The response variable is a count (non-negative integers), i.e. the number of times an event occurs in an homogeneous time interval or a given space (e.g. the number of goal scored during a football game). It is suitable for grouped or ungrouped data since the sum of Poisson distributed observations is also Poisson. When the reponse is a category (a ranking), we should consider a Multinomial GLM instead.
Independent events: The counts, i.e. the events, are assumed to be independent of each other. When this assumption does not hold, we should consider a Generalized Linear Mixed Model (GLMM) instead.
Constant variance: The factors affecting the mean are also affecting the variance. The variance is assumed to be equal to the mean. When this assumption does not hold, we should consider a Quasipoisson GLM for overdispersed (or underdispersed) data or a Negative Binomial GLM instead.
The log-likelihood function of the model is given by
\[ l( \boldsymbol{y}, \boldsymbol{\beta}) = \sum_{i=1}^{n} \bigg( y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{\beta} - e^{\boldsymbol{x}_{i}^{T} \boldsymbol{\beta}} - ln(y_{i}!) \bigg) \]
Differentiating with respect to \(\boldsymbol{\beta}\) and setting the new function equal to \(0\) yields the Maximum Likelihood equations
\[ \sum_{i=1}^{n} \big(y_{i} - e^{\boldsymbol{x}_{i}^{T} \boldsymbol{\beta}} \big) x_{ij} = 0 \]
with \(j = 0,..., p\) and $x_{i0} = 1.
There is no closed-form solution for the Maximum Likelihood equations. We therefore have to resort to numerical optimization, for example the Iteratively Weighted Least Squares (IWLS) algorithm or the Newton-Raphson algorithm to obtain estimates of the regression coefficients.
\(\beta_{0}\) represents the change in the log of the mean when all covariates \(x_{j}\) are equal to 0. Thus \(e^{\beta_{0}}\) represents the change in the mean.
\(\beta_{j}\), for \(j >0\) represents the change in the log of the mean when \(x_{j}\) increases by one unit and all other covariates are held constant. Thus \(e^{\beta_{j}}\) represents the change in the mean.
We will fit a Poisson regression model to a subset of the ‘Affairs’ dataset (after W. H. Greene).
There are \(n= 20\) observations and \(8\) variables in the reduced dataset. The variable ‘affairs’ is the number of extramarital affairs in the past year and is our response variable. We will include as covariates the variables ‘gender’, ‘age’, ‘yearsmarried’, ‘children’, ‘religiousness’, ‘education’ and ‘rating’ in our analysis. ‘religiousness’ ranges from \(1\) (anti) to \(5\) (very) and ‘rating’ is a self rating of the marriage, ranging from \(1\) (very unhappy) to \(5\) (very happy).
data(Affairs, package = 'AER')
set.seed(2023)
data <- Affairs[sample(nrow(Affairs), size = 20, replace = FALSE),-c(8)]
head(data)
## affairs gender age yearsmarried children religiousness education rating
## 174 12 female 42 15 yes 5 9 1
## 1895 0 female 32 15 yes 2 14 4
## 1540 0 male 32 10 yes 3 20 5
## 1226 0 female 32 15 yes 4 18 4
## 1445 12 male 37 15 yes 5 17 2
## 526 12 female 42 15 yes 4 12 1
dim(data)
## [1] 20 8
class(data)
## [1] "data.frame"
# Poisson model
poisson.model <- glm(affairs ~ .,
family = 'poisson', data = data)
summary(poisson.model)
##
## Call:
## glm(formula = affairs ~ ., family = "poisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8004 -0.9069 -0.5699 0.0426 3.3917
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.55732 2433.69469 0.000 0.999817
## gendermale 2.90347 0.81913 3.545 0.000393 ***
## age -0.17067 0.05807 -2.939 0.003293 **
## yearsmarried 0.11233 0.08247 1.362 0.173180
## childrenyes 14.46413 2433.69273 0.006 0.995258
## religiousness -0.07241 0.17862 -0.405 0.685188
## education -0.50119 0.12543 -3.996 6.45e-05 ***
## rating -0.80337 0.17495 -4.592 4.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 150.341 on 19 degrees of freedom
## Residual deviance: 41.005 on 12 degrees of freedom
## AIC: 88.102
##
## Number of Fisher Scoring iterations: 15
The deviance of the model (also called G-statistic) is given by
\[ D_{model} = 2 \sum_{i=1}^{n} \bigg( y_{i} ln\bigg( \frac{y_{i}}{\hat{\lambda}_{i}} \bigg) - (y_{i} - \hat{\lambda}_{i}) \bigg) \]
where \(\hat{\lambda}_{i} = e^{ \boldsymbol{x}_{i}^{T} \boldsymbol{\hat{\beta}}}\) is the fitted value of \(\lambda_{i}\).
The deviance can be used as a goodness-of-fit test. We test \(H_{0}\): ‘The model is appropriate’ versus \(H_{1}\): ‘The model is not appropriate’. Under \(H_{0}\), we have that
\[ D_{model} \sim \chi_{1-\alpha, n-(p+1)}^{2} \]
where \(p+1\) is the number of parameters of the model and \(1-\alpha\) is a quantile of the \(\chi^{2}\) distribution.
# p-value of Residual deviance goodness-of-fit test
1 - pchisq(deviance(poisson.model), df = poisson.model$df.residual)
## [1] 4.890236e-05
Our model does not fit the data very well.} Since our p-value is \(0.085\), \(H_{0}\) is just not rejected.
The Pearson goodness-of-fit statistic is given by
\[ X^{2} = \sum_{i=1}^{n} \frac{(y_{i} -\hat{ \lambda}_{i})^{2}}{\hat{\lambda}_{i}} \]
where \(\hat{\lambda}_{i} = e^{ \bold{x}_{i}^{T} \boldsymbol{\hat{\beta}}}\) is the fitted value of \(\lambda_{i}\).
We test \(H_{0}\): ‘The model is appropriate’ versus \(H_{1}\): ‘The model is not appropriate’. Under \(H_{0}\), we have that
\[ X^{2} \sim \chi_{1-\alpha, n-(p+1)}^{2} \]
where \(p+1\) is the number of parameters of the model and \(1-\alpha\) is a quantile of the \(\chi^{2}\) distribution.
# Pearson's goodness-of-fit
Pearson <- sum((data$affairs - poisson.model$fitted.values)^2
/ poisson.model$fitted.values)
1 - pchisq(Pearson, df = poisson.model$df.residual)
## [1] 4.702802e-05
The fit is not much better. Our p-value is \(0.1054\) and \(H_{0}\) is not rejected.
The variance of \(y_{i}\) is approximated by \((y_{i} - \hat{\lambda}_{i})^{2}\). From the first graph we can see that the range of the variance differs from the range of the mean. Moreover, from the second graph, we see that the residuals show some kind of pattern. \(E[Y] = var(Y)\) seems not to hold. Let us examine the dispersion of the data and try a Quasipoisson in case of overdispersion.
lambdahat <-fitted(poisson.model)
par(mfrow=c(1,2), pty="s")
plot(lambdahat,(data$affairs-lambdahat)^2,
xlab=expression(hat(lambda)), ylab=expression((y-hat(lambda))^2 ))
plot(lambdahat, resid(poisson.model,type="pearson"),
xlab=expression(hat(lambda)), ylab="Pearson Residuals")
The variance of \(Y\) must be somewhat proportional to its mean. We can write
\[ var(Y) = E[Y] = { \color{purple} \phi} \lambda \]
where \(\phi\) is a scale parameter of dispersion and is equal to \(1\) if the equality \(E[Y] = var(Y)\) holds. If \(\phi > 1\), the data are overdispersed and if \(\phi < 1\), the data are underdispersed. If a Poisson model is fitted under overdispersion of the response, then the standard errors of the estimated coefficients are underestimated. The scale parameter \(\phi\) can be estimated as
\[ \hat{\phi} = \frac{\sum_{i=1}^{n} \frac{(y_{i} -\hat{ \lambda}_{i})^{2}}{\hat{\lambda}_{i}}}{n-(p+1)} = \frac{X^{2}}{n-(p+1)} \]
# Estimated dispersion parameter
Pearson / poisson.model$df.residual
## [1] 3.425568
The dispersion parameter is roughly equal to \(1.53\) for our data. Let us try a Quasipoisson regression model.
The fitted Quasipoison model yields the following R output. However, the fit seems not to have improved based on the deviance goodness-of-fit test.
# Quasipoisson model
quasipoisson.model <- glm(affairs ~ .,
family = 'quasipoisson', data = data)
summary(quasipoisson.model)
##
## Call:
## glm(formula = affairs ~ ., family = "quasipoisson", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8004 -0.9069 -0.5699 0.0426 3.3917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.55732 4504.35291 0.000 0.9999
## gendermale 2.90347 1.51608 1.915 0.0796 .
## age -0.17067 0.10748 -1.588 0.1383
## yearsmarried 0.11233 0.15264 0.736 0.4759
## childrenyes 14.46413 4504.34928 0.003 0.9975
## religiousness -0.07241 0.33059 -0.219 0.8303
## education -0.50119 0.23216 -2.159 0.0518 .
## rating -0.80337 0.32380 -2.481 0.0289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.425568)
##
## Null deviance: 150.341 on 19 degrees of freedom
## Residual deviance: 41.005 on 12 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 15
# p-value of Residual deviance goodness-of-fit test
1 - pchisq(deviance(quasipoisson.model), df = quasipoisson.model$df.residual)
## [1] 4.890236e-05
Some variables may not be relevant to the model or have low explanatory power. Stepwise model selection provides one possible solution to select our covariates based on Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) reduction (not available for Quasipoisson models).
# variable selection using BIC
library(MASS)
##
## Attachement du package : 'MASS'
## L'objet suivant est masqué depuis 'package:dplyr':
##
## select
stepAIC(poisson.model, direction = 'both', k = log(dim(data)[1]))
## Start: AIC=96.07
## affairs ~ gender + age + yearsmarried + children + religiousness +
## education + rating
##
## Df Deviance AIC
## - religiousness 1 41.169 93.236
## - children 1 41.231 93.298
## - yearsmarried 1 43.037 95.104
## <none> 41.005 96.068
## - age 1 51.372 103.439
## - gender 1 57.363 109.430
## - education 1 63.560 115.627
## - rating 1 77.775 129.841
##
## Step: AIC=93.24
## affairs ~ gender + age + yearsmarried + children + education +
## rating
##
## Df Deviance AIC
## - children 1 41.407 90.478
## - yearsmarried 1 43.038 92.109
## <none> 41.169 93.236
## + religiousness 1 41.005 96.068
## - age 1 53.378 102.448
## - gender 1 57.369 106.440
## - education 1 63.894 112.965
## - rating 1 78.697 127.768
##
## Step: AIC=90.48
## affairs ~ gender + age + yearsmarried + education + rating
##
## Df Deviance AIC
## - yearsmarried 1 43.478 89.553
## <none> 41.407 90.478
## + children 1 41.169 93.236
## + religiousness 1 41.231 93.298
## - age 1 54.563 100.638
## - gender 1 58.957 105.032
## - education 1 65.960 112.036
## - rating 1 79.587 125.662
##
## Step: AIC=89.55
## affairs ~ gender + age + education + rating
##
## Df Deviance AIC
## <none> 43.478 89.553
## + yearsmarried 1 41.407 90.478
## + children 1 43.038 92.109
## + religiousness 1 43.478 92.549
## - gender 1 59.877 102.956
## - education 1 66.504 109.584
## - age 1 77.193 120.272
## - rating 1 88.952 132.032
##
## Call: glm(formula = affairs ~ gender + age + education + rating, family = "poisson",
## data = data)
##
## Coefficients:
## (Intercept) gendermale age education rating
## 12.2632 2.6951 -0.1084 -0.4541 -0.8337
##
## Degrees of Freedom: 19 Total (i.e. Null); 15 Residual
## Null Deviance: 150.3
## Residual Deviance: 43.48 AIC: 84.57
It appears that the variables ‘yearsmariried’, ‘children’, ‘religiousness’ and ‘rating’ are the most relevant to our analysis. The next step is to select the best Quasipoisson model between one including all covariates and one for which only those four covariates are incorporated in the model.
We will select the best model in terms of predictions using leave-one-out Crossvalidation (LOOCV). The model with the lowest Root Mean Squared Error (RMSE) will be preferred.
# Leave-one-out crossvalidation (LOOCV)
pred.cv.mod_1 <- pred.cv.mod_2 <- numeric(dim(data)[1])
for(i in 1:dim(data)[1]) {
mod_1 = glm(affairs ~ .,
family = 'quasipoisson', data = data, subset = -i)
mod_2 = glm(affairs ~ children + yearsmarried + religiousness + rating,
family = 'quasipoisson', data = data, subset = -i)
pred.cv.mod_1[i] = predict.glm(mod_1, data[i,], type = 'response' )
pred.cv.mod_2[i] = predict.glm(mod_2, data[i,], type = 'response')
}
error.mod_1 = (1/dim(data)[1]) * sum((data$affairs - pred.cv.mod_1)^2)
error.mod_2 = (1/dim(data)[1]) * sum((data$affairs - pred.cv.mod_2)^2)
# Root Mean Squared Error (RMSE)
sqrt(c(error.mod_1, error.mod_2))
## [1] 14557.069929 5.789588
Clearly, the model with four covariates yields better predictions than the complete model and should be preferred. However, the RMSE remains relatively large indicating potential outliers in the dataset.
Next we can use diagnostic plots to see weather the assumptions are fullfiled.
# Diagnostic plots
par(mfrow = c(2,3))
plot(quasipoisson.model.2, which = 1:6)
Based on the Cook’s distance, the observation \(1218\) appears to be atypical and have a strong influence on the parameter estimates as well as on the predictions. This observation should be removed.
# Final model
# 10. Remove outlier
round(cooks.distance(quasipoisson.model.2)) # observation 1218 is atypical
## 174 1895 1540 1226 1445 526 903 1138 1548 648 374 344 1671 929 227 1654
## 11 0 0 0 1 2 0 0 1 0 0 1 0 0 0 0
## 670 635 1341 516
## 0 0 0 0
data2 <- data[ - which.max(round(cooks.distance(quasipoisson.model.2))), ]
quasipoisson.model.3 = glm(affairs ~ children + yearsmarried + religiousness + rating,
family = 'quasipoisson', data = data2, maxit = 100)
summary(quasipoisson.model.3)
##
## Call:
## glm(formula = affairs ~ children + yearsmarried + religiousness +
## rating, family = "quasipoisson", data = data2, maxit = 100)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8931 -1.8296 -0.9807 0.3044 4.0127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.76079 3344.36345 -0.004 0.9972
## childrenyes 16.45699 3344.36313 0.005 0.9961
## yearsmarried -0.08343 0.07988 -1.044 0.3140
## religiousness -0.14587 0.29007 -0.503 0.6229
## rating -0.83127 0.34010 -2.444 0.0284 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 5.058069)
##
## Null deviance: 137.169 on 18 degrees of freedom
## Residual deviance: 68.519 on 14 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 14
# p-value of Residual deviance goodness-of-fit test
1 - pchisq(deviance(quasipoisson.model.3), df = quasipoisson.model.3$df.residual)
## [1] 3.574799e-09
# Pearson's goodness-of-fit
Pearson <- sum((data2$affairs - quasipoisson.model.3$fitted.values)^2
/ quasipoisson.model.3$fitted.values)
1 - pchisq(Pearson, df = quasipoisson.model.3$df.residual)
## [1] 1.374603e-09
Once the outlier has be removed, the fit is much better and the standard errors are much lower compared to the parameter estimates. This is our best model.
The problems of overdispersion, covariate selection and influence of outliers have been addressed. Our final Quasipoisson model is a good fit for the data. About \(86 \%\) of the deviance is explained by the model.
The level of religiousness and the number of years of marriage seem to be positively related to the average number of affairs, whereas having children and a happy self rated marriage seem to be negatively related to the average number of affairs. Caution however since the dataset only contains 19 observations.
If an individual has one child or more, the change in the mean response given all other covariates held constant is \(e^{-2.75} \approx 0.064\), hence a decrease of \(93.6 \%\) of the average number of affairs in the past year.
For one more year of marriage, the change in the mean response given all other covariates held constant is \(e^{0.304} \approx 1.36\), hence an increase of \(36 \%\) of the average number of affairs in the past year.
When the self rating of the marriage changes from unhappy to happy, the change in the mean response given all other covariates held constant is \(e^{-2.034} \approx 0.13\), hence a decrease of \(87 \%\) of the average number of affairs in the past year.
P. McCullagh and J. A. Nelder, Generalized Linear Models, Second Edition, Chapman & Hall/CRC, 1983. R.
J. Faraway, Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, Second Edition, Chapman & Hall, 2005.
A. J.Dobson and A. G. Barnett, An introduction to Generalized Linear Models, Third Edition, Chapman & Hall/CRC, 2008.