This blog examines the mathematics behind Poisson regression for count data. I then create some simulated data, subject it to Poisson regression, and explore R’s functionality.
I cover residuals and residual analysis very briefly, as the next blog will concern those topis for generalized linear models (GLMs) more generally.
The Poisson distribution, briefly
The Poisson is an unbounded count distribution (0, 1, 2, …) characterized by \(\mu = E(Y) = Var(Y)\). Its probability mass function is:
\[P(Y = y) = \frac{e^{-\mu} \mu^{y}}{y!}\]
where \(\mu > 0\) is the mean (and variance), and \(y = 0, 1, 2, \ldots\). Poisson distributions with \(\mu\) in \((1, 2, 4, 8, 16)\) look like:
set.seed(1805)
plot(density(rpois(1000, 16)), col='purple', xlim=c(-1, 30), ylim=c(0, .65),
main='Poisson Distributions')
lines(density(rpois(1000, 8)), col='green')
lines(density(rpois(1000, 4)), col='blue')
lines(density(rpois(1000, 2)), col='red')
lines(density(rpois(1000, 1)))The distribution becomes more normal as \(\mu\) increases. Expected value and variance are the same \(E(Y) = Var(Y) = \mu\).
The Problem
Poisson is often used to model probability of the number of events that occur over a period of time. In such circumstances, \(\mu = \lambda t\) where \(lambda\) is the rate of events per time, and \(t\) is the number of units of time.
Closer to actual practice, Poisson regression is often turned to for small counts, where larger values become increasingly rare.
As with all regression, we can state it simply as,
\[Y_i = E(Y_i) + \epsilon_i\]
that is, a systematic and random component. Next, suppose we have a matrix of \(X\) potential predictor variables we would like to explain or predict \(E(Y_i)\). Assuming a linear and additive model gives us the linear predictor, that is familiar to us from the two blog posts on linear and binary logistic regression:
\[\eta = x^T \beta\]
Now we need to link the linear predictor to the dependent variable \(Y \sim Pois(\mu)\), subject to the requirements that all predictions are nonnegative. The canonical function is log, giving us:
\[Y_i = E(Y_i) = \mu_i = exp(\eta_i) = exp(x_i^T \beta)\]
Parameter estimation
Now that we have the functional form in place, we need to find the optimal values of \(\beta\). Maximum likelihood estimation methods will generate the \(\beta\)’s most likely from the data. But first, we have to understand what the likelihood of a series of \(\beta\)’s is:
\[lik(\beta) = \prod_{i=1}^{n} \frac{e^{-\mu_i} \mu_i^{y_i} }{ y_i! }\]
where \(\mu\) is defined in terms of \(\beta\) and \(x\) as elucidated above. I.e., this equation is the likelihood of the given values of \(\beta\) given the data.
We prefer to work with the log-likelihood because it is easier (converting products to sums, etc.):
\[ l(\beta) = \sum_{i=1}^{n} \left( y_i x_i^T \beta - exp(x_i^T \beta) - log(y_i!) \right) \]
Differentiating w.r.t \(\beta\) gives the maximum likelihood estimate of \(\hat{\beta}\) as:
\[ \sum_{i=1}^{n} \left( y_i - exp(x_i^T \hat{\beta}) \right) x_{ij}= 0 ~~~~~, \forall j \]
which can also be written as:
\[ X^T y = X^T \hat{\mu} \]
At this point in the derivation, numerical methods (Newton-Raphson method, gradient descent) must be employed as there is no general solution.
Simulating data
Our simulated dataset will consist of 1000 measures of three variables. \(X_1\) will be normally distributed with \(\mu = 0\) and \(\sigma^2 = 0.5\). \(X_2\) will be a balanced dichotomous variable, where each value \(0, 1\) is equally as likely. The final variable \(X_3\) is itself Poisson distributed with \(\mu = 0.5\).
n <- 1000
set.seed(1804)
X <- matrix(0, nrow=n, ncol=3)
X[, 1] <- rnorm(n, mean=0, sd=.5) # normally distributed IV
X[, 2] <- sample(0:1, size=n, replace=TRUE, prob=c(0.5, 0.5)) # dichotomous IV (balanced)
X[, 3] <- rpois(n, lambda=.5) # Poisson distributedSet the parameters that govern the relationship between the variables and \(Y\):
Use these parameters to create the linear predictor \(\eta\), and use the log link function to put it all together:
eta <- beta_1* X[,1] + beta_2*X[,2] + beta_3*X[,3]
g <- exp(eta)
y <- rpois(n, g)
df <- data.frame(y=y, x=X)Graphically, our simulated dataset looks like:
p1 <- ggplot(df, aes(x=x.1, y=y)) + geom_point(alpha=0.33) + geom_smooth()
p2 <- ggplot(df, aes(x=as.factor(x.2), y=y)) + geom_boxplot()
p3 <- ggplot(df, aes(x=jitter(x.3), y=y)) + geom_point(alpha=0.33) + geom_smooth()
grid.arrange(p1, p2, p3, ncol=3)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
p4 <- ggplot(df, aes(x=x.1, color=1, fill=1)) + geom_density(alpha=0.4, show.legend=FALSE)
p5 <- ggplot(df, aes(x=x.2, color=1, fill=1)) + geom_bar(show.legend=FALSE)
p6 <- ggplot(df, aes(x=x.3, color=1, fill=1)) + geom_bar(show.legend=FALSE)
grid.arrange(p4, p5, p6, ncol=3)Poisson regression modeling
Using R’s base glm() function with the poisson family:
##
## Call:
## glm(formula = y ~ x.1 + as.factor(x.2) + x.3, family = poisson(),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1371 -1.1544 -0.1061 0.6041 3.0341
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04929 0.04667 -1.056 0.291
## x.1 0.42939 0.03829 11.215 < 2e-16 ***
## as.factor(x.2)1 1.41386 0.04927 28.696 < 2e-16 ***
## x.3 0.12380 0.02499 4.953 7.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2389.4 on 999 degrees of freedom
## Residual deviance: 1156.5 on 996 degrees of freedom
## AIC: 3483.5
##
## Number of Fisher Scoring iterations: 5
The population parameters,
## [1] 0.40546511 1.38629436 0.09531018
are well within the 95 percent confidence intervals for the fitted parameters:
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.14197144 0.04101902
## x.1 0.35434945 0.50442580
## as.factor(x.2)1 1.31824465 1.51142510
## x.3 0.07436155 0.17233767
Despite the different kinds of independent variables, the model was within reasonable accuracy of each parameter.
Proportion of deviance explained is about half, so that’s good:
## [1] 0.5159861
Residuals and diagnostics
I will only briefly cover residuals and diagnostics for Poisson regression. These topics will be covered extensively in a future post.
We can define the deviance and Pearson residuals, each with a standardized version:
dev_res <- residuals(m, type='deviance')
dev_res_std <- residuals(m, type='deviance') / sqrt(1 - hatvalues(m))
pearson_res <- residuals(m, type='pearson')
pearson_res_std <- residuals(m, type='pearson') / sqrt(1 - hatvalues(m))
par(mfrow=c(1, 2))
plot(density(pearson_res), main='Deviance (red) v. Pearson Residuals', xlim=c(-5, 5))
lines(density(dev_res), col='red')
plot(density(pearson_res_std), main='Deviance Standardized (red) v.\n Pearson Standardized Residuals', xlim=c(-5, 5))
lines(density(dev_res_std), col='red')We can see there isn’t much of a difference, although the shape of the Pearson residuals (black) trails off to the right. Examine the deviance standardized residuals by variable:
df$resid <- dev_res_std
df$y_hat <- predict(m)
r1 <- ggplot(df, aes(x=x.1, y=resid)) + geom_point(alpha=0.33) + geom_smooth()
r2 <- ggplot(df, aes(x=as.factor(x.2), y=resid)) + geom_boxplot()
r3 <- ggplot(df, aes(x=x.3, y=resid)) + geom_point(alpha=0.33) + geom_smooth(method='lm')
grid.arrange(r1, r2, r3, ncol=3)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
These plots show residuals are mostly gathered around 0, as expected.
We should also check disperson, which is not much larger than 1, so it makes sense to use Poisson regression rather than negative binomial alternative:
## [1] 1.058634
Outliers
The halfnorm plot shows a couple of outliers:
## y x.1 x.2 x.3 resid y_hat
## 182 0 0.2446701 1 1 -3.141653 1.5934303
## 92 5 -0.2036232 0 0 3.037182 -0.1367196
Inference
Overall, the model appears to be forming as expected. Now we can draw inferences from it. Let’s re-examine it:
## (Intercept) x.1 as.factor(x.2)1 x.3
## -0.0492866 0.4293861 1.4138612 0.1237977
The intercept can be interpreted as: When \(x_1, x_2,\) and \(x_3\) equal zero, the expected log count is -0.0493. Since the link function used here is log, we can convert it to ‘normal’ scale with exp(): The expected count when the other variables are 0 is \(exp(-0.0493) = 0.9519\), or almost one.
The coefficients:
An increase of one unit in \(x_1\) is associated with an increase in \(Y\) of \(exp(0.4294) = 1.5363\).
The count for \(x_2 = 1\) will be \(exp(1.4138) = 4.1118\) higher than when \(x_2 = 0\), on average.
An increase in one unit of \(x_3\) is associated with an increase in \(Y\) of \(exp(0.1238) = 1.1318\).