Deriving Poisson Regression

Ben Horvath

November 2019

This blog examines the mathematics behind Poisson regression for count data. I then create some simulated data, subject it to Poisson regression, and explore R’s functionality.

I cover residuals and residual analysis very briefly, as the next blog will concern those topis for generalized linear models (GLMs) more generally.

The Poisson distribution, briefly

The Poisson is an unbounded count distribution (0, 1, 2, …) characterized by \(\mu = E(Y) = Var(Y)\). Its probability mass function is:

\[P(Y = y) = \frac{e^{-\mu} \mu^{y}}{y!}\]

where \(\mu > 0\) is the mean (and variance), and \(y = 0, 1, 2, \ldots\). Poisson distributions with \(\mu\) in \((1, 2, 4, 8, 16)\) look like:

The distribution becomes more normal as \(\mu\) increases. Expected value and variance are the same \(E(Y) = Var(Y) = \mu\).

The Problem

Poisson is often used to model probability of the number of events that occur over a period of time. In such circumstances, \(\mu = \lambda t\) where \(lambda\) is the rate of events per time, and \(t\) is the number of units of time.

Closer to actual practice, Poisson regression is often turned to for small counts, where larger values become increasingly rare.

As with all regression, we can state it simply as,

\[Y_i = E(Y_i) + \epsilon_i\]

that is, a systematic and random component. Next, suppose we have a matrix of \(X\) potential predictor variables we would like to explain or predict \(E(Y_i)\). Assuming a linear and additive model gives us the linear predictor, that is familiar to us from the two blog posts on linear and binary logistic regression:

\[\eta = x^T \beta\]

Now we need to link the linear predictor to the dependent variable \(Y \sim Pois(\mu)\), subject to the requirements that all predictions are nonnegative. The canonical function is log, giving us:

\[Y_i = E(Y_i) = \mu_i = exp(\eta_i) = exp(x_i^T \beta)\]

Parameter estimation

Now that we have the functional form in place, we need to find the optimal values of \(\beta\). Maximum likelihood estimation methods will generate the \(\beta\)’s most likely from the data. But first, we have to understand what the likelihood of a series of \(\beta\)’s is:

\[lik(\beta) = \prod_{i=1}^{n} \frac{e^{-\mu_i} \mu_i^{y_i} }{ y_i! }\]

where \(\mu\) is defined in terms of \(\beta\) and \(x\) as elucidated above. I.e., this equation is the likelihood of the given values of \(\beta\) given the data.

We prefer to work with the log-likelihood because it is easier (converting products to sums, etc.):

\[ l(\beta) = \sum_{i=1}^{n} \left( y_i x_i^T \beta - exp(x_i^T \beta) - log(y_i!) \right) \]

Differentiating w.r.t \(\beta\) gives the maximum likelihood estimate of \(\hat{\beta}\) as:

\[ \sum_{i=1}^{n} \left( y_i - exp(x_i^T \hat{\beta}) \right) x_{ij}= 0 ~~~~~, \forall j \]

which can also be written as:

\[ X^T y = X^T \hat{\mu} \]

At this point in the derivation, numerical methods (Newton-Raphson method, gradient descent) must be employed as there is no general solution.

Simulating data

Our simulated dataset will consist of 1000 measures of three variables. \(X_1\) will be normally distributed with \(\mu = 0\) and \(\sigma^2 = 0.5\). \(X_2\) will be a balanced dichotomous variable, where each value \(0, 1\) is equally as likely. The final variable \(X_3\) is itself Poisson distributed with \(\mu = 0.5\).

Set the parameters that govern the relationship between the variables and \(Y\):

Use these parameters to create the linear predictor \(\eta\), and use the log link function to put it all together:

Graphically, our simulated dataset looks like:

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Poisson regression modeling

Using R’s base glm() function with the poisson family:

## 
## Call:
## glm(formula = y ~ x.1 + as.factor(x.2) + x.3, family = poisson(), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1371  -1.1544  -0.1061   0.6041   3.0341  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.04929    0.04667  -1.056    0.291    
## x.1              0.42939    0.03829  11.215  < 2e-16 ***
## as.factor(x.2)1  1.41386    0.04927  28.696  < 2e-16 ***
## x.3              0.12380    0.02499   4.953 7.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2389.4  on 999  degrees of freedom
## Residual deviance: 1156.5  on 996  degrees of freedom
## AIC: 3483.5
## 
## Number of Fisher Scoring iterations: 5

The population parameters,

## [1] 0.40546511 1.38629436 0.09531018

are well within the 95 percent confidence intervals for the fitted parameters:

## Waiting for profiling to be done...
##                       2.5 %     97.5 %
## (Intercept)     -0.14197144 0.04101902
## x.1              0.35434945 0.50442580
## as.factor(x.2)1  1.31824465 1.51142510
## x.3              0.07436155 0.17233767

Despite the different kinds of independent variables, the model was within reasonable accuracy of each parameter.

Proportion of deviance explained is about half, so that’s good:

## [1] 0.5159861

Residuals and diagnostics

I will only briefly cover residuals and diagnostics for Poisson regression. These topics will be covered extensively in a future post.

We can define the deviance and Pearson residuals, each with a standardized version:

We can see there isn’t much of a difference, although the shape of the Pearson residuals (black) trails off to the right. Examine the deviance standardized residuals by variable:

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

These plots show residuals are mostly gathered around 0, as expected.

We should also check disperson, which is not much larger than 1, so it makes sense to use Poisson regression rather than negative binomial alternative:

## [1] 1.058634

Outliers

The halfnorm plot shows a couple of outliers:

##     y        x.1 x.2 x.3     resid      y_hat
## 182 0  0.2446701   1   1 -3.141653  1.5934303
## 92  5 -0.2036232   0   0  3.037182 -0.1367196

Inference

Overall, the model appears to be forming as expected. Now we can draw inferences from it. Let’s re-examine it:

##     (Intercept)             x.1 as.factor(x.2)1             x.3 
##      -0.0492866       0.4293861       1.4138612       0.1237977

The intercept can be interpreted as: When \(x_1, x_2,\) and \(x_3\) equal zero, the expected log count is -0.0493. Since the link function used here is log, we can convert it to ‘normal’ scale with exp(): The expected count when the other variables are 0 is \(exp(-0.0493) = 0.9519\), or almost one.

The coefficients:

  • An increase of one unit in \(x_1\) is associated with an increase in \(Y\) of \(exp(0.4294) = 1.5363\).

  • The count for \(x_2 = 1\) will be \(exp(1.4138) = 4.1118\) higher than when \(x_2 = 0\), on average.

  • An increase in one unit of \(x_3\) is associated with an increase in \(Y\) of \(exp(0.1238) = 1.1318\).