Introduction

The linear regression is a well established modeling technique that is used for modeling a response variable using numerous explanatory variables (or what we call features in the machine learning jargon). The ordinay linear regression is well suited for modeling response variables that have a normal distributions. For modeling other types of response variables, other derivatives of the linear regression are more suited.
Two of the most common and widely used ones are logistic regression, which is used for modeling a Bernoulli outcome, and Poisson regression, used for modeling a Poisson outcome.
In this post, I try to explain the Poisson regression and present a demo for it using real data.1. I use different models and compare between them.

What is a Poisson Outcome

The Poisson distribution is the discrete probability distribution of the number of events occurring in a fixed amount of time or space, given that these events occur with a known constant mean rate and occur independently. For example, the number of calls received in a call center during any minute has a Poisson probability distribution.

lambas <- c(2, 5, 10, 50)
pois_data <- map_dfr(lambas, ~tibble(Value = rpois(1e4, .x), lambda = .x))
ggplot(pois_data, aes(x = Value)) +
  geom_bar(width = 0.8, fill = "blue", alpha = 0.6) +
  facet_wrap(~lambda, scales = "free")

The Poisson distribution is well suited for modeling count data.

For sufficiently large values of λ, (say λ>1000), the normal distribution with mean λ and variance λ is an excellent approximation to the Poisson distribution.

Why Linear Regression

  1. A linear regression model predicts the target as a weighted sum of the feature inputs. The linearity of the learned relationship makes the interpretation easy.
  2. Linear regression models are simple, they are not prone to overfitting as much as other more complex models (tree models for example).
  3. Linear regression has no or few hyper-parameters, which simplifies the training process.
  4. The training and prediction process requires modest processing resources, which makes them fast.
  5. They can be extended to capture non-linear relationships (i.e feature transformation, GLM, GAM, ..etc).^Some of those terms will be explained later.

Generalised Linear Models

Generalized Linear Models (GLM) were first introduced by (Nelder and Wedderburn in 1972)[https://www.jstor.org/stable/2344614]. They are very robust & popular and are still widely used today. The generalized linear model is a family of models that includes the classic linear regression model. It extends the linear regression model to also include other models that proved to be very useful and practical.

It has 3 Components:
- An exponential family model for the response. - A systematic component via a linear predictor. - A link function that connects the means of the response to the linear predictor.

To illustrate this, let’s examine those three components for various models.

a. A classic linear regression model

  1. Response family: The linear regression model assumes that the response variable (the outcome or label) has a normal distribution.
  2. Linear predictors: This is simply the weighted sum of the features. \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
  3. The link function is the identity function.
    So, the linear model is modeled by \[ Y = \sum_{k=1}^{p}X_{ki}\beta_k + \epsilon_i \]

with an error that belongs to normal distribution \[ \epsilon \overset{iid}\sim N(0,\sigma^2) \]

b. Logistic regression

In this case, the response family is a Bernoulli distribution, that’s \(Y_i \sim Bernoulli(\mu_i)\).
The linear predictor is also the weighted sum of the features, that’s \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
The link function is the logit function, that’s \(\eta = log(\frac{\mu}{1-\mu})\).

c. Poisson regression

For the Poisson regression, the glm would have the below components:
1. Response family is a Poisson distribution, that’s \(Y_i \sim Poisson(\mu_i)\).
1. Linear predictor is also \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
1. The link function is the log function, so \(\eta = log(\mu)\).

glmnet models

You may come across the term “glmnet model”, so I’m giving a brief about it. glmnet is a popular R implementation of the elastic net models. An elastic net model is a linear model with elastic L1 (lasso) and L2 (ridge) regularization. glmnet is written in FORTRAN and it is fast. It also has supporting functions for the selection and cross validation of the best penalty amount.

A Sample Case Study

The data models count data collected from a real data. I normally use the tidymodels framework for doing my machine learning projects. In this post, I preferred to use base R functions, to focus on the differences between different Poisson models2. My main interest here is illustrating the various models, it is not getting the best performance metrics. So, the examples below don’t go through a formal workflow for model tuning (resampling, cross validation, hyper-parameter tuning, ,,,,etc).

Note
This is not a formal model tuning process. In a strict tuning process, I would use cross validation to validate the model performance. Instead, my aim here is to illustrate the various poisson modeling, so I relax this procedure.

The data has already passed through the pre-processing steps (e.g. normalization, transformation, ..etc).

The data has 40 predictors (named PC01 to PC04)3.

The response variable is called label. Let’s explore it.

summary(data_train$label)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00   14.00   28.78   34.00 1038.00
# Explore Response Var ----
mean(data_train$label)
## [1] 28.77936
var(data_train$label)
## [1] 1933.054
MASS::fitdistr(data_train$label, "poisson")
##      lambda   
##   28.77935777 
##  ( 0.01656724)

The variance is larger than the mean, so this is more like a dispersed Poisson distribution. This may be better modeled by a quassi-poisson model or a negative binomial model, as will be shown later.

ggplot(data_train, aes(x = label)) +
  geom_histogram(bins = 60, aes(y = ..density..)) +
  stat_function(fun = dpois, color = "red", args = list(lambda = 28.5), n = 11)

ggplot(data_train, aes(x = label)) +
  geom_bar() +
  scale_x_continuous(limits = c(0, 50))

data_train %>% 
  count(label) %>% 
  slice_head(n = 10) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")
label n
0 5757
1 5636
2 5014
3 4655
4 4190
5 3775
6 3465
7 3113
8 3019
9 2698

The distribution is not really a poisson one:
1. It is zero inflated. 2. It is over dispersed and has flatter tail. 3. It has some outliers at the right end.

Ordinal Linear Regression

Let’s start with a plain vanilla linear regression model. I will use the built in lm() for that.

system.time(lm_fit <- lm(
  label ~ .,
  data = data_train))
##    user  system elapsed 
##   3.282   0.020   0.283

Linear models are simple and fast, it only took few seconds to train this model.

Let’s have a look on the top 6 coefficients

broom::tidy(lm_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")
term estimate std.error statistic p.value
(Intercept) 28.783592 0.0978353 294.20469 0
PC01 3.431140 0.0141770 242.02122 0
PC04 -3.428379 0.0324431 -105.67358 0
PC20 5.219567 0.0656759 79.47456 0
PC10 2.885033 0.0528761 54.56209 0
PC06 -2.221422 0.0429615 -51.70730 0

Now, let’s check the performance metrics.

res_lm <- calc_model_metrics(lm_fit, data_train, data_test, type = "response")
res_lm$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.48 31.67
Test 0.48 31.67

The model have subpar performance, the r square (rsq) is 0.48. The root square mean error (rmsq) is 31.7, which is considerably larger than the mean of the data.

Now, let’s examine the model more closely.

plot_grid(
  res_lm$plots$p1,
  res_lm$plots$p2, 
  res_lm$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

The top left plot shows the distribution of the actual label and the predicted label. Since this is a normal linear regression model, the predicted label distribution follows a normal distribution. The mismatch between the predicted and true label distributions is clear.

Notes:
1. Prediction is not correct when the label is high.
2. Predictions follow a normal distribution and has negative values.

Using a generalized linear model with a Poisson response is expected to produce a better and more fit match. Let’s try it.

Poisson Regression using glm()

Now, let’s try fitting a Poisson regression model. I will use the glm() function. glm() is included in standard R, it implements the “Generalized Linear Model”.

system.time(glm_pois_fit <- glm(
  label ~ .,
  family = "poisson",
  data = data_train))
##    user  system elapsed 
##  17.446   0.132   1.489
res_glm_pois <- calc_model_metrics(glm_pois_fit, data_train, data_test, type = "response")
res_glm_pois$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.62 27.23
Test 0.64 26.67

Just switching to a Poisson response significantly enhanced the metrics.

Coeffecients

Let’s have a look on the top 6 coefficients

broom::tidy(glm_pois_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")
term estimate std.error statistic p.value
(Intercept) 2.7305066 0.0009524 2866.8450 0
PC01 0.1641104 0.0001499 1095.0280 0
PC20 0.1459469 0.0003512 415.5406 0
PC11 -0.0928925 0.0002576 -360.6672 0
PC05 -0.0508081 0.0001877 -270.7335 0
PC04 -0.0502220 0.0002094 -239.7806 0

Residual Plots

Let’s check the residual plots.

plot_grid(
  res_glm_pois$plots$p1,
  res_glm_pois$plots$p2, 
  res_glm_pois$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

This is much better:
1. The residual histogram is now not skewed.
2. The distribution of the predicted values is now close to the true values.

Quasi-Poisson

What if I use a quasipoisson() family instead of a plain vanilla Poisson family. I expect it should be better as the response variable is over dispersed.

system.time(qpois_fit <- glm(
  label ~ .,
  data = data_train,
  family  = quasipoisson()))
##    user  system elapsed 
##  14.209   0.155   1.160

Metrics

res_qpois <- calc_model_metrics(qpois_fit, data_train, data_test, type = "response")
res_qpois$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.62 27.23
Test 0.64 26.67

Coeffecients

broom::tidy(qpois_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")
term estimate std.error statistic p.value
(Intercept) 2.7305066 0.0045967 594.00895 0
PC01 0.1641104 0.0007233 226.88929 0
PC20 0.1459469 0.0016951 86.09981 0
PC11 -0.0928925 0.0012430 -74.73008 0
PC05 -0.0508081 0.0009057 -56.09586 0
PC04 -0.0502220 0.0010109 -49.68243 0

Residual Plots

plot_grid(
  res_qpois$plots$p1,
  res_qpois$plots$p2, 
  res_qpois$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

It has exactly the same results as the Poisson regression. This is because the quasi-poisson estimates the same parameter coefficients. It changes the p-values and confidence intervals. This is normal, as described in this [stack overflow answer] (https://stats.stackexchange.com/questions/176918/identical-coefficients-estimated-in-poisson-vs-quasi-poisson-model)

Negative Binomial Response

library(MASS, quietly = TRUE)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
system.time(nbin_fit <- glm.nb(
  label ~ .,
  data = data_train))
##    user  system elapsed 
##  62.903   0.763   8.681

Metrics

res_nbin <- calc_model_metrics(nbin_fit, data_train, data_test, type = "response")
res_nbin$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.00 1.814739e+09
Test 0.56 3.286000e+01

Coeffecients

broom::tidy(nbin_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")
term estimate std.error statistic p.value
(Intercept) 2.7092062 0.0021088 1284.70989 0
PC01 0.1639295 0.0003581 457.82244 0
PC20 0.1446654 0.0013270 109.01819 0
PC04 -0.0718339 0.0006863 -104.66943 0
PC15 -0.0956607 0.0011528 -82.98073 0
PC11 -0.0841760 0.0010788 -78.02805 0

Residual Plots

plot_grid(
  res_nbin$plots$p1,
  res_nbin$plots$p2, 
  res_nbin$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Hurdle Model

What if we try a hurdle model

library(pscl, quietly = TRUE)
system.time(hurdle_fit <- hurdle(
  label ~ .,
  data = data_train,
  dist  = "poisson"))
##    user  system elapsed 
## 245.361   0.985  20.830
res_hurdle <- calc_model_metrics(hurdle_fit, data_train, data_test, type = "response")

Metrics

res_hurdle$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.62 27.21
Test 0.64 26.65

Coeffecients

Residual Plots

plot_grid(
  res_hurdle$plots$p1,
  res_hurdle$plots$p2, 
  res_hurdle$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Zero-inflated Model

What if we try a zero inflated model

library(pscl, quietly = TRUE)
system.time(zeroinf_fit <- zeroinfl(
  label ~ .,
  data = data_train,
  dist  = "poisson"))
##    user  system elapsed 
## 417.913   1.989  37.954
res_zeroinf <- calc_model_metrics(zeroinf_fit, data_train, data_test, type = "response")

Metrics

res_zeroinf$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Model Metrics
Data rsq rmse
Train 0.62 27.21
Test 0.64 26.65

Coeffecients

Residual Plots

plot_grid(
  res_zeroinf$plots$p1,
  res_zeroinf$plots$p2, 
  res_zeroinf$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Conclusion

Using a poisson model to model a count response variable yields better and more representable results than using a classic regression model.
If the response variable doesn’t strictly follow a poisson distribution, then using a quasi model and/or a zero inflated model may give better results.

References

The Poisson regression is a well established technique and has countless online refrences. However, the below refrences helped me a lot in understanding the Poisson regression and its tools.


  1. Unfortunately, I can’t share this data even though it is anonymized (using PCA).↩︎

  2. I used the tidymodels recipes package for pre-processing the data↩︎

  3. Actually, those predictors are the outcome of a PCA transfrmation↩︎