Introduction
What is a Poisson Outcome
Why Linear Regression
- Generalised Linear Models
- glmnet models
A Sample Case Study
Conclusion
References

Introduction

The linear regression is a well established modeling technique that is used for modeling a response variable using numerous explanatory variables (or what we call features in the machine learning jargon). The ordinay linear regression is well suited for modeling response variables that have a normal distributions. For modeling other types of response variables, other derivatives of the linear regression are more suited.
Two of the most common and widely used ones are logistic regression, which is used for modeling a Bernoulli outcome, and Poisson regression, used for modeling a Poisson outcome.
In this post, I try to explain the Poisson regression and present a demo for it using real data.¹. I use different models and compare between them.

What is a Poisson Outcome

The Poisson distribution is the discrete probability distribution of the number of events occurring in a fixed amount of time or space, given that these events occur with a known constant mean rate and occur independently. For example, the number of calls received in a call center during any minute has a Poisson probability distribution.

lambas <- c(2, 5, 10, 50)
pois_data <- map_dfr(lambas, ~tibble(Value = rpois(1e4, .x), lambda = .x))
ggplot(pois_data, aes(x = Value)) +
  geom_bar(width = 0.8, fill = "blue", alpha = 0.6) +
  facet_wrap(~lambda, scales = "free")

The Poisson distribution is well suited for modeling count data.

For sufficiently large values of λ, (say λ>1000), the normal distribution with mean λ and variance λ is an excellent approximation to the Poisson distribution.

Why Linear Regression

A linear regression model predicts the target as a weighted sum of the feature inputs. The linearity of the learned relationship makes the interpretation easy.
Linear regression models are simple, they are not prone to overfitting as much as other more complex models (tree models for example).
Linear regression has no or few hyper-parameters, which simplifies the training process.
The training and prediction process requires modest processing resources, which makes them fast.
They can be extended to capture non-linear relationships (i.e feature transformation, GLM, GAM, ..etc).^Some of those terms will be explained later.

Generalised Linear Models

Generalized Linear Models (GLM) were first introduced by (Nelder and Wedderburn in 1972)[https://www.jstor.org/stable/2344614]. They are very robust & popular and are still widely used today. The generalized linear model is a family of models that includes the classic linear regression model. It extends the linear regression model to also include other models that proved to be very useful and practical.

It has 3 Components:
- An exponential family model for the response. - A systematic component via a linear predictor. - A link function that connects the means of the response to the linear predictor.

To illustrate this, let’s examine those three components for various models.

a. A classic linear regression model

Response family: The linear regression model assumes that the response variable (the outcome or label) has a normal distribution.
Linear predictors: This is simply the weighted sum of the features. \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
The link function is the identity function.
So, the linear model is modeled by \[ Y = \sum_{k=1}^{p}X_{ki}\beta_k + \epsilon_i \]

with an error that belongs to normal distribution \[ \epsilon \overset{iid}\sim N(0,\sigma^2) \]

b. Logistic regression

In this case, the response family is a Bernoulli distribution, that’s \(Y_i \sim Bernoulli(\mu_i)\).
The linear predictor is also the weighted sum of the features, that’s \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
The link function is the logit function, that’s \(\eta = log(\frac{\mu}{1-\mu})\).

c. Poisson regression

For the Poisson regression, the glm would have the below components:
1. Response family is a Poisson distribution, that’s \(Y_i \sim Poisson(\mu_i)\).
1. Linear predictor is also \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
1. The link function is the log function, so \(\eta = log(\mu)\).

glmnet models

You may come across the term “glmnet model”, so I’m giving a brief about it. glmnet is a popular R implementation of the elastic net models. An elastic net model is a linear model with elastic L1 (lasso) and L2 (ridge) regularization. glmnet is written in FORTRAN and it is fast. It also has supporting functions for the selection and cross validation of the best penalty amount.

A Sample Case Study

The data models count data collected from a real data. I normally use the tidymodels framework for doing my machine learning projects. In this post, I preferred to use base R functions, to focus on the differences between different Poisson models². My main interest here is illustrating the various models, it is not getting the best performance metrics. So, the examples below don’t go through a formal workflow for model tuning (resampling, cross validation, hyper-parameter tuning, ,,,,etc).

Note
This is not a formal model tuning process. In a strict tuning process, I would use cross validation to validate the model performance. Instead, my aim here is to illustrate the various poisson modeling, so I relax this procedure.

The data has already passed through the pre-processing steps (e.g. normalization, transformation, ..etc).

The data has 40 predictors (named PC01 to PC04)³.

The response variable is called label. Let’s explore it.

summary(data_train$label)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00   14.00   28.78   34.00 1038.00

# Explore Response Var ----
mean(data_train$label)

## [1] 28.77936

var(data_train$label)

## [1] 1933.054

MASS::fitdistr(data_train$label, "poisson")

##      lambda   
##   28.77935777 
##  ( 0.01656724)

The variance is larger than the mean, so this is more like a dispersed Poisson distribution. This may be better modeled by a quassi-poisson model or a negative binomial model, as will be shown later.

ggplot(data_train, aes(x = label)) +
  geom_histogram(bins = 60, aes(y = ..density..)) +
  stat_function(fun = dpois, color = "red", args = list(lambda = 28.5), n = 11)

ggplot(data_train, aes(x = label)) +
  geom_bar() +
  scale_x_continuous(limits = c(0, 50))

data_train %>% 
  count(label) %>% 
  slice_head(n = 10) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")

label	n
0	5757
1	5636
2	5014
3	4655
4	4190
5	3775
6	3465
7	3113
8	3019
9	2698

The distribution is not really a poisson one:
1. It is zero inflated. 2. It is over dispersed and has flatter tail. 3. It has some outliers at the right end.

Ordinal Linear Regression

Let’s start with a plain vanilla linear regression model. I will use the built in lm() for that.

system.time(lm_fit <- lm(
  label ~ .,
  data = data_train))

##    user  system elapsed 
##   3.282   0.020   0.283

Linear models are simple and fast, it only took few seconds to train this model.

Let’s have a look on the top 6 coefficients

broom::tidy(lm_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")

term	estimate	std.error	statistic
(Intercept)	28.783592	0.0978353	294.20469
PC01	3.431140	0.0141770	242.02122
PC04	-3.428379	0.0324431	-105.67358
PC20	5.219567	0.0656759	79.47456
PC10	2.885033	0.0528761	54.56209
PC06	-2.221422	0.0429615	-51.70730

Now, let’s check the performance metrics.

res_lm <- calc_model_metrics(lm_fit, data_train, data_test, type = "response")
res_lm$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.48	31.67
Test	0.48	31.67

The model have subpar performance, the r square (rsq) is 0.48. The root square mean error (rmsq) is 31.7, which is considerably larger than the mean of the data.

Now, let’s examine the model more closely.

plot_grid(
  res_lm$plots$p1,
  res_lm$plots$p2, 
  res_lm$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

The top left plot shows the distribution of the actual label and the predicted label. Since this is a normal linear regression model, the predicted label distribution follows a normal distribution. The mismatch between the predicted and true label distributions is clear.

Notes:
1. Prediction is not correct when the label is high.
2. Predictions follow a normal distribution and has negative values.

Using a generalized linear model with a Poisson response is expected to produce a better and more fit match. Let’s try it.

Poisson Regression using `glm()`

Now, let’s try fitting a Poisson regression model. I will use the glm() function. glm() is included in standard R, it implements the “Generalized Linear Model”.

system.time(glm_pois_fit <- glm(
  label ~ .,
  family = "poisson",
  data = data_train))

##    user  system elapsed 
##  17.446   0.132   1.489

res_glm_pois <- calc_model_metrics(glm_pois_fit, data_train, data_test, type = "response")
res_glm_pois$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.62	27.23
Test	0.64	26.67

Just switching to a Poisson response significantly enhanced the metrics.

Coeffecients

Let’s have a look on the top 6 coefficients

broom::tidy(glm_pois_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")

term	estimate	std.error	statistic
(Intercept)	2.7305066	0.0009524	2866.8450
PC01	0.1641104	0.0001499	1095.0280
PC20	0.1459469	0.0003512	415.5406
PC11	-0.0928925	0.0002576	-360.6672
PC05	-0.0508081	0.0001877	-270.7335
PC04	-0.0502220	0.0002094	-239.7806

Residual Plots

Let’s check the residual plots.

plot_grid(
  res_glm_pois$plots$p1,
  res_glm_pois$plots$p2, 
  res_glm_pois$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

This is much better:
1. The residual histogram is now not skewed.
2. The distribution of the predicted values is now close to the true values.

Quasi-Poisson

What if I use a quasipoisson() family instead of a plain vanilla Poisson family. I expect it should be better as the response variable is over dispersed.

system.time(qpois_fit <- glm(
  label ~ .,
  data = data_train,
  family  = quasipoisson()))

##    user  system elapsed 
##  14.209   0.155   1.160

Metrics

res_qpois <- calc_model_metrics(qpois_fit, data_train, data_test, type = "response")
res_qpois$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.62	27.23
Test	0.64	26.67

Coeffecients

broom::tidy(qpois_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")

term	estimate	std.error	statistic
(Intercept)	2.7305066	0.0045967	594.00895
PC01	0.1641104	0.0007233	226.88929
PC20	0.1459469	0.0016951	86.09981
PC11	-0.0928925	0.0012430	-74.73008
PC05	-0.0508081	0.0009057	-56.09586
PC04	-0.0502220	0.0010109	-49.68243

Residual Plots

plot_grid(
  res_qpois$plots$p1,
  res_qpois$plots$p2, 
  res_qpois$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

It has exactly the same results as the Poisson regression. This is because the quasi-poisson estimates the same parameter coefficients. It changes the p-values and confidence intervals. This is normal, as described in this [stack overflow answer] (https://stats.stackexchange.com/questions/176918/identical-coefficients-estimated-in-poisson-vs-quasi-poisson-model)

Negative Binomial Response

library(MASS, quietly = TRUE)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

system.time(nbin_fit <- glm.nb(
  label ~ .,
  data = data_train))

##    user  system elapsed 
##  62.903   0.763   8.681

Metrics

res_nbin <- calc_model_metrics(nbin_fit, data_train, data_test, type = "response")
res_nbin$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.00	1.814739e+09
Test	0.56	3.286000e+01

Coeffecients

broom::tidy(nbin_fit) %>% 
  arrange(desc(abs(statistic))) %>% 
  slice_head(n = 6) %>% 
  kbl() %>% 
  kable_classic(full_width = F, position = "left")

term	estimate	std.error	statistic
(Intercept)	2.7092062	0.0021088	1284.70989
PC01	0.1639295	0.0003581	457.82244
PC20	0.1446654	0.0013270	109.01819
PC04	-0.0718339	0.0006863	-104.66943
PC15	-0.0956607	0.0011528	-82.98073
PC11	-0.0841760	0.0010788	-78.02805

Residual Plots

plot_grid(
  res_nbin$plots$p1,
  res_nbin$plots$p2, 
  res_nbin$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Hurdle Model

What if we try a hurdle model

library(pscl, quietly = TRUE)
system.time(hurdle_fit <- hurdle(
  label ~ .,
  data = data_train,
  dist  = "poisson"))

##    user  system elapsed 
## 245.361   0.985  20.830

res_hurdle <- calc_model_metrics(hurdle_fit, data_train, data_test, type = "response")

Metrics

res_hurdle$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.62	27.21
Test	0.64	26.65

Coeffecients

Residual Plots

plot_grid(
  res_hurdle$plots$p1,
  res_hurdle$plots$p2, 
  res_hurdle$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Zero-inflated Model

What if we try a zero inflated model

library(pscl, quietly = TRUE)
system.time(zeroinf_fit <- zeroinfl(
  label ~ .,
  data = data_train,
  dist  = "poisson"))

##    user  system elapsed 
## 417.913   1.989  37.954

res_zeroinf <- calc_model_metrics(zeroinf_fit, data_train, data_test, type = "response")

Metrics

res_zeroinf$metrics %>% 
  kbl(caption = "Model Metrics", digits = 2) %>%
  kable_classic(full_width = F, html_font = "Cambria", position = "left")

Model Metrics
Data	rsq	rmse
Train	0.62	27.21
Test	0.64	26.65

Coeffecients

Residual Plots

plot_grid(
  res_zeroinf$plots$p1,
  res_zeroinf$plots$p2, 
  res_zeroinf$plots$p3,
  ncol = 2, rel_widths = c(1.5, 1, 1))

Conclusion

Using a poisson model to model a count response variable yields better and more representable results than using a classic regression model.
If the response variable doesn’t strictly follow a poisson distribution, then using a quasi model and/or a zero inflated model may give better results.

References

The Poisson regression is a well established technique and has countless online refrences. However, the below refrences helped me a lot in understanding the Poisson regression and its tools.

Brian Caffo Youtube lectures on the Generalised Linear Models.
The companion book by Brian Caffo, Regression Models for Data Science in R. The book is available on Leanpub.
A glmnet webinar presented by Dr.Trevor Hastie.
Regression Models for Count Data in R, Zeileis et al.

Unfortunately, I can’t share this data even though it is anonymized (using PCA).↩︎
I used the tidymodels recipes package for pre-processing the data↩︎
Actually, those predictors are the outcome of a PCA transfrmation↩︎

Poisson Regression

Not all linear regression models are made equal

Introduction

What is a Poisson Outcome

Why Linear Regression

Generalised Linear Models

a. A classic linear regression model

b. Logistic regression

c. Poisson regression

glmnet models

A Sample Case Study

Ordinal Linear Regression

Poisson Regression using glm()

Coeffecients

Residual Plots

Quasi-Poisson

Metrics

Coeffecients

Residual Plots

Negative Binomial Response

Metrics

Coeffecients

Residual Plots

Hurdle Model

Metrics

Coeffecients

Residual Plots

Zero-inflated Model

Metrics

Coeffecients

Residual Plots

Conclusion

References

Poisson Regression using `glm()`