The linear regression is a well established modeling technique that is used for modeling a response variable using numerous explanatory variables (or what we call features in the machine learning jargon). The ordinay linear regression is well suited for modeling response variables that have a normal distributions. For modeling other types of response variables, other derivatives of the linear regression are more suited.
Two of the most common and widely used ones are logistic regression, which is used for modeling a Bernoulli outcome, and Poisson regression, used for modeling a Poisson outcome.
In this post, I try to explain the Poisson regression and present a demo for it using real data.1. I use different models and compare between them.
The Poisson distribution is the discrete probability distribution of the number of events occurring in a fixed amount of time or space, given that these events occur with a known constant mean rate and occur independently. For example, the number of calls received in a call center during any minute has a Poisson probability distribution.
lambas <- c(2, 5, 10, 50)
pois_data <- map_dfr(lambas, ~tibble(Value = rpois(1e4, .x), lambda = .x))
ggplot(pois_data, aes(x = Value)) +
geom_bar(width = 0.8, fill = "blue", alpha = 0.6) +
facet_wrap(~lambda, scales = "free")
The Poisson distribution is well suited for modeling count data.
For sufficiently large values of λ, (say λ>1000), the normal distribution with mean λ and variance λ is an excellent approximation to the Poisson distribution.
Generalized Linear Models (GLM) were first introduced by (Nelder and Wedderburn in 1972)[https://www.jstor.org/stable/2344614]. They are very robust & popular and are still widely used today. The generalized linear model is a family of models that includes the classic linear regression model. It extends the linear regression model to also include other models that proved to be very useful and practical.
It has 3 Components:
- An exponential family model for the response. - A systematic component via a linear predictor. - A link function that connects the means of the response to the linear predictor.
To illustrate this, let’s examine those three components for various models.
with an error that belongs to normal distribution \[ \epsilon \overset{iid}\sim N(0,\sigma^2) \]
In this case, the response family is a Bernoulli distribution, that’s \(Y_i \sim Bernoulli(\mu_i)\).
The linear predictor is also the weighted sum of the features, that’s \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
The link function is the logit function, that’s \(\eta = log(\frac{\mu}{1-\mu})\).
For the Poisson regression, the glm would have the below components:
1. Response family is a Poisson distribution, that’s \(Y_i \sim Poisson(\mu_i)\).
1. Linear predictor is also \(\eta = \sum_{k=1}^{p}X_{ki}\beta_k\).
1. The link function is the log function, so \(\eta = log(\mu)\).
You may come across the term “glmnet model”, so I’m giving a brief about it. glmnet is a popular R implementation of the elastic net models. An elastic net model is a linear model with elastic L1 (lasso) and L2 (ridge) regularization. glmnet is written in FORTRAN and it is fast. It also has supporting functions for the selection and cross validation of the best penalty amount.
The data models count data collected from a real data. I normally use the tidymodels framework for doing my machine learning projects. In this post, I preferred to use base R functions, to focus on the differences between different Poisson models2. My main interest here is illustrating the various models, it is not getting the best performance metrics. So, the examples below don’t go through a formal workflow for model tuning (resampling, cross validation, hyper-parameter tuning, ,,,,etc).
Note
This is not a formal model tuning process. In a strict tuning process, I would use cross validation to validate the model performance. Instead, my aim here is to illustrate the various poisson modeling, so I relax this procedure.
The data has already passed through the pre-processing steps (e.g. normalization, transformation, ..etc).
The data has 40 predictors (named PC01 to PC04)3.
The response variable is called label. Let’s explore it.
summary(data_train$label)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.00 14.00 28.78 34.00 1038.00
# Explore Response Var ----
mean(data_train$label)
## [1] 28.77936
var(data_train$label)
## [1] 1933.054
MASS::fitdistr(data_train$label, "poisson")
## lambda
## 28.77935777
## ( 0.01656724)
The variance is larger than the mean, so this is more like a dispersed Poisson distribution. This may be better modeled by a quassi-poisson model or a negative binomial model, as will be shown later.
ggplot(data_train, aes(x = label)) +
geom_histogram(bins = 60, aes(y = ..density..)) +
stat_function(fun = dpois, color = "red", args = list(lambda = 28.5), n = 11)
ggplot(data_train, aes(x = label)) +
geom_bar() +
scale_x_continuous(limits = c(0, 50))
data_train %>%
count(label) %>%
slice_head(n = 10) %>%
kbl() %>%
kable_classic(full_width = F, position = "left")
| label | n |
|---|---|
| 0 | 5757 |
| 1 | 5636 |
| 2 | 5014 |
| 3 | 4655 |
| 4 | 4190 |
| 5 | 3775 |
| 6 | 3465 |
| 7 | 3113 |
| 8 | 3019 |
| 9 | 2698 |
The distribution is not really a poisson one:
1. It is zero inflated. 2. It is over dispersed and has flatter tail. 3. It has some outliers at the right end.
Let’s start with a plain vanilla linear regression model. I will use the built in lm() for that.
system.time(lm_fit <- lm(
label ~ .,
data = data_train))
## user system elapsed
## 3.282 0.020 0.283
Linear models are simple and fast, it only took few seconds to train this model.
Let’s have a look on the top 6 coefficients
broom::tidy(lm_fit) %>%
arrange(desc(abs(statistic))) %>%
slice_head(n = 6) %>%
kbl() %>%
kable_classic(full_width = F, position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 28.783592 | 0.0978353 | 294.20469 | 0 |
| PC01 | 3.431140 | 0.0141770 | 242.02122 | 0 |
| PC04 | -3.428379 | 0.0324431 | -105.67358 | 0 |
| PC20 | 5.219567 | 0.0656759 | 79.47456 | 0 |
| PC10 | 2.885033 | 0.0528761 | 54.56209 | 0 |
| PC06 | -2.221422 | 0.0429615 | -51.70730 | 0 |
Now, let’s check the performance metrics.
res_lm <- calc_model_metrics(lm_fit, data_train, data_test, type = "response")
res_lm$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.48 | 31.67 |
| Test | 0.48 | 31.67 |
The model have subpar performance, the r square (rsq) is 0.48. The root square mean error (rmsq) is 31.7, which is considerably larger than the mean of the data.
Now, let’s examine the model more closely.
plot_grid(
res_lm$plots$p1,
res_lm$plots$p2,
res_lm$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
The top left plot shows the distribution of the actual label and the predicted label. Since this is a normal linear regression model, the predicted label distribution follows a normal distribution. The mismatch between the predicted and true label distributions is clear.
Notes:
1. Prediction is not correct when the label is high.
2. Predictions follow a normal distribution and has negative values.
Using a generalized linear model with a Poisson response is expected to produce a better and more fit match. Let’s try it.
glm()Now, let’s try fitting a Poisson regression model. I will use the glm() function. glm() is included in standard R, it implements the “Generalized Linear Model”.
system.time(glm_pois_fit <- glm(
label ~ .,
family = "poisson",
data = data_train))
## user system elapsed
## 17.446 0.132 1.489
res_glm_pois <- calc_model_metrics(glm_pois_fit, data_train, data_test, type = "response")
res_glm_pois$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.62 | 27.23 |
| Test | 0.64 | 26.67 |
Just switching to a Poisson response significantly enhanced the metrics.
Let’s have a look on the top 6 coefficients
broom::tidy(glm_pois_fit) %>%
arrange(desc(abs(statistic))) %>%
slice_head(n = 6) %>%
kbl() %>%
kable_classic(full_width = F, position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.7305066 | 0.0009524 | 2866.8450 | 0 |
| PC01 | 0.1641104 | 0.0001499 | 1095.0280 | 0 |
| PC20 | 0.1459469 | 0.0003512 | 415.5406 | 0 |
| PC11 | -0.0928925 | 0.0002576 | -360.6672 | 0 |
| PC05 | -0.0508081 | 0.0001877 | -270.7335 | 0 |
| PC04 | -0.0502220 | 0.0002094 | -239.7806 | 0 |
Let’s check the residual plots.
plot_grid(
res_glm_pois$plots$p1,
res_glm_pois$plots$p2,
res_glm_pois$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
This is much better:
1. The residual histogram is now not skewed.
2. The distribution of the predicted values is now close to the true values.
What if I use a quasipoisson() family instead of a plain vanilla Poisson family. I expect it should be better as the response variable is over dispersed.
system.time(qpois_fit <- glm(
label ~ .,
data = data_train,
family = quasipoisson()))
## user system elapsed
## 14.209 0.155 1.160
res_qpois <- calc_model_metrics(qpois_fit, data_train, data_test, type = "response")
res_qpois$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.62 | 27.23 |
| Test | 0.64 | 26.67 |
broom::tidy(qpois_fit) %>%
arrange(desc(abs(statistic))) %>%
slice_head(n = 6) %>%
kbl() %>%
kable_classic(full_width = F, position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.7305066 | 0.0045967 | 594.00895 | 0 |
| PC01 | 0.1641104 | 0.0007233 | 226.88929 | 0 |
| PC20 | 0.1459469 | 0.0016951 | 86.09981 | 0 |
| PC11 | -0.0928925 | 0.0012430 | -74.73008 | 0 |
| PC05 | -0.0508081 | 0.0009057 | -56.09586 | 0 |
| PC04 | -0.0502220 | 0.0010109 | -49.68243 | 0 |
plot_grid(
res_qpois$plots$p1,
res_qpois$plots$p2,
res_qpois$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
It has exactly the same results as the Poisson regression. This is because the quasi-poisson estimates the same parameter coefficients. It changes the p-values and confidence intervals. This is normal, as described in this [stack overflow answer] (https://stats.stackexchange.com/questions/176918/identical-coefficients-estimated-in-poisson-vs-quasi-poisson-model)
library(MASS, quietly = TRUE)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
system.time(nbin_fit <- glm.nb(
label ~ .,
data = data_train))
## user system elapsed
## 62.903 0.763 8.681
res_nbin <- calc_model_metrics(nbin_fit, data_train, data_test, type = "response")
res_nbin$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.00 | 1.814739e+09 |
| Test | 0.56 | 3.286000e+01 |
broom::tidy(nbin_fit) %>%
arrange(desc(abs(statistic))) %>%
slice_head(n = 6) %>%
kbl() %>%
kable_classic(full_width = F, position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.7092062 | 0.0021088 | 1284.70989 | 0 |
| PC01 | 0.1639295 | 0.0003581 | 457.82244 | 0 |
| PC20 | 0.1446654 | 0.0013270 | 109.01819 | 0 |
| PC04 | -0.0718339 | 0.0006863 | -104.66943 | 0 |
| PC15 | -0.0956607 | 0.0011528 | -82.98073 | 0 |
| PC11 | -0.0841760 | 0.0010788 | -78.02805 | 0 |
plot_grid(
res_nbin$plots$p1,
res_nbin$plots$p2,
res_nbin$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
What if we try a hurdle model
library(pscl, quietly = TRUE)
system.time(hurdle_fit <- hurdle(
label ~ .,
data = data_train,
dist = "poisson"))
## user system elapsed
## 245.361 0.985 20.830
res_hurdle <- calc_model_metrics(hurdle_fit, data_train, data_test, type = "response")
res_hurdle$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.62 | 27.21 |
| Test | 0.64 | 26.65 |
plot_grid(
res_hurdle$plots$p1,
res_hurdle$plots$p2,
res_hurdle$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
What if we try a zero inflated model
library(pscl, quietly = TRUE)
system.time(zeroinf_fit <- zeroinfl(
label ~ .,
data = data_train,
dist = "poisson"))
## user system elapsed
## 417.913 1.989 37.954
res_zeroinf <- calc_model_metrics(zeroinf_fit, data_train, data_test, type = "response")
res_zeroinf$metrics %>%
kbl(caption = "Model Metrics", digits = 2) %>%
kable_classic(full_width = F, html_font = "Cambria", position = "left")
| Data | rsq | rmse |
|---|---|---|
| Train | 0.62 | 27.21 |
| Test | 0.64 | 26.65 |
plot_grid(
res_zeroinf$plots$p1,
res_zeroinf$plots$p2,
res_zeroinf$plots$p3,
ncol = 2, rel_widths = c(1.5, 1, 1))
Using a poisson model to model a count response variable yields better and more representable results than using a classic regression model.
If the response variable doesn’t strictly follow a poisson distribution, then using a quasi model and/or a zero inflated model may give better results.
The Poisson regression is a well established technique and has countless online refrences. However, the below refrences helped me a lot in understanding the Poisson regression and its tools.