Poisson Regression model is used to model count data. These are response variables that are discrete data with non-negative, whole (non-fraction) integer values that count something.
In other words, it shows which explanatory variables have a notable effect on the response variable. Poisson Regression involves regression models in which the response variable is in the form of counts and not fractional numbers.
Much like Oridinary Least Squares regression, Poisson regression has some general model assumptions:
Poisson Response: The response variable is a count per unit of time Independence: The observations must be independent of each other Mean=Variance: The mean of a Poisson random variable must be equal to its variance. Linearity: The log of the mean rate, log(λ), must be a linear function of x.
In this example we will use a dataset, where the response variable is the number of awards earned by students at one high school. The independent variables include the type of program in which the student was enrolled in (vocational, general or academic) and each student’s final exam score in math.
library(ggplot2)
library(skimr)
library(tidyr)
library(dplyr)
awards <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
skim(awards)
| Name | awards |
| Number of rows | 200 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 100.50 | 57.88 | 1 | 50.75 | 100.5 | 150.25 | 200 | ▇▇▇▇▇ |
| num_awards | 0 | 1 | 0.63 | 1.05 | 0 | 0.00 | 0.0 | 1.00 | 6 | ▇▁▁▁▁ |
| prog | 0 | 1 | 2.02 | 0.69 | 1 | 2.00 | 2.0 | 2.25 | 3 | ▃▁▇▁▃ |
| math | 0 | 1 | 52.65 | 9.37 | 33 | 45.00 | 52.0 | 59.00 | 75 | ▃▆▇▅▂ |
awards %>% select(-id) %>% gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = 1) +
theme_minimal()
awards$prog <- factor(awards$prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
awards_m <- glm(num_awards ~ prog + math, family="poisson", data=awards)
summary(awards_m)
##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = awards)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.8436 -0.5106 0.2558 2.6796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.24712 0.65845 -7.969 1.60e-15 ***
## progAcademic 1.08386 0.35825 3.025 0.00248 **
## progVocational 0.36981 0.44107 0.838 0.40179
## math 0.07015 0.01060 6.619 3.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
To evaluate the performance of a GLM model, we consider the following from the model summary:
Coefficients: As with linear regression these represent the beta coefficients and their statistical significance, below 0.05
AIC (Akaike Information Criteria): To measure the overall model in linear regression \(R^2\) was used. In Logistic regression AIC is used as a measure of fit. Residual deviance indicates the response predicted by a model on adding independent variables. A lower AIC means a better model.
Deviance: Deviance is also a measure of goodness of fit of a model. Higher numbers indicates bad fit. Deviance is reported in two forms, Null Deviance and Residual Deviance
Null Deviance shows how well the response variable is predicted with nothing but an intercept.
Residual Deviance shows how well the response variable is predicted with the independent variables.
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.