Blog 3

Poisson Regression

Poisson Regression model is used to model count data. These are response variables that are discrete data with non-negative, whole (non-fraction) integer values that count something.

In other words, it shows which explanatory variables have a notable effect on the response variable. Poisson Regression involves regression models in which the response variable is in the form of counts and not fractional numbers.

Assumptions

Much like Oridinary Least Squares regression, Poisson regression has some general model assumptions:

Poisson Response: The response variable is a count per unit of time Independence: The observations must be independent of each other Mean=Variance: The mean of a Poisson random variable must be equal to its variance. Linearity: The log of the mean rate, log(λ), must be a linear function of x.

In this example we will use a dataset, where the response variable is the number of awards earned by students at one high school. The independent variables include the type of program in which the student was enrolled in (vocational, general or academic) and each student’s final exam score in math.

library(ggplot2)
library(skimr)
library(tidyr)
library(dplyr)

Load and Visualize

awards <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
skim(awards)

Data summary
Name	awards
Number of rows	200
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	100.50	57.88	1	50.75	100.5	150.25	200	▇▇▇▇▇
num_awards	1	0.63	1.05	0	0.00	0.0	1.00	6	▇▁▁▁▁
prog	1	2.02	0.69	1	2.00	2.0	2.25	3	▃▁▇▁▃
math	1	52.65	9.37	33	45.00	52.0	59.00	75	▃▆▇▅▂

awards %>% select(-id) %>% gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = 1) +
  theme_minimal()

awards$prog <- factor(awards$prog, levels=1:3, labels=c("General", "Academic", "Vocational"))

Build Models

awards_m <- glm(num_awards ~ prog + math, family="poisson", data=awards)
summary(awards_m)

## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = awards)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
## progAcademic    1.08386    0.35825   3.025  0.00248 ** 
## progVocational  0.36981    0.44107   0.838  0.40179    
## math            0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

Interpreting Model Results

To evaluate the performance of a GLM model, we consider the following from the model summary:

Coefficients: As with linear regression these represent the beta coefficients and their statistical significance, below 0.05
AIC (Akaike Information Criteria): To measure the overall model in linear regression \(R^2\) was used. In Logistic regression AIC is used as a measure of fit. Residual deviance indicates the response predicted by a model on adding independent variables. A lower AIC means a better model.
Deviance: Deviance is also a measure of goodness of fit of a model. Higher numbers indicates bad fit. Deviance is reported in two forms, Null Deviance and Residual Deviance

Null Deviance shows how well the response variable is predicted with nothing but an intercept.
Residual Deviance shows how well the response variable is predicted with the independent variables.

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.