In these notes, we will discuss several types of count models
- Ordered Probit/Logit
- Poisson Model
- Negative Binomial
- Hurdle Model
- Zero Inflated Poisson Model
MSBA Data Analytics III
In these notes, we will discuss several types of count models
Suppose your dependent is categorical in nature and these categories show an ordinal proper (i.e. you can order them)
Examples: Surveys including Likert Scales (strongly agree, agree, neither, disagree, strongly disagree)
Medalling in the olympics (Gold, Silver, Bronze)
Soda size preference (small, medium, large, extra large)
In these cases of ordinal categorical variables, we use an ordered probit/logit model.
Notice, these responses, while they can be ordered, the distance between levels is not constant.
That is, we do not know what increase is necessary to move someone from strongly disagree to just disagree.
The math of the ordered probit/logit is very similar to a simple binary model
There is just one difference, we need to estimate level parameters.
The math to the ordered probit/logit is very similar to a simple binary model
There is just one difference, we need to estimate level parameters.
Consider three observed outcomes: y = 0,1,2. Consider the latent variable model without a constant: \[y^{*} = \beta_1 x_1 +...+\beta_k x_k +\epsilon\] where \(\epsilon \rightarrow N(0,1)\). Define two cut-off points: \(\alpha_1 < \alpha_2\) We do not observe \(y^{*}\), but we do observe choices according to the following rule
y = 0 if inactive, y = 1 if part-time, and y = 2 if full-time
\(y* = \beta_e educ +\beta_k kids +\epsilon\), where \(\epsilon \rightarrow N (0,1)\)
Then y = 0 if \(\beta_e educ +\beta_k kids +\epsilon\) < \(\alpha_1\)
y = 1 if \(\alpha_1 < \beta_e educ +\beta_k kids +\epsilon\) < \(\alpha_2\)
y = 2 if \(\alpha_2 < \beta_e educ +\beta_k kids +\epsilon\)
We could alternatively introduce a constant \(\beta_0\) and assume that \(\alpha_1\) = 0.
\[Log \, L=\sum_{i=1}^{n} \sum_j Z_{ij}[F_{ij}-F_{ij-1}]\]
where \(F_{ij}=F(\alpha_j-\beta_e educ -\beta_k kids)\) and \(Z_{ij}\) is an indicator that equals 1 if person i select option j and zero otherwise.
We choose the values of \(\alpha\) and \(\beta\) that maximize the value of the log likelihood function.
A study looks at factors that influence the decision of whether to apply to graduate school.
College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school.
Hence, our outcome variable has three categories.
Data on parental educational status, whether the undergraduate institution is public or private, and current GPA is also collected.
library(foreign)
dat <- read.dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
head(dat)
## apply pared public gpa ## 1 very likely 0 0 3.26 ## 2 somewhat likely 1 0 3.21 ## 3 unlikely 1 1 3.94 ## 4 somewhat likely 0 0 2.81 ## 5 somewhat likely 0 0 2.53 ## 6 unlikely 0 1 2.59
## one at a time, table apply, pared, and public
lapply(dat[, c("apply", "pared", "public")], table)
## $apply ## ## unlikely somewhat likely very likely ## 220 140 40 ## ## $pared ## ## 0 1 ## 337 63 ## ## $public ## ## 0 1 ## 343 57
library(MASS) m <- polr(apply ~ pared + public + gpa, data = dat, Hess=TRUE) ## view a summary of the model summary(m)
## Call: ## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE) ## ## Coefficients: ## Value Std. Error t value ## pared 1.04769 0.2658 3.9418 ## public -0.05879 0.2979 -0.1974 ## gpa 0.61594 0.2606 2.3632 ## ## Intercepts: ## Value Std. Error t value ## unlikely|somewhat likely 2.2039 0.7795 2.8272 ## somewhat likely|very likely 4.2994 0.8043 5.3453 ## ## Residual Deviance: 717.0249 ## AIC: 727.0249
Let's say we want to know the probability that someone who's parents are educated is likely very likely to go to graduate school.
We want to know \[Pr(y=2|pared=1, public=0)\]
## First let's find the cumulative probability of being less than 2.
gpa<-as.vector(seq(0,4,.1))
pared.f<-rep(1,length(gpa))
public.f<-rep(0,length(gpa))
fakedata=as.matrix(rbind(pared.f,public.f,gpa))
p2=exp(m$zeta[2]-t(fakedata)%*%m$coefficients)/(1+exp(m$zeta[2]
+t(fakedata)%*%m$coefficients))
## but we want to know Pr(y=2)=1-Pr(y<2)
p2=1-p2
Count data are integer counts of the outcome variable that can include the number zero.
These data present problems for typical linear regression methods because the outcome variable is non-negative, but the regression error is bounded by \(-\infty\) and \(\infty\).
We cannont simply take a the natural log of the outcome variable because zero's are present.
Also, the integer values are more discrete in nature.
Count data capture the number of occurrences over a fixed time intervals such as:
The Poisson density function is \[Pr(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}\]
where the \(E[y|X]=Var[y|X]=\lambda\)
For our purposes, we assume \(\lambda=exp(X\beta)\)
Interpretation of \(\beta\) is the percentage change in \(\lambda\) for a one unit increase in \(X\).
The equidispersion property of the Poisson distribution: Implies \(E[y|X]=Var[y|X]=\lambda\), which is very often violated.
When this property fails to hold, we state that there is overdispersion in the data. In this case, the negative binomial can be used.
Overdispersion causes the standard errors estimated using the Poisson distribution to be too small. An alternative approach is to estimate heterosckedastic (robust) standard errors.
Woolridge (1999) had demonstrated that using robust standard errors compensates for the overdispersion problem and often performs better than the negative binomial distribution.
The Negative Binomail distribution has a less restrictive relationship between the mean and variance of y. \[Var(y)=\lambda+\alpha \lambda^2\]
We can think of \(\alpha\) as the overdispersion parameter.
Step 1. Estimate the count model using the negative binomial distribution.
Step 2. Test \(H_o:\alpha=0\) vs \(H_a:\alpha\neq0\)
Step 3. Determine which case you are in
library(MASS); library(AER) #MASS contains the negative binomial model
data("NMES1988")
plot(table(NMES1988$visits),xlab="Physician Office Visits", ylab="Counts")
First, notice the shape of Physician office visits. Does this look normally distributed to you?
Should we be concerned with the mass of zero's?
How about those outliers?
fm_pois <- glm(visits ~ ., data = NMES1988, family = poisson) # Standard Poisson Model
# Calculating robust standard errors
cov.m1 <- vcovHC(fm_pois, type="HC0")
std.err <- sqrt(diag(cov.m1))
fm_qpois <- glm(visits ~ ., data = NMES1988, family = quasipoisson) # Quasi-Poisson
fm_nbin <- glm.nb(visits ~ ., data = NMES1988) # Negative Binomial
stargazer::stargazer(fm_pois,fm_pois,fm_qpois,fm_nbin,type="html",
se = list(NULL,std.err,NULL,NULL),
dep.var.labels = c("Poisson","Poisson Robust",
"Quasi Poisson","Neg. Bin"))
| Dependent variable: | ||||
| Poisson | ||||
| Poisson | glm: quasipoisson | negative | ||
| link = log | binomial | |||
| (1) | (2) | (3) | (4) | |
| hospital | 0.182*** | 0.182*** | 0.182*** | 0.240*** |
| (0.006) | (0.022) | (0.015) | (0.020) | |
| chronic | 0.175*** | 0.175*** | 0.175*** | 0.204*** |
| (0.004) | (0.011) | (0.011) | (0.012) | |
| insuranceyes | 0.183*** | 0.183*** | 0.183*** | 0.192*** |
| (0.017) | (0.043) | (0.044) | (0.040) | |
| gendermale | -0.116*** | -0.116*** | -0.116*** | -0.136*** |
| (0.013) | (0.036) | (0.034) | (0.031) | |
| school | 0.022*** | 0.022*** | 0.022*** | 0.021*** |
| (0.002) | (0.005) | (0.005) | (0.004) | |
| Constant | 1.053*** | 1.053*** | 1.053*** | 0.986*** |
| (0.023) | (0.064) | (0.061) | (0.054) | |
| Observations | 4,406 | 4,406 | 4,406 | 4,406 |
| Log Likelihood | -18,151.350 | -18,151.350 | -12,209.240 | |
| theta | 1.180*** (0.033) | |||
| Akaike Inf. Crit. | 36,314.700 | 36,314.700 | 24,430.490 | |
| Note: | p<0.1; p<0.05; p<0.01 | |||
A problem that plagues both the Poisson and Negative Binomial Models is when there are excessive zeros in the data.
There a two appoaches to address these zeros
Hurdle Model or Two Part Model
Zero Inflated Model
The Hurdle Model treats the distribution of zeros separately from the distribution of positive values.
For example, your choice to start exercising may be different than your choice about how often to exercise.
The first part of the model is typical captured by a probit or logit model.
The second part of the model is captured by a poisson or negative binomial model that is trunctated at 1.
Let the process generating the zeros be \(f_1(0)\) and the process generating the positive responses be \(f_2(0)\), then the two part density distribution is \[g(y)=\left\{\begin{matrix} f_1(0) & y=0 \\ \frac{1-f_1(0)}{1-f_2(0)}f_2(y) & y>0 \end{matrix}\right.\]
If \(f_1(0)=f_2(0)\), then the hurdle model reduces to the standard count model.
The zero inflated model treats zeros differently than the hurdle model.
Zeros are produce for two reasons.
There is an exogenous mass of zeros that are independent of the count process.
There are a set of zeros that are caused by the count process.
Example: You may drink alchohol or you may not. Even if you do drink alcohol you may not drink alcohol everyday. On the days you do drink, you may have 1, 2, 3 or more drinks.
Let the process generating the structural zeros be \(f_1(.)\) and the process generating the random counts including zero be \(f_2(.)\), then the two part density distribution is \[g(y)=\left\{\begin{matrix} f_1(0)+(1-f_1(0))f_2(0) & y=0 \\ (1-f_1(0))f_2(y) & y>0 \end{matrix}\right.\]
Think of this model as a finite mixture of a zero mass and the count model.
Next, we will use R to estimate these models.
The package you will need is pscl: zero-inflation and hurdle models via zeroinfl() and hurdle()
library(pscl)
dt_hurdle <- hurdle(visits ~ hospital + chronic + insurance + school + gender |
hospital + chronic + insurance + school + gender,
data = NMES1988, dist = "negbin")
dt_zinb <- zeroinfl(visits ~ hospital + chronic + insurance + school + gender |
+ hospital + chronic + insurance + school + gender,
data = NMES1988, dist = "negbin")
stargazer::stargazer(fm_pois,fm_pois,fm_qpois,fm_nbin,dt_hurdle,dt_zinb,
type="html",se = list(NULL,std.err,NULL,NULL,NULL,NULL),
dep.var.labels = c("Poisson","Poisson Robust",
"Quasi Poisson","Neg. Bin","Hurdle","Zero-Inflated"))
| Dependent variable: | ||||||
| Poisson | ||||||
| Poisson | glm: quasipoisson | negative | hurdle | zero-inflated | ||
| link = log | binomial | count data | ||||
| (1) | (2) | (3) | (4) | (5) | (6) | |
| hospital | 0.182*** | 0.182*** | 0.182*** | 0.240*** | 0.235*** | 0.223*** |
| (0.006) | (0.022) | (0.015) | (0.020) | (0.022) | (0.020) | |
| chronic | 0.175*** | 0.175*** | 0.175*** | 0.204*** | 0.154*** | 0.155*** |
| (0.004) | (0.011) | (0.011) | (0.012) | (0.012) | (0.012) | |
| insuranceyes | 0.183*** | 0.183*** | 0.183*** | 0.192*** | 0.064 | 0.096** |
| (0.017) | (0.043) | (0.044) | (0.040) | (0.043) | (0.042) | |
| gendermale | -0.116*** | -0.116*** | -0.116*** | -0.136*** | -0.078** | -0.090*** |
| (0.013) | (0.036) | (0.034) | (0.031) | (0.033) | (0.031) | |
| school | 0.022*** | 0.022*** | 0.022*** | 0.021*** | 0.015*** | 0.016*** |
| (0.002) | (0.005) | (0.005) | (0.004) | (0.005) | (0.004) | |
| Constant | 1.053*** | 1.053*** | 1.053*** | 0.986*** | 1.262*** | 1.251*** |
| (0.023) | (0.064) | (0.061) | (0.054) | (0.058) | (0.056) | |
| Observations | 4,406 | 4,406 | 4,406 | 4,406 | 4,406 | 4,406 |
| Log Likelihood | -18,151.350 | -18,151.350 | -12,209.240 | -12,125.900 | -12,126.080 | |
| theta | 1.180*** (0.033) | |||||
| Akaike Inf. Crit. | 36,314.700 | 36,314.700 | 24,430.490 | |||
| Note: | p<0.1; p<0.05; p<0.01 | |||||