In the UVA StatLab blog post, Getting Started with Hurdle Models, the following mathematical expression is presented to predict the expected mean count of a hurdle model.
E[y|x]=1−f1(0|x)1−f2(0|x)μ2(x)
This says the expected count of a hurdle model (y) given our predictors (x) is a product of two things: a ratio and a mean. The ratio is the probability of a non-zero in the first process divided the probability of a non-zero in the second untruncated process. The f symbols represent distributions. Recall these are logistic and Poisson, respectively, by default but can be others.
In this note we show how to calulate the ratio:
1−f1(0|x)1−f2(0|x)
First we load the AER package (for the example data) and the pscl package for the hurdle
function.
library(AER)
library(pscl)
Next we load the example data and select a few columns.
data("NMES1988")
# select certain columns; Col 1 is number of visits
nmes <- NMES1988[, c(1, 6:8, 13, 15, 18)]
names(nmes)
## [1] "visits" "hospital" "health" "chronic" "gender" "school"
## [7] "insurance"
Now we fit a hurdle model using all predictors.
mod.hurdle <- hurdle(visits ~ ., data = nmes)
The predicted ratio of non-zero probabilities using predict
with type = "zero"
for the first 5 observations.
predict(mod.hurdle, type = "zero")[1:5]
## 1 2 3 4 5
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806
How are these calculated?
Recall the ratio is the probability of a non-zero in the first process (zero hurdle model) divided by the probability of a non-zero in the second untruncated process (count model).
First we calculate the value in the numerator: the probability of a non-zero in the first process (zero hurdle model).
To begin we create a model matrix. These are the values we are plugging into our model. We create a matrix because we are going to use linear algebra to perform the calculations.
Z <- model.matrix(mod.hurdle$terms$zero, mod.hurdle$model,
contrasts = mod.hurdle$contrasts$zero)
Notice we use the model object, mod.hurdle
, to extract what we need to create the model matrix. See the help pages for ?model.matrix
and ?hurdle
for more information.
Now we fit the zero hurdle model to the data using matrix multiplication. The result is on the log-odds scale since the count model is a binary logit model. To get probabilities we need to take the inverse logit, which we can do with the plogis
function.
# The [,1] extracts a vector from the resulting matrix
p0_zero <- plogis(Z %*% mod.hurdle$coefficients$zero)[,1]
Next we calculate the value in the denominator. Again we create a model matrix. This is simply a matrix of the predictor values we are going to “plug” into our model to get predicted counts.
X <- model.matrix(mod.hurdle$terms$count, mod.hurdle$model,
contrasts = mod.hurdle$contrasts$count)
Then we fit the count model to the data. The result is on the log scale since the count model is a poisson model. To get expected counts we need to exponentiate, which we can do with the exp
function.
# The [,1] extracts a vector from the resulting matrix
mu <- exp(X %*% mod.hurdle$coefficients$count)[,1]
Now we use the expected counts to calculate expected probabilties of non-zero counts. For this we use the ppois
function. Below the code says calculate the probability of getting a count greater than 0 for the given expected mean. In this case mu
is a vector with 4406 expected counts. The result is a vector of 4406 probabilties.
p0_count <- ppois(0,lambda = mu, lower.tail = FALSE)
Finally we take the ratio and show the first 5 values.
(p0_zero/p0_count)[1:5]
## 1 2 3 4 5
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806
# same as...
predict(mod.hurdle, type = "zero")[1:5]
## 1 2 3 4 5
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806
The predict.hurdle
function actually uses log probabilties to carry out this calculation for numerical stability. Instead of division we use subtraction. The result is the same:
# take log() of probabilties
p0_zero <- log(plogis(Z %*% mod.hurdle$coefficients$zero)[,1])
# set log.p = TRUE
p0_count <- ppois(0, lambda = mu, lower.tail = FALSE, log.p = TRUE)
# subtract and then exponentiate
exp(p0_zero - p0_count)[1:5]
## 1 2 3 4 5
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806