Calculating ratio of non-zero probabilities in a hurdle model

In the UVA StatLab blog post, Getting Started with Hurdle Models, the following mathematical expression is presented to predict the expected mean count of a hurdle model.

$E[y | \textbf{x}] = \frac{1 - f_{1}(0|\textbf{x})}{1 - f_{2}(0|\textbf{x})} \mu_{2}(\textbf{x})$

This says the expected count of a hurdle model ( $y$ ) given our predictors ( $\textbf{x}$ ) is a product of two things: a ratio and a mean. The ratio is the probability of a non-zero in the first process divided the probability of a non-zero in the second untruncated process. The f symbols represent distributions. Recall these are logistic and Poisson, respectively, by default but can be others.

In this note we show how to calulate the ratio:

$\frac{1 - f_{1}(0|\textbf{x})}{1 - f_{2}(0|\textbf{x})}$

First we load the AER package (for the example data) and the pscl package for the hurdle function.

library(AER)
library(pscl)

Next we load the example data and select a few columns.

data("NMES1988")

# select certain columns; Col 1 is number of visits
nmes <- NMES1988[, c(1, 6:8, 13, 15, 18)]
names(nmes)

## [1] "visits"    "hospital"  "health"    "chronic"   "gender"    "school"   
## [7] "insurance"

Now we fit a hurdle model using all predictors.

mod.hurdle <- hurdle(visits ~ ., data = nmes)

The predicted ratio of non-zero probabilities using predict with type = "zero" for the first 5 observations.

predict(mod.hurdle, type = "zero")[1:5]

##         1         2         3         4         5 
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806

How are these calculated?

Recall the ratio is the probability of a non-zero in the first process (zero hurdle model) divided by the probability of a non-zero in the second untruncated process (count model).

the probability of a non-zero in the first process (zero hurdle model)

First we calculate the value in the numerator: the probability of a non-zero in the first process (zero hurdle model).

To begin we create a model matrix. These are the values we are plugging into our model. We create a matrix because we are going to use linear algebra to perform the calculations.

Z <- model.matrix(mod.hurdle$terms$zero, mod.hurdle$model, 
                  contrasts = mod.hurdle$contrasts$zero)

Notice we use the model object, mod.hurdle, to extract what we need to create the model matrix. See the help pages for ?model.matrix and ?hurdle for more information.

Now we fit the zero hurdle model to the data using matrix multiplication. The result is on the log-odds scale since the count model is a binary logit model. To get probabilities we need to take the inverse logit, which we can do with the plogis function.

# The [,1] extracts a vector from the resulting matrix 
p0_zero <- plogis(Z %*% mod.hurdle$coefficients$zero)[,1]

the probability of a non-zero in the second untruncated process (count model)

Next we calculate the value in the denominator. Again we create a model matrix. This is simply a matrix of the predictor values we are going to “plug” into our model to get predicted counts.

X <- model.matrix(mod.hurdle$terms$count, mod.hurdle$model, 
                  contrasts = mod.hurdle$contrasts$count)

Then we fit the count model to the data. The result is on the log scale since the count model is a poisson model. To get expected counts we need to exponentiate, which we can do with the exp function.

# The [,1] extracts a vector from the resulting matrix 
mu <- exp(X %*% mod.hurdle$coefficients$count)[,1]

Now we use the expected counts to calculate expected probabilties of non-zero counts. For this we use the ppois function. Below the code says calculate the probability of getting a count greater than 0 for the given expected mean. In this case mu is a vector with 4406 expected counts. The result is a vector of 4406 probabilties.

p0_count <- ppois(0,lambda = mu, lower.tail = FALSE)

calulate the ratio

Finally we take the ratio and show the first 5 values.

(p0_zero/p0_count)[1:5]

##         1         2         3         4         5 
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806

# same as...
predict(mod.hurdle, type = "zero")[1:5]

##         1         2         3         4         5 
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806

The predict.hurdle function actually uses log probabilties to carry out this calculation for numerical stability. Instead of division we use subtraction. The result is the same:

# take log() of probabilties
p0_zero <- log(plogis(Z %*% mod.hurdle$coefficients$zero)[,1])
# set log.p = TRUE
p0_count <- ppois(0, lambda = mu, lower.tail = FALSE, log.p = TRUE)
# subtract and then exponentiate
exp(p0_zero - p0_count)[1:5]

##         1         2         3         4         5 
## 0.8928488 0.9215760 0.9758270 0.8728787 0.9033806

Calculating ratio of non-zero probabilities in a hurdle model

Clay Ford

3/24/2020

the probability of a non-zero in the first process (zero hurdle model)

the probability of a non-zero in the second untruncated process (count model)

calulate the ratio