Zero inflation, and overdispersion
Overdispersion
We generally use the Poisson distribution when we are dealing with counts. Poisson is used to model a positive, integer response variable. Some examples ow when we could use Poisson include the number of insect species present in a tree, number of bacteria in water, or some more complex questions, for example, is plant abundance influenced by presence of herbivore insects, or is flock size influenced by resource availability.
Those are all generally good examples of when to use Poisson, however, Poisson regression assumes that the mean \(\lambda_i\) is equal to the residual variance.
This is different from the linear models where we have homoscedasticity (equal variance, denoted by \(\sigma^2\) ), independently of the expected value of \(y_i\). In the case of Poisson:
\(\lambda_i\) is the expected value
\(\lambda_i\) is the variance
Compare the following two figures.
Figure 1: Linear model (normal distribution of residuals)
Figure 2: glm (Poisson distribution of residuals)
This essentially shows that on the Poisson distribution, the variance depends on the predicted (or expected, or mean) value.
So, what is the issue?
More often than not, the residual variance of is larger that the expected value. This is called over dispersion.
We can use a package called performance
to check for it. So, that is a great thing! It is pretty easy to use.
After you run a model, you can simply do the following:
::check_overdispersion(model) performance
But, to make matters worse, overdispersion is far from the only issue when working with counts. Oftentimes we have more zeroes than expected
Zero inflated data
Simply told, zero inflation happens when there are more zeroes in the response variable, than expected under the Poisson distribution (or, if we are working with other distributions, under other distributions).
Usually we have both issues, zero-inflation and overdispersion
We have discussed this, but think about reasons why we have zero-inflated data and overdispersion (more variance in the residuals than expected). Think particularly about your research topic and try to figure out if there are systems that would see such overdispersion and zero inflation
Fortunately, testing it is as easy as testing for overdispersion.
::check_zeroinflation(model) performance
Let’s test for over-dispersion and zero-inflation
We will be working with the same example from class. We are exploring horseshoe crab data (available at: https://users.stat.ufl.edu/). You can download the data from Canvas (crabs.txt
)
Here is an image of horseshoe crabs:
Photo and caption taken from Brockman et al 2017: https://www.sciencedirect.com/science/article/pii/S0003347217303469
In these crabs, females release pheromones and as a result a male attaches to the female using claws, and releases sperm. While this happens, other males (called satellites) follow the pair and try to also fertilize eggs. Let’s start by doing the following:
- Download the dataset
- Read it into R (it is a .txt file, so use the
read.table
function. Make sure to set header as true!) - Explore it (use
head
,str
orsummary
)
These are the variables:
sat: Number of males that surround a female. This includes both attached and satellite males.
Weight: Female weight
Width: Carapace width,
Color: Female color;
1- light medium, 2 - medium, 3 - dark medium, 4 - dark
Spine: Female spine condition
1 - both good, 2 - one worn or broken, 3 - both worn or broken
The data is originally published in: Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996).
After we explored the data, hopefully you noticed that we need to change two variables to factor:
- Change color to factor and change spine to factor
Data analysis (Poisson)
Now, we will analyze the data. Let’s assume you think Poisson is always correct when dealing with counts. We will run the same models and analysis as in the lecture.
We are wondering whether the color of the female has an effect on the amount of males it can attract.
- Write your null and alternative hypotheses.
- Run a Poisson
glm
with color as the explanatory variable and satellites as the response variable, and print the summary - Run an Anova (remember, use the car package Anova function)
- Make a statistical decision (reject, or fail to reject)
- If you reject the null, then run a post-hoc test (you can use
emmeans
) and describe the results - Write a “scientific” or “biological” conclusion
- Plot the estimated mean and CI’s for each color. The x axis should be color, and the y axis should be number of satellite males. Use a point to represent the mean, and a CI to represent the confidence intervals. (you may need to google this, as we haven’t done it before, but I trust you can figure it out)
In other labs, you would have been done at this point. Actually, our Poisson lab had residuals that were very over-dispersed. But now, we know better, and we will check for overdispersion and zero-inflation
In reality, we would test for over-dispersion and zero-inflation right after running the glm. No need to run Anovas
or any other analyses if the data is severely over-dispersed.
OK, so, let’s check for over dispersion and zero-inflation
- Use the package
performance
and test for over dispersion and for zero inflation - When you check for over-dispersion you get a p-value. This means you are running a test. What is the null hypothesis and what is the alternative hypothesis of this test?
Sometimes it is hard to identify exactly what is zero-inflation. The check_zeroinflation
function compares the expected number of zeroes with the observed ones. However, it is up to you to decide if the observed difference is acceptable or not. In this case, the difference is so large, that it is obvious that we have many more zeroes than expected.
What next?
At this point you probably realized that we have over-dispersed residuals and way more zeroes than expected under a Poisson distribution. And we need to solve for it, particularly because it is affecting our p-value estimates and therefore our inferences.
We will explore some ways you can solve this.
Quasipoison
Using a “quasipoisson” distribution is a good way to deal with overdispersed residuals. This distribution is similar to poisson, but the residual variance is \(\lambda_i \phi\) where \(\phi\) is a variance inflation factor. This helps with the variance estimations, but does nothing for zero inflation.
In this case, we would skip quasipoisson because of the zero-inflation and go straight to the next one. But, this is an assignment, so, we won’t skip it.
- Run the same
glm
, but use the quasipoisson distribution this time, and print the summary - Use the package
performance
and test for over dispersion and for zero inflation
Notice how the quassipoisson summary gives you no AIC value? When there is overdispersion, our likelihood values are also affected.
If we are using a quasipoisson distribution to test multiple models then we need to use a QAIC (QuassiAIC) instead of AIC. The QAIC equation includes a \(\hat{c}\) (you will generally see this described as c-hat), which is a variance inflation factor:
\[ QAIC = \frac{-2 * log-likelihood}{\hat{c}} + 2 * K \]
You can estimate c-hat using the AICcmodavg
package, but that package won’t estimate the QAIC of any quasipoisson model (they are purists who believe this shouldn’t be done). Anyway… pretty long tangent, and the take-home message is, quassipoison doesn’t fix everything and can be a pain to use… let’s skip to some generally better alternatives.
Negative binomial
This is a distribution similar to Poisson. It is a discrete probability of only positive integers. This makes it great for counts. The difference is that while Poisson has a single parameter (\(\lambda\)) that represents both mean and variance, the Negative binomial has two: \(\mu\) and \(\theta\).
The mean or expected value is \(\mu\) while the variance is \(\mu + \frac{\mu^2}{\theta}\) . This makes sense, look at the first figure of the Poisson distribution (Fig 2), as the expected value (AKA mean) increases, so does the variance. When \(\theta\) is very large, then negative binomial and Poisson converge. When we run a negative binomial glm, the link function is the same, the only thing that changes is that \(\theta\) is estimated and the residuals are expected to have a negative binomial distribution.
Unfortunately, the negative binomial distribution is not included in the glm function. We need to use the MASS
package. And we run it using the glm.nb
function. This function works the same as glm
but we do not have to specify the distribution.
- Run a negative binomial glm with color as the explanatory variable and satellites as the response variable, and print the summary
- Use the package
performance
and test for over dispersion and for zero inflation. Compare it to the results from the Poisson model. How about the zeroes? - Run an Anova (remember, use the car package Anova function)
- Make a statistical decision (reject, or fail to reject)
- If you reject the null, then run a post-hoc test (you can use emmeans) and describe the results
- Write a “scientific” or “biological” conclusion
- Compare the conclusion you got using this distribution to the one you got with the Poisson distribution
- Which one do you think is better?
The negative binomial is usually the best compromise. But we will explore some other potential solutions.
Hurdle models
Hurdle models are a “two-in-one” situation.
We essentially run a binomial model (with a logit link) where the response variables are either 0 (absence of satellite males) or 1 (presence of satellite males). Then, we use a count model (with either poisson distribution or negative binomial) to estimate the expected number of satellite males.
Why do this? This is when the process that creates zeroes is independent from the process that creates counts. In other words, when the count-based process cannot generate zeroes, but they are generated by a different process.
In biological terms, that would mean that females that release pheromones have some number of satellite males, while females that release pheromones all have at least one satellite.
This is how we run a hurdle model:
We need the pscl
package for this.
To run a hurdle model with a Poisson distribution we do:
<-hurdle(sat~color|color, data=crabs)
hurdmodsummary(hurdmod)
Call:
hurdle(formula = sat ~ color | color, data = crabs)
Pearson residuals:
Min 1Q Median 3Q Max
-1.3218 -0.9542 -0.1097 0.6348 4.3573
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6902 0.1446 11.688 <2e-16 ***
color2 -0.1894 0.1558 -1.216 0.2242
color3 -0.3890 0.1794 -2.168 0.0302 *
color4 0.1690 0.2083 0.811 0.4171
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0986 0.6667 1.648 0.0994 .
color2 -0.1226 0.7053 -0.174 0.8620
color3 -0.7309 0.7338 -0.996 0.3192
color4 -1.8608 0.8087 -2.301 0.0214 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 11
Log-likelihood: -369.5 on 8 Df
Here we are running two mini models, and each model can have its own set of covariates (separated by |
). Covariates left of the |
are part of the count model, while covariates to the right are part of the hurdle model (binomial).
To run it with a negative binomial distribution:
<-hurdle(sat~color|color, data=crabs, dist="negbin")
hurmodnbsummary(hurmodnb)
Call:
hurdle(formula = sat ~ color | color, data = crabs, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.12222 -0.86733 -0.09559 0.55308 3.79647
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6684 0.2109 7.910 2.57e-15 ***
color2 -0.1998 0.2257 -0.885 0.376
color3 -0.4124 0.2543 -1.622 0.105
color4 0.1763 0.3100 0.569 0.570
Log(theta) 1.6463 0.3696 4.454 8.43e-06 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0986 0.6667 1.648 0.0994 .
color2 -0.1226 0.7053 -0.174 0.8620
color3 -0.7309 0.7338 -0.996 0.3192
color4 -1.8608 0.8087 -2.301 0.0214 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 5.1879
Number of iterations in BFGS optimization: 20
Log-likelihood: -359.6 on 9 Df
Like I said before, you can use different covariates (explanatory variables) in the count model and in the hurdle model.
1- Run another hurdle model (with Poisson distribution) in which the explanatory variable for counts is color, but with different explanatory variables for the hurdle portion of the model. Check the data and make a decision on what explanatory variable (or variables) to use.
2- Run the same model using negative binomial as the count distribution
3- Use AICcmodavg to estimate the AICc of the four hurdle models, and decide which one is the best model.
We CANNOT test these models for overdispersion and zero inflation. But we do know the count residuals do have some overdispersion problems, so a negative binomial distribution is probably best.
As I said before, these models are the best when the processes that cause zeroes are independent from the ones that cause counts. If the processes aren’t independent (i.e., a crab with pheromones may not attract any males and therefore have 0 satellite males), then a mixture model is best.
Zero-inflated Mixture models
These are similar to hurdle models. We have two mini-models, but the count portion of the model is allow to have zeroes as well.
We run zero-inflated mixture models just like we did hurdle models, but we use the zeroinf
function rather than the hurdle
one.
Run the same four models that you ran for the last section (hurdle models) and compare them to the hurdle models.
Again… We CANNOT test these models for overdispersion and zero inflation. But we do know the count residuals have some overdispersion problems, so a negative binomial distribution is probably best.
Now… sometimes when we run hurdle models and zero-inflated mixture models we essentially get the same result. We essentially get the same coefficients, same Likelihood, same AICc, same everything. If this is the case, then the processes that causes zeroes and ones is probably independent and we are better off with the hurdle models.
You can run these models in package glmmTMB. This can be helpful if you are running multiple models that you want to compare. I shy away from overusing this package because it tends to have issues (as you experienced when trying to download and run it!) I do favor it for mixed effects models (next week’s lab!).
The issues are related to the fact that it has a lot of dependencies, so it needs to download tons of packages to function. And if one of those packages gets an update, it might result in glmmTMB not working with that partiuclar version. Finally, it usually requires you to have a fairly recent version of R, and constantly updating R can be a pain and unnecessary most of the time
Which distribution/model should I pick?
There is not an easy and straightforward answer to this question (sorry!). A potential answer is… any model that is “good enough”. I know I would definitely not pick the poisson glm (or the quasi-poisson one), and if I had my pick, I would probably go with a Hurdle binomial model in this case. However, I will tell you how NOT TO pick a model. You should not explore all “viable” models and go with the one that has the most interesting results or the one that is more publishable! So… how do we choose one?
I like to use a method to choose one. I also like to define the method “a priori” in order to not introduce my bias when selecting a model.
In this case, you will use AICc and make an AIC table and use that to choose the best model. BUT BE AWARE! This result does not tell us is the model is good. It only tells us if it’s the best of the bunch (they could all be terrible, and you would still get a “best of the bunch” model). In other words: “The AIC comparisons only tell us which of our models is “best”, but they do not tell us if any of the models are actually any good” (Fieberg. Statistic for Ecologists. https://fw8051statistics4ecologists.netlify.app/).
We should do a a thing caled GOODNESS OF FIT which will tell us if our model is good. In next week’s lab you will do a goodness of fit test for this dataset (and your best model) and will also do model validation. You will also do that for the beaches mixed effects model example that we ran in class.
Meanwhile, we will only choose an AICc model, and assume it is good (FYI… most published papers rarely have goodness of fit tests, and oftentimes just run a single model with an assumed distribution).
Using the AICcmodavg package estimate AICc and \(\Delta AICc\) for all of the models you ran (except quasipoisson) and publish a table.
Be aware! AICcmodavg won’t allow you to use the aictab
function on models that were obtained using different functions (glm
, glm.nb
, hurdle
, zeroinf
). So, I recommend you use AICcmodavg::AICc
to estimate the AICc for each model, and then construct your own AIC table. Your table should include a column for k, which is the number of parameters (or coefficients). You can use the summary(model)
function in order to count the parameters (or run aictab
for a single model if you want the package to count them for you).
If you use modelname$coefficients
to count the number of parameters you may be undercounting, as \(log\theta\) is essentially a coefficient in the negative binomial distribution.
Final question: Explore and interpret the results using the best model.