Modelling count data using a Binomial model

Author

Fede

Bits and bobs

The Binomial distribution is a discrete probability law that describes the probability of observing \(x\) successes over \(n\) independent trials, each with constant probability of success \(p\). The probability mass function is easy enough to work with, which allows to obtain closed form expressions for its expected value, namely if \(X\sim Bin(n,p)\), then \(E(X) = np\), which is quite handy to know.

Handiness aside, this property follows from the very definition of a Binomial, which is ultimately a sum of repeated experiments (\(n\) times) with possible outcome \(0\) or \(1\) (these are called Bernoulli experiments), carried out independently and at the same conditions every time (constant \(p\)). As such, the Binomial distribution summarises the behaviour of a probabilistic experiment as a count; in this sense, the Binomial distribution is a probabilistic model for count data.

I won’t delve here into whether it’s an appropriate choice or not for your specific application here. It certainly is not the only one - as you have rightly noted there are alternatives such as the Negative Binomial and the Poisson. A note that the former is called “negative” Binomial because it describes the probability of the number of failures - as opposed to successes - in a sequence of independent Bernoulli trials under the same circumstances (rings a bell?) before a pre-specified number of successes occur. So if you are happy with the Negative Binomial distribution describing counts, it should be fairly easy to also see the Binomial as doing the same thing (although on a different count!). The Poisson is another matter, and is typically used when you do not know/don’t want to assume there is a number of trials.

A textbook experiment and estimation of \(p\)

Now on to what made me twitch my nose yesterday when you were showing me a Binomial glm being applied to proportion data straight away. I had to go away and think a bit about it, I hadn’t quite clocked what it was clearly enough to explain it properly, and for that I apologise! Let’s see if I can remedy now.

Imagine that you are collecting data on shopping baskets. A basket contains exactly \(n=10\) items, each of which could be a low-fhss item with probability \(p=0.35\), independently of the others. Then, if you were to describe the probability of observing a certain number of low-fhss items in a basket, that probability would be described by a Binomial distribution with parameters \(n\) and \(p\).

Now, imagine you don’t know \(p\), and to estimate it you have collected a certain amount of data points (baskets, say \(N=100\)). How do we tackle the problem?

In this situation, you can take a number of approaches. First some setting up:

set.seed(42) # for reproducibility

# helper functions
expit <- function(x) exp(x)/(1+exp(x))


# parameters
N <- 100 # number of of baskets
p <- .35 # probability of an item being low-fhss food in the basket
n <- 10 # number of items in a basket

# simulated data
x <- rbinom(N,n,p) # this vector contains data on N baskets, specifically, how many (count) low-fhss items are in the basket

Now we can look at estimation. We want to know \(p\), so we could, keeping it simple, provide an estimate estimate based on the sample of realisations of the process. In this case, this would amount to computing the average of the proportion of low-fhss items in each basket. As the basket size is known, we can simply compute it like \[ \widehat{p}=\frac{1}{N}\sum_{i=1}^N \frac{x_i}{n} \] that in this case amounts to evaluating

mean(x/n)

[1] 0.358

The estimator we have used is good - under the assumptions it is actually the best possible, and coincides with the maximum likelihood estimator. The drawback is, this is a raw estimate, meaning it cannot account for covariates - but we don’t have covariates here so let’s just move on.

A second approach, which is especially helpful when you start considering covariates, is to look at the problem from a more explicit statistical modelling perspective. What this mean is, we look at describing the behaviour of the phenomen of interest (here the number of low-fhss items in baskets) using some kind of mathematical relationship endowed with a probabilistic structure. For the case at hand, we might want to consider a generalised linear regression model, with Binomial error component and canonical link (it’s a mouthful, I know):

mod_glm_binomial <- glm(cbind(x,n-x)~1,family=binomial)

Now, a word on the notation cbind(x0,n0-x0): this is one correct - albeit redundant here - way to specify a Binomial model for count data, c(number of successes, number of failures) contains all the information that you need with respect to \(n\), number of successes and to estimate \(p\). By inverting the canonical link (which is a logit) on the estimated intercept for this model, we can revert back to the estimate for \(p\), which is not only similar, but coincides exactly with the one obtained with the simple method before (not a case, these glms are fitted using the maximum likelihood method):

expit(coef(mod_glm_binomial))

(Intercept) 
      0.358

Now, what about using proportion data directly like you saw being done in so many places? Let’s try:

proportion_data <- x/n

mod_glm_binomial_proportion <- glm(proportion_data~1,family=binomial)

Warning in eval(family$initialize): non-integer #successes in a binomial glm!

expit(coef(mod_glm_binomial_proportion))

(Intercept) 
      0.358

Exactly the same again! Although, we do get a warning of some kind. Hm. I wonder where that comes from? I know that a Binomial distribution is a model for count data, so maybe I could have expected some kind of error to trigger by chucking in non-integer values, but why does the model still fits and actually succeed in providing the same answer as the “correct” methods?

A coding detour

As I’m quite a curious person, and I was not entirely sure how this had been implemented in the first place (hence my confusion yesterday), I dug up the source code for the Binomial family from family.glm used by stats::glm() and everything else that draws from there (which includes glmmTB), you can find a version here. If you scroll down a bit to line 322 you can read the section we are interested in and familiarise yourself with how this is implemented, if you want, but the key takeaways are:

the input could potentially be a factor (as in yes/no) - this is for when one fits a logistic regression models!
if the input is numerical, this get adjusted for the regression weights, which should be provided as an argument to the function call via weights=... (what happens if not? we’ll see later)
beyond the two cases above, which are automatically handled by the algorithm, “for the ‘binomial’ family, y must be a vector of 0 and 1’s or a 2 column matrix where col 1 is no. successes and col 2 is no. failures”

Number 2 is why proportion data is accepted and “only” returns the non-integer values warning. Now let’s see why this is potentially problematic.

A more realistic experiment

Let’s say we are not willing anymore to assume that each basket has the same amount of items. In other words, we do not take \(n\) as a fixed value for all basket, rather record the actual number of items the compose the basket, say \(n_i, i =1, ..., N\) (let’s still assume we collect \(N=100\) baskets). This could make sense, unless we force each buyer to purchase exactly the same amount of items (which I don’t think is the case here?)

Now, our sample is a collection of realisations \(y_i\), each coming from a Binomial distribution with parameters \(n_i\) and \(p\). As long as the baskets are independent of each other (whatever that might mean) we can still relatively easily attempt to estimate \(p\).

So let’s now generate, for the sake of this example, the basket sizes (the how is not important here) and their contents.

ni <- 2 + rbinom(N,25,.32) # at least two items, with an average of items of 10 per basket

y <- rbinom(N,ni,p) # this will return the number of successes (low-fhss items) for 100 baskets each with its own size, each with the same probability p

Let’s now apply the same rationale as before. Unfortunately, the crude estimator from before is not a good one anymore in this case: the fact that the \(n_i\) are not constant makes it so that the best estimator here is different. To be explicit, the maximum likelihood estimator for this case would be: \[ \widehat{p}=\frac{\sum_{i=1}^N y_i}{\sum_{i=1}^N n_i} \] meaning a simple

mean(y/ni)

[1] 0.3263368

would not yield the maximum likelihood estimate. We should rather compute

sum(y)/sum(ni)

[1] 0.3210369

Now again via the “correct” glm approach

mod2_glm_binomial <- glm(cbind(y,ni-y)~1,family=binomial)

expit(coef(mod2_glm_binomial))

(Intercept) 
  0.3210369

which returns exactly the maximum likelihood estimate, and the “proportion” one

proportion_data_2 <- y/ni

mod2_glm_binomial_proportion <- glm(y/ni~1,family=binomial)

Warning in eval(family$initialize): non-integer #successes in a binomial glm!

expit(coef(mod2_glm_binomial_proportion))

(Intercept) 
  0.3263368

which does not (alongside throwing the usual warning).

As you can see, the “correct” approach returns what we expected to see, whereas the “proportion” approach does not. Why? Because we haven’t specified the regression weights that are required when taking this path. And whys is it required? Because the Binomial model is one for count data, so the algorithm needs something to “revert back” to full information. If you don’t provide any, the assumption will be of all equal \(n_i\)s and you won’t even see the difference (although I could probably come up with a counterexample involving covariates), but if you do the algorithm stops complaining because you have done the right thing!

Let’s try again, this time specifying the weights (which are nothing more than the number of items in the baskets \(n_i\)):

mod2_glm_binomial_proportion_with_weights <- glm(y/ni~1,family=binomial,weights=ni)

expit(coef(mod2_glm_binomial_proportion_with_weights))

(Intercept) 
  0.3210369

Ta-daaaa! No more warnings, and the estimate we wanted!

Wrapping up

I got a bit carried away and this ended up WAY longer than intended, my apologies. I’ll wrap up quickly.

The Binomial distribution is one describing a count. The probability of observing specific counts is a function of both the number of trials and the probability of each individual success. Because of this, we can express the information needed to estimate the parameter \(p\) in the model equivalently either via number of successes and number of failures, or number of success/failures and number of trials, or proportions and number of trials. Not proportions alone as we have seen.

Aside from the Binomial model-specific example here, the key takeaway here is the following:

always make sure you understand exactly what the model you are using is, and also what the function you are using is doing.

This is often difficult. There are tons of alternative implementations for statistical models - although generalised linear models have been around long enough to be somewhat less controversial at this point - and if one is not confident with the theory (here you would have needed to know about maximum likelihood estimators in the context of generalised linear models) then it might become difficult to troubleshoot what is happening. Hope this helps a bit and always happy to chat.