Outlier detection

One manner of attempting to identify outliers is to examine the distribution of the model parameters themselves. However, this is going to be kind of hard, because there are a lot of individual-level parameters, and accordingly they are individually pretty poorly identified. So you might end up in a situation of having to run another model to make sense of the output of your model, which seems like a dangerously recursive road to go down.

Alternatively, one could modify the model itself to pick up on outliers during the fitting process. Most obviously, change the distributions of individual-level parameters from a normal distributions some kind of thick-tailed, spike-and-slab model. The downside of this is that it would be a lot of work, and I suspect it would be extremely hard to fit. It might also compromise the regularization the model performs, which is important for making sure you don’t confuse sampling error with monkey deviance.

While mulling over these unsavory menu options, it occurred to me that there was a much simpler way of doing this that avoids dealing with messy and under-determined distributions of hyperparameters and random effects. I’ve been using information theory metrics to take animal variability across the 2^d possible patterns of behavior and collapse it into a single number for “repeatability”, how close animals are on average to the average animal. We can apply the same framework to ask how far each animal is to the average animal, and look at the distribution of those distances to try to detect outliers. Specifically, we’ll use KL divergence to quantify the difference between each animal’s probability distribution over all possible behavioral patterns to the population average distribution.

First let’s look at the estimated KL divergences for each monkey under the quadratic exponential binomial.

Those… sure are distributions? They definitely have tails. How far out an animal’s KL divergence needs to be before it means something is hard to say, though. I mean, someone has to be the weirdest monkey.

One way we can turn this into something more interpretable is by looking at the uncertainty about how weird each monkey is. That can tell us how confident we should be that, say, the monkey with the highest expected KL divergence really is weirder than any other monkey.

Here we’re looking at the posteriors of the weirdness ranks of each monkey. The y-axis is something like, the percentile normality, where being in the 90th percentile means that you are closer to the average monkey than 90% of the population. In other words, lower is weirder. The thick and thin lines are 80% and 95% CIs, respectively. The animals are ordered along the x-axis by their average percentile across the three versions of the data set.

As you can see, most of the monkeys fall into the uncertainty soup, where they could be weird, they could be normal, we really don’t know. However, something like between two and four animals stand out as being consistently within the 20ith% normality percentile.

We can reproduce these analyses with the behavioral state model with 8 states:

The results are pretty consistent across the models.

Extentions

One issue we might run into in this analysis is that we might classify the alphas as deviant simply because there are not many alphas. One way of getting around that is that, if we include dominance rank as a fixed effect in the model, we can compare an animal’s probability distribution to the expected distribution for an animal with their fixed effects rather than the global population average.

In any case, if one were looking for a general measure of monkey weirdness to regress against, say, de novo variant load, I think this would be a good place to start.