Interactions between clear and ambiguous evidence suggest… anti-confirmation bias?

All the analyses where I compared predicted vs observed respones as a function of evidence focused specifically on the exculpatory and inculpatory data. I hadn’t gotten around to looking at the amgibuous evidence, really just because I wasn’t sure what to do with it. However, upon closer, or rather, any, observation, it seems that there might be something there.

As a reminder, this is what the main effects of the evidence types look like:

Specifically, these are the popluation average effects for each evidence type as inferred from a logistic regression on the legal guilt judgements. All subsequent analyses are using the binary legal guilt judgments. In light of how the juror bias results evaporated when we controlled for random responders, the model is augmented with a “lapse” component that takes into account that different subjects may be more or less attentive to the actual task. Accordingly, on each trial each subject has some (subject-specific) probability of responding randomly. This lapse component will be present in all model-based analyses discussed below.

Anyway, the main effect of ambiguous evidence is small and appropriately… ambiguous, though note that for ambiguous physical evidnece in particular there is a lot of variability from scenario to scenario, sugesting that for some scenarios the ambiguous evidence may be read as consistently exculpatory or inculpatory.

However, if we look at the interaction between ambiguous evidence and in/ex-culpatory evidence, more substantive, consistent effects start to emerge. These interactions can be interpreted as the presence of a piece of ambiguous evidence changing the weight of a piece of in/ex-culpatory evidence, or vice versa. Let’s start with the usual trick of ignoring the difference between physical, document, etc, and look at a) the relationship between voting guilty and the balance of inc vs exculpatory evidence and ambiguous evidence, and b) how that relationship changes with the presence vs absence of ambiguous evidence.

We see a modest but visible effect in the raw data here, but in the opposite direction of what one might expect under a confirmation bias theory. If ambiguous evidence were interpreted as being consistent with the stronger evidence, the curve should get sharper when ambiguous evidence is present; what we observe is the opposite, in that ambiguous evidence makes the curve shallower, as if it were pushing back against the balance of unambiguous evidence.

Next we can fit an actual regression with random effects and everything to see if our visual impression from the raw data holds up to statistical scrutiny. In the first model, we’ll include the main effects of “# exculpatory”, “# inculpatory”, and “# ambiguous”, as well as the interactions between “# ambiguous” and both “# exculpatory” and “# inculpatory”, for a total of five regressors. I’ll call this the counting model since everything is based on just counting up the amount of each kind of evidence. As always this model includes random effects for both scenario and subject for all five of the regressors.

The interaction effects estimated from the counting model are consistent with what we saw in the raw data: the “# ambiguous” by “# exculpatory” interaction effect is positive (more guilty), whereas the main effect of exculpatory evidence is negative (less guilty), and the same thing but reversed for with inculpatory data. The magnitudes of these interactions are shown on right side of the figure below.

In the second model, we’ll try to disaggregate a bit by separating out each type of ambiguous evidence. That is, where the counting model is equivalent to treating each type of ambiguous of evidence as equivalent, now we’ll estimate main effects and interactions separately for ambiguous physical evidence, ambiguous document evidence, and ambiguous witness evidence (there is no ambiguous character evidence). These interactions are shown in the figure below, alongside the aggregate interaction from the counting model. In order to keep the number of parameters down, we’ll continue to treat all inculpatory and exculpatory evidence types as equivalent, and accordingly just count up the number of them in each trial. Under this model, we find that exculpatory evidence interactions are trending positive (more guilty) for all three types of ambiguous evidence. This makes sense, as it implies that when the counting model adds up all those trending positive effects it gets a strong positive effect. However, the inculpatory interactions are… weird? We see negative trends for interactions with negative (less guilty) for document and witness ambiguous evidence, but a positive (more guilty) interaction with physical ambiguous evidence. These disaggregated effects do not average out what we saw in the counting model, and is something of a puzzle.

Of course, this assumption that all types of inculpatory (exculpatory) evidence are weighted the same is a terrible one, as we clearly see in the main effects estimates from the begining. I’ve tried relaxing that assumption and interacting ambiguous evidence with the inferred net weight of inc/exc-ulpatory evidence on each trial. That model was not very well behaved but it seemed to give basically the same results as the model described previously, except that the positive interactions with ambiguous physical evidence were both more robust.

For completeness, I’ll note that whenever I’ve included interaction effects between exculpatory and inculpatory evidence, they are always centered on zero, consistent with the original analyses from last year.

Finally, I’ll briefly document the analysis of all the pair-wise interactions, which chronologically preceded and motivated all the above analyses, but had no interpretable outcome. As you’ll recall, when analyzing the case strength ratings fit a model that included all 45 pair-wise interactions between each type and level of evidence (phys/inc x doc/inc, phys/inc x doc/exc, etc), and found that the model was quite confident that net effect of these interactions was very small. However, I hadn’t bothered running that same analysis on the binary judgements until I was getting everything together for the manuscript outline. Lo and behold, the binary judgments gave different results, the substance of which can be summarized by the two posterior distributions shown below.

On the left we have the posterior distribution of the standard deviation of the population average interaction effects. That is, how far from zero the average interaction effect is. Note that the posterior is piled up at zero, indicating that the model is perfectly happy to have all the interactions be zero. However, we see that we can’t really rule out that the average interaction effect is pretty substantial – on the logistic scale, a value like 0.25 isn’t nothing. This is contrary to what we saw on the continuous ratings, suggesting that the graded nature of the ratings response gives us more power to rule out interaction effects.

The more interesting part is right side of the plot, which shows the standard deviation of the subject specific interaction effects. This posterior is mostly not piled up at zero, suggesting the model has given it some kind of explanatory role. To test more rigorously the “significance” of this, we can use model comparison tools that approximate how well a model would predict a held-out data point. Specifically, we’ll compare the model with all the interactions to the model with no interactions.

The first metric is called loo-is, and gives us the below result:

compare(loo(fit_null),loo(fit_allint))

## elpd_diff        se 
##       6.2       3.1

The first “diff” number is the estimate of how much better the full interaction model fits the data. That it’s positive means the full interaction model does better than no interactions, and the value of 6.2 is on the log probability scale so it suggests the full interaction model is four hundred times better. However, the second value is the standard error of the first value. Going by the “two standard error” rule, we can be… kinda but not super confident about the full interaction model being better.

There’s another metric we can use called WAIC, which gives us substantially more confident results in favor of the interaction model:

compare(extract_log_lik(fit_null) %>% waic(),extract_log_lik(fit_allint) %>% waic())

## elpd_diff        se 
##      27.3       3.1

However, WAIC is, I believe, considered less “robust” than loo-ic, so this any of this is at best suggestive rather than definitive.

I made various attempts to see if I could narrow down these results to some particular interaction effects. This involved estimating a separate standard deviation across subjects for each of the 45 interaction effects, with those 45 standard deviations themselves having some kind of regularizing hyperprior. Long story short, I found no evidence of heterogeneity across the interaction effects. Mostly, we just don’t have the power to say much of anything about the individual interactions, so the only thing the models are able to pick up on is something aggregated across a bunch of of them. This result motivated the analyses presented first, in which I manually aggregated interaction effects in a quasi-hypothesis driven way.