preamble: modeling response scale use
I wanted to note an “under the hood” change to the way the model works, the technical details of which are described here. Breifly, the motivation was to more accurately capture the way participants tend to favor the ends of the scale and avoid the middle region of the scale, leading to a bowl-shaped distribution of case-strength ratings. The previous model, in which piece of evidence moves you up or down some fixed amount of scale, failed (spectacularly!) to capture this. Under the new model, different regions of the scale are allowed to be “stickier” than others, leading to ratings bunching up in the sticky regions. Each piece of evidence is still evaluated independently from the other evidence, in that each piece of evidence pushes you up or down the scale with some fixed force – however, it takes less force to move you through the middle of the scale than it does to move you to the very end of the scale.
The main motivation for getting this bowl-shape right was that if we didn’t, we might mistake for confirmation bias some effect that is better explained as just an artifact of how people use the scale. In the end I don’t know how much any of it mattered!
main effects of evidence types

Here we see the average effects – across scenarios and participants – of each evidence category. I believe we reproduce the findings of the previous paper, with inculpatory character and witness evidnece being weaker and inculpatory physical evidence being strongest. Ambiguous evidence effects are weaker and… ambiguous. Most strikingly, across all categories exculpatory evidence is far less exculpatory than inculpatory evidence is inculpatory!
Technical note: These effect sizes are presented in terms of points on the scale, but because of the modeling changes described above the exact number of points a piece of evidence contributes depends on where you are on the scale. Here the proper interpretation of the y-axis is, number of points a piece of evidnece would contribute when added to a case that is otherwise of average strength, which in this data turns out to be a little below the halfway point of the scale.
the search for confirmation bias
First I’ll look for various sorts of systematic discrepencies between the model and the data which could be interpreted as indications of confirmation bias. Spoiler alert: I won’t find any. Never before have I encountered the problem of a model fitting too well.
If participants have a tendency to underweight exculpatory evidence in an otherwise strong case, or underweight inculpatory evidence in an otherwise weak case, one might expect that that our model in which evidnece weight is constant should systematically underestimate the ratings of strong cases and/or overestimate the ratings of weak cases. Do we observe such a discrepancy?

Here, each point represents the average rating across all scenarios with the same configuration of evidence, and we compare the observed rating (y-axis) to the rating predicted by the model (x-axis). Under the prediction described above, we might expect more points to lie below the line on the low end of the x-axis (the real ratings are lower than the predicted ratings) and/or above the line at the high end of the x-axis (the real ratings are higher than the predicted ratings). There is no sign of systematic discrepancy of that sort.
Another way of thinking about confirmation bias is that the impact of adding in a piece of exculpatory evidence depends on how much inculpatory evidence is present, and vice versa. We can examine this prediction as well.

Here each point is the average response across scenarios with a given number of exculpatory and inculpatory pieces of evidence. It appears that the (average) marginal effect of one more piece of evidence of either type is pretty consistent across evidence configurations, with some compression of the effect as you head towards the extremes of the scales. I’m not seeing any systematic difference between the model predictions and the observed ratings. Also, I haven’t even attempted to propogate the uncertainty in the model predictions here, but I’m pretty sure that uncertainty is going to swamp the distinctions between the model predictions and the observed data.
Finally, yet another possibility is that something like confirmation bias might show up in the variability of responses rather than in the averages of responses. For example, imagine that for every scenario the participant flips a coin (or equivalently, on the basis of some information we’re not accounting for) decides that the defendant is guilty or innocent, and then downweights/upweights evidence that contradicts/supports that. Under such behavior, on any given trial, our model predicts that exculpatory and inculpatory evidence should cancel each other out, but what’s actually happening is that people are swinging hard in one direction in the other and the canceling out only happens in the averages. This should mainfest as increasing variance in trial-by-trial responses as the amount of evidence increasese, which the model should not be able to capture. 
This is a much messier graph! We’re plotting scenarios in the same way as the previous graph, but here we’re looking at the standard deviation across ratings rather than the mean of the ratings. The things to note here are that: a) the model predictions aren’t quite as good as they are for the rating means, but they do capture the same overall trends, and b) to the extent that the model gets it wrong, the model consistently thinks there should be more variability in the data than there actually is – that is, people are more consistent than the model is able to pick up on. This contradicts the idea that unpredictable confirmation bias would manifest as excess “swingyness” in the responses.
direclty modeling evidence interactions
Another possibility is that something like confirmation bias isn’t causing systematic discrepancies between the model and the observed data, but is rather simply getting absorbed into the model’s residual variability. We can therefore attempt to modify the model directly and see if any such modification increases the explanatory power of the model – that is, decreases the amount of unexplained response variability.
One general way of thinking about confirmation bias is in terms of interaction effects between pieces of evidence – any piece of evidence changes the effect of another piece of evidence. We can therefore augment our previous model by simply adding pairwise interactions between pieces of evidence, which combine linearly with the main effects of evidence. We should then be able to discern confirmation bias by fitting and examining interaction effects. Technical note: for the interactions, I fit fixed effects and per-scenario random effects, with a hierarchical normal prior on both the fixed and random effects.
The result we obtain is that estimated interaction effects are all extremely small with credible intervals containing zero. However, there are a lot of interaction effects, and it is possible for a many very small effects to add up to something substantial even if the sign of any given effect cannot be established with any confidence. To get a sense of the net contribution of the interaction effects, I calculated how much they, along with several other components of the model, contributed to the total variability in judgements predicted by the model. This gives a sense of the amount each piece of the model contributes to the model’s predictions for behavior.

As you can see, not only do the interactions effects contribute much less to predicted behavior than the other pieces of the model, the contribution of the interaction effects actually goes negative! A negative contribution to behavioral variance suggests that the best use the model could find for the interaction effects is to cancel out the contributions of other effects.
Technical note: for the above plot, I’m using the posterior predictive variability of a new set of responses from the exact same set of subjects and scenarios. That is, the only thing that is getting resampled in the simulations from the posterior is the residual variability.