We had a total of 728 participants, of whom 302 completed the task; however 3 participants were excluded due to some anomaly in their data such as having the wrong number of catch trials recorded (there are a handful of bugs like that I haven’t been able to figure out). Of the participants with usable data, the condition breakdown goes like this:
## cond_question
## cond_evidence ['bard', 'mltn', 'subj'] ['bard']
## credible 50 49
## inculpatory 51 49
## random 51 49
I’ve only taken a cursory look at the differences between ‘standards of guilt’ conditions (above labeled ‘cond_question’), and will focus here on the comparison between evidence distributions.
First let’s look at the distribution of case strength ratings from the different evidence conditions:
As expected, increasing the average strength of the evidence presented causes responses to pile up progressively farther towards the high end of the scale.
We can compare these rating distributions to the previous two data sets we collected:
The ‘random’ evidence condition looks pretty similar to the first round of data we collected back in fall 2019 (albeit with a bit less extreme of a bowl), which also used a random evidence distribution. Meanwhile, the ‘credible’ evidence condition is a good match for the summer 2020 data set, which used something close to the same criteria for evidence. Having at least approximately replicated both the the previous data sets, we can go ahead and rule out cohort effects as an explanation for the differences between them. Which is very very good! Whatever is going on, it looks like it is predictable, repeatable, and driven by the distribution of evidence being presented.
In comparison, here are the average case strength ratings across the ‘questions asked’ conditions:
Asking about standards of guilt other than BARD doesn’t appear to have any effect, at least at the aggregate level.
What was interesting about the difference between Fall 2019 and Summer 2020 is that the rightward shift in the ratings distribution seemed more extreme than we would have expected based purely on the difference in the evidence presented. We can examine this question more carefully in our new data set. My strategy is to first fit our model of the ratings to the ‘random’ evidence condition, then ask the fitted model what ratings it would expect from new batches of participants doing the ‘credible’ and ‘inculpatory’ conditions. I show the observed and predicted ratings distributions below, in the solid and dashed lines (plus shaded credible intervals!) respectively. For easier visual comparisons I’ll plot the cummalitve distributions of the ratings rather than their density distributions:
This is maybe not the most legible figure I have ever made but the takeaway is, the observed distributions of ratings in the ‘credible’ condition is more extreme than you would expect if the ‘credible’ condition ratings were generated by the same model as the ‘random’ distribution. The same is… maaaaybe true of the inculpatory condition, but it’s kinda borderline, as you can see from the solid green line falling juuuust outside the shaded green area. One may have expected the opposite: that whatever changes occur going from the random to the credible condition should be amplified by going to the even more extreme inculpatory condition. This does not seem to be what is happening.
So far we’ve treated our model of the ratings as a black box, but of course we can also look inside at the baseline and evidence weights to try to figure out what is driving the change in behavior across evidence conditions. (In the ‘random’ condition, we find that pretty much every parameter is consistent with what we saw in the original 2019 data set, which I won’t bother plotting at the moment.) Below I plot the average baseline case strength and evidence weights in the ‘credible’ and ‘inculpatory’ conditions in terms of their difference from the ‘random’ condition:
A few observations: first, tilting the evidence distribution in favor of the prosecution increases the baseline case strength by, like, a lot. However, the weights on the evidence items themselves shift consistently towards favoring the defense! In other words, exculpatory items become more exculpatory, while inculpatory items become less inculpatory, and ambiguous items generally go from weakly favoring the prosecution to weakly favoring the defense or being neutral. Though this shift is not “significant” for most individual evidence items, the magnitude of the shift is surprisingly consistent across items, almost as if a flat penalty were being applied to the prosecution (relative to the random condition, that is) for every piece of evidence presented. Second, like before, the inculpatory condition is not associated with more extreme deviation from the ‘random’ model, and if anything the opposite is true in that the difference between ‘random’ and ‘inculpatory’ is less than that between ‘random’ and ‘credible’. However, as you can tell from just looking at the error bars, we don’t really have the power to resolve differences between the ‘credible’ and ‘inculpatory’ conditons.
Every question we asked about the case strength ratings we can also ask of the binary BARD question (I haven’t even touched the binary judgments yet). As you would expect, and as with the case strength ratings, guilty judgements were more common as the evidence menu favored the prosecution more heavily. However, once again the real question is, in the stronger evidence conditions are we getting more guilty judgements than we would have expected based on the ‘random’ condition? As before, we can answer this question by fitting the model to the ‘random’ condition and then generating predictions for the ‘credible’ and ‘inculpatory’ conditions from that model.
…and we actually get quite a different answer for the BARD judgements compared to the case strength ratings. Instead of the observed responses favoring the prosecution more strongly than expected as the average evidence strength goes up, if anything the opposite happens, and participants vote guilty less often than we would have expected! Note that the observed behavior is well within the credible intervals, however, so “if anything” is doing quite a lot of work in the previous sentence.
And yet, if we look at how the baseline and evidence weights change between the ‘random’ condition and the other two, we see a similar pattern of changes with the BARD responses as with the case strength ratings:
Well, similar but not identical. The baseline goes up, but only in the ‘credible’ condition and not the ‘inculpatory’ condition, while the evidence weights look generally decreased (favoring the defense), but not nearly so uniformly as with the case strength ratings. Instead of a flat decrement across all the evidence, this looks more like the inculpatory evidence weights are decreasing more than the rest. Though again, with the size of the error bars we should avoid reading too much into the differences between evidence types and conditions.
So, that’s all a bit confusing isn’t it? The good news is that we seem to have replicated and explained the difference between our previous data sets. The rest of the news isn’t actually bad at all, but it is going to require some serious thinking through, which as the frequently very tired father of a four-month old I was hoping I might avoid.
Of the results presented here, one seems very interpretable and two seem rather anomalous. I’ll start with the anomalies.
We (well, I, at least) have been considering the case strength ratings as more or less a continous proxy for the guilt judgements. Here, however, we see that while increasing the average strength of evidence drives case strength ratings to disproportionatly favor the prosecution, the rate of guilty responses remains consistent (or maybe even lower) than you’d expect under the ‘random’ condition. This suggests that the two responses actually getting are at distinct and separable constructs. Not sure what to think about this issue yet, but one way forward is to look at the other binary judgements (MLTN and subjective), and see if they behave more like the case strength ratings or like BARD.
The other big mystery that I have no good ideas for is why the evidence weights should shift to favor the defense! Below I will talk about a couple candidate explanations and their discontents.
The simplest explanation that comes to mind is that when there’s stronger cases, some kind of dynamic range adaptation kicks in in order to use more of the scale: if you take the increase in baseline as given, then decreasing the evidence weights causes you to be less bunched up at the high end of the scale. This is a very boring explanation which is basically all about the response modality. The problems with this explanation are fourfold: First, there’s no reason to take the increase in baseline as given! I can’t think of any reason that the burden of range adaptation should fall disproportionately on the evidence weights while the baseline is allowed have the opposite effect (unless the baseline comes “first” in terms of sequential evidence integration, and therefore doesn’t “trigger” the range adaptation?). Second, in total particpants do the opposite of range adaptation: ratings in the ‘credible’ and ‘inculpatory’ conditions are more clumped up at the high end than expected from the ‘random’ condition. Third, our model for the ratings maps from an unbounded “guilt” space to the bounded “slider” space, and that mapping seems to be nearly identical between conditions. Range adaptation should, I think, show up in the model as a squishing or stretching of that mapping, and we don’t see that. Fourth, we see similar effects in the BARD question, where presumably there is no finite range to adapt to. This is the simplest and most compelling argument to me, but one could counter that people are adapting their BARD judgements to be consistent with their ratings and that therefore the range adaptation in the latter is spilling over into the former.
These arguments all suggest to me that range adaptation is not a good explanation for the changes in evidence weight, but I don’t know that I’d bet my life on any of them. We could pretty easily rule it out by running a condition where we drop the slider altogether and ask only for binary BARD judgements, in which case there is no continuous scale to “corrupt” the binary judgement.
I also thought that maybe people have finite attention and when the prosecuter seems on the level, they weight their prior more heavily and pay less attention to the evidence. Then I realized this doesn’t work, because it would imply that all evidence weights should shrink towards zero rather than shifting negatively. An alternative version that might work is that people effectively grade cases on a curve, such that if they have more faith in the prosecuter they actually scrutinize the evidence more sceptically. I really don’t know how we would test this, though!
We have a pretty good handle on what is happening, but I don’t have a good mental model that would explain why. The baseline increase in favor of the prosecution can be interpted as learning to trust that the prosecutor is competent, or conversely, in the random condition learning that the prosecution may actually be totally incompetent. This is a very appealing and intuitive story, and ironically one could think of it as a form of confirmation bias. We could try to confirm this by running a task where before every case, we tell subjects a story about how the prosecuter is either great at their job and everyone loves them, or they are terrible and under investiation for fabricating evidence (or something). Then we can see if those manipulations have similar effects as the evidence manipulations. In theory we should also be able to study the process of learning within the data sets we’ve already collected. I haven’t tried yet, but we could try to explicitly model learning and use that as a way to unpack what is driving the changes.