Does the continuous rating distort BARD judgements?

Previously we compared participant judgements across evidence distributions that were either balanced between prosecution and defense or favored the prosecution. We found that when the evidence distribution favored the prosecution rather than being balanced, baseline judgements favored the prosecution more but evidence weights shifted towards favoring the defense. One possible explanation for this phenomenon is that the use of the continuous slider induced a “ceiling effect”, whereby subjects avoided clustering their responses too tightly on one side of the scale by adjusting evidence weights to favor the opposite side of the scale. Contradicting this explanation, the phenomenon was observed in the binary BARD judgement as well as the continuous case strength rating; however, a ceiling effect in the continuous response could manifest in the binary response due to correlation between the responses. In order to rule out artifactual explanations based on how subjects use the continuous scale, we ran an a version of experiment in which we randomly assigned participants to either make both the continuous and binary responses or to only make the binary response. As before, participants were also randomly assigned to one of three evidence distributions.

A total of 293 participants completed the task, of whom 4 were excluded for (uh I’m not sure which exclusion criteria they hit but it would be some problem with somehow they ended up with the wrong number of events recorded, suggesting some server error might have happened). The final subjects per condition were:

##              cond_rating
## cond_evidence  0  1
##   credible    48 49
##   inculpatory 50 48
##   random      48 46

(where cond_rating being zero indicates the condition without the continuous response).

As a sanity check, we will first estimate the baseline and evidence weights for the case strength ratings and compare them to those obtained from the previous experiment conducted in December. The results are shown in the figure below. For the sake of comprehensibility, I have collapsed across evidence types (physical, document, etc) and am plotting the average evidence weight within each level (exculpatory, ambiguous, inculpatory).

The current results are noisier than the December data set, this is expected since only half the subjects in the current data set gave case strenght ratings. Nevertheless, I feel confident proclaiming that we have replicated the effect of evidence distribution on both the baseline and on the evidence weights. The most substantial point of discrepancy between the two datasets is in the baseline of the credible condition, which is lower in the current dataset than in the previous one; however, it is still well above the baseline in the random condition, which is the main point of interest.

Next, we examine the baseline and evidence weights for the BARD judgements as displayed in the figure below.

Again, the new datasets are smaller, manifesting in larger credible intervals. We replicate the overall shift in evidence weights towards the defense in both conditions, regardless of whether subjects gave a continuous response. We do not replicate the change in baseline in either condition, though we see a trend in the same direction among subjects who gave a continuous response. However, given the size of the credible intervals, we don’t really contradict the previous study either. To sum up, we clearly replicate the effect of evidence distributions on case strength ratings, and on BARD judgements we replicate the effect on evidence but have an ambiguous result for the effect on the baseline. Most importantly, though, we see no evidence for a qualitative difference between the BARD judgements given alongside versus without case strength ratings.

But does it blend?

Now that we have our model, we can attempt to fit it to our data and see if we can reproduce the differences between evidence distributions However, learning models of this sort can be challenging to fit even to tasks that were designed specifically around such models, which our task definitely was not. Therefore, while for the static model we were able to do full Bayesian inference on the evidence weights at both the individual and population levels, it is not feasible to do the same with the learning model. Instead, we will use maximum likelihood estimation to fit a single point estimate of the learning model parameters across entire data set. This procedure is vastly easier than full Bayesian inference in a hierarchical model, but unfortunately it means that we can’t use the model fits as a basis for inference about the population because the entire data set isn’t necessarily representative of any one individual’s behavior. When we fit the learning model we are therefore not doing inference, but rather using the data to calibrate the model and pick a sensible set of parameters which are consistent with the data. We can then determine whether the model, when sensibly calibrated, is capable of producing the qualitative difference between evidence conditions which we are trying to explain.

My hands being well and thoroughly wrung, we can now actually fit the model and see what happens. I will fit the model to the December dataset and estimate the effect of learning by calculating the inferred baselines and evidence effects for each subject at both the beginning and end of the task. To compare evidence distributions, I then average those baselines and evidence effects across all subjects assigned to the same evidence distribution.

We see that the learning model captures several qualitative differences between evidence conditions. All evidence conditions start the task with negative baselines, consistent with the baseline in the random condition, after the task the credible and inculpatory conditions have shifted substantially towards favoring the prosecution. Similarly, all evidence conditions start the task with evidence weights that overall favor the prosecution, after the task the evidence weights in the inculpatory and credible conditions have shifted downwards to slightly favor the defense. However, the model does not accurately reflect the observed differences between the credible and the inculpatory conditions. After learning the model predicts that the baselines in the credible and inculpatory conditions should be similar, whereas we observe in the December dataset that the inculpatory baseline is lower than the credible baseline (note however that the model predictions do match the February dataset, to which it was not fit!). The learning model also predicts that evidence weights should end up lower in the inculpatory condition than in the credible condition, whereas we consistently observe the opposite across experiments and response modalities.

We can actually rectify the above discrepancies between the learning model and the static model by making one more modification to the model: allowing the implicit total evidence (as governed by \(\nu_0\) and \(\nu_1\)) to differ between the random and credible evidence conditions (recall that we already were allowing the inculpatory condition to have it’s own \(\nu_0\) and \(\nu_1\)). This change allows the model to unlink the learning effects of the random and credible conditions, and the results are beautiful:

In this version of the model, learning results in the inculpatory condition baseline and evidence effects falling between the random and credible conditions (ever so slightly in the case of the evidence weights). We also see much sharper distinctions between the random conditions and the other two conditions. The reason for the difference between this version of the model and the previous one is that this model infers that in the random condition, the implicit total amount of evidence is the same as the observed amount of evidence: or in terms of the parameters, \(\nu_0 \approx 0\) and \(\nu_1 \approx 1\). This means that while learning about the average case strength \(\bar{w}\) does occur in the random condition, it has no impact on behavior. Unfortunately, I can think of no a priori justification (or even post hoc, really) why the implicit total evidence should differ between the random and credible conditions. This is mainly because the random and credible conditions very similar distributions of observed amount of evidence, as you can see for yourself below:

In conclusion, we have one version of the learning model for which I can give a reasonable rationale for every choice made in it’s specification and which gets most things but not everything right, and another version of the model for which there is one choice for which I have no justification but which gets basically everything right. I have wrestled with this conundrum well past the point of diminishing returns, and so, dear reader, I hereby pass the buck to you!

Appendix: A Missing-at-random model for implicit evidence amounts

When describing the learning model I claimed to have a plausible justificaion for the implicit total evidence amount being a linear function of the observed evidence amount, \(\nu_t = \nu_0 + \Sigma x_t \nu_1\) (credit goes to John for the original idea, though apparently he actually had something else in mind). Imagine that the actual total number of pieces of evidence that exists for a case is sampled from a discretized normal distribution, such that

\[ p(\nu = x) \propto \Phi(x+0.5 | \mu,\sigma) - \Phi(x-0.5 | \mu,\sigma)\] where \(\Phi\) is the cummulative normal distribution and \(x\) is a nonnegative integer. Imagine next that each piece of evidence is actually observed with fixed probability \(\theta\), such that the amount observed evidence (which I’ll now write \(S_x\)) is distributed

\[ S_x \sim \textrm{Binom}(x,\theta) \] The problem of going from observed to implicit evidence amounts can then be recast as an inference problem of finding \(nu\) given \(S_x\), or rather calculating the function \(\textrm{E}[\nu | S_x]\). This function can, with a little algebra, be easily approximated to an arbitrary degree of precision, but it doesn’t have a nice interpretable analytical form (at least that I can figure out). Still, we can visualize the function:

The points are the expected value of \(\nu | S_x\) for parameters \(\mu = 3.5\), \(\sigma=1\), and \(\theta=0.75\); or, in other words, how much total evidence you should think there is after observing some amount of evidence, assuming on average there is 3.5 pieces of evidence per case, the amount of evidence per case has a variance of 1, and each piece of evidence is shown to you with probability 0.75. The red and blue lines represent the linear functions relating obesrved to implicit total evidence fit to the credible and inculpatory evidence conditions, respectively. The linear functions imply that eventually, as the amount of observed evidence increases, there is actually less evidence than you have observed, as represented by the red and blue lines crossing the unity line. However, if you treat this as an inference problem where evidence is missing at random, then the relationship between observed and total evidence amounts starts out looking linear(ish) at low quantities of observed evidence, but as you observe more and more evidence you converge towards thinking that you’ve must have observed all the evidence that’s really out there. An important caveat is that this function is extremely sensitive to the setting of \(\sigma\), such that if you make \(\sigma\) too big then you no longer get this nice clean convergence to the unity line.

This is a crude model, and one we are really not equipped to actually interrogate in our current task, but I think it’s good enough as motiviation for the current model. One could imagine elaborating upon this model in many interesting ways: instead of evidence being missing at random, what if you have a prosecution and defense each “censoring” pieces of evidence based on their weight? Or some probability of fabricating evidence? In fact, one could use this, combined with the learning model, as an integrated model of “persuasion under asymmetric information” with application to advertising, or political debate, or… dating? Probably? But all that will have to wait for another day.