learningmodel.utf8

Does the continuous rating distort BARD judgements?

Previously we compared participant judgements across evidence distributions that were either balanced between prosecution and defense or favored the prosecution. We found that when the evidence distribution favored the prosecution rather than being balanced, baseline judgements favored the prosecution more but evidence weights shifted towards favoring the defense. One possible explanation for this phenomenon is that the use of the continuous slider induced a “ceiling effect”, whereby subjects avoided clustering their responses too tightly on one side of the scale by adjusting evidence weights to favor the opposite side of the scale. Contradicting this explanation, the phenomenon was observed in the binary BARD judgement as well as the continuous case strength rating; however, a ceiling effect in the continuous response could manifest in the binary response due to correlation between the responses. In order to rule out artifactual explanations based on how subjects use the continuous scale, we ran an a version of experiment in which we randomly assigned participants to either make both the continuous and binary responses or to only make the binary response. As before, participants were also randomly assigned to one of three evidence distributions.

A total of 293 participants completed the task, of whom 4 were excluded for (uh I’m not sure which exclusion criteria they hit but it would be some problem with somehow they ended up with the wrong number of events recorded, suggesting some server error might have happened). The final subjects per condition were:

##              cond_rating
## cond_evidence  0  1
##   credible    48 49
##   inculpatory 50 48
##   random      48 46

(where cond_rating being zero indicates the condition without the continuous response).

As a sanity check, we will first estimate the baseline and evidence weights for the case strength ratings and compare them to those obtained from the previous experiment conducted in December. The results are shown in the figure below. For the sake of comprehensibility, I have collapsed across evidence types (physical, document, etc) and am plotting the average evidence weight within each level (exculpatory, ambiguous, inculpatory).

The current results are noisier than the December data set, this is expected since only half the subjects in the current data set gave case strenght ratings. Nevertheless, I feel confident proclaiming that we have replicated the effect of evidence distribution on both the baseline and on the evidence weights. The most substantial point of discrepancy between the two datasets is in the baseline of the credible condition, which is lower in the current dataset than in the previous one; however, it is still well above the baseline in the random condition, which is the main point of interest.

Next, we examine the baseline and evidence weights for the BARD judgements as displayed in the figure below.

Again, the new datasets are smaller, manifesting in larger credible intervals. We replicate the overall shift in evidence weights towards the defense in both conditions, regardless of whether subjects gave a continuous response. We do not replicate the change in baseline in either condition, though we see a trend in the same direction among subjects who gave a continuous response. However, given the size of the credible intervals, we don’t really contradict the previous study either. To sum up, we clearly replicate the effect of evidence distributions on case strength ratings, and on BARD judgements we replicate the effect on evidence but have an ambiguous result for the effect on the baseline. Most importantly, though, we see no evidence for a qualitative difference between the BARD judgements given alongside versus without case strength ratings.

A learning model for legal judgements

So far we’ve been working with a static model for participant judgements based in which we assumed that there exists some baseline, representing the case strength of an accusation with no accompanying evidence, and evidence weights associated with the presented pieces of evidence, all of which are additively combined to produce a final case strength rating. When we vary the distribution of presented evidence and use the static model to examine the resulting behavior, we obtain seemingly paradoxical results: increasing the proportion of prosecutorial evidence increases the baseline accusation strength, favoring the prosecution, but shifts all evidence weights downwards, favoring the defense. These two effects don’t entirely cancel each other out, so their combined effect is to shift response patterns modestly in favor of the prosecution.

How can we account for these paradoxical adjustments in behavior? I propose that both the change in baseline and evidence weights can be accounted for by a learning model in which participants learn to anticipate the evidentiary strength of the average case. Then when asked to make an inference about a specific case, that estimate of the average case strength acts as a prior which is updated according to the evidence presented for that case. When the evidence distribution shifts towards the prosecutor the baseline increases because the repondant has learned to, a priori, expect stronger cases. As a consequence, when an item of inculpatory evidence is presented it favors the prosecution less because it is expected, and therefore already partially accounted for in the prior. Conversely, inculpatory evidence sways the respondant more strongly towards the defense because it is surprising.

But enough chit-chat! Talk is cheap, math is… also cheap, but substantially more reliable. Let’s see if we can manifest the concept described above in a solid, tangible equation.

The model can be broken into two pieces: a long-run learning step in which the average case strength is learned, and an inference step in which the strength of the current case is estimated. I’ll start with the inference step, as that is where all the interesting stuff happens. We start from the static model we’ve been using to analyze the data thus far:

\[\eta_t = \alpha + X_t' \beta + \epsilon_i,\] which is literally just regression. To cover our notational bases, here \(\eta_t\) is the case strength of the \(t\)th scenario, \(\alpha\) is the baseline case strength, \(X_t\) is a vector of dummy variables indicating which pieces of evidence are present, \(\beta\) is the vector of evidence weights, and \(\epsilon_t\) is response variability. From hereon I will drop \(\alpha\) and \(\epsilon\) because they’re not going to matter for the rest of the model and I don’t want to keep writing them out.

First let’s modify the model to incorprate the notion of expected case strength: \[\eta_t \approx \nu_t \bar{w} + X_t'(\beta - \mathbf{1}\bar{w}).\] Well that wasn’t so hard. But what does it mean? First off it will probably help to label \(\bar{w}\) as the expected strength of the average individual piece of evidence, and \(\nu_t\) as the total number of pieces of evidence associated with the \(t\)th scenario. This means we can interpret the term \(\nu_t \bar{w}\) as the expected total weight of evidence for \(t\), while the term \(\beta-\mathbf{1}\bar{w}\) represents the difference between the evidentiary value of each piece of actual evidence and how strong you think pieces of evidence are on average. One can imagine these two terms being integrated sequentially over the course of a trial: first you learn that there is a crime, and initially you guess that the case strength will be \(\nu_t \bar{w}\): the amount of evidence times the average strenght of evidence. Then you actually read the presented evidence, and you adjust the case strength up or down based on whether the evidence presented is stronger or weaker than \(\bar{w}\).

Writing the equation as we have above makes it clear how piece of evidence are evaluated relative to the expected strength of a piece of evidence. However, we can make the mechanics of the model a bit more transparent by rewriting the equation to seperate out the terms involving \(\bar{w}\):

\[\eta_t \approx X_t' \beta + (\nu_t - \Sigma\ x_t) \bar{w},\] where \(\Sigma\ x_t\) corresponds to the amount of evidence presented in the case. This arrangement of terms makes clear that adding an extra piece of evidence costs you a \(\bar{w}\). It also shows that if you think that the evidence you’re shown is all the evidence that exists, then \(v_t = \Sigma x_t\) and your belief about the average strenght of evidence is meaningless, and if you think there is actually less evidence than you are shown, then the model makes no sense. Therefore for our model to have any explanatory power, participants must believe that – or at least act as though – the evidence that they are shown is a subset of the total evidence that exists for a crime.

We can also interpret this model in terms of Bayesian (or at least Bayesish) updating, in which the final case strength is the result of combining a prior about the strength of the average case with the observed evidence. To see this, note that we can rewrite the model as

\[ \eta_t \approx \nu_t \left( \frac{\Sigma x_t}{\nu_t} \hat{w}_t + (1-\frac{\Sigma x_t}{\nu_t}) \bar{w} \right),\] where \(\hat{w}_t = (\Sigma x_t)^{-1} X_t' \beta\) is the average strength of the observed evidence. In other words, the final case strength for case \(t\) is derived from a weighted average of the global average evidence strength, \(\bar{w}\), and the average strength of the observed evidence for that case, \(\hat{w}_t\). Note that we can rewrite the weighted average in the right-hand side in terms of prediction error driven updating:

\[ \bar{w} + \frac{\Sigma x_t}{\nu_t} (\hat{w}_t - \bar{w}).\]

This is a classic error-driven updating equation in which the learning rate is proportional to the amount of observed data. This kind of error-driven updating is generally invoked in the context of trying to infer some quantity based on noisy observations. Normatively, in such contexts you want to have a high learning rate when you are confident in your current observations, and a low learning rate when you do not. One way of achieving confidence in your observations is by having a lot of them, which is indeed what causes the model to have a high learning rate! The point being, if you assume that there is some kind of uncertainty associated with the evidence then this model actually ends up making good normative sense and I think ties in nicely to the existing literatures on reward learning and perceptual inference.

Finally, to complete our specification of the inference step, we need to decide where the implicit total evidence \(\nu_t\) comes from. The simplest option is to simply set it to be constant across all scenarios, regardless of how much evidence is actually presented. This turns out to work badly in practice, so instead we assume that the implicit total evidence is a linear function of the amount of observed evidence:

\[ \nu_t = \nu_0 + \Sigma x_t \ \nu_1,\] where \(nu_0\) is the total evidence when no evidence is observed, and \(nu_1\) is how much observing one piece of evidence changes the amount of total evidence the subject thinks is present. This turns out to work rather well in practice and I think is fairly intuitive, but it lacks any clear interpretation or normative motivation, and if you take it seriously it has some very strange implications. However, as I will discuss below, I think we can view it as an approximation of a model that is better motivated and mathematically better-behaved.

Intuitively, the implicit total amount of evidence should probably depend on the the observed amount of evidence. The amount of evidence presented differs between the ‘inculpatory’ evidence distribution and the ‘credible’ and ‘random’ evidence conditions; accordingly I will allow the values of \(nu_0\) and \(nu_1\) to differ between the ‘inculpatory’ condition and the other two conditions. In reality, any differences between conditions in the (implicit) total amount of evidence must be learned; however, inferring that learning process alongside the learning of the average case strength is well beyond what we can ask of this dataset.

All of the above was the inference step, so to complete the model we just need to specify the learning step where the subject estimates \(\bar{w}\):

\[ \bar{w} \gets \bar{w} + \tau(\hat{w}_t - \bar{w}),\] where \(\tau \in [0,1]\) is a learning rate parameter. That is, the same prediction error by which people infer individual case strength is also used to learn the average case strength over the course of the task. At the beginning of the task we initalize \(\bar{w} \gets \bar{w}_0\), where \(\bar{w}_0\) is a free parameter.

But does it blend?

Now that we have our model, we can attempt to fit it to our data and see if we can reproduce the differences between evidence distributions However, learning models of this sort can be challenging to fit even to tasks that were designed specifically around such models, which our task definitely was not. Therefore, while for the static model we were able to do full Bayesian inference on the evidence weights at both the individual and population levels, it is not feasible to do the same with the learning model. Instead, we will use maximum likelihood estimation to fit a single point estimate of the learning model parameters across entire data set. This procedure is vastly easier than full Bayesian inference in a hierarchical model, but unfortunately it means that we can’t use the model fits as a basis for inference about the population because the entire data set isn’t necessarily representative of any one individual’s behavior. When we fit the learning model we are therefore not doing inference, but rather using the data to calibrate the model and pick a sensible set of parameters which are consistent with the data. We can then determine whether the model, when sensibly calibrated, is capable of producing the qualitative difference between evidence conditions which we are trying to explain.

My hands being well and thoroughly wrung, we can now actually fit the model and see what happens. I will fit the model to the December dataset and estimate the effect of learning by calculating the inferred baselines and evidence effects for each subject at both the beginning and end of the task. To compare evidence distributions, I then average those baselines and evidence effects across all subjects assigned to the same evidence distribution.

We see that the learning model captures several qualitative differences between evidence conditions. All evidence conditions start the task with negative baselines, consistent with the baseline in the random condition, after the task the credible and inculpatory conditions have shifted substantially towards favoring the prosecution. Similarly, all evidence conditions start the task with evidence weights that overall favor the prosecution, after the task the evidence weights in the inculpatory and credible conditions have shifted downwards to slightly favor the defense. However, the model does not accurately reflect the observed differences between the credible and the inculpatory conditions. After learning the model predicts that the baselines in the credible and inculpatory conditions should be similar, whereas we observe in the December dataset that the inculpatory baseline is lower than the credible baseline (note however that the model predictions do match the February dataset, to which it was not fit!). The learning model also predicts that evidence weights should end up lower in the inculpatory condition than in the credible condition, whereas we consistently observe the opposite across experiments and response modalities.

We can actually rectify the above discrepancies between the learning model and the static model by making one more modification to the model: allowing the implicit total evidence (as governed by \(\nu_0\) and \(\nu_1\)) to differ between the random and credible evidence conditions (recall that we already were allowing the inculpatory condition to have it’s own \(\nu_0\) and \(\nu_1\)). This change allows the model to unlink the learning effects of the random and credible conditions, and the results are beautiful:

In this version of the model, learning results in the inculpatory condition baseline and evidence effects falling between the random and credible conditions (ever so slightly in the case of the evidence weights). We also see much sharper distinctions between the random conditions and the other two conditions. The reason for the difference between this version of the model and the previous one is that this model infers that in the random condition, the implicit total amount of evidence is the same as the observed amount of evidence: or in terms of the parameters, \(\nu_0 \approx 0\) and \(\nu_1 \approx 1\). This means that while learning about the average case strength \(\bar{w}\) does occur in the random condition, it has no impact on behavior. Unfortunately, I can think of no a priori justification (or even post hoc, really) why the implicit total evidence should differ between the random and credible conditions. This is mainly because the random and credible conditions very similar distributions of observed amount of evidence, as you can see for yourself below:

In conclusion, we have one version of the learning model for which I can give a reasonable rationale for every choice made in it’s specification and which gets most things but not everything right, and another version of the model for which there is one choice for which I have no justification but which gets basically everything right. I have wrestled with this conundrum well past the point of diminishing returns, and so, dear reader, I hereby pass the buck to you!

Appendix: A Missing-at-random model for implicit evidence amounts

When describing the learning model I claimed to have a plausible justificaion for the implicit total evidence amount being a linear function of the observed evidence amount, \(\nu_t = \nu_0 + \Sigma x_t \nu_1\) (credit goes to John for the original idea, though apparently he actually had something else in mind). Imagine that the actual total number of pieces of evidence that exists for a case is sampled from a discretized normal distribution, such that

\[ p(\nu = x) \propto \Phi(x+0.5 | \mu,\sigma) - \Phi(x-0.5 | \mu,\sigma)\] where \(\Phi\) is the cummulative normal distribution and \(x\) is a nonnegative integer. Imagine next that each piece of evidence is actually observed with fixed probability \(\theta\), such that the amount observed evidence (which I’ll now write \(S_x\)) is distributed

\[ S_x \sim \textrm{Binom}(x,\theta) \] The problem of going from observed to implicit evidence amounts can then be recast as an inference problem of finding \(nu\) given \(S_x\), or rather calculating the function \(\textrm{E}[\nu | S_x]\). This function can, with a little algebra, be easily approximated to an arbitrary degree of precision, but it doesn’t have a nice interpretable analytical form (at least that I can figure out). Still, we can visualize the function:

The points are the expected value of \(\nu | S_x\) for parameters \(\mu = 3.5\), \(\sigma=1\), and \(\theta=0.75\); or, in other words, how much total evidence you should think there is after observing some amount of evidence, assuming on average there is 3.5 pieces of evidence per case, the amount of evidence per case has a variance of 1, and each piece of evidence is shown to you with probability 0.75. The red and blue lines represent the linear functions relating obesrved to implicit total evidence fit to the credible and inculpatory evidence conditions, respectively. The linear functions imply that eventually, as the amount of observed evidence increases, there is actually less evidence than you have observed, as represented by the red and blue lines crossing the unity line. However, if you treat this as an inference problem where evidence is missing at random, then the relationship between observed and total evidence amounts starts out looking linear(ish) at low quantities of observed evidence, but as you observe more and more evidence you converge towards thinking that you’ve must have observed all the evidence that’s really out there. An important caveat is that this function is extremely sensitive to the setting of \(\sigma\), such that if you make \(\sigma\) too big then you no longer get this nice clean convergence to the unity line.

This is a crude model, and one we are really not equipped to actually interrogate in our current task, but I think it’s good enough as motiviation for the current model. One could imagine elaborating upon this model in many interesting ways: instead of evidence being missing at random, what if you have a prosecution and defense each “censoring” pieces of evidence based on their weight? Or some probability of fabricating evidence? In fact, one could use this, combined with the learning model, as an integrated model of “persuasion under asymmetric information” with application to advertising, or political debate, or… dating? Probably? But all that will have to wait for another day.