legal decision-making: the supplement

Legal versus subjective guilt

Let’s start by viewing the distributions of case strength ratings as a function of guilt judgements:

As you might hope, users almost never respond that they would vote to find a defendant guilty (legal guilt) unless they also say that the believe the subject is guilty (subjective guilt) (indicated by the very small amount of green). Subjective and legal judgements almost always agree (indicated by the relatively small amount of blue), which is somewhat disturbing if you think that the standard of reasonable doubt should be substantially higher than that of subjective belief.

We can plot something like a population-level psychometric function for each kind of guilt judgement as a function of case strength, like so:

Here it looks like asking about legal versus subjective guilt has two effects: first, it shifts the curve to the right, and second (and more subtley) it seems to sharpen the curve. We see the same effects if we do this a bit more rigorously and fit a proper mixed-effects logistic regression:

glmer(value ~ 1 + rating*variable + (1 + rating*as.numeric(I(variable=="legalguilt")) || uid),
      family = binomial, 
      data = melt(dat,id.vars = c("rating","uid"), 
                  measure.vars = c("subjguilt","legalguilt"))[, rating:=rating/100]) %>% 
  summary()

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## value ~ 1 + rating * variable + (1 + rating * as.numeric(I(variable ==  
##     "legalguilt")) || uid)
##    Data: melt(dat, id.vars = c("rating", "uid"), measure.vars = c("subjguilt",  
##     "legalguilt"))[, `:=`(rating, rating/100)]
## 
##      AIC      BIC   logLik deviance df.resid 
##     2777     2829    -1380     2761     5016 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -13.863  -0.228  -0.017   0.173   9.954 
## 
## Random effects:
##  Groups Name                                           Variance Std.Dev.
##  uid    (Intercept)                                    1.937785 1.3920  
##  uid.1  rating                                         8.293724 2.8799  
##  uid.2  as.numeric(I(variable == "legalguilt"))        2.046765 1.4307  
##  uid.3  rating:as.numeric(I(variable == "legalguilt")) 0.000243 0.0156  
## Number of obs: 5024, groups:  uid, 81
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -3.936      0.244  -16.14  < 2e-16 ***
## rating                      11.790      0.621   18.98  < 2e-16 ***
## variablelegalguilt          -3.537      0.383   -9.24  < 2e-16 ***
## rating:variablelegalguilt    3.175      0.702    4.52  6.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) rating vrbllg
## rating      -0.596              
## varbllglglt -0.214  0.160       
## rtng:vrbllg  0.335 -0.379 -0.832
## convergence code: 0
## Model failed to converge with max|grad| = 0.0123404 (tol = 0.001, component 1)

Here “rating” is the effect of going from 0 to 100 on the case strength scale, while the “variablelegalguilt” is the difference between legal judgements and subjective judgements. The interaction effect “rating:variablelegalguilt” agrees with our (or at least, my) visual impression that case strength has a bigger impact on legal than subjective judgements.

The shift of the curve has a clear interpretation – legal guilt requires a stronger case than subjective guilt. The sharpening of the curve when going to subjective to legal judgements is a bit more ambiguous. One interpretation is that with subjective guilt, people are letting factors other than case strength influence their judgement, whereas they are focusing more specifically on the case strength when judging legal guilt.

Crime and punishment (and legal severity and case strength)

Next we’ll look at the relationships among the legal severity of crimes, user ratings of crime punishability, and case strength. To examine the two response scales jointly, as well as the correlations between them, we’ll use a multivariate linear mixed effects model. Note that this model treats responses are purely guassian, ignoring the bounded response scales and how people are using those scales. Accordingly, predictions from this model won’t match the observed data and exact numbers for effect sizes and credible intervals should perhaps not be taken too seriously. However, for questions of the form, “do these two things have a positive or negative relationship” this simple model should be adequate.

Note that here “severity” means simply, which number each crime is in the original spreadsheet. The spreadsheet also has a column called “severity” which contains single letters or letter-number combinations which I assume correspond to a formal legal classification, but I wasn’t sure what to do with those.

Crime severity is, unsurpringnly, strongly predictive of how severely people think a crime shoulb be punished. Different crimes do also appear to have different case strength baselines, but the magnitude of these differences are much smaller than for punishment ratings and there is no systematic relationship between the case strength of a crime and the crime’s severity.

This all seems very sensible, but on the other hand – haven’t we found that in general case strength ratings and punishment ratings are correlated, at the level of individual observation? Do we still observe this in the present data set? And if so, what accounts for this correlation, if not crime severity?

In the raw data we do see a correlation of about 0.3 between the two ratings. However, the relationship doesn’t seem to be linear. It looks like people might be thinking something like, if a crime isn’t punishable then cases are necessarily weak, whereas for a very punishable crime a case can be either strong or weak. This doesn’t particularly help us explain what, other than crime severity, is driving this correlation.

To try to get a little more insight, we can refit the model while incorporating crime severity directly as a predictor.

First off, we can see that the model agrees that severity has a huge effect on punishment ratings, but a small-if-any effect on case strength ratings:

The x-axis in the above plot shows the expected change in points from the least to most severe crime.

So according to the fit model, what is the source of the correlation between punishment and case strength ratings? The answer is, pretty much everything else. Most strikingly, the effects of evidence are very similar across the two scales, albeit much larger for case strength than punishment:

In the above plot, each point is an evidence category (that is, a combination of evidence type and strength/direction). Additionally, users’ baselines for case strength and punishment are correlated at 0.45 (CI = [0.24, 0.63]) (that is, people who think cases are generally stronger also think punishments should be harsher), residual response-level variability between the two scales is correlated at 0.33 (CI = [0.30, 0.37]), and, most strikingly, a crime’s leftover punishablility after crime severity is regressed out is correlated with its case strength (albeit with credible intervals crossing zero) at 0.32 (CI = [-0.05, 0.63]). That is, crimes that ellicit higher punishment ratings also ellicit stronger case strength ratings (probably), but in a way unrelated to the official classification of crime severity!

What does it all mean? I have no idea. I’ve included the full model output below in case anyone feels inclined to pore over the numbers trying to make sense of it all.

summary(mvfit_severe)

##  Family: MV(gaussian, gaussian) 
##   Links: mu = identity; sigma = identity
##          mu = identity; sigma = identity 
## Formula: rating ~ 1 + severity + physical + document + witness + character + (1 + severity | p | uid) + (1 | q | scenario) 
##          rate_punishment ~ 1 + severity + physical + document + witness + character + (1 + severity | p | uid) + (1 | q | scenario) 
##    Data: cbind(dat, severity = dat[, (scenario - 1)/max(sce (Number of observations: 2512) 
## Samples: 4 chains, each with iter = 1500; warmup = 500; thin = 3;
##          total post-warmup samples = 1334
## 
## Group-Level Effects: 
## ~scenario (Number of levels: 31) 
##                                                Estimate Est.Error l-95% CI
## sd(rating_Intercept)                               6.52      1.04     4.80
## sd(ratepunishment_Intercept)                      10.49      1.50     7.97
## cor(rating_Intercept,ratepunishment_Intercept)     0.32      0.18    -0.05
##                                                u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(rating_Intercept)                               8.79 1.01      914     1190
## sd(ratepunishment_Intercept)                      13.87 1.00      928     1192
## cor(rating_Intercept,ratepunishment_Intercept)     0.63 1.00      728      975
## 
## ~uid (Number of levels: 81) 
##                                                       Estimate Est.Error
## sd(rating_Intercept)                                      8.79      0.86
## sd(rating_severity)                                       4.94      2.13
## sd(ratepunishment_Intercept)                             14.83      1.28
## sd(ratepunishment_severity)                              25.02      2.51
## cor(rating_Intercept,rating_severity)                     0.64      0.25
## cor(rating_Intercept,ratepunishment_Intercept)            0.45      0.10
## cor(rating_severity,ratepunishment_Intercept)             0.19      0.29
## cor(rating_Intercept,ratepunishment_severity)            -0.13      0.13
## cor(rating_severity,ratepunishment_severity)             -0.18      0.29
## cor(ratepunishment_Intercept,ratepunishment_severity)     0.33      0.11
##                                                       l-95% CI u-95% CI Rhat
## sd(rating_Intercept)                                      7.18    10.56 1.00
## sd(rating_severity)                                       0.67     9.04 1.01
## sd(ratepunishment_Intercept)                             12.63    17.64 1.00
## sd(ratepunishment_severity)                              20.39    30.40 1.00
## cor(rating_Intercept,rating_severity)                    -0.03     0.95 1.00
## cor(rating_Intercept,ratepunishment_Intercept)            0.24     0.63 1.01
## cor(rating_severity,ratepunishment_Intercept)            -0.47     0.67 1.04
## cor(rating_Intercept,ratepunishment_severity)            -0.37     0.14 1.00
## cor(rating_severity,ratepunishment_severity)             -0.72     0.40 1.06
## cor(ratepunishment_Intercept,ratepunishment_severity)     0.11     0.54 1.00
##                                                       Bulk_ESS Tail_ESS
## sd(rating_Intercept)                                       932     1246
## sd(rating_severity)                                       1095      851
## sd(ratepunishment_Intercept)                               805     1025
## sd(ratepunishment_severity)                               1053     1211
## cor(rating_Intercept,rating_severity)                     1253     1279
## cor(rating_Intercept,ratepunishment_Intercept)             658     1178
## cor(rating_severity,ratepunishment_Intercept)              116      332
## cor(rating_Intercept,ratepunishment_severity)             1081     1357
## cor(rating_severity,ratepunishment_severity)                74      277
## cor(ratepunishment_Intercept,ratepunishment_severity)     1197     1223
## 
## Population-Level Effects: 
##                                  Estimate Est.Error l-95% CI u-95% CI Rhat
## rating_Intercept                    32.10      2.21    27.66    36.43 1.00
## ratepunishment_Intercept            52.97      2.90    47.18    58.68 1.00
## rating_severity                      3.06      4.33    -5.55    11.09 1.01
## rating_physicalclear_ex            -10.48      1.29   -12.99    -7.91 1.00
## rating_physicalambiguous             4.07      1.29     1.66     6.64 1.00
## rating_physicalclear_in             27.86      1.27    25.44    30.36 1.00
## rating_documentclear_ex            -10.49      1.31   -13.05    -7.89 1.00
## rating_documentambiguous             4.27      1.33     1.68     6.76 1.00
## rating_documentclear_in             19.63      1.31    17.17    22.23 1.00
## rating_witnessclear_ex              -7.47      1.32   -10.06    -4.93 1.00
## rating_witnessambiguous             -1.69      1.25    -4.11     0.89 1.00
## rating_witnessclear_in              13.76      1.28    11.31    16.35 1.00
## rating_characterclear_ex            -4.77      1.10    -7.01    -2.81 1.00
## rating_characterclear_in             8.27      1.14     5.97    10.42 1.00
## ratepunishment_severity             48.13      7.03    35.44    63.10 1.01
## ratepunishment_physicalclear_ex     -1.00      1.11    -3.15     1.04 1.00
## ratepunishment_physicalambiguous     2.39      1.12     0.25     4.49 1.00
## ratepunishment_physicalclear_in      6.74      1.17     4.45     9.06 1.00
## ratepunishment_documentclear_ex     -3.63      1.12    -5.73    -1.49 1.00
## ratepunishment_documentambiguous     0.55      1.13    -1.61     2.76 1.00
## ratepunishment_documentclear_in      4.83      1.14     2.66     7.16 1.00
## ratepunishment_witnessclear_ex      -1.89      1.14    -4.18     0.30 1.00
## ratepunishment_witnessambiguous      0.43      1.15    -1.73     2.62 1.00
## ratepunishment_witnessclear_in       5.54      1.13     3.35     7.72 1.00
## ratepunishment_characterclear_ex    -1.59      0.96    -3.44     0.28 1.00
## ratepunishment_characterclear_in     3.16      0.97     1.32     4.99 1.00
##                                  Bulk_ESS Tail_ESS
## rating_Intercept                     1051     1116
## ratepunishment_Intercept              938     1160
## rating_severity                       868     1110
## rating_physicalclear_ex              1293     1181
## rating_physicalambiguous             1326     1320
## rating_physicalclear_in              1361     1265
## rating_documentclear_ex              1315     1212
## rating_documentambiguous             1401     1283
## rating_documentclear_in              1312     1240
## rating_witnessclear_ex               1415     1246
## rating_witnessambiguous              1304     1064
## rating_witnessclear_in               1293     1248
## rating_characterclear_ex             1421     1177
## rating_characterclear_in             1216     1148
## ratepunishment_severity               948     1194
## ratepunishment_physicalclear_ex      1181     1342
## ratepunishment_physicalambiguous     1353     1394
## ratepunishment_physicalclear_in      1406     1247
## ratepunishment_documentclear_ex      1179     1282
## ratepunishment_documentambiguous     1410     1192
## ratepunishment_documentclear_in      1381     1366
## ratepunishment_witnessclear_ex       1320     1135
## ratepunishment_witnessambiguous      1389     1247
## ratepunishment_witnessclear_in       1243     1322
## ratepunishment_characterclear_ex     1299     1284
## ratepunishment_characterclear_in     1284     1324
## 
## Family Specific Parameters: 
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sigma_rating            22.56      0.33    21.91    23.20 1.00     1485
## sigma_ratepunishment    19.56      0.28    19.04    20.12 1.00     1418
##                      Tail_ESS
## sigma_rating             1276
## sigma_ratepunishment     1341
## 
## Residual Correlations: 
##                               Estimate Est.Error l-95% CI u-95% CI Rhat
## rescor(rating,ratepunishment)     0.33      0.02     0.30     0.37 1.00
##                               Bulk_ESS Tail_ESS
## rescor(rating,ratepunishment)     1682     1368
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).