Particularly regarding the recently collected pilot data, consisting of 33 mTurk subjects.

Marginal distributions of ratings

First thing, let’s take a look at the overall distribution of responses to see if there are any red flags.

msubj <- subjdat[!str_detect(uid,"test") & task_complete,uid]
scendat <- scendat[uid %in% msubj]
clickdat <- clickdat[!is.na(guilty)]
dat <- merge(scendat,clickdat,by=c("uid","scenario"))
dat[,period:=cut_number(question,4)]

qplot(dat$rating) + xlab("case strength")

qplot(dat$rate_punishment) + xlab("punishability")

For case strength we’re getting a lot of clumping on the lower end of the scale, but overall people are using the entire scale. Interesting how U-shaped it seems to be, maybe because it’s clear what a 0 pt case is and what a 100 pt case is but it’s not so obvious what a 50 pt case looks like. For punishment, on the other hand, we see a lot of people maxing out the scale.

Of course what we really want to see is that individuals are using the whole scale, not just the population. That’s tougher to visualize with so many participants but we can give it a go.


ggplot(dat,aes(x=rating,color=uid)) + stat_ecdf(geom="step", pad=FALSE) + theme(legend.position = "none") + xlab("case strength") + ylab("proportion")

ggplot(dat,aes(x=rate_punishment,color=uid)) + stat_ecdf(geom="step", pad=FALSE) + theme(legend.position = "none") + xlab("punishability") + ylab("proportion")

Here we’re looking at the empirical cummulative distributions for each subject – so, for each value on the x-axis, what proportion of their ratings were below that value. It appears that use of the case strength scale is fairly consistent across subjects, but punishment seems more… idiosyncratic, with more subjects leaning more heavily on one side of the scale or another.

We can also see this by just plotting the standard deviations of each participant’s ratings:

ggplot(dat[,.(rating=sd(rating),rate_punishment=sd(rate_punishment)),by=uid],aes(y=rating,x=rate_punishment)) + geom_point() + coord_fixed() + geom_abline() + ylab("case strength (sd)") + xlab("punishability (sd)")

For any given subject, case strength ratings tend to be a more variable than punishability ratings.

Evidence manipulation check

Moving on, we can repeat our previous sanity check of looking at punishment ratings as a function of the total number of inclupatory versus exculpatory pieces of evidence in a case:

evidat <- dat[,.(n_exculp=str_detect(c(physical,document,witness,character),"ex") %>% sum(),
       n_inculp=str_detect(c(physical,document,witness,character),"in") %>% sum(),
       rating, period),by=.(uid,scenario)]

ggplot(evidat,aes(x=n_inculp-n_exculp,y=rating)) + geom_smooth(method="lm", se=F) + geom_point(position=position_jitter(width=0.1),alpha=0.25) + ylim(c(-1,101)) + ylab("case strength")

Lookin’ good. However, Pate brought up that we should check whether specific cases might be behaving badly due to confusing wording, etc, so we can look at this relationship on a case-by-case basis.

qplot(evidat[,cor(n_inculp-n_exculp,rating),by=scenario]$V1) + xlab("correlation") + ylab("number of scenarios")

Here we’re looking at the (distribution of) the correlation between the case strength rating and the exculpatory/inculpatory balance. Every case seems to have at least a modest relationship in the right direction. Though there appears to be some heterogeneity, it’s hard to tell how much of that heterogeneity is “real” versus how much is just the consequence of noise.

We can take a crack at that latter question by fitting a simple mixed-effects model, which will do it’s best to sensibly distribute variability across scenarios and response noise:

fit <- lmer(rating ~ 1 + n_exculp + n_inculp + (1|uid) + (0+n_exculp|uid) + (0+n_inculp|uid) + 
       (1|scenario) + (0+n_exculp|scenario) + (0+n_inculp|scenario),data=evidat, REML=F)
print(fit, digits=2)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: rating ~ 1 + n_exculp + n_inculp + (1 | uid) + (0 + n_exculp |  
    uid) + (0 + n_inculp | uid) + (1 | scenario) + (0 + n_exculp |      scenario) + (0 + n_inculp | scenario)
   Data: evidat
      AIC       BIC    logLik  deviance  df.resid 
 9769.310  9818.914 -4874.655  9749.310      1044 
Random effects:
 Groups     Name        Std.Dev.
 uid        (Intercept)  7.54398
 uid.1      n_exculp     0.00034
 uid.2      n_inculp     0.00084
 scenario   (Intercept)  6.51870
 scenario.1 n_exculp     0.00129
 scenario.2 n_inculp     1.83279
 Residual               23.61145
Number of obs: 1054, groups:  uid, 33; scenario, 32
Fixed Effects:
(Intercept)     n_exculp     n_inculp  
         33          -10           18  
convergence code 0; 1 optimizer warnings; 0 lme4 warnings 

The main takeaway here is that we see negligible variability across subjects and scenarios in the marginal effect of a piece of evidence. The one exception is inculpatory evidence across scenarios, but even the standard deviation is only about 10% of the size of the fixed effect for inculpatory evidence. The best interpretation of this is less that there is actually little to no variability across subjects or scenarios in the evidence effects, but rather that the model doesn’t see any convincing evidence for such variability. We do see more substantial variability across both subjects and scenarios in the baselines, which I believe is consistent with the previous findings.

Time on task and consequences thereof

It’s worth checking how much time participants are actually spending on the task to see if the projection we give in the mTurk listing is accurate.

timedat <- dat[,.(task_time=difftime(max(as.POSIXct(stop)),min(as.POSIXct(start)), units="mins")),by=uid]
timedat <- merge(timedat,subjdat[uid %in% dat$uid,.(uid, qualtrics_time=difftime(as.POSIXct(EndDate),as.POSIXct(start),units="mins"))])
timedat[,qualtrics_time:=qualtrics_time-task_time]
ggplot(melt(timedat),aes(x=reorder(uid,-value),y=value,fill=ordered(variable,levels=c("qualtrics_time","task_time")))) + geom_bar(stat = "identity") + scale_fill_discrete(NULL) + scale_x_discrete(labels=NULL) + ylab("time (mins)")

That’s quite a spread, with some participants going crazy fast. The mTurk listing currently says that the task takes around 30 minutes, which is quite a bit longer than most people actually take. Still, as we’ve had no difficulty recruiting subjects, I’d rather remain conservative with our advertised time than risk subjects getting impatient.

On a related note, given that our task is a bit on the long side by mTurk standards, I was worried that subjects might be getting less contemplitive in their responses towards the end. Accordingly I redid some of the above analyses across four different time bins.

ggplot(dat,aes(x=rating,fill=period)) + geom_histogram(bins=20) + xlab("case strength")

ggplot(evidat,aes(x=n_inculp-n_exculp,y=rating, color=period)) + geom_smooth(method="lm", se=F) + geom_point(position=position_jitter(width=0.1),alpha=0.3) + ylim(c(-1,101)) + ylab("case strength")

If there are any changes over the course of the task, they’re certainly subtle.

Somewhat surprisingly, we don’t even see much of a change in response times over the course of the task:

rtdat <- dat[,.(rt=c(0,difftime(as.POSIXct(stop), min(as.POSIXct(start)), units="secs")) %>% diff(),question),by=uid]
ggplot(rtdat,aes(y=rt,x=ordered(question+1))) + geom_boxplot() + coord_cartesian(ylim=c(125,0)) + ylab("response time") + xlab("question number") + geom_smooth(aes(y=rt,x=question+1),method="loess")

While participants do speed up over the course of the task, it’s again pretty subtle. Note that a) I am chopping off a number of much longer outliers which render the graph unreadable, and b) the questions which appear to be sticking up compared to their neighbors follow catch trials, and the time on the comprehension question isn’t being subtracted out.

Conclusion

Data quality continues to look excellent. Going in I was very concerned about the length of the task, but I see no indication that this is actually a problem. There are only two issues we may want to address.

First, the bunching around zero of the case strength ratings. While this is hardly catastrophic, data near the lower bound is intrinsically less informative. I suggest that instead of fully randomizing the evidence as we are doing now, instead each scenario should have at least one piece of evidence that is either inculpatory or ambiguous. I imagine that this would also be more ecologically valid.

Second, the inflated 100s for punishment. This is even less of a problem because frankly, I have no particular plans for that data anyway. However, we might want to revisit the wording again. I changed it from “life in prison” to “extremely severe punishment” so that people wouldn’t be trying to figure out how to weigh different kinds of punishment against each other – e.g. fines vs prison time, but as a consequence subjects might be internally normalizing the value of the slider to the severity of the crime. I’m open to suggestions on this matter.

---
title: "Presenting: An Excessive Quantity of Plots"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE, tidy = TRUE, fig.height=3, fig.width=5)
library(stringr)
library(gridExtra)
library(lme4)
options(scipen = 999)
source("/home/seth/code/casereveal/analysis/process_data.R")
```

Particularly regarding the recently collected pilot data, consisting of 33 mTurk subjects.

## Marginal distributions of ratings

First thing, let's take a look at the overall distribution of responses to see if there are any red flags.

```{r}
msubj <- subjdat[!str_detect(uid,"test") & task_complete,uid]
scendat <- scendat[uid %in% msubj]
clickdat <- clickdat[!is.na(guilty)]
dat <- merge(scendat,clickdat,by=c("uid","scenario"))
dat[,period:=cut_number(question,4)]

qplot(dat$rating) + xlab("case strength")
qplot(dat$rate_punishment) + xlab("punishability")
```

For case strength we're getting a lot of clumping on the lower end of the scale, but overall people are using the entire scale.
Interesting how U-shaped it seems to be, maybe because it's clear what a 0 pt case is and what a 100 pt case is but it's not so obvious what a 50 pt case looks like.
For punishment, on the other hand, we see a lot of people maxing out the scale.

Of course what we really want to see is that individuals are using the whole scale, not just the population.
That's tougher to visualize with so many participants but we can give it a go.
```{r}
ggplot(dat,aes(x=rating,color=uid)) + stat_ecdf(geom="step", pad=FALSE) + theme(legend.position = "none") + xlab("case strength") + ylab("proportion")
ggplot(dat,aes(x=rate_punishment,color=uid)) + stat_ecdf(geom="step", pad=FALSE) + theme(legend.position = "none") + xlab("punishability") + ylab("proportion")
```

Here we're looking at the empirical cummulative distributions for each subject -- so, for each value on the x-axis, what proportion of their ratings were below that value.
It appears that use of the case strength scale is fairly consistent across subjects, but punishment seems more... idiosyncratic, with more subjects leaning more heavily on one side of the scale or another.

We can also see this by just plotting the standard deviations of each participant's ratings:

```{r}
ggplot(dat[,.(rating=sd(rating),rate_punishment=sd(rate_punishment)),by=uid],aes(y=rating,x=rate_punishment)) + geom_point() + coord_fixed() + geom_abline() + ylab("case strength (sd)") + xlab("punishability (sd)")
```

For any given subject, case strength ratings tend to be a more variable than punishability ratings.


## Evidence manipulation check

Moving on, we can repeat our previous sanity check of looking at punishment ratings as a function of the total number of inclupatory versus exculpatory pieces of evidence in a case:
```{r}
evidat <- dat[,.(n_exculp=str_detect(c(physical,document,witness,character),"ex") %>% sum(),
       n_inculp=str_detect(c(physical,document,witness,character),"in") %>% sum(),
       rating, period),by=.(uid,scenario)]

ggplot(evidat,aes(x=n_inculp-n_exculp,y=rating)) + geom_smooth(method="lm", se=F) + geom_point(position=position_jitter(width=0.1),alpha=0.25) + ylim(c(-1,101)) + ylab("case strength")
```

Lookin' good.
However, Pate brought up that we should check whether specific cases might be behaving badly due to confusing wording, etc, so we can look at this relationship on a case-by-case basis.

```{r}
qplot(evidat[,cor(n_inculp-n_exculp,rating),by=scenario]$V1) + xlab("correlation") + ylab("number of scenarios")
```

Here we're looking at the (distribution of) the correlation between the case strength rating and the exculpatory/inculpatory balance.
Every case seems to have at least a modest relationship in the right direction.
Though there appears to be some heterogeneity, it's hard to tell how much of that heterogeneity is "real" versus how much is just the consequence of noise.

We can take a crack at that latter question by fitting a simple mixed-effects model, which will do it's best to sensibly distribute variability across scenarios and response noise:
```{r}
fit <- lmer(rating ~ 1 + n_exculp + n_inculp + (1|uid) + (0+n_exculp|uid) + (0+n_inculp|uid) + 
       (1|scenario) + (0+n_exculp|scenario) + (0+n_inculp|scenario),data=evidat, REML=F)
print(fit, digits=2)
```

The main takeaway here is that we see negligible variability across subjects and scenarios in the marginal effect of a piece of evidence.
The one exception is inculpatory evidence across scenarios, but even the standard deviation is only about 10% of the size of the fixed effect for inculpatory evidence.
The best interpretation of this is less that there is _actually_ little to no variability across subjects or scenarios in the evidence effects, but rather that the model doesn't see any convincing evidence for such variability.
We do see more substantial variability across both subjects and scenarios in the baselines, which I believe is consistent with the previous findings.

## Time on task and consequences thereof

It's worth checking how much time participants are actually spending on the task to see if the projection we give in the mTurk listing is accurate.

```{r, fig.width=6}
timedat <- dat[,.(task_time=difftime(max(as.POSIXct(stop)),min(as.POSIXct(start)), units="mins")),by=uid]
timedat <- merge(timedat,subjdat[uid %in% dat$uid,.(uid, qualtrics_time=difftime(as.POSIXct(EndDate),as.POSIXct(start),units="mins"))])
timedat[,qualtrics_time:=qualtrics_time-task_time]
ggplot(melt(timedat),aes(x=reorder(uid,-value),y=value,fill=ordered(variable,levels=c("qualtrics_time","task_time")))) + geom_bar(stat = "identity") + scale_fill_discrete(NULL) + scale_x_discrete(labels=NULL) + ylab("time (mins)")

```
That's quite a spread, with some participants going crazy fast.
The mTurk listing currently says that the task takes around 30 minutes, which is quite a bit longer than most people actually take.
Still, as we've had no difficulty recruiting subjects, I'd rather remain conservative with our advertised time than risk subjects getting impatient.

On a related note, given that our task is a bit on the long side by mTurk standards, I was worried that subjects might be getting less contemplitive in their responses towards the end.
Accordingly I redid some of the above analyses across four different time bins.

```{r}
ggplot(dat,aes(x=rating,fill=period)) + geom_histogram(bins=20) + xlab("case strength")
ggplot(evidat,aes(x=n_inculp-n_exculp,y=rating, color=period)) + geom_smooth(method="lm", se=F) + geom_point(position=position_jitter(width=0.1),alpha=0.3) + ylim(c(-1,101)) + ylab("case strength")
```

If there are any changes over the course of the task, they're certainly subtle.

Somewhat surprisingly, we don't even see much of a change in response times over the course of the task:
```{r, fig.width=7}
rtdat <- dat[,.(rt=c(0,difftime(as.POSIXct(stop), min(as.POSIXct(start)), units="secs")) %>% diff(),question),by=uid]
ggplot(rtdat,aes(y=rt,x=ordered(question+1))) + geom_boxplot() + coord_cartesian(ylim=c(125,0)) + ylab("response time") + xlab("question number") + geom_smooth(aes(y=rt,x=question+1),method="loess")
```
While participants do speed up over the course of the task, it's again pretty subtle.
Note that a) I am chopping off a number of much longer outliers which render the graph unreadable, and b) the questions which appear to be sticking up compared to their neighbors follow catch trials, and the time on the comprehension question isn't being subtracted out.

## Conclusion

Data quality continues to look excellent.
Going in I was very concerned about the length of the task, but I see no indication that this is actually a problem.
There are only two issues we may want to address.

First, the bunching around zero of the case strength ratings.
While this is hardly catastrophic, data near the lower bound is intrinsically less informative.
I suggest that instead of fully randomizing the evidence as we are doing now, instead each scenario should have at least one piece of evidence that is either inculpatory or ambiguous.
I imagine that this would also be more ecologically valid.

Second, the inflated 100s for punishment.
This is even less of a problem because frankly, I have no particular plans for that data anyway.
However, we might want to revisit the wording again.
I changed it from "life in prison" to "extremely severe punishment" so that people wouldn't be trying to figure out how to weigh different kinds of punishment against each other -- e.g. fines vs prison time, but as a consequence subjects might be internally normalizing the value of the slider to the severity of the crime.
I'm open to suggestions on this matter.