This is clock time over the whole experiment (paying attention or not)
How long are individual trials taking?
if we exclude the > than 1 minute ones (as plausible got distracted doing other things), mean rts:
So, 10-20 seconds per trial generally.
Average accuracy is around 62%, with moderate person-to-person variability.
Key question is whether there are differences between the conditions: not huge ones. Here’s mean accuracy and bootstrapped 95% CIs in each condition. There’s a 10 percentage point difference between 2-player round 6 and 6-player round 6 (round 1’s in the middle).
And plot with per-participant small dots.
Tangrams vary a lot in guessability.
predict accuracy as a function of:
## Family: bernoulli
## Links: mu = logit
## Formula: correct ~ group_size * round + trial_order + (group_size * round | correct_tangram) + (group_size * round + trial_order | workerid)
## Data: good_stuff (Number of observations: 3600)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~correct_tangram (Number of levels: 12)
## Estimate Est.Error
## sd(Intercept) 0.98 0.23
## sd(group_size6_player) 0.17 0.12
## sd(roundround_6) 0.48 0.18
## sd(group_size6_player:roundround_6) 0.18 0.14
## cor(Intercept,group_size6_player) -0.23 0.41
## cor(Intercept,roundround_6) -0.01 0.30
## cor(group_size6_player,roundround_6) 0.25 0.42
## cor(Intercept,group_size6_player:roundround_6) -0.02 0.43
## cor(group_size6_player,group_size6_player:roundround_6) -0.06 0.45
## cor(roundround_6,group_size6_player:roundround_6) 0.11 0.44
## l-95% CI u-95% CI Rhat
## sd(Intercept) 0.63 1.51 1.00
## sd(group_size6_player) 0.01 0.45 1.00
## sd(roundround_6) 0.19 0.89 1.00
## sd(group_size6_player:roundround_6) 0.01 0.50 1.00
## cor(Intercept,group_size6_player) -0.87 0.64 1.00
## cor(Intercept,roundround_6) -0.57 0.58 1.00
## cor(group_size6_player,roundround_6) -0.66 0.89 1.01
## cor(Intercept,group_size6_player:roundround_6) -0.79 0.79 1.00
## cor(group_size6_player,group_size6_player:roundround_6) -0.84 0.80 1.00
## cor(roundround_6,group_size6_player:roundround_6) -0.75 0.85 1.00
## Bulk_ESS Tail_ESS
## sd(Intercept) 1484 1857
## sd(group_size6_player) 1400 1777
## sd(roundround_6) 1784 1757
## sd(group_size6_player:roundround_6) 1727 1915
## cor(Intercept,group_size6_player) 4992 3066
## cor(Intercept,roundround_6) 2855 2645
## cor(group_size6_player,roundround_6) 954 1451
## cor(Intercept,group_size6_player:roundround_6) 5930 3127
## cor(group_size6_player,group_size6_player:roundround_6) 3030 3109
## cor(roundround_6,group_size6_player:roundround_6) 3790 3594
##
## ~workerid (Number of levels: 60)
## Estimate Est.Error
## sd(Intercept) 0.64 0.12
## sd(group_size6_player) 0.18 0.12
## sd(roundround_6) 0.22 0.13
## sd(trial_order) 0.01 0.00
## sd(group_size6_player:roundround_6) 0.26 0.17
## cor(Intercept,group_size6_player) 0.12 0.36
## cor(Intercept,roundround_6) -0.47 0.34
## cor(group_size6_player,roundround_6) -0.07 0.39
## cor(Intercept,trial_order) 0.05 0.34
## cor(group_size6_player,trial_order) 0.09 0.39
## cor(roundround_6,trial_order) -0.07 0.39
## cor(Intercept,group_size6_player:roundround_6) -0.09 0.37
## cor(group_size6_player,group_size6_player:roundround_6) -0.02 0.39
## cor(roundround_6,group_size6_player:roundround_6) -0.14 0.41
## cor(trial_order,group_size6_player:roundround_6) -0.06 0.40
## l-95% CI u-95% CI Rhat
## sd(Intercept) 0.42 0.88 1.00
## sd(group_size6_player) 0.01 0.45 1.00
## sd(roundround_6) 0.01 0.48 1.01
## sd(trial_order) 0.00 0.01 1.01
## sd(group_size6_player:roundround_6) 0.02 0.65 1.00
## cor(Intercept,group_size6_player) -0.60 0.77 1.00
## cor(Intercept,roundround_6) -0.91 0.39 1.00
## cor(group_size6_player,roundround_6) -0.77 0.69 1.00
## cor(Intercept,trial_order) -0.58 0.71 1.00
## cor(group_size6_player,trial_order) -0.68 0.79 1.00
## cor(roundround_6,trial_order) -0.76 0.69 1.00
## cor(Intercept,group_size6_player:roundround_6) -0.76 0.64 1.00
## cor(group_size6_player,group_size6_player:roundround_6) -0.74 0.74 1.00
## cor(roundround_6,group_size6_player:roundround_6) -0.80 0.68 1.00
## cor(trial_order,group_size6_player:roundround_6) -0.77 0.73 1.00
## Bulk_ESS Tail_ESS
## sd(Intercept) 1810 2031
## sd(group_size6_player) 930 1866
## sd(roundround_6) 1025 1438
## sd(trial_order) 488 1367
## sd(group_size6_player:roundround_6) 1076 1965
## cor(Intercept,group_size6_player) 4059 2970
## cor(Intercept,roundround_6) 2421 2529
## cor(group_size6_player,roundround_6) 3200 3006
## cor(Intercept,trial_order) 2957 2641
## cor(group_size6_player,trial_order) 788 2065
## cor(roundround_6,trial_order) 1356 2557
## cor(Intercept,group_size6_player:roundround_6) 3378 2911
## cor(group_size6_player,group_size6_player:roundround_6) 2435 2830
## cor(roundround_6,group_size6_player:roundround_6) 2144 2926
## cor(trial_order,group_size6_player:roundround_6) 2692 3172
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept 0.66 0.31 0.05 1.29 1.00
## group_size6_player 0.14 0.13 -0.12 0.38 1.00
## roundround_6 -0.26 0.18 -0.63 0.09 1.00
## trial_order -0.00 0.00 -0.01 0.00 1.00
## group_size6_player:roundround_6 0.40 0.17 0.06 0.74 1.00
## Bulk_ESS Tail_ESS
## Intercept 973 1789
## group_size6_player 3441 2928
## roundround_6 2303 2393
## trial_order 6121 3216
## group_size6_player:roundround_6 3520 3145
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
So basically this confirms what we see above: the percentages aren’t that different across conditions, but there is an interaction – 2-player games end up less transparent than 6-player games, even if they start in (roughly) the same place and don’t change that much from start to end.
Logistic models are annoying to think about, so we can look at predictions:
Dotted lines are 95% posterior predictions from the model (no random effects), solid lines are mean and 95% CI from bootstrapping that category.
seems a bit odd that model preds are all larger than the bootstrap, so we check out what’s causing it. Seems to be that adding mixed effects leads to pulling up the lower outliers more (b/c there’s further out low ones??), although maybe this is partly due to that x logit stuff
how does RT vary based on transcript length?
how does accuracy vary based on transcript length?
Note there’s some join or exclusion issue happening where we don’t have round results for some of these – I thought it might be an NA issue, but it looks like not – looks like results v chat mismatch thing which I will deal with later. Grrr.
What’s the relationship between original listener accuracy and accuracy?
Definitely positive correlation here, which makes sense.
just to check correlation issues, original accuracy v length. some correlation, but not huge.
So, given these 4 data points we have some things to explain
what ““theories”” or proto-theories do we have to work with:
““theories”” also need to be consistent with our other sources of evidence:
We didn’t really have priors on this. So, these (first round) accuracies are substantially lower than in-game listener (first round) accuracies – might expect some satisficing / some its easier to have a conversation than to read it / and also some this isn’t including the right details.
The big question might be how we see really big image to image differences, so to some extent it’s not that accuracy is like 62% it’s that it’s a certain mix of 35-85%.
Theory implications: ???
People don’t start off doing anything different for group size, and maybe people on this task don’t read that far? / what questions people have are sufficiently random that the clarifications aren’t that differential?
Here we have the issue that in real games, initial accuracy for 2p is marginally? greater than 6p, so there may or may not be something to explain here…
How far people actually read / time might also be part of it
so, descriptions vary and we can think of “goodness to partner” and “goodness to naive” as having a shared component “overall goodness” and then a “path dependence” component (or something)
and we might imagine that both of these increase over time generally
things that don’t narrow down the set of tangrams (diamond head) isn’t useful at all
maybe should factor out a “nichiness” part <– how broad is the prior knowledge -> like is this based on a general “what bunnies look like” or a pop-culture thing or a our-personal-conversation-history
seems like there might also be a : how commonly does this description come up v how well it fits the image (/differentially fits the image). Like “cobra” might not be commonly used but may be high fit?
So one question is: is everything the same path, but with different time components … or are there different paths in some way? Issue is that we think path speed is a group x tangram thing
so there’s a few things that contribute to difficulty like tangram & maybe group dynamic (difficulty might be externally measureable with kilogram or image distance from others or something)
is there a way to info theory this?
what’s a non-ad-hoc way…
could deep dive on a couple games to see if within a game / image we see u shape
but what do we think the parameters are…
there’s like lots of functional forms that one could fit
but what do we think the process - plausible ones are
on a per-trajectory basis: * could asymtote up as it is better / more clearly described * if in-group clarity > something, then can reduce which may or may not reduce clarity depending on in-groupiness?
no idea what those do with description length, going to have to build some toy process model aren’t we?