Experiment
Boring stuff
Accuracy
Models!
- check what’s up with predictions via simpler models
compare accuracy as function of original accuracy & length
confusion matrix might be real interesting!
Thinking about what the heck the “reduction process” theory space is
Fun with fake data!
Next steps:

Experiment

try the experiment

Boring stuff

Read in data

Bonus

Timing

This is clock time over the whole experiment (paying attention or not)

How long are individual trials taking?

if we exclude the > than 1 minute ones (as plausible got distracted doing other things), mean rts:

So, 10-20 seconds per trial generally.

Accuracy

Average accuracy is around 62%, with moderate person-to-person variability.

Key question is whether there are differences between the conditions: not huge ones. Here’s mean accuracy and bootstrapped 95% CIs in each condition. There’s a 10 percentage point difference between 2-player round 6 and 6-player round 6 (round 1’s in the middle).

And plot with per-participant small dots.

Tangrams vary a lot in guessability.

Models!

predict accuracy as a function of:

group size x round
nuisance variable of trial (might get better over time)
size x round | tangram
size x round + trial | participant

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: correct ~ group_size * round + trial_order + (group_size * round | correct_tangram) + (group_size * round + trial_order | workerid) 
##    Data: good_stuff (Number of observations: 3600) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~correct_tangram (Number of levels: 12) 
##                                                         Estimate Est.Error
## sd(Intercept)                                               0.98      0.23
## sd(group_size6_player)                                      0.17      0.12
## sd(roundround_6)                                            0.48      0.18
## sd(group_size6_player:roundround_6)                         0.18      0.14
## cor(Intercept,group_size6_player)                          -0.23      0.41
## cor(Intercept,roundround_6)                                -0.01      0.30
## cor(group_size6_player,roundround_6)                        0.25      0.42
## cor(Intercept,group_size6_player:roundround_6)             -0.02      0.43
## cor(group_size6_player,group_size6_player:roundround_6)    -0.06      0.45
## cor(roundround_6,group_size6_player:roundround_6)           0.11      0.44
##                                                         l-95% CI u-95% CI Rhat
## sd(Intercept)                                               0.63     1.51 1.00
## sd(group_size6_player)                                      0.01     0.45 1.00
## sd(roundround_6)                                            0.19     0.89 1.00
## sd(group_size6_player:roundround_6)                         0.01     0.50 1.00
## cor(Intercept,group_size6_player)                          -0.87     0.64 1.00
## cor(Intercept,roundround_6)                                -0.57     0.58 1.00
## cor(group_size6_player,roundround_6)                       -0.66     0.89 1.01
## cor(Intercept,group_size6_player:roundround_6)             -0.79     0.79 1.00
## cor(group_size6_player,group_size6_player:roundround_6)    -0.84     0.80 1.00
## cor(roundround_6,group_size6_player:roundround_6)          -0.75     0.85 1.00
##                                                         Bulk_ESS Tail_ESS
## sd(Intercept)                                               1484     1857
## sd(group_size6_player)                                      1400     1777
## sd(roundround_6)                                            1784     1757
## sd(group_size6_player:roundround_6)                         1727     1915
## cor(Intercept,group_size6_player)                           4992     3066
## cor(Intercept,roundround_6)                                 2855     2645
## cor(group_size6_player,roundround_6)                         954     1451
## cor(Intercept,group_size6_player:roundround_6)              5930     3127
## cor(group_size6_player,group_size6_player:roundround_6)     3030     3109
## cor(roundround_6,group_size6_player:roundround_6)           3790     3594
## 
## ~workerid (Number of levels: 60) 
##                                                         Estimate Est.Error
## sd(Intercept)                                               0.64      0.12
## sd(group_size6_player)                                      0.18      0.12
## sd(roundround_6)                                            0.22      0.13
## sd(trial_order)                                             0.01      0.00
## sd(group_size6_player:roundround_6)                         0.26      0.17
## cor(Intercept,group_size6_player)                           0.12      0.36
## cor(Intercept,roundround_6)                                -0.47      0.34
## cor(group_size6_player,roundround_6)                       -0.07      0.39
## cor(Intercept,trial_order)                                  0.05      0.34
## cor(group_size6_player,trial_order)                         0.09      0.39
## cor(roundround_6,trial_order)                              -0.07      0.39
## cor(Intercept,group_size6_player:roundround_6)             -0.09      0.37
## cor(group_size6_player,group_size6_player:roundround_6)    -0.02      0.39
## cor(roundround_6,group_size6_player:roundround_6)          -0.14      0.41
## cor(trial_order,group_size6_player:roundround_6)           -0.06      0.40
##                                                         l-95% CI u-95% CI Rhat
## sd(Intercept)                                               0.42     0.88 1.00
## sd(group_size6_player)                                      0.01     0.45 1.00
## sd(roundround_6)                                            0.01     0.48 1.01
## sd(trial_order)                                             0.00     0.01 1.01
## sd(group_size6_player:roundround_6)                         0.02     0.65 1.00
## cor(Intercept,group_size6_player)                          -0.60     0.77 1.00
## cor(Intercept,roundround_6)                                -0.91     0.39 1.00
## cor(group_size6_player,roundround_6)                       -0.77     0.69 1.00
## cor(Intercept,trial_order)                                 -0.58     0.71 1.00
## cor(group_size6_player,trial_order)                        -0.68     0.79 1.00
## cor(roundround_6,trial_order)                              -0.76     0.69 1.00
## cor(Intercept,group_size6_player:roundround_6)             -0.76     0.64 1.00
## cor(group_size6_player,group_size6_player:roundround_6)    -0.74     0.74 1.00
## cor(roundround_6,group_size6_player:roundround_6)          -0.80     0.68 1.00
## cor(trial_order,group_size6_player:roundround_6)           -0.77     0.73 1.00
##                                                         Bulk_ESS Tail_ESS
## sd(Intercept)                                               1810     2031
## sd(group_size6_player)                                       930     1866
## sd(roundround_6)                                            1025     1438
## sd(trial_order)                                              488     1367
## sd(group_size6_player:roundround_6)                         1076     1965
## cor(Intercept,group_size6_player)                           4059     2970
## cor(Intercept,roundround_6)                                 2421     2529
## cor(group_size6_player,roundround_6)                        3200     3006
## cor(Intercept,trial_order)                                  2957     2641
## cor(group_size6_player,trial_order)                          788     2065
## cor(roundround_6,trial_order)                               1356     2557
## cor(Intercept,group_size6_player:roundround_6)              3378     2911
## cor(group_size6_player,group_size6_player:roundround_6)     2435     2830
## cor(roundround_6,group_size6_player:roundround_6)           2144     2926
## cor(trial_order,group_size6_player:roundround_6)            2692     3172
## 
## Population-Level Effects: 
##                                 Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept                           0.66      0.31     0.05     1.29 1.00
## group_size6_player                  0.14      0.13    -0.12     0.38 1.00
## roundround_6                       -0.26      0.18    -0.63     0.09 1.00
## trial_order                        -0.00      0.00    -0.01     0.00 1.00
## group_size6_player:roundround_6     0.40      0.17     0.06     0.74 1.00
##                                 Bulk_ESS Tail_ESS
## Intercept                            973     1789
## group_size6_player                  3441     2928
## roundround_6                        2303     2393
## trial_order                         6121     3216
## group_size6_player:roundround_6     3520     3145
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

So basically this confirms what we see above: the percentages aren’t that different across conditions, but there is an interaction – 2-player games end up less transparent than 6-player games, even if they start in (roughly) the same place and don’t change that much from start to end.

Logistic models are annoying to think about, so we can look at predictions:

Dotted lines are 95% posterior predictions from the model (no random effects), solid lines are mean and 95% CI from bootstrapping that category.

check what’s up with predictions via simpler models

seems a bit odd that model preds are all larger than the bootstrap, so we check out what’s causing it. Seems to be that adding mixed effects leads to pulling up the lower outliers more (b/c there’s further out low ones??), although maybe this is partly due to that x logit stuff

compare accuracy as function of original accuracy & length

how does RT vary based on transcript length?

how does accuracy vary based on transcript length?

Note there’s some join or exclusion issue happening where we don’t have round results for some of these – I thought it might be an NA issue, but it looks like not – looks like results v chat mismatch thing which I will deal with later. Grrr.

What’s the relationship between original listener accuracy and accuracy?

Definitely positive correlation here, which makes sense.

just to check correlation issues, original accuracy v length. some correlation, but not huge.

confusion matrix might be real interesting!

Thinking about what the heck the “reduction process” theory space is

So, given these 4 data points we have some things to explain

what does it mean that accuracy (overall) here is what it is
what does it mean that there isn’t (much) condition difference
what does it mean that 2-player is harder later but 6-player is easier later

what ““theories”” or proto-theories do we have to work with:

Robert’s CHAI – predictions unclear since it depends a lot on priors (if priors start out random, then naive should be uniformly hopeless, otherwise earlier should be easier ?), also doesn’t account for practice effects / better sampling of utterance components over time
claim that utterances get more opaque over time
larger games stick closer to priors / have less game-to-game variability (a la Guilbeault paper)
vague notion that group size (in this range) should yield continuous behavior (same trajectory, different speed parameters)
auxiliary theory of task demands!
???

““theories”” also need to be consistent with our other sources of evidence:

reduction trajectories
semantic trajectories
item-level stuff

Overall accuracy

We didn’t really have priors on this. So, these (first round) accuracies are substantially lower than in-game listener (first round) accuracies – might expect some satisficing / some its easier to have a conversation than to read it / and also some this isn’t including the right details.

The big question might be how we see really big image to image differences, so to some extent it’s not that accuracy is like 62% it’s that it’s a certain mix of 35-85%.

Theory implications: ???

Why is initial accuracy the same-ish for 2 and 6?

People don’t start off doing anything different for group size, and maybe people on this task don’t read that far? / what questions people have are sufficiently random that the clarifications aren’t that differential?

Here we have the issue that in real games, initial accuracy for 2p is marginally? greater than 6p, so there may or may not be something to explain here…

How far people actually read / time might also be part of it

Why aren’t there more differences between conditions?

the descriptions aren’t actually that different (how do we quantify different-ness?) – but this is contra some of the similarity trajectories we see where 2p converges much faster
it is kinda strange that this is more squished than in game listener differences – so is having it be interactive make a big difference? (regardless of in-order-ness) but is it actually more squished – 2 v 6 in the first experiment isn’t a big different, but that might mostly be partial pooling of the model?
is this a product of some mixture of some being quite identifiable and others being a guess between 2 or 3 or …
THIS IS NOT SOMETHING I HAVE A GOOD ANSWER TO!

Why different trajectories?

so, descriptions vary and we can think of “goodness to partner” and “goodness to naive” as having a shared component “overall goodness” and then a “path dependence” component (or something)

and we might imagine that both of these increase over time generally

things that don’t narrow down the set of tangrams (diamond head) isn’t useful at all

maybe should factor out a “nichiness” part <– how broad is the prior knowledge -> like is this based on a general “what bunnies look like” or a pop-culture thing or a our-personal-conversation-history

seems like there might also be a : how commonly does this description come up v how well it fits the image (/differentially fits the image). Like “cobra” might not be commonly used but may be high fit?

So one question is: is everything the same path, but with different time components … or are there different paths in some way? Issue is that we think path speed is a group x tangram thing

so there’s a few things that contribute to difficulty like tangram & maybe group dynamic (difficulty might be externally measureable with kilogram or image distance from others or something)

is there a way to info theory this?

what’s a non-ad-hoc way…

could deep dive on a couple games to see if within a game / image we see u shape

but what do we think the parameters are…

Fun with fake data!

there’s like lots of functional forms that one could fit

but what do we think the process - plausible ones are

on a per-trajectory basis: * could asymtote up as it is better / more clearly described * if in-group clarity > something, then can reduce which may or may not reduce clarity depending on in-groupiness?

no idea what those do with description length, going to have to build some toy process model aren’t we?

Next steps:

could collect more data on the same thing (not V’s favorite idea)
could pull things where in-game patterns were more different (2 thick v 6 thin for example)
could pull intermediates (middling game size & middling rounds)
could deep dive in a couple games to see if there’s any u shape / non-linearity
could go straight to some incremental viewing to get at how much people are reading? get some sort of RT separate from selection time ??

tg-matcher 1: Preliminary analysis