Want a data format that has the utterances divied up for throwing into sbert
take 1: - divide on end punctuation (.!?) and on returns - include speaker and listener (easy to filter later)
?: do we use the pre or post filtered/cleaned? here we’ll use raw (no spellcheck, no filter)
take two : we concatenate everything (with a space) still no spellcheck or filtering
This is where they need to get embedding with jupyter
[Deprecated] We embed each sentence separately, but then average across the embeddings to get a vector for each player:trial combo.
How different are average versus concat embeddings:
9672 of the r
test %>% nrow()` person-trial combos
were single line utterances to begin with, so these are the same
embedding regardless of method.
Of the others, this is the cosine similarity distribution between the two methods (grouped by player count).
So, average similarity is .87ish for that half (and 1 for the just
identical half), but there are some outliers thare lower than .8.
Switching below diagrams to use concatenate method.
(Still questions about cleaning etc.)
This is done on the raw transcripts. Some things that could contribute to “similarity” include larger strings (because there will be averaging, which tends to bring things in) and more non-reference language which we expect to be more uniform (maybe?).
We could attempt to “solve” these in future by a) annotating for the “reference” expression (or using cleaned transcript as a half way point).
Do tangram descriptions diverge sooner in smaller groups? How similar are descriptions of the same tangram in the same block across games?
(only looking at speaker utts)
Within the same tangram, repNum, and numPlayers, but across different games
Similarity is higher in larger games and earlier in the games. See that 6noro starts out looking like a 6p games and ends looking like a 2-3 player game.
(only looking at speaker utts)
in a game, in a round, how different are the descriptions for different tangrams?
Within game, repNum, numPlayers, between tangrams
Earlier reps have more similarity, as do large games.
How similar is a tangram description on round N to N+1 as a function of rep and game size?
6noro has really similar descriptions throughout.
Pairs of adjacent rounds have more similar descriptions when the rounds are later (4-5 is closer than 0-1). Descriptions are more similar with larger groups in 0-1, but less similar in larger groups in 2-3, 3-4, and 4-5.
Here’s where 6noro really looks like a 2p game.
A semi-random selection of games & tangrams, trying to get some diversity.
earlier | later | earlierText | laterText | sim |
---|---|---|---|---|
0 | 1 | dinosaur one leg lifted head backwards | head tilted backward, one leg raised little standing position | 0.6322160 |
0 | 2 | dinosaur one leg lifted head backwards | looks like a stomping penguin to me, you called it a dinosaur square on upper right side | 0.4868374 |
0 | 3 | dinosaur one leg lifted head backwards | penguin stomping feet, dinosaur, square head on the right | 0.5793958 |
0 | 4 | dinosaur one leg lifted head backwards | dinosaur, head tilted backward right | 0.8121429 |
0 | 5 | dinosaur one leg lifted head backwards | stomping penguin dinosaur | 0.5108741 |
1 | 2 | head tilted backward, one leg raised little standing position | looks like a stomping penguin to me, you called it a dinosaur square on upper right side | 0.2924130 |
1 | 3 | head tilted backward, one leg raised little standing position | penguin stomping feet, dinosaur, square head on the right | 0.3743544 |
1 | 4 | head tilted backward, one leg raised little standing position | dinosaur, head tilted backward right | 0.5450624 |
1 | 5 | head tilted backward, one leg raised little standing position | stomping penguin dinosaur | 0.1429436 |
2 | 3 | looks like a stomping penguin to me, you called it a dinosaur square on upper right side | penguin stomping feet, dinosaur, square head on the right | 0.8326169 |
2 | 4 | looks like a stomping penguin to me, you called it a dinosaur square on upper right side | dinosaur, head tilted backward right | 0.5779020 |
2 | 5 | looks like a stomping penguin to me, you called it a dinosaur square on upper right side | stomping penguin dinosaur | 0.7573282 |
3 | 4 | penguin stomping feet, dinosaur, square head on the right | dinosaur, head tilted backward right | 0.6258747 |
3 | 5 | penguin stomping feet, dinosaur, square head on the right | stomping penguin dinosaur | 0.8331335 |
4 | 5 | dinosaur, head tilted backward right | stomping penguin dinosaur | 0.5017789 |
earlier | later | earlierText | laterText | sim |
---|---|---|---|---|
0 | 1 | sitting on the ground facing left | sitting facing left, can’t see arms, legs are outstretched | 0.5795866 |
0 | 2 | sitting on the ground facing left | sitting, facing left, head bowed, legs outstretched some | 0.6725468 |
0 | 3 | sitting on the ground facing left | mopey guy facing left legs outstretched | 0.1782776 |
0 | 4 | sitting on the ground facing left | mopey guy | 0.0669464 |
0 | 5 | sitting on the ground facing left | moper | 0.0535394 |
1 | 2 | sitting facing left, can’t see arms, legs are outstretched | sitting, facing left, head bowed, legs outstretched some | 0.7637988 |
1 | 3 | sitting facing left, can’t see arms, legs are outstretched | mopey guy facing left legs outstretched | 0.4570391 |
1 | 4 | sitting facing left, can’t see arms, legs are outstretched | mopey guy | 0.0721988 |
1 | 5 | sitting facing left, can’t see arms, legs are outstretched | moper | -0.0354502 |
2 | 3 | sitting, facing left, head bowed, legs outstretched some | mopey guy facing left legs outstretched | 0.4424953 |
2 | 4 | sitting, facing left, head bowed, legs outstretched some | mopey guy | 0.1479361 |
2 | 5 | sitting, facing left, head bowed, legs outstretched some | moper | 0.0124740 |
3 | 4 | mopey guy facing left legs outstretched | mopey guy | 0.6457207 |
3 | 5 | mopey guy facing left legs outstretched | moper | 0.3183658 |
4 | 5 | mopey guy | moper | 0.5250157 |
earlier | later | earlierText | laterText | sim |
---|---|---|---|---|
0 | 1 | weird shape. like a palm tree or bunny ears from the head body is of 2 triangles. looking right | body looks like a triangle head has 2 bunny ears sticking out | 0.7387307 |
1 | 2 | body looks like a triangle head has 2 bunny ears sticking out | bunny ears traiangle body | 0.7907372 |
2 | 3 | bunny ears traiangle body | bunny ears | 0.8592576 |
3 | 4 | bunny ears | the bunny or girl in a kimono facing right yes | 0.5323732 |
4 | 5 | the bunny or girl in a kimono facing right yes | bunny ears | 0.5323732 |
earlier | later | earlierText | laterText | sim |
---|---|---|---|---|
0 | 1 | this is the triangle head person haha | this is the one with the triangle ears like a bunny | 0.5474127 |
0 | 2 | this is the triangle head person haha | good ol triangle head | 0.6664659 |
0 | 3 | this is the triangle head person haha | our fave mr triangle head | 0.6533870 |
0 | 4 | this is the triangle head person haha | triangle head ftw! | 0.6695504 |
0 | 5 | this is the triangle head person haha | triangle head<3 our fave :’( haha | 0.7322333 |
1 | 2 | this is the one with the triangle ears like a bunny | good ol triangle head | 0.4983486 |
1 | 3 | this is the one with the triangle ears like a bunny | our fave mr triangle head | 0.4054277 |
1 | 4 | this is the one with the triangle ears like a bunny | triangle head ftw! | 0.4208512 |
1 | 5 | this is the one with the triangle ears like a bunny | triangle head<3 our fave :’( haha | 0.4181670 |
2 | 3 | good ol triangle head | our fave mr triangle head | 0.7457436 |
2 | 4 | good ol triangle head | triangle head ftw! | 0.7452906 |
2 | 5 | good ol triangle head | triangle head<3 our fave :’( haha | 0.6989794 |
3 | 4 | our fave mr triangle head | triangle head ftw! | 0.6391654 |
3 | 5 | our fave mr triangle head | triangle head<3 our fave :’( haha | 0.7786541 |
4 | 5 | triangle head ftw! | triangle head<3 our fave :’( haha | 0.6734986 |
this example really raises the question of what to do with non-referential language (or dubiously deferential language). Like the sentiment in “good ol” and “our fave” and “haha” is also relevantly similar to later iterations, but it’s not “reference per se”?
We could try to pull just the referential descriptions out?
Include listener stuff group within game, within tangram, within round, across utterer
code for includes speaker?
(This is nonsense)
We want to look at whether properties of the utterances (like distinctiveness) are related to how successful people were that trial.
We’re going to need to un upper-triangle this!
First try grouping by rep and game
We can also un upper triangle this to look at a per tangram basis – we see that more distinctive tangram descriptions (lower similarity) correlate with higher success rates.
(Note this is a correlation, causality is likely … complicated and cyclic)
Is there anything good to do with listener data?
How to look for “same speaker” effects? Would want to do same tangram over all the rounds and then use final (or first round) distance, and same speaker?. Where there’s a same speaker is deeply confounded with group size!!!!