We have lots of language data from tangrams experiments. My goal here is to quantitatively take a finer grained look at the language used to get some sort of grasp on how language changes over the course of a game, with a focus on what description elements stay or go.
This sample is the 4-player rotate games (~ 20 games total)
They were “chunked” to extract each descriptive ~phrase. This gets rid of game talk, hedges (most of them), and separates long-multi part descriptions into their parts.
This was done with gpt-4 using the prompt: “Here’s a partial
transcript of people describing images. Extract verbatim a list of the
descriptive phrases that are used.
If there are no descriptive phrases used, return an empty list.
As an example, if the transcript was ‘It looks like a magician, and,
uhh, I think he’s got a rabbit.’, the response would be [‘a
magician’,‘he’s got a rabbit’].
As an example, if the transcript was ‘big triangle arm facing left head
on the right’, the response would be [‘big triangle arm’, ‘facing left’,
‘head on the right’].
Return just a list of the descriptive phrases. Here’s the text:”
Chunks were checked and corrected to ensure they were substrings (modulo spelling normalization). Chunks were added/removed/split when I noticed there were problems.
Chunks were hand-tagged (twice + adjudication) for being “abstract”. Chunks for regex-tagged for “body”, “posture”, “position”, and “shape”. Each chunk could have 0 or more labels.
TODO: are there more canonical (not ad-hoc) classes for posture shape etc I could use?
regexes used:
detect_body <- str_c("\\b(face|head|heads|back|shoulder|shoulders",
"|arm|arms|leg|legs|foot|feet|body|knee|knees|",
"toe|toes|hand|hands|body|butt|heel|heels|ear|ears|nose|neck|chest|hair)\\b")
detect_shape <- str_c("squar|triangle|triangular|diamond|shape|",
"trapez|angle|degree|parallel|rhomb|box|cube")
detect_position <- str_c("right|left|above|below|under|over|top|bottom|behind|side|beneath")
detect_posture <- str_c("kick|crouch|squat|kneel|knelt|stood|",
"stand|sit|sat|lying|walk|facing|fall|looking|",
"lean|seat|laying")
Here, we take advantage of having cleaner descriptions (w/o hedges, filler) to rerun the content analyses done previously.
We see the trifecta of expected effects, yay!
As a version of “reduction”, can look at how # of chunks changes over time.
The number of chunks declines over time!
TODO: could also look at chunk lengths, could try to tie this to performance, but mostly people get things right…
Want to check what the chunk type tagging is doing.
What isn’t getting labelled into any category?
What labels / combinations are common?
We might expect that abstract becomes more common over time while “lower level” types decrease.
For this, we impose a category hierarchy where abstract > posture > body > shape > position for tie-breaking things with multiple labels. (So things count as the first thing on the list that is true)
Abstract mostly stays constant, other categories, especially “body” decrease.
Try splitting it up by whether everyone got it right or at least someone got it wrong.
We find that descriptions that someone gets wrong tend to be longer – not sure how to entangle causality here. Could try lagging? (note that this is low-feedback and rotate, so people don’t have a great sense of what worked or not)
Tangrams vary a lot in how fast the chunks decrease and what types of descriptions are used.
One issue with the prior SBERT analyses is that we can’t easily distinguish where conventions come from versus the dropping of other descriptions or extra non-referential verbiage.
A lot of the analyses we care about want some form of chunk -to- chunk similarity so we can look at what stays or drops by comparing to end. (or distinctiveness, etc)
We could look at the emergence of “conventions” by specifying what conventions are (ex. last round chunks) and then looking for when a chunk of at least YY similarity occurs. The big question is what cutoff to use for similarity!
Rather than pick a cutoff, we can look at all the cutoffs at once.
Here we look at what the first chunk is that is at least $SIM on cosine similarity to an end chunk. We look at when these occur and how many were from listeners.
So, listeners are the originators ~15% of the time.
And a lot of “conventions” emerge fairly early.
Often there are multiple chunks used in a description. Which is given first versus later might be meaningful.
I think the way to look at this is what position it’s in, without looking at what the n is because that’s too much.
Could in future separate singletons from others.
Still not a good visual, but we see that for abstract, it’s increasingly 1st and decreasingly later. (not distinguishing first and only)
Should somehow take into account how long it’s generally part of.
Basically the environments where different type chunks are found are different – some are more likely to be one-of-many in a description, and others more one-of-few or singletons.
And this probably interacts with the decrease in number of chunks over time.
We would really like to get some predictive traction on what chunks are likely to “stay” versus not.
One might think that …
chunks with certain category labels are more likely to stick
chunks that are more dissimilar from descriptions used for other tangrams (by same person) are more likely to stick
chunks that are more dissimilar to descriptions from other groups are more likely to stick
An initial attempt at the dissimilar didn’t work (got very messy result implying the opposite) – one notes that similarity to later chunk and similarity to same time chunk could be correlating on the basis of generic similarity / not being in a dumb corner or sbert space.
Another question is what “stay” means:
how do we operationalize the new/old-ness of a chunk?
implicitly we want a graph of chunk evolution/parenthood
could either be discrete and say a chunk is new if there is no previous chunk with >SIM with it, and otherwise it’s a child of the previous chunk it has highest >SIM with???
probably want some continuous measure instead?
If we ignore discontinuities where a chunk skips a block, could look at mean and highest sims to previous and to next? And could look at when sim to next is > sim to previous ?
this includes both speaker and listener chunks
want to give each thing a mean distinctiveness rating and then see if that has predictive value?
well, if I force a straight line, it has non-zero predictive value in the unexpected direction – things that are more similar descriptions of other tangrams are also more similar to an end description. So we’re just getting “not in a silly corner of sbert space” not anything useful.
what if we try distinctive (within game measure) with types?
so whether we use average or most, the abstract and “other” categories are more different from what is used for other tangrams (which makes sense).
(not sure it’s interesting, but it at least makes sense)
Again, not entirely sure how to interpret.
less similar to other descriptions in same round? (less similar to descriptions of other tangrams? of other games?)
could consider a tSNE
There’s probably something to do for round to round in addition to the to-later and from-first
perhaps more unique chunks are more likely to stick? (is this independent of type?)
is there a way to look at drop-out rate
did we lose repNum somewhere in the sbert process?
Fix rechunking problem in a better way!!!! (when I split up chunks manually and chunk numbers got messed up)