Goal and Method

We have lots of language data from tangrams experiments. My goal here is to quantitatively take a finer grained look at the language used to get some sort of grasp on how language changes over the course of a game, with a focus on what description elements stay or go.

Method

This sample is the 4-player rotate games (~ 20 games total)

Chunking

They were “chunked” to extract each descriptive ~phrase. This gets rid of game talk, hedges (most of them), and separates long-multi part descriptions into their parts.

This was done with gpt-4 using the prompt: “Here’s a partial transcript of people describing images. Extract verbatim a list of the descriptive phrases that are used.
If there are no descriptive phrases used, return an empty list.
As an example, if the transcript was ‘It looks like a magician, and, uhh, I think he’s got a rabbit.’, the response would be [‘a magician’,‘he’s got a rabbit’].
As an example, if the transcript was ‘big triangle arm facing left head on the right’, the response would be [‘big triangle arm’, ‘facing left’, ‘head on the right’].
Return just a list of the descriptive phrases. Here’s the text:”

Chunks were checked and corrected to ensure they were substrings (modulo spelling normalization). Chunks were added/removed/split when I noticed there were problems.

Tagging

Chunks were hand-tagged (twice + adjudication) for being “abstract”. Chunks for regex-tagged for “body”, “posture”, “position”, and “shape”. Each chunk could have 0 or more labels.

TODO: are there more canonical (not ad-hoc) classes for posture shape etc I could use?

regexes used:

detect_body <- str_c("\\b(face|head|heads|back|shoulder|shoulders",
                 "|arm|arms|leg|legs|foot|feet|body|knee|knees|",
                 "toe|toes|hand|hands|body|butt|heel|heels|ear|ears|nose|neck|chest|hair)\\b")

detect_shape <- str_c("squar|triangle|triangular|diamond|shape|",
"trapez|angle|degree|parallel|rhomb|box|cube")

detect_position <- str_c("right|left|above|below|under|over|top|bottom|behind|side|beneath")

detect_posture <- str_c("kick|crouch|squat|kneel|knelt|stood|",
                        "stand|sit|sat|lying|walk|facing|fall|looking|",
"lean|seat|laying")

Repeat s-bert analyses

Here, we take advantage of having cleaner descriptions (w/o hedges, filler) to rerun the content analyses done previously.

We see the trifecta of expected effects, yay!

Number of chunks over time

As a version of “reduction”, can look at how # of chunks changes over time.

The number of chunks declines over time!

TODO: could also look at chunk lengths, could try to tie this to performance, but mostly people get things right…

Chunk type examination

Want to check what the chunk type tagging is doing.

What isn’t getting labelled into any category?

What labels / combinations are common?

Look at number of chunks of each type over time

We might expect that abstract becomes more common over time while “lower level” types decrease.

For this, we impose a category hierarchy where abstract > posture > body > shape > position for tie-breaking things with multiple labels. (So things count as the first thing on the list that is true)

Abstract mostly stays constant, other categories, especially “body” decrease.

Tie to performance

Try splitting it up by whether everyone got it right or at least someone got it wrong.

We find that descriptions that someone gets wrong tend to be longer – not sure how to entangle causality here. Could try lagging? (note that this is low-feedback and rotate, so people don’t have a great sense of what worked or not)

Per tangram

Tangrams vary a lot in how fast the chunks decrease and what types of descriptions are used.

Chunk -to- chunk SBERT

One issue with the prior SBERT analyses is that we can’t easily distinguish where conventions come from versus the dropping of other descriptions or extra non-referential verbiage.

A lot of the analyses we care about want some form of chunk -to- chunk similarity so we can look at what stays or drops by comparing to end. (or distinctiveness, etc)

When do “conventions” emerge?

We could look at the emergence of “conventions” by specifying what conventions are (ex. last round chunks) and then looking for when a chunk of at least YY similarity occurs. The big question is what cutoff to use for similarity!

Rather than pick a cutoff, we can look at all the cutoffs at once.

Here we look at what the first chunk is that is at least $SIM on cosine similarity to an end chunk. We look at when these occur and how many were from listeners.

So, listeners are the originators ~15% of the time.

And a lot of “conventions” emerge fairly early.

What types of labels are most like the end?

Ordering

Often there are multiple chunks used in a description. Which is given first versus later might be meaningful.

I think the way to look at this is what position it’s in, without looking at what the n is because that’s too much.

Could in future separate singletons from others.

Still not a good visual, but we see that for abstract, it’s increasingly 1st and decreasingly later. (not distinguishing first and only)

Should somehow take into account how long it’s generally part of.

Basically the environments where different type chunks are found are different – some are more likely to be one-of-many in a description, and others more one-of-few or singletons.

And this probably interacts with the decrease in number of chunks over time.

Stickiness

We would really like to get some predictive traction on what chunks are likely to “stay” versus not.

One might think that …

An initial attempt at the dissimilar didn’t work (got very messy result implying the opposite) – one notes that similarity to later chunk and similarity to same time chunk could be correlating on the basis of generic similarity / not being in a dumb corner or sbert space.

Another question is what “stay” means:

If we ignore discontinuities where a chunk skips a block, could look at mean and highest sims to previous and to next? And could look at when sim to next is > sim to previous ?

distinctiveness

this includes both speaker and listener chunks

want to give each thing a mean distinctiveness rating and then see if that has predictive value?

well, if I force a straight line, it has non-zero predictive value in the unexpected direction – things that are more similar descriptions of other tangrams are also more similar to an end description. So we’re just getting “not in a silly corner of sbert space” not anything useful.

what if we try distinctive (within game measure) with types?

so whether we use average or most, the abstract and “other” categories are more different from what is used for other tangrams (which makes sense).

(not sure it’s interesting, but it at least makes sense)

divergence between games

Again, not entirely sure how to interpret.

What labels are stickiest?

Does distinctiveness predict?

There’s probably something to do for round to round in addition to the to-later and from-first

Do any similarities predict stickiness

What are analyses we’d want? (TODO implement)

TODO notes for Veronica