Analyses just of this

How often is the highest likelihood label the correct one?

Split by tangram. We know that tangrams vary in codeability.

By probability assigned

Alternative is to look at how much probability the correct answer got.

confusion matrices

Of top option.

Of probability mass.

Compare with people

Basically, we want to know how the model qualitatively compares to humans – i.e. is there alignment on what the harder / easier ones are.

Could look at this various ways, but the cleanest comparison is that we have naive human guessing data.

Each point is one of the 12 condition (round 1/6 x 2/6 person x rotate/thin/thick)

This is slightly unfair in some ways since they might be seeing different subsets.

Model sees on a per utterance basis, humans see on a per transcript basis. It may in future make sense to show the model something more like what the people see if comparison is what we care about.

Taking only the first utterance

Assume first utterance is most contentful, and later ones may be more addressing questions or adding details.

Taking only singleton utterances

This has less data especially in some conditions, but is the most comparable.