Thoughts for later
Analyses of just CLIP results
Compare error patterns among CLIP models
Compare with tg-matcher results
- Taking only the first utterance
- Taking only singleton utterances
Comparions with mpt accuracies
Comparison with kilogram naming divergence

Thoughts for later

? try more comparable to tg-matcher in presenting full-ish transcripts?
should we retrain on not these tangrams, only others? (is there a pre-trained model that achieves this?)
? should we split the utterances somehow and look at fit of words/phrases (i.e. to feed to CHAI? or do drop out analysis or ….)

Analyses of just CLIP results

We have results from 3 models

the best performing of the fine tuned on kilgram models
a base clip model (no finetuning)
a large clip model (no finetuning)

Of highest likelihood option

How often is the highest likelihood label the correct one?

25-40% range, intriguing potential patterns, but could be noise?

Highest likelihood by tangram

Split by tangram.

We know that tangrams vary in codeability.

Tangrams vary widely in model performance; also vary in what round the model is best at.

By probability assigned

Alternative is to look at how much probability the correct answer got.

This mostly tracks the above, which makes sense.

Confusion matrices

Of top option.

Of probability mass.

Compare error patterns among CLIP models

## 
##  Pearson's product-moment correlation
## 
## data:  wide$ft and wide$pt_base
## t = 75.812, df = 37670, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3550403 0.3725634
## sample estimates:
##      cor 
## 0.363834

## 
##  Pearson's product-moment correlation
## 
## data:  wide$ft and wide$pt_large
## t = 96.984, df = 37670, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4388766 0.4550379
## sample estimates:
##       cor 
## 0.4469937

## 
##  Pearson's product-moment correlation
## 
## data:  wide$pt_base and wide$pt_large
## t = 119.97, df = 37670, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5184261 0.5330398
## sample estimates:
##       cor 
## 0.5257717

Compare with tg-matcher results

(Always comparing to FT model) Basically, we want to know how the model qualitatively compares to humans – i.e. is there alignment on what the harder / easier ones are.

Could look at this various ways, but the cleanest comparison is that we have naive human guessing data.

Each point is one of the 12 condition (round 1/6 x 2/6 person x rotate/thin/thick)

The model is very bad at ice skater?

This is slightly unfair in some ways since they might be seeing different subsets.

Model sees on a per utterance basis, humans see on a per transcript basis. It may in future make sense to show the model something more like what the people see if comparison is what we care about.

Taking only the first utterance

Assume first utterance is most contentful, and later ones may be more addressing questions or adding details.

Taking only singleton utterances

This has less data especially in some conditions, but is the most comparable.

Comparions with mpt accuracies

Again, only using FT model.

Same caveats as previous comparison with people apply. The model’s error pattern does not seem particularly correlated with the human error pattern.

Could consider doing within-tangram analyses for utterance by utterance or something?

Comparison with kilogram naming divergence

(only using FT model) * part naming divergence (PND): “PND is computed identically to SND, but with the concatenation of all part names of an annotation as the input text”

Shape Naming Divergence (SND): “A tangram’s SND quantifies the variability among whole-shape annotations. SND is an operationalization of nameability,”
part segmentation agreement (PSA): “PSA quantifies the agreement between part segmentations as the maximum number of pieces that does not need to be”

First pass