Introduction

The goal of this study is to test the ability of a variety of cross-situational word learning models to account for a wide range of experimental data. Many models for cross-situational word learning have been proposed, and yet most of these models have only been tested on a few different experiments–often distinct ones, and only with one or two other models for comparison. Our goal is to scale up both the number of models evaluated and the number of experiment designs (and participants) modeled, recognizing that the plurality of models and the theories and intuitions they represent can only be winnowed down via rigorous comparison. We will find best-fitting parameters for associative models, hypothesis-testing models, and a few hybrid models, and will quantify generalization on held-out data, as well as qualitatively in other experiments. Finally, we will interpret the results, and what they entail for theories of cross-situational word learning.

Models

baseline count model (SSE=.620)
propose-but-verify
guess-and-test
pursuit model
Kachergis model (associative + sampling version)
- strength bias model
- uncertainty bias model
- novelty bias model
Fazly et al.
“Bayesian” decay
Tilles & Fontanari 2013 (ToDo: verify)
Yurovsky & Frank 2015 (ToDo!!)

The Data

The modeled data are average accuracies from 726 word-object pairs in 44 experimental conditions, in which a total of 1696 subjects participated. Most of these data have been previously published:

We will first optimize each model’s parameters with 5-fold cross-validation, leaving out data from 20% of the conditions to test generalization from optimizing to the remaining 80%. We will then optimize each model’s parameterss to the entire dataset, and use these parameters to test generalization to a selection of other experiments from the literature for which we do not have item-level accuracy.

Each condition consists of an ordered list of training trials consisting of 1-4 words and 2-4 objects per trial. We also require a test function–often presenting each word a single time along mAFC of objects, where m is the number of objects seen during the training, but sometimes a subset of the objects are tested.

Model Descriptions

We fit the following associative models: Fazly [@fazly2010], Kachergis [@kachergis2012], … and a baseline co-occurrence counting model. We also fit two hypothesis-testing models: guess-and-test [@trueswell2011,@blythe2010] and propose-but-verify [@trueswell2013]. Finally, we fit two hybrid models, that store multiple, graded hypotheses–but only a subset of all possible assoociations: pursuit [@stevens2017], and a stochastic version of the Kachergis model. These models and their free parameters are described below.

Propose-but-verify

In the propose-but-verify hypothesis testing model @trueswell2013, one of the presented referents is selected at random for any word heard that has no remembered referent. The next time that word occurs, the previously-proposed referent is remembered with probability \(\alpha\), a free parameter. If the remembered referent is verified to be present, the future probability of recalling the word is increased by \(\epsilon\) (another free parameter). If the remembered referent is not present, the old hypothesis is assumed to be forgotten and a new proposal is selected from the available referents. This model implements trial-level mutual exclusivity by selecting new proposals only from among the referents that are not yet part of a hypothesis.

Guess-and-test

The guess-and-test hypothesis testing model is based on the description given by @medina2011 of a one-shot (i.e. “fast mapping”) learning, which posits that “i) learners hypothesize a single meaning based on their first encounter with a word, ii) learners neither weight nor even store back-up alternative meanings, and iii) on later encounters, learners attempt to retrieve this hypothesis from memory and test it against a new context, updating it only if it is disconfirmed.” In summary, guess-and-test learners do not reach a final hypothesis by comparing multiple episodic memories of prior contexts or multiple semantic hypotheses. We give this model two free parameters: a probability of successful encoding (\(s\), hypothesis formation), and a probability \(f\) of forgetting a hypothesis at retrieval. This model is quite similar to the guess-and-test model formally analyzed by @blythe2010 to determine the theoretical long-term efficacy of cross-situational word learning.

Model Fitting Procedure

We will fit each model with five-fold cross-validation method, leaving out 9 of the 44 conditions in each of 4 folds, and 8 in the final. The 2 or 3 free parameters for each model were optimized using diffential evolution [@DEoptim], a global optimization algorithm that requires no assumptions about a differentiable fitness landscape (which may not be met here) and works well when there may be many local minima.

Results

Conditions-left-out Cross-validated Group Fits

Group model fits.
Model	SSE	r
kachergis_sampling	17.737	0.743
fazly	18.717	0.731
kachergis	18.922	0.723
novelty	22.016	0.672
uncertainty	23.059	0.692
Bayesian_decay	23.950	0.624
rescorla-wagner	31.795	0.618
strength	39.047	0.460
trueswell2012	42.919	0.352
guess-and-test	43.009	0.322
pursuit_detailed	49.333	0.311

Group-level Model Fits

Group model fits.
Model	SSE	r
kachergis_sampling	18.024	0.740
fazly	18.771	0.730
kachergis	19.066	0.721
novelty	22.014	0.671
uncertainty	23.016	0.694
Bayesian_decay	23.708	0.627
rescorla-wagner	31.794	0.618
strength	39.065	0.459
trueswell2012	42.543	0.349
guess-and-test	43.475	0.290
pursuit_detailed	46.698	0.342

Plotted by condition means.

Plotted by individual items.

Per-Condition Model Fits

(not done yet for sampling models, and we may not even want to do this except for a version with cross-validation)

By-condition model fits.
Model	SSE	r
kachergis	6.774	0.906
Bayesian_decay	7.541	0.894
novelty	7.793	0.891
fazly	8.275	0.884
uncertainty	13.375	0.826
strength	14.079	0.803
rescorla-wagner	42.234	0.632

Generalization Experiments

Using the parameters optimized to each model for all of the conditions, we simulate model performance for a selection of published experiments and compare to participants’ group-level performance for these conditions. The first three studies are of adult participants, and include two experiments that have been interpreted as supporting hypothesis-testing theories of word learning.

Koehne, Trueswell & Gleitman (2013)

For example, Koehne, Trueswell & Gleitman (2013), in which each of 16 novel nouns was assigned two meanings with different co-occurrence frequencies: One referent was present whenever the noun was present (six times, 100% referent), the other referent was present in only half of the cases the noun was (three times, 50% referent). All other objects co-occurred only once with a noun (17%). Training trial order was manipulated: including and excluding the 50% referent were presented within four levels (within participants): Firstly, the 50%-present (P) and 50%-absent (A) trials could be either blocked (AAAPPP and PPPAAA) or not blocked (APAPAP and PAPAPA); secondly, the first encounter of a noun could be either an A trial (AAAPPP and APAPAP) or a P trial (PPPAAA and PAPAPA).

Medina et al. 2011

Medina et al. 2011 tasked adult participants with learning 12 nonce words presented across 60 training trials presenting only one word at a time. The number of referents per training trial varied: High Informative (HI) trials showed two referents, while Low Informative (LI) trials showed five referents. Unlike most cross-situational word learning experiments, each word depicted not a single referent, but instead corresponded to a category of five referents, each appearing (e.g., “bosa” might appear with five pictures of bears on as many separate trials). Different between-subjects conditions presented 12 HI trials (one per word) at the beginning, middle, or end of training, or not at all (HI-absent).

##    Condition Vignette Performance     Model
## 1   HI first        1  0.08340120 kachergis
## 2   HI first        2  0.05528849 kachergis
## 3   HI first        3  0.13817842 kachergis
## 4   HI first        4  0.16040443 kachergis
## 5   HI first        5  0.17491735 kachergis
## 6  HI middle        1  0.07665140 kachergis
## 7  HI middle        2  0.12852427 kachergis
## 8  HI middle        3  0.26036110 kachergis
## 9  HI middle        4  0.27810274 kachergis
## 10 HI middle        5  0.28556782 kachergis
## 11   HI last        1  0.07665140 kachergis
## 12   HI last        2  0.12852427 kachergis
## 13   HI last        3  0.18026999 kachergis
## 14   HI last        4  0.18982430 kachergis
## 15   HI last        5  0.25438366 kachergis
## 16 HI absent        1  0.07665140 kachergis
## 17 HI absent        2  0.12852427 kachergis
## 18 HI absent        3  0.18026999 kachergis
## 19 HI absent        4  0.18982430 kachergis
## 20 HI absent        5  0.20328638 kachergis

Yu, Zhong, and Fricker (2012)

Yu, Zhong, and Fricker (2012) pre-trained adult participants with three word-object associations and found higher performance on the other 15 words after this pre-training, demonstrating that knowing even a few word meanings can improve learning for other co-occurring words.

## pretrain    other 
## 0.819675 0.527990

## [1] 0.01263725

## pretrain    other 
##  0.82864  0.61774

## pretrain    other 
## 0.849655 0.703090

Suanda et al. (2014)

Suanda et al. (2014) varied contextual diversity in a cross-situational word learning study of 5- to 7-year-olds, presenting children with 8 to-be-learned words shown two per trial across a total of 16 trials. …2AFC?

Vlach & DeBrock (2017, 2019)

12 word-object pairs, 2 pairs/trial, 36 trials 6 pairs massed, 6 interleaved 2AFC testing

Smith & Yu 2008 / Yu & Smith (2011)

6 word-object pairs, 2 pairs/trial, 30 trials

A large-scale comparison of several cross-situational word learning models

George Kachergis & Michael C. Frank

2020-12-07