Proposing a Developmental Multimodal Benchmark

Author

Alvin W.M. Tan

Published

March 9, 2024

The objective of this proposal is to construct a benchmark suite for vision–language machine learning (ML) models that is developmentally appropriate in nature, permitting the comparison between model and human learning trajectories. Such a suite would allow for the evaluation of ML models as possible cognitive models for human learning, thereby permitting an analysis of the similarities and differences between models and humans in terms of their learning patterns and processes, which can be used both to better understand early child development and to construct more data-efficient ML models.

Motivation

Recent development in vision–language models (VLMs) have attempted to train models on developmentally realistic quantities of input data (Wang et al. 2023; Vong et al. 2024). These models have claimed to have learnt appropriate representations for both input streams that are effective representations and that are well aligned across modalities, as supported by generalisation tests on proximal or distal test sets. However, one fundamental problem is that there is little indication of how children themselves would perform on such tests, and therefore it is difficult to estimate the ability level of the trained VLM; even though there are quantitative metrics from the generalisation test, these are interpreted with a qualitative lens.

To demonstrate why this is problematic, we conduct a re-evaluation of the Child’s View for Contrastive Learning (CVCL) model from Vong et al. (2024). Here, we used a manually curated labelled subset of the SAYCam data (Sullivan et al. 2021), which itself is a superset of the training data used by Vong et al. (2024). Specifically, SAYCam comprises data from three children (S, A, and Y), and Vong et al. (2024) used only the S dataset for training. Our labelled SAYCam dataset contains 26 image categories of 5 images each, sampled across S, A, and Y (82, 34, and 14 images respectively; the uneven sampling is due to nonuniform distribution of the target object categories). These image categories were all present in the vocabulary of the language input for child S, ensuring the possibility of learning during model training. Furthermore, the test images are of a similar visual quality to those used during training, but the A and Y subsets may be slightly more out-of-domain for CVCL since they capture a different set of visual experience. Hence, if the CVCL model has learnt robust representations, we would expect a performance of around 61.6% (as observed by Vong et al. (2024)) for the S subset, and slightly lower but above-chance performance for the A and Y subsets. We use the same testing setup as Vong et al. (2024), randomly grouping the images into 4-alternative forced choice trials (four images : one word).

Here we plot the top-1 accuracies for each object category and each child, with dashed lines indicating chance level. Some notable observations include the fact that performance on some object categories (e.g., cat, chair) are much worse in our evaluation than in the paper, and overall performance on the S subset (which should be highest) is at chance.

Accuracies by category on labelled SAYCam

We then plot the direct probabilities as a way of better capturing the variability across categories and children. Here the problem is more apparent and drastic: the model performs effectively at chance on almost all categories and for all children, a fact which is obscured when only top-1 accuracy is used.

Choice probabilities by category on labelled SAYCam

Choice probabilities by child on labelled SAYCam

One possible reason for the disparity in results is that many of the test set images in Vong et al. (2024) have similar low-level visual cues (e.g., colour, luminance) as training examples (considering, for example, their Figure 4C). The internal heterogeneity of each category may vary across category, such that categories with many images that are similar to training examples would drive up accuracy values. Our hand-picked examples avoid such issues by intentionally including only visually distinct examples, such that each test item is more independent from other items (of course, there is a trade-off of having much fewer items).

We also observe a similar spread of results when using a looking-while-listening paradigm (Fernald et al. 2008), using stimuli from Frank et al. (2016). This is a two-alternative forced choice task (two images : one word). Here, because there is a single trial per category, we report choice probabilities, which are notably all close to 0.5.

Choice probabilities by category on looking-while-listening

Significance

ML research has been advanced by the simultaneous progress in at least four directions: (1) training data, (2) model architectures, (3) training objectives, and (4) evaluation benchmarks. In the developmental VLM world, training data has been improved with the collection of medium-to-large-scale naturalistic datasets (Long et al. 2022; Sullivan et al. 2021), allowing for developmentally realistic training. There has also been continued progress on the architectural front, with newer models including vision transformers proving useful in extracting information from the training data. Contrastive learning and other unsupervised methods have emerged as effective objectives that are nonetheless developmentally plausible (Zhuang et al. 2021). However, there is a notable lack of effective evaluation benchmarks, which limits our ability to draw conclusions from the performance of developmental VLMs. Hence, we propose the construction of a small benchmark suite that tests multiple aspects of visual–language understanding from VLMs, that have associated data collected on children as a way to provide appropriate cognitive evaluation of developmental VLMs. Specifically, this benchmark suite will allow us to understand growth trajectories of VLMs (over training) as compared with those of children (over maturation), giving us insight into the learning processes of machines and humans (rather than simply comparing endpoints).

Benchmark

We propose a benchmark comprising at least three components: lexicon, syntax, and semantics. Additional components are possible, but here we focus on three components that are readily applicable to children. There are a few desiderata for this benchmark:

Tasks within the benchmark should be suitable for both VLMs and children. This allows for maximum comparability using a similar paradigm, and for the collection of child benchmark performance.
Tasks within the benchmark should be applicable across a range of VLM architectures. Notably, many VLMs do not have text generation capacity, and hence we rely on tasks that broadly rely on matching (i.e., identifying the best choice for a prompt among a set of options), which is a capability that most VLMs have.
Tasks within the benchmark should cover a range of levels of representation. Such coverage would allow for a more comprehensive and systematic analysis of VLMs.

Lexicon

To evaluate the lexicon multimodally, we use a combination of the looking-while-listening paradigm and the visual vocabulary task, currently being developed by Long et al. The looking-while-listening paradigm already has substantial associated data from multiple experiments targeted at infants and young toddlers, which are being collated in Peekbank (Zettersten et al. 2023). We are currently involved in a project to analyse the data in order to understand item characteristics, allowing for the accurate assessment of developmental response trajectories for this paradigm. A sample looking-while-listening trial is shown below:

The visual vocabulary task is a new task currently under development, which uses the four-alternative forced choice approach over a range of images, and is designed for a much larger age range. Initial waves of data collection have been conducted, and statistical modelling of item difficulties is underway. A sample visual vocabulary trial is shown below:

Syntax

To evaluate syntactic understanding, we use the Test for Reception of Grammar (TROG) (Bishop 1989). TROG is a four-alternative forced choice task with four images presented with a sentence; the stimuli are grouped by difficulty, such that children who are more adept at grammar would be able to answer the more difficult items correctly. The images from the original study are available online on OSF, along with norms from the paper; item-response theoretic analyses may be possible to recover more precise information on this task. A sample TROG trial is shown below:

Semantics

To evaluate semantic information known by VLMs or children, we propose a triplet task based on THINGS (Hebart et al. 2023). THINGS is a large dataset of images across many item categories, with associated behavioural and neural data collected on adults. One important behavioural outcome is similarity ratings, which were collected using a triplet task (choosing the odd-one-out from three images). We propose the collection of similarity ratings from children across ages, making use of the tablet setup we have at the Children’s Discovery Museum of San Jose (which has been used in other studies, including Long et al. (2024)) (see also McKnight 2022 for some developmental data). This approach is effective because each individual trial is relatively quick, and can be completed autonomously by children; the community-centric setup in the Museum therefore permits the collection of a large number of trials by a large number of children. It is also possible to apply the triplet task to words, allowing for the evaluation of both visual and language modalities, although this is contingent on children’s reading ability and may therefore be restricted to older children. A sample triplet task trial is shown below:

Evaluation

Once the benchmark has been constructed, it is possible to simultaneously collect data from children (as needed) and to evaluate VLMs. However, most VLMs are not trained developmentally, and do not have trajectory (i.e., checkpoint) information. We will demonstrate our benchmark by making use of (1) the developmentally trained VLMs from Vong et al. (2024) (without checkpoints), (2) other open VLMs with checkpoint information (without developmental training), and (3) other state-of-the-art VLMs (without both). These would help us to understand the comprehensiveness of our benchmark, and would provide strong grounding for future VLM training projects (e.g., using data from Long et al. (2022)), when both developmental training and checkpoints are available.

References

Bishop, D. V. M. 1989. Test for Reception of Grammar (TROG). 2nd ed. Manchester, Eng.: Medical Research Council.

Fernald, Anne E., Renate Zangl, Ana Luz Portillo, and Virginia A. Marchman. 2008. Using Eye Movements to Monitor Spoken Language Comprehension by Infants and Young Children: Looking While Listening. Edited by Irina A. Sekerina, Eva M. Fernández, and Harald Clahsen. Developmental Psycholinguistics: On-Line Methods in Children’s Language Processing. Language Acquisition and Language Disorders. John Benjamins Publishing Company. https://doi.org/10.1075/lald.44.06fer.

Frank, Michael C., Elise Sugarman, Alexandra C. Horowitz, Molly L. Lewis, and Daniel Yurovsky. 2016. “Using Tablets to Collect Data From Young Children.” Journal of Cognition and Development 17 (1): 1–17. https://doi.org/10.1080/15248372.2015.1061528.

Hebart, Martin N, Oliver Contier, Lina Teichmann, Adam H Rockter, Charles Y Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I Baker. 2023. “THINGS-data, a Multimodal Collection of Large-Scale Datasets for Investigating Object Representations in Human Brain and Behavior.” Edited by Morgan Barense, Floris P de Lange, Talia Konkle, and Enrico Glerean. eLife 12 (February): e82580. https://doi.org/10.7554/eLife.82580.

Long, Bria, Judith E. Fan, Holly Huey, Zixian Chai, and Michael C. Frank. 2024. “Parallel Developmental Changes in Children’s Production and Recognition of Line Drawings of Visual Concepts.” Nature Communications 15 (1): 1191. https://doi.org/10.1038/s41467-023-44529-9.

Long, Bria, Sarah Goodin, George Kachergis, Virginia A. Marchman, Samaher Radwan, Robert Sparks, Violet Xiang, et al. 2022. “The BabyView Camera: Designing a New Head-mounted Camera to Capture Children’s Early Social and Visual Environment.” November 2, 2022. https://doi.org/10.31234/osf.io/238jk.

McKnight, D. E. 2022. “Age-Related Differences in Object-Similarity Judgment.” Master’s thesis, Edmonton, AB: University of Alberta. https://doi.org/10.7939/r3-mz1n-v607.

Sullivan, Jessica, Michelle Mei, Andrew Perfors, Erica Wojcik, and Michael C. Frank. 2021. “SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded From the Infant’s Perspective.” Open Mind: Discoveries in Cognitive Science 5 (May): 20–29. https://doi.org/10.1162/opmi_a_00039.

Vong, Wai Keen, Wentao Wang, A. Emin Orhan, and Brenden M. Lake. 2024. “Grounded Language Acquisition Through the Eyes and Ears of a Single Child.” Science 383 (6682): 504–11. https://doi.org/10.1126/science.adi1374.

Wang, Wentao, Wai Keen Vong, Najoung Kim, and Brenden M. Lake. 2023. “Finding Structure in One Child’s Linguistic Experience.” Cognitive Science 47 (6): e13305. https://doi.org/10.1111/cogs.13305.

Zettersten, Martin, Daniel Yurovsky, Tian Linger Xu, Sarp Uner, Angeline Sin Mei Tsui, Rose M. Schneider, Annissa N. Saleh, et al. 2023. “Peekbank: An Open, Large-Scale Repository for Developmental Eye-Tracking Data of Children’s Word Recognition.” Behavior Research Methods 55 (5): 2485–2500. https://doi.org/10.3758/s13428-022-01906-4.

Zhuang, Chengxu, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael C. Frank, James J. DiCarlo, and Daniel L. K. Yamins. 2021. “Unsupervised Neural Network Models of the Ventral Visual Stream.” Proceedings of the National Academy of Sciences 118 (3): e2014196118. https://doi.org/10.1073/pnas.2014196118.