Replication of CVCL Model by Vong et al. (2024, Science)

Author

Jane Yang (j7yang@ucsd.edu)

Published

October 30, 2024

Introduction

I am broadly interested in how children’s early vocabulary development intersects with their visual category learning. It fascinates me that a child can easily grasp fundamental properties of objects from just a few examples, while highly trained models often fall short. By integrating infants’ visual and linguistic experiences into computational models, I aim to explore the mechanisms through which children learn categories from everyday experiences. In Vong’s paper, the authors proposed the Child’s View for Contrastive Learning (CVCL) model, which embodies a form of cross-situational associative learning. This model tracks the co-occurrences of words and their possible visual referents to establish mappings. By reproducing the findings from this paper and understanding the model’s implementation details, I will learn how to use contrastive language-image pre-training models in my own research. Ultimately, my long-term goal is to develop a cognitively realistic model that learns robust representations from children’s everyday experiences.

First, I will obtain the SAYCam training dataset from Databrary and download the pre-trained CVCL model from the HuggingFace Hub. To familiarize myself with the dataset, I will randomly sample it and feed the sampled data into the CVCL model, encoding images and utterances to quickly assess the model’s performance before proceeding further. With a basic understanding of the model, I will then follow the analysis pipeline outlined in the paper to reproduce its main figures. These analyses are divided into four key categories: (1) descriptive analysis of the training data, (2) t-SNE plots showing the alignment of vision and language from a child’s perspective, (3) image classification accuracy comparing CVCL, CLIP, and a linear probe, and (4) attention maps generated by Grad-CAM to illustrate object localization capabilities across four different categories in CVCL. Challenges will likely occur during model evaluation, particularly implementing the CLIP model and other approaches comparing their performance with CVCL on image classification. Finally, I will test the models’ generalization by evaluating them on novel visual exemplars not included in the training dataset.

Link to GitHub repo: https://github.com/JaneYang07/vong2024_replication
Link to the original paper: https://www.science.org/doi/abs/10.1126/science.adi1374

Methods

Materials

“SAYCam-S dataset of longitudinal egocentric video recordings from an individual child, which consists of clips over a 1.5-year period of the child’s life (6 to 25 months), with a total of 600,000 video frames paired with 37,500 transcribed utterances (extracted from 61 hours of video).”

Procedure

CVCL was trained on the SAYCam-S dataset, using a contrastive learning model architecture that integrates both a vision encoder and a language encoder. Within this structure, images and corresponding utterances are embedded into a joint vector space via modality-specific neural networks. During training, the model learns by adjusting similarity metrics: matched image-utterance pairs are drawn closer (increased cosine similarity), while mismatched pairs are pushed apart.

After the training phase, CVCL was evaluated alongside several alternative models. The evaluation process adapted a well-known child testing procedure, prompting models with a target category label to identify the corresponding visual referent among four candidate images. This selection relied on cosine similarity to the label, allowing for a straightforward assessment of model performance. The models were tested on the Labeled-S dataset, an evaluation set with frames annotated for 22 visual concepts that were consistently observed across both the visual and linguistic experiences of the child.

Among the alternative models, CVCL-Shuffled was trained on a dataset where co-occurring frames and utterances were randomly shuffled, breaking the original links between frames and utterances while retaining information from each modality independently. Another model, CVCL-Random Features, was designed to test the reliance on strong visual embeddings by randomly initializing and freezing the vision encoder during training. Additionally, a Linear Probe model was developed by fitting a linear classifier on top of the frozen pretrained vision encoder, which had been initialized through self-supervision.

To assess out-of-distribution generalization, CVCL’s performance was tested on the Konkle Objects dataset. This dataset comprises 64 visual concepts, each with a corresponding label in CVCL’s vocabulary, presented as single-object images on a white background.

Analysis Plan

The primary goal of this project is to deepen my understanding of implementing the CVCL model and its alternatives, with a focus on model training and evaluation. Following the procedures outlined above, I will train CVCL and the alternative models accordingly. Evaluation will be conducted using the Labeled-S and Konkle Objects datasets, applying a four-alternative forced-choice (4AFC) image recognition test for accuracy assessment.

Additionally, I plan to generate t-SNE plots based on the cosine similarity of text and image embeddings, offering a visual representation of the alignment between vision and language from a child’s perspective.

Differences from Original Study

I will try to follow the training procedure of CVCL and the alternative models, however, the models may not perform exactly as described in the paper. I anticipate the classification accuracy across categories during evaluation will look sightly different than the results in the paper.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.