Replication of CVCL Model by Vong et al. (2024, Science)

Author

Jane Yang (j7yang@ucsd.edu)

Published

November 18, 2024

Introduction

I am broadly interested in how children’s early vocabulary development intersects with their visual category learning. It fascinates me that a child can easily grasp fundamental properties of objects from just a few examples, while highly trained models often fall short. By integrating infants’ visual and linguistic experiences into computational models, I aim to explore the mechanisms through which children learn categories from everyday experiences. In Vong’s paper, the authors proposed the Child’s View for Contrastive Learning (CVCL) model, which embodies a form of cross-situational associative learning. This model tracks the co-occurrences of words and their possible visual referents to establish mappings. By reproducing the findings from this paper and understanding the model’s implementation details, I will learn how to use contrastive language-image pre-training models in my own research. Ultimately, my long-term goal is to develop a cognitively realistic model that learns robust representations from children’s everyday experiences.

First, I will obtain the SAYCam training dataset from Databrary and download the pre-trained CVCL model from the HuggingFace Hub. To familiarize myself with the dataset, I will randomly sample it and feed the sampled data into the CVCL model, encoding images and utterances to quickly assess the model’s performance before proceeding further. With a basic understanding of the model, I will then follow the analysis pipeline outlined in the paper to reproduce its main figures. These analyses are divided into four key categories: (1) descriptive analysis of the training data, (2) t-SNE plots showing the alignment of vision and language from a child’s perspective, (3) image classification accuracy comparing CVCL, CLIP, and a linear probe, and (4) attention maps generated by Grad-CAM to illustrate object localization capabilities across four different categories in CVCL. Challenges will likely occur during model evaluation, particularly implementing the CLIP model and other approaches comparing their performance with CVCL on image classification. Finally, I will test the models’ generalization by evaluating them on novel visual exemplars not included in the training dataset.

Link to GitHub repo: https://github.com/JaneYang07/vong2024_replication
Link to the original paper: https://www.science.org/doi/abs/10.1126/science.adi1374

Design Overview

The image classification accuracy was the only measure in the paper. The model architecture, training dataset, evaluation dataset, and alternative models can all be considered as factors of the study. It’s an within-participant study as the paper proposed one primary model and it was trained on the data of one single child. The measure was not repeated. The study is replicable if the model was trained on the same dataset, but it may not be generalizable on a different dataset (e.g. it does not generalize to SAYCam-A and SAYCam-L data).

Methods

Materials

“SAYCam-S dataset of longitudinal egocentric video recordings from an individual child, which consists of clips over a 1.5-year period of the child’s life (6 to 25 months), with a total of 600,000 video frames paired with 37,500 transcribed utterances (extracted from 61 hours of video).”

Procedure

CVCL was trained on the SAYCam-S dataset, using a contrastive learning model architecture that integrates both a vision encoder and a language encoder. Within this structure, images and corresponding utterances are embedded into a joint vector space via modality-specific neural networks. During training, the model learns by adjusting similarity metrics: matched image-utterance pairs are drawn closer (increased cosine similarity), while mismatched pairs are pushed apart.

After the training phase, CVCL was evaluated alongside several alternative models. The evaluation process adapted a well-known child testing procedure, prompting models with a target category label to identify the corresponding visual referent among four candidate images. This selection relied on cosine similarity to the label, allowing for a straightforward assessment of model performance. The models were tested on the Labeled-S dataset, an evaluation set with frames annotated for 22 visual concepts that were consistently observed across both the visual and linguistic experiences of the child.

Among the alternative models, CVCL-Shuffled was trained on a dataset where co-occurring frames and utterances were randomly shuffled, breaking the original links between frames and utterances while retaining information from each modality independently. Another model, CVCL-Random Features, was designed to test the reliance on strong visual embeddings by randomly initializing and freezing the vision encoder during training. Additionally, a Linear Probe model was developed by fitting a linear classifier on top of the frozen pretrained vision encoder, which had been initialized through self-supervision.

To assess out-of-distribution generalization, CVCL’s performance was tested on the Konkle Objects dataset. This dataset comprises 64 visual concepts, each with a corresponding label in CVCL’s vocabulary, presented as single-object images on a white background.

Analysis Plan

The primary goal of this project is to deepen my understanding of implementing the CVCL model and its alternatives, with a focus on model training and evaluation. Following the procedures outlined above, I will train CVCL and the alternative models accordingly. Evaluation will be conducted using the Labeled-S and Konkle Objects datasets, applying a four-alternative forced-choice (4AFC) image recognition test for accuracy assessment.

Additionally, I plan to generate t-SNE plots based on the cosine similarity of text and image embeddings, offering a visual representation of the alignment between vision and language from a child’s perspective.

Differences from Original Study

I will try to follow the training procedure of CVCL and the alternative models, however, the models may not perform exactly as described in the paper. I anticipate the classification accuracy across categories during evaluation will look sightly different than the results in the paper.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Project Check #1

Outcome measure of the success/failure

The primary outcome measure is the image classification accuracy across object categories. The classification accuracy of the trained model will be calculated during the model evaluation stage. I will then compute the root squared mean error (RMSE) between the trained model’s classification accuracy and the accuracy scores proposed in the paper. The goal is to have a RMSE less than 10%.

Progress description

So far I have asked the primary investigator for the access to the training dataset hosted on Databrary. Before I can train the model using the training dataset, I have been looking into the paper describing the training dataset to get more familiar with it. I have also set up necessary environment on the server needed for model training later. All required packages have been documented in the requirements.txt file. I have also looked into the analysis code to understand the procedure of reproducing figures in the paper.

Project Check #2

Progress description

I now have the access to the training dataset hosted on Databrary, however, downloading the data from Databrary has been a hurdle. It kept on timing out and not letting me download files, even when I tried to download one file at a time. After configuring the network option, I was finally able to download one session of the training videos. I still need to figure out a way to download all sessions from subject-S to get the complete training dataset.

Before I had access to the training data, I tried trying out the pretrained model using similar egocentric view, however, I ran into some issues on the server where I set up the project. I don’t have the sudo permission to update an environmental package needed for the code to run smoothly. I have created a ticket and will talk to the IT service department to see how I can resolve this issue. I will attach a picture of the error for reference.

Roadmap to future steps

My goal is still trying to get the full training dataset and train models as proposed. If under unforseen circumstances I can’t have access to the full training dataset, I will focus on reproducing the analysis results using the existing intermediate data files the authors provided without re-training the models.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.