Reproducibility of CVCL Model by Vong et al. (2024, Science)

Author

Jane Yang (j7yang@ucsd.edu)

Published

December 11, 2024

Introduction

AI models have demonstrated remarkable language learning capabilities, yet they fundamentally differ from human language acquisition. While AI systems utilize trillions of words during training, children learn language through exposure to merely millions of words annually. This vast disparity in learning input has prompted skepticism among researchers about drawing parallels between AI language learning and human linguistic development.

To explore the potential connections between artificial and human language learning, an innovative approach would involve training an AI model exclusively on the linguistic input received by a single child, thereby revealing the model’s learning performance under dramatically constrained data conditions. Such an experiment could provide insights into the mechanisms of language acquisition and the comparative capacities of artificial and human learning systems.

A team of researchers from New York University conducted an experiment that directly addressed this question. By training a multimodal model (CVCL, Child’s View for Contrastive Learning) using only the visual and auditory experiences captured through a headcam recording a child from six months to two years of age, they sought to understand the potential for language acquisition under severely restricted input conditions.

The study’s findings, published in Science, were remarkable. Despite having access to video recordings that represented merely 1% of the child’s waking hours, the model demonstrated the ability to learn a significant number of words and concepts. This result challenged existing assumptions about the massive data requirements for language learning, suggesting that even limited, contextually rich input could support meaningful linguistic and conceptual development.

Initially, my objective was to comprehensively reproduce the CVCL model, encompassing a complete process from model training to evaluation across multiple datasets. However, due to unexpected challenges and practical constraints, I pivoted to evaluating the CVCL model using the Konkle Objects Dataset. Despite my best efforts, full reproducibility of the original evaluation results was hindered by several obstacles and time constraints, which I will detail in the subsequent sections.

Link to GitHub repo: https://github.com/ucsd-psych201a/vong2024
Link to the original paper: https://www.science.org/doi/abs/10.1126/science.adi1374

Procedure

I first configured a conda environment on the server, downloading all relevant libraries and obtained the source code from Github. I created a script called ‘trial_test.py’ to test if I would be able to run the pretrained CVCL model successfully and to also check if the environment was set up properly. I was able to run the script without an issue (see Figure-environment).

After making sure the environment was set up properly, I downloaded the Konkle Objects Dataset (Konkle et al., 2010) and put it over on the server. For the evalution process, the paper selected 64 object categories from the Konkle Objects Dataset to match the vocabularies in the headcam videos. I then needed to process the evaluation dataset to match the metadata used in the original paper. With the generous help from the original author of the paper, I obtained the original metadata of the evaluation dataset. I wrote a script calling the helper functions from ‘multimodal_data_module.py’ to restructure the evaluation dataset and to resize each image to be 50% of its original size, and then resized the overall image to be 224 x 224 (see figure-resize-restructure). I originally ran into issues running the helper functions from ‘multimodal_data_module.py’. I thought it was a system path configuration error at first, but later I realized it was because of a surprising discrepancy in folder structures (see figure-error1). I resolved the conflict by removing the ‘multimodal’ prefix in the import, because all the other imported scripts where from the same folder as ‘multimodal_data_module.py’.

I downloaded the pretrained checkpoint of the CVCL model from HuggingFace. I then ran the ‘eval.py’ script to run a full evaluation on the Konkle Objects dataset. However, I could not figure out why the script wouldn’t run. It raised an error that seemed to have to do with an updated function in ‘train.py’ related to the ‘pytorch_lightning’ library. I tried updating the add_parser() function in the ‘train.py’ to match the latest version of the same function, it still did not resolve the error (see figure-error2).

Summary

It has been a journey trying to reproduce the paper faithfully. Given the generous support from Wai Keen, Bria, Mike, Janna, and Khuyen, I was able to have a better understanding of the model architecture and pipeline through hands-on experience. Although I was not able to resolve all the bugs that I encountered nor having enough time to reproduce the results, this has truly been a valuable learning experience!