This document summarizes our approach to deriving a standard model for early word learning.
We begin with an item response theory model. The goal is to predict whether a particular child \(c\) will know a particular word \(w\), based on their learning rate \(\theta\) and age \(t\), and the word’s difficulty \(d\). This is a modified Rasch model.
\[ p(w = 1 \mid \theta, a, d) \sim \frac{1}{1 + exp^{-(t\theta - d)}} \]
The trouble with this seemingly simple logistic formulation is that it is computationally difficult for large datasets. Because each observation depends on a word and a child, if there are \(M\) words and \(N\) children, we have \(MN\) parameters for even the simplest model. Further, each observation depends on a different combination of parameters.
Different previous work deals with this problem in similar ways, but using different model classes.
All three of these papers focus on word-level difficulty, and do not assume heterogeneity between learners. This assumption allows them to describe overall patterns (e.g., acceleration in vocab growth) but doesn’t allow them to fulfill our goal of linking specific units of input measurement for individuals to particular outcome measures.
Here we try to derive a version of this model that allows us to model heterogeneity across learners with different functional forms, but that maintains some distributional assumptions about word difficulty.
In particular, we note that for each child, the probability of knowing a word is distributed as a Bernoulli variable with the parameter given by the IRT model above. But then the probability of having a particular vocabulary size \(V\) from some set of words \(W = {w_1, ..., w_i, ... w_m}\) and their associated difficulties \(D\) is distributed as a Poisson Binomial, with parameters \(p_i\) given by the probabilities of producing each word for that child at that time. This probability is
\[ p(V = v\mid \theta, t, D) = \sum_{A \in F_v} \prod_{i \in A} p_i \prod_{j \in A^c} (1 - p_i) \]
where \(F_v\) is all the subsets of \(v\) integers that can be selected from the integers from 1 to \(M\), \(A\) is one of these subsets, and \(A^c\) is the complement of this set. Intuitively, the sum is over all the different ways to produce \(v\) of \(M\) words; for each of these the probabilities of production are multiplied for those words produced and the probabilities of non-production are multipled for those words that are not produced. Unfortunately due to the sum over subsets, this quantity is very difficult to compute.
Luckily we can use a normal approximation for this quantity by the central limit theorem (described in Hong 2013).
Hidaka, Shohei. 2013. “A Computational Model Associating Learning Process, Word Attributes, and Age of Acquisition.” PloS One 8 (11). Public Library of Science.
Hong, Yili. 2013. “On Computing the Distribution Function for the Poisson Binomial Distribution.” Computational Statistics & Data Analysis 59. Elsevier: 41–51.
McMurray, Bob. 2007. “Defusing the Childhood Vocabulary Explosion.” Science 317 (5838). American Association for the Advancement of Science: 631–31.
Mollica, Francis, and Steven T Piantadosi. 2017. “How Data Drive Early Word Learning: A Cross-Linguistic Waiting Time Analysis.” Open Mind 1 (2). MIT Press: 67–77.