A Bayesian `Standard’ Model of Early Word Learning

Baseline vs. Frequency Model

Our baseline model estimates a difficulty for each word and a rate (words per hour) for each child, and assumes that each child receives 12 (waking) hours per day of input.

Although this model does not include word frequency, a uniform scaling factor to represent that each word is fairly unlikely (e.g., \(P(w_n) = 1/27693\)) is included to ensure that the estimated difficulty parameters are on the same scale as those in the word frequency model. The denominator is the number of unique non-hapax word types in CHILDES, in order to acknowledge the bulk of the words not appearing on the CDI.

The word frequency model uses word frequency estimates from CHILDES instead of a uniform probability. The range of these probabilities is 3.88e-7 (e.g., “green beans”, “TV”, “washing machine”) to 0.044 (“you”).

Due to convergence issues when we used empirical priors on rate (a normal with mean = 1198 and SD=84), for this round of inference we used uniform priors [0,2000] for both rate and difficulty, and a uniform [0,16] months for start age of input accumulation. We plan to revisit the use of data-informed priors for at least rate after we examine the results below.

Start Age of Input Accumulation

What do the estimated start ages for accumulation look like in these models? This parameter was allowed to vary from 0-16 (months), but we would reasonably expect it to start by 12 months, when most infants are starting to produce their first words.

In the baseline model the mean estimated start age was 15.73 months, while in the word frequency model the estimated start age was 15.97 months. These values are startlingly high, as they imply that children do not start accumulating words until beyond one year of age, when most children have in fact started producing one or more words. It’s unclear whether this will be corrected by using a normal input rate prior, and/or by using a normal prior on the start age itself (\(N(6,4)\), perhaps?).

Input Rate

What are the models’ distributions of estimated input rates (tokens per hour) per subject?

Subjects’ mean estimated input rate for the baseline model is 229.97 tokens per hour (SD = 363.28), while the mean rate for the word frequency model is 269.42 (SD = 595.54). Although some ~20% of the subjects are estimated to have hourly input rates in a reasonable empirical range (500 - 2000), the majority of subjects are estimated to have surprisingly low input rates (e.g., 100 hourly tokens corresponds to <0.5m tokens/year, and fewer than half as many as the lowest observed by Hart & Risley 1995).

Word Difficulties

What are the models’ distributions of estimated word difficulties?

The mean estimated word difficulty for the baseline model is 7.52 (SD = 1.94), while the mean for the word frequency model is 21.13 (SD = 67.82). Although the estimated word difficulties in the baseline model are normally-distributed, the mean number of times a word must be heard to be learned is quite low. The mean word difficulty in the word frequency model is somewhat higher primarily because a small proportion of words are estimated to require hundreds of tokens to learn (median = 3.28).

Comparing the Models’ Estimated Parameters

How do subjects’ input rates recovered in the baseline model compare to those in the frequency model?

Children’s estimated rates from the two models are highly correlated (r = 0.95).

Word difficulties from the baseline model were not correlated with word difficulties in the frequency model (r = 0.07).

Age and input rate

Is there a correlation between children’s estimated input rates and their age? In the baseline model, the correlation of children’s age and estimated rate was -0.55. In the frequence model, the correlation of children’s age and estimated rate was -0.57.

15-month-olds are the only ones with realistic estimated input rates (and given that the estimated start age is ~15.7-15.9 months, this is not comforting).

Word difficulty by lexical class

Examine a small subset of words

Estimated Word Difficulty vs. CHILDES Frequency

In the baseline model, the estimated word difficulty had no correlation with the log(word frequency) from CHILDES (r = -0.07). In the frequency model, the estimated word difficulty was positively correlated with the log(word frequency) from CHILDES (r = 0.58).

Model comparison

Which of these models fits the data better?

Note: this takes awhile to run…save output?