Main findings

-There is high individual variability in children’s lexical organization, compared to adults’ organization.

-Despite this high individual variability, the associations produced by children have interesting common properties:

  1. They tend to be more phonetically similar. In particular, children tend to have more minimal pairs in their associations (e.g., “house” -> “mouse”).

  2. They tend to be less thematically/semantically related, although not at chance.

  3. The phonetic and semantic relatedness in children’s associations interact: The pairs that are more similar phonetically are less similar semantically.

Descriptive statistics

Number of subjects in each age group (I, sometimes, collapse “Younger” and “Older” in one category to have more power).

## # A tibble: 3 x 2
##      Age2 N_subjects
##     <chr>      <int>
## 1   Adult         60
## 2   Older         17
## 3 Younger         43

Number of cues (Experimenter’s word) by age group, and average number of subjects per cue

## # A tibble: 3 x 3
##      Age2 N_cues Ave_subjects
##     <chr>  <int>        <dbl>
## 1   Adult     66    28.787879
## 2   Older     66     7.333333
## 3 Younger     65    16.615385

Ages in each group

## # A tibble: 15 x 2
##      Age    Age2
##    <int>   <chr>
##  1     3 Younger
##  2     4 Younger
##  3     5 Younger
##  4     6   Older
##  5     7   Older
##  6     8   Older
##  7    18   Adult
##  8    19   Adult
##  9    20   Adult
## 10    21   Adult
## 11    22   Adult
## 12    24   Adult
## 13    26   Adult
## 14    38   Adult
## 15    43   Adult

Result 0: Development of paradigmatic vs. syntagmatic associations

This is a replication of previous findings (Cite XX). The relations were hand-coded into paradigmatic or syntagmatic.

Result 1: Development of associations’ entropy

I measure agreement among subjects in their associations as a function of age. Agreement is quantified through standard information-theoretical measures (here I use Normalized Entropy).

For each cue \(y\) (i.e., Experimenter’s word), I compute the normalized entropy defined as:

\[H(y)=\sum_{i=1}^{N} \frac{p(x_i)*log_2(p(x_i))}{log_2(N)} \] \(p(x_i)\) is the probability of a target \(x_i\) (i.e., Child’s word), which I obtain, for each cue, through averaging across subjects’ responses for that cue. \(H\) have values between 0 (no disagreement) and 1 (no agreement).

\(N\) is the total number of answers provided by all subjects for a given cue.

The graph, below, shows the average entropy across all cues (with 95% CI).

I find that entropy for children is very high, and it is higher than adults’ entropy. This means that, compared to adults, children differ strongly at the individual level in terms of their associations.

Below I study the properties of the cue -> target relations. I investigate the phonetic and semantic proximity, as well as the correlation between these two measures.

Result 2: Development of the phonetic proximity in the free associations

I investigate the extent to which the cue -> target relations are determined by phonetic proximity (e.g., “house” -> “mouse”).

I convert the orthographic words into their phonetic transcription, and I measure the Levenshtein distance (also known as “edit” distance). This measure counts the minimum number of operations (insertions, deletions, substitutions) required to change one string into another. For example, the edit distance of the pair (house -> mouse) is 1, because we only needed 1 operation: substituting “h” with “m” (for illustration this example was given with the orthographic transcription).

I remove from the analysis the case where targerts were identical to the cues (e.g., cat -> cat)

I found that the children’s associations are, on average, more phonetically related than adults’ associations. The interesting explanation is that children’s free associations are more determined by their phonetic proximity. But if this is the case, we would expect the difference between children and adults to be driven mostly by minimal pairs (e.g., mouse -> house). We test this prediction in what follows.

For each cue, I measure the probability that the target will have a given edit distance. E.g., for the cue “house”, what is the probability that the target will have an edit distance of 1 (e.g., “mouse”), an edit distance of 2 (e.g., “mouth”),….Then I compute the average proportions ocross all cues (with 95% CI).

I collapse “Older” and “Younger”" into one Age group (“Young”).

For a given cue, children and adults are equally likely to give a target which is NOT a minimal pair. They only differ in the probability of answering with a minimal pair: Children are more likely to give a minimal pair as a target.

Result 3: Development of the semantic proximity in the free associations

Now I investigate the semantic proximity of the free associations.

I measure the semantic similarity of the cue-target pairs using the state-of-the-art distributional semantic model known as Word2Vec. In this model, two words are similar if they co-occur in a large corpus of text.

I use vectors trained on two kinds of texts:

Measure 1: the semantic similarity is derived from a model trained on a (very) large Wikipedia corpus (around 6B tokens). This measure is supposed to offer a quantification of similarity from an adult perspective. It is supposed to be a rather “objective” representation of semantic similarity in English.

Measure 2: the semantic similarity is derived from a model trained on a corpus of child directed speech. This measure is supposed to approximate the perspective of the child. This is because children are more likely to derive co-occurrence similarity from such specific corpus, rather than from a more language-representative text such as wikipedia.

Wikipedia-based co-occurrence similarity

CHILDES-based co-occurrence similarity

For both models, I find a clear developmental trend in terms of the semantic proximity of the associations.

One might wonder if this is related to the tension with phonetic similarity. That is, children’s answers are less semantically similar because children also tend to give phonetically similar answers (which are not necessarily semantically related). If this is the case, we should expect a greater drop in semantic similarity for shorter phonetic distances. I will explore this in the next section.

Result 4: Correlation of Semantic vs. Phonetic similarity

For each value of the edit distance between the cue-targe relations, we compute the average semantic similarity.

For this analysis I collapsed “Older” and “Younger” into one age group “Young”.

Wikipedia-based co-occurrence similarity

CHILDES-based co-occurrence similarity

I found that the semantic similarity is, overall, higher in adults’ associations across almost all values of phonetic distance. However, I also found that the highest discrepancy between children and adults was in the minimal pairs (i.e., phon_dist = 1). Children tend to provide targets that are phonetically minimal pairs with the cue, regardless of their semantic relatedness. Indeed:

-As a first observation, the mean semantic similarity for minimal pairs was closer to chance than other values of phonetic distance (except for longer phonetic distances (> 5) where we also have lower statistical power).

-As a second analysis, we can compare the semantic similarity of the minimal pairs to the total average across all phonetic distances (far left of the graph). We see that, for children (but not adults), the semantic similarity of the minimal pairs is way lower than the average. (A more statistically sound comparison would require comparing the set of minimal pairs to the set of non-minimal pairs, but this is obviously going to give similar results).