This template provides the structure for how to fit age of acquisition (AoA) prediction models, using data and scripts in this repository.

The general steps are: - loading the provided CDI data and predictor data - adding your predictor(s) to the provided predictor data - using the functions in scripts/prep_data.R to prepare the data for modeling - using the functions in scripts/fit_models.R to fit models and extract information from them

Load Wordbank data

Loading cached Wordbank data for English:

Creating saved Wordbank data for uncached language, example with English

Creating saved Wordbank data one step at a time (potentially making changes between steps) for uncached language, example with English :

Loading Wordbank data for multiple languages (cached or not):

## Loading data for Croatian...
## Loading data for Danish...
## Loading data for English (American)...
## Loading data for Norwegian...
## Loading data for Russian...
## Loading data for Turkish...
## Loading data for Spanish (Mexican)...
## Loading data for Italian...
## Loading data for Swedish...
## Loading data for French (Quebecois)...

## # A tibble: 154,860 × 7
##    language measure  uni_lemma   age num_true total items           
##    <chr>    <chr>    <chr>     <int>    <int> <int> <list>          
##  1 Croatian produces about        16        0    13 <tibble [1 × 5]>
##  2 Croatian produces about        17        6    25 <tibble [1 × 5]>
##  3 Croatian produces about        18        5    22 <tibble [1 × 5]>
##  4 Croatian produces about        19        1    21 <tibble [1 × 5]>
##  5 Croatian produces about        20        1    26 <tibble [1 × 5]>
##  6 Croatian produces about        21        2    29 <tibble [1 × 5]>
##  7 Croatian produces about        22        3    24 <tibble [1 × 5]>
##  8 Croatian produces about        23        7    26 <tibble [1 × 5]>
##  9 Croatian produces about        24        8    30 <tibble [1 × 5]>
## 10 Croatian produces about        25        8    30 <tibble [1 × 5]>
## # … with 154,850 more rows

Load predictors

Ratings and phonemes

Merge in the by-concept predictors (babiness, concreteness, etc) to the unilemmas and the by word predictors (phonemes) to the words/definitions.

CHILDES

Loading cached CHILDES data for English:

Loading cached CHILDES data for multiple languages:

Creating saved CHILDES data for English, potentially changing which metrics are computed and/or arguments that are passed to childesr:

Creating saved CHILDES data for many languages:

Prepare data for modeling

Frequency transformations

Order: Frequency, weighted, log transformed and Laplace smoothed, residualized

## # A tibble: 4,061 × 20
##    language uni_lemma tokens    n_tokens count count_first count_last count_solo
##    <chr>    <chr>     <chr>        <int> <int>       <int>      <int>      <int>
##  1 Croatian about     o                1   111          66          1         14
##  2 Croatian airplane  avion,av…        4    54           4         28          4
##  3 Croatian all       sve,svi          2   306          47         31          1
##  4 Croatian all gone  nema,nem…        7   604         265         59         31
##  5 Croatian animal    životinj…        3    33           1         20          0
##  6 Croatian another   druga,dr…       10   228          16         49          3
##  7 Croatian apple     jabuka,j…        3     9           0          6          0
##  8 Croatian arm       ruka,ruk…        5    84           3         43          0
##  9 Croatian armchair  fotelje,…        3    11           0          4          0
## 10 Croatian asleep    spava,sp…       10    92          13         34          7
## # … with 4,051 more rows, and 12 more variables: mlu <dbl>, length_char <dbl>,
## #   length_phon <dbl>, freq_raw <dbl>, sumcount <int>, sumcount_first <int>,
## #   sumcount_last <int>, sumcount_solo <int>, frequency <dbl>,
## #   final_frequency <dbl>, first_frequency <dbl>, solo_frequency <dbl>

Combine mapped predictors and CHILDES predictors:

Imputation

## # A tibble: 10 × 3
## # Groups:   language [10]
##    language           frequency naperlang
##    <chr>                  <dbl>     <int>
##  1 Croatian                  NA        52
##  2 Danish                    NA        82
##  3 English (American)        NA        18
##  4 French (Quebecois)        NA        60
##  5 Italian                   NA        25
##  6 Norwegian                 NA        54
##  7 Russian                   NA       261
##  8 Spanish (Mexican)         NA        47
##  9 Swedish                   NA        56
## 10 Turkish                   NA        84

## Imputing Croatian with 20 steps...
## Imputing Danish with 20 steps...
## Imputing English (American) with 20 steps...
## Imputing French (Quebecois) with 20 steps...
## Imputing Italian with 20 steps...
## Imputing Norwegian with 20 steps...
## Imputing Russian with 20 steps...
## Imputing Spanish (Mexican) with 20 steps...
## Imputing Swedish with 20 steps...
## Imputing Turkish with 20 steps...

## # A tibble: 10 × 3
## # Groups:   language [10]
##    language           data                imputed            
##    <chr>              <list>              <list>             
##  1 Croatian           <tibble [423 × 13]> <tibble [423 × 13]>
##  2 Danish             <tibble [418 × 13]> <tibble [418 × 13]>
##  3 English (American) <tibble [693 × 13]> <tibble [693 × 13]>
##  4 Norwegian          <tibble [409 × 13]> <tibble [409 × 13]>
##  5 Russian            <tibble [696 × 13]> <tibble [696 × 13]>
##  6 Turkish            <tibble [438 × 13]> <tibble [438 × 13]>
##  7 Spanish (Mexican)  <tibble [431 × 13]> <tibble [431 × 13]>
##  8 Italian            <tibble [425 × 13]> <tibble [425 × 13]>
##  9 Swedish            <tibble [407 × 13]> <tibble [407 × 13]>
## 10 French (Quebecois) <tibble [636 × 13]> <tibble [636 × 13]>

Fit models

AoA lm model

## # A tibble: 20 × 7
## # Groups:   language, measure [20]
##    language           measure     data      model  coefficients  rsquared  preds
##    <chr>              <chr>       <list>    <list> <list>        <list>    <lis>
##  1 Croatian           produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  2 Croatian           understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  3 Danish             produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  4 Danish             understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  5 English (American) produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  6 English (American) understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  7 Norwegian          produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  8 Norwegian          understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
##  9 Russian            produces    <tibble … <lm>   <tibble [66 … <tibble … <chr…
## 10 Russian            understands <tibble … <lm>   <tibble [66 … <tibble … <chr…
## 11 Turkish            produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 12 Turkish            understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 13 Spanish (Mexican)  produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 14 Spanish (Mexican)  understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 15 Italian            produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 16 Italian            understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 17 Swedish            produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 18 Swedish            understands <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 19 French (Quebecois) produces    <tibble … <lm>   <tibble [55 … <tibble … <chr…
## 20 French (Quebecois) understands <tibble … <lm>   <tibble [55 … <tibble … <chr…

Evaluate collinearity

Evaluate collinearity between predictors using variance inflation factor (VIF) of a set of predictors (pred). For a single language:

## # A tibble: 20 × 5
##    language           measure                    data predictors      vif_results
##    <chr>              <chr>       <list<tibble[,16]>> <chr>                 <dbl>
##  1 English (American) produces             [693 × 16] frequency              2.24
##  2 English (American) produces             [693 × 16] mlu                    1.87
##  3 English (American) produces             [693 × 16] final_frequency        2.29
##  4 English (American) produces             [693 × 16] valence                1.11
##  5 English (American) produces             [693 × 16] concreteness           2.10
##  6 English (American) produces             [693 × 16] babiness               1.09
##  7 English (American) produces             [693 × 16] first_frequency        2.29
##  8 English (American) produces             [693 × 16] solo_frequency         2.20
##  9 English (American) produces             [693 × 16] length_char            1.44
## 10 English (American) produces             [693 × 16] n_tokens               1.17
## 11 English (American) understands          [408 × 16] frequency              2.12
## 12 English (American) understands          [408 × 16] mlu                    2.76
## 13 English (American) understands          [408 × 16] final_frequency        2.25
## 14 English (American) understands          [408 × 16] valence                1.14
## 15 English (American) understands          [408 × 16] concreteness           1.97
## 16 English (American) understands          [408 × 16] babiness               1.11
## 17 English (American) understands          [408 × 16] first_frequency        2.82
## 18 English (American) understands          [408 × 16] solo_frequency         2.75
## 19 English (American) understands          [408 × 16] length_char            1.33
## 20 English (American) understands          [408 × 16] n_tokens               1.15

Evaluate collinearity between predictors using variance inflation factor (VIF) of a set of predictors (pred). For a set of languages:

## # A tibble: 200 × 5
##    language measure                 data predictors      vif_results
##    <chr>    <chr>    <list<tibble[,16]>> <chr>                 <dbl>
##  1 Croatian produces          [402 × 16] frequency              1.98
##  2 Croatian produces          [402 × 16] mlu                    1.22
##  3 Croatian produces          [402 × 16] final_frequency        1.32
##  4 Croatian produces          [402 × 16] valence                1.08
##  5 Croatian produces          [402 × 16] concreteness           1.46
##  6 Croatian produces          [402 × 16] babiness               1.05
##  7 Croatian produces          [402 × 16] first_frequency        1.38
##  8 Croatian produces          [402 × 16] solo_frequency         1.39
##  9 Croatian produces          [402 × 16] length_char            1.38
## 10 Croatian produces          [402 × 16] n_tokens               1.30
## # … with 190 more rows

Cross-validation

## # A tibble: 50 × 12
##    language   measure   name test   train model test_word lexical_category   aoa
##    <chr>      <chr>    <int> <list> <lis> <lis> <chr>     <fct>            <dbl>
##  1 English (… produces     1 <int … <int… <lm>  mommy     other             10.4
##  2 English (… produces     1 <int … <int… <lm>  daddy     other             10.4
##  3 English (… produces     1 <int … <int… <lm>  child     other             31.2
##  4 English (… produces     1 <int … <int… <lm>  uh oh     other             13.6
##  5 English (… produces     1 <int … <int… <lm>  grandpa   other             18.8
##  6 English (… produces     1 <int … <int… <lm>  give me … other             25.2
##  7 English (… produces     1 <int … <int… <lm>  child's … other             19.9
##  8 English (… produces     1 <int … <int… <lm>  nurse     other             31.7
##  9 English (… produces     1 <int … <int… <lm>  vagina    nouns             30.7
## 10 English (… produces     1 <int … <int… <lm>  belly bu… nouns             20.6
## # … with 40 more rows, and 3 more variables: aoa_pred <dbl>, abs_dev <dbl>,
## #   se <dbl>

Plot model results

Production: Concreteness and valence all over the place Babiness consistently negative Length char consistently positive Mlu mostly non significant, consistently positive Final frequency consistently negative First frequency consistently negative Solo frequency consistently negative Frequency consistently negative

!!!!Final frequency Spanish Mexican !!!!Final frequency Turkish !!!!Frequency Turkish !!!!N tokens negative direction

Comprehension: Concreteness all over the place Valence consistently negative Babiness consistently negative Ntokens consistently negative Length_char all over the place Mlu consistently positive Final frequency all over the place First frequency all over the place Solo frequency consistently negative

!!!!Frequency Turkish

AoA prediction template