This template provides the structure for how to fit age of acquisition (AoA) prediction models, using data and scripts in this repository.
The general steps are: - loading the provided CDI data and predictor data - adding your predictor(s) to the provided predictor data - using the functions in scripts/prep_data.R to prepare the data for modeling - using the functions in scripts/fit_models.R to fit models and extract information from them
Loading cached Wordbank data for English:
Creating saved Wordbank data for uncached language, example with English
Creating saved Wordbank data one step at a time (potentially making changes between steps) for uncached language, example with English :
Loading Wordbank data for multiple languages (cached or not):
## Loading data for Croatian...
## Loading data for Danish...
## Loading data for English (American)...
## Loading data for Norwegian...
## Loading data for Russian...
## Loading data for Turkish...
## Loading data for Spanish (Mexican)...
## Loading data for Italian...
## Loading data for Swedish...
## Loading data for French (Quebecois)...
## # A tibble: 154,860 × 7
## language measure uni_lemma age num_true total items
## <chr> <chr> <chr> <int> <int> <int> <list>
## 1 Croatian produces about 16 0 13 <tibble [1 × 5]>
## 2 Croatian produces about 17 6 25 <tibble [1 × 5]>
## 3 Croatian produces about 18 5 22 <tibble [1 × 5]>
## 4 Croatian produces about 19 1 21 <tibble [1 × 5]>
## 5 Croatian produces about 20 1 26 <tibble [1 × 5]>
## 6 Croatian produces about 21 2 29 <tibble [1 × 5]>
## 7 Croatian produces about 22 3 24 <tibble [1 × 5]>
## 8 Croatian produces about 23 7 26 <tibble [1 × 5]>
## 9 Croatian produces about 24 8 30 <tibble [1 × 5]>
## 10 Croatian produces about 25 8 30 <tibble [1 × 5]>
## # … with 154,850 more rows
Merge in the by-concept predictors (babiness, concreteness, etc) to the unilemmas and the by word predictors (phonemes) to the words/definitions.
Loading cached CHILDES data for English:
Loading cached CHILDES data for multiple languages:
Creating saved CHILDES data for English, potentially changing which metrics are computed and/or arguments that are passed to childesr:
Creating saved CHILDES data for many languages:
Order: Frequency, weighted, log transformed and Laplace smoothed, residualized
## # A tibble: 4,061 × 20
## language uni_lemma tokens n_tokens count count_first count_last count_solo
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Croatian about o 1 111 66 1 14
## 2 Croatian airplane avion,av… 4 54 4 28 4
## 3 Croatian all sve,svi 2 306 47 31 1
## 4 Croatian all gone nema,nem… 7 604 265 59 31
## 5 Croatian animal životinj… 3 33 1 20 0
## 6 Croatian another druga,dr… 10 228 16 49 3
## 7 Croatian apple jabuka,j… 3 9 0 6 0
## 8 Croatian arm ruka,ruk… 5 84 3 43 0
## 9 Croatian armchair fotelje,… 3 11 0 4 0
## 10 Croatian asleep spava,sp… 10 92 13 34 7
## # … with 4,051 more rows, and 12 more variables: mlu <dbl>, length_char <dbl>,
## # length_phon <dbl>, freq_raw <dbl>, sumcount <int>, sumcount_first <int>,
## # sumcount_last <int>, sumcount_solo <int>, frequency <dbl>,
## # final_frequency <dbl>, first_frequency <dbl>, solo_frequency <dbl>
Combine mapped predictors and CHILDES predictors:
## # A tibble: 10 × 3
## # Groups: language [10]
## language frequency naperlang
## <chr> <dbl> <int>
## 1 Croatian NA 52
## 2 Danish NA 82
## 3 English (American) NA 18
## 4 French (Quebecois) NA 60
## 5 Italian NA 25
## 6 Norwegian NA 54
## 7 Russian NA 261
## 8 Spanish (Mexican) NA 47
## 9 Swedish NA 56
## 10 Turkish NA 84
## Imputing Croatian with 20 steps...
## Imputing Danish with 20 steps...
## Imputing English (American) with 20 steps...
## Imputing French (Quebecois) with 20 steps...
## Imputing Italian with 20 steps...
## Imputing Norwegian with 20 steps...
## Imputing Russian with 20 steps...
## Imputing Spanish (Mexican) with 20 steps...
## Imputing Swedish with 20 steps...
## Imputing Turkish with 20 steps...
## # A tibble: 10 × 3
## # Groups: language [10]
## language data imputed
## <chr> <list> <list>
## 1 Croatian <tibble [423 × 13]> <tibble [423 × 13]>
## 2 Danish <tibble [418 × 13]> <tibble [418 × 13]>
## 3 English (American) <tibble [693 × 13]> <tibble [693 × 13]>
## 4 Norwegian <tibble [409 × 13]> <tibble [409 × 13]>
## 5 Russian <tibble [696 × 13]> <tibble [696 × 13]>
## 6 Turkish <tibble [438 × 13]> <tibble [438 × 13]>
## 7 Spanish (Mexican) <tibble [431 × 13]> <tibble [431 × 13]>
## 8 Italian <tibble [425 × 13]> <tibble [425 × 13]>
## 9 Swedish <tibble [407 × 13]> <tibble [407 × 13]>
## 10 French (Quebecois) <tibble [636 × 13]> <tibble [636 × 13]>
## # A tibble: 20 × 7
## # Groups: language, measure [20]
## language measure data model coefficients rsquared preds
## <chr> <chr> <list> <list> <list> <list> <lis>
## 1 Croatian produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 2 Croatian understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 3 Danish produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 4 Danish understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 5 English (American) produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 6 English (American) understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 7 Norwegian produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 8 Norwegian understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 9 Russian produces <tibble … <lm> <tibble [66 … <tibble … <chr…
## 10 Russian understands <tibble … <lm> <tibble [66 … <tibble … <chr…
## 11 Turkish produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 12 Turkish understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 13 Spanish (Mexican) produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 14 Spanish (Mexican) understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 15 Italian produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 16 Italian understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 17 Swedish produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 18 Swedish understands <tibble … <lm> <tibble [55 … <tibble … <chr…
## 19 French (Quebecois) produces <tibble … <lm> <tibble [55 … <tibble … <chr…
## 20 French (Quebecois) understands <tibble … <lm> <tibble [55 … <tibble … <chr…
Evaluate collinearity between predictors using variance inflation factor (VIF) of a set of predictors (pred). For a single language:
## # A tibble: 20 × 5
## language measure data predictors vif_results
## <chr> <chr> <list<tibble[,16]>> <chr> <dbl>
## 1 English (American) produces [693 × 16] frequency 2.24
## 2 English (American) produces [693 × 16] mlu 1.87
## 3 English (American) produces [693 × 16] final_frequency 2.29
## 4 English (American) produces [693 × 16] valence 1.11
## 5 English (American) produces [693 × 16] concreteness 2.10
## 6 English (American) produces [693 × 16] babiness 1.09
## 7 English (American) produces [693 × 16] first_frequency 2.29
## 8 English (American) produces [693 × 16] solo_frequency 2.20
## 9 English (American) produces [693 × 16] length_char 1.44
## 10 English (American) produces [693 × 16] n_tokens 1.17
## 11 English (American) understands [408 × 16] frequency 2.12
## 12 English (American) understands [408 × 16] mlu 2.76
## 13 English (American) understands [408 × 16] final_frequency 2.25
## 14 English (American) understands [408 × 16] valence 1.14
## 15 English (American) understands [408 × 16] concreteness 1.97
## 16 English (American) understands [408 × 16] babiness 1.11
## 17 English (American) understands [408 × 16] first_frequency 2.82
## 18 English (American) understands [408 × 16] solo_frequency 2.75
## 19 English (American) understands [408 × 16] length_char 1.33
## 20 English (American) understands [408 × 16] n_tokens 1.15
Evaluate collinearity between predictors using variance inflation factor (VIF) of a set of predictors (pred). For a set of languages:
## # A tibble: 200 × 5
## language measure data predictors vif_results
## <chr> <chr> <list<tibble[,16]>> <chr> <dbl>
## 1 Croatian produces [402 × 16] frequency 1.98
## 2 Croatian produces [402 × 16] mlu 1.22
## 3 Croatian produces [402 × 16] final_frequency 1.32
## 4 Croatian produces [402 × 16] valence 1.08
## 5 Croatian produces [402 × 16] concreteness 1.46
## 6 Croatian produces [402 × 16] babiness 1.05
## 7 Croatian produces [402 × 16] first_frequency 1.38
## 8 Croatian produces [402 × 16] solo_frequency 1.39
## 9 Croatian produces [402 × 16] length_char 1.38
## 10 Croatian produces [402 × 16] n_tokens 1.30
## # … with 190 more rows
## # A tibble: 50 × 12
## language measure name test train model test_word lexical_category aoa
## <chr> <chr> <int> <list> <lis> <lis> <chr> <fct> <dbl>
## 1 English (… produces 1 <int … <int… <lm> mommy other 10.4
## 2 English (… produces 1 <int … <int… <lm> daddy other 10.4
## 3 English (… produces 1 <int … <int… <lm> child other 31.2
## 4 English (… produces 1 <int … <int… <lm> uh oh other 13.6
## 5 English (… produces 1 <int … <int… <lm> grandpa other 18.8
## 6 English (… produces 1 <int … <int… <lm> give me … other 25.2
## 7 English (… produces 1 <int … <int… <lm> child's … other 19.9
## 8 English (… produces 1 <int … <int… <lm> nurse other 31.7
## 9 English (… produces 1 <int … <int… <lm> vagina nouns 30.7
## 10 English (… produces 1 <int … <int… <lm> belly bu… nouns 20.6
## # … with 40 more rows, and 3 more variables: aoa_pred <dbl>, abs_dev <dbl>,
## # se <dbl>
Production: Concreteness and valence all over the place Babiness consistently negative Length char consistently positive Mlu mostly non significant, consistently positive Final frequency consistently negative First frequency consistently negative Solo frequency consistently negative Frequency consistently negative
!!!!Final frequency Spanish Mexican !!!!Final frequency Turkish !!!!Frequency Turkish !!!!N tokens negative direction
Comprehension: Concreteness all over the place Valence consistently negative Babiness consistently negative Ntokens consistently negative Length_char all over the place Mlu consistently positive Final frequency all over the place First frequency all over the place Solo frequency consistently negative
!!!!Frequency Turkish