Can we identify words on the CDI that are more bookish, speechy, or from TV? (Controlling for difficulty, as more bookish words are probably more difficult)
Using this distributional source information, can we find features of children (e.g., mother’s education) that relate to knowledge of subsets of words (e.g. bookish words)?
To start, we will look at American English, before extending to British English and French. We will begin with unlemmatized corpora, and disregard part-of-speech classification.
##
## books speech TV
## 0 826776 26918 62240
## 1 638 618 636
##
## books speech tv
## 0 42045 38107 83681
## 1 596 656 652
Our data sources:
We will calculate keyness scores for each word, i.e. the ratio of normalized frequency in focus corpus to normalized frequency in a reference corpus. For now, we will use the subset of words that are found in all four corpora (N=4229), but we may consider following Dawson et al. (2021) and adding a constant (e.g., 10) to all normalized frequencies in every corpus in order to not eliminate the bulk of words that do not appear in the smaller corpora (which are the child-directed corpora).
#ch_freq %>% arrange(desc(ch_book_vs_speech)) %>% head(10) %>% kable(digits=2)
#ch_freq_smooth %>% arrange(desc(ch_book_vs_speech)) %>% head(10) %>% kable(digits=2)
p1 <- ch_freq %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
ggplot(aes(x=CHILDES, y=`Children's Books`, color=Word)) +
geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
scale_x_log10() + scale_y_log10() + ggtitle("Raw Frequencies (intersection)")
p2 <- ch_freq_smooth %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
ggplot(aes(x=CHILDES, y=`Children's Books`, color=Word)) +
geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
scale_x_log10() + scale_y_log10() + ggtitle("Smoothed Frequencies (union)")
ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
cor(subset(ch_freq, on_cdi==1)$CHILDES, subset(ch_freq, on_cdi==1)$`Children's Books`) # .75
## [1] 0.7522343
cor(subset(ch_freq, on_cdi==0)$CHILDES, subset(ch_freq, on_cdi==0)$`Children's Books`) # .42
## [1] 0.4237489
Same, but with Charlesworth corpora
p1 <- ch_freq_charles %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
ggplot(aes(x=speech, y=books, color=Word)) +
geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
scale_x_log10() + scale_y_log10() + ggtitle("Raw Frequencies (intersection)")
p2 <- ch_freq_smooth_charles %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
ggplot(aes(x=speech, y=books, color=Word)) +
geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
scale_x_log10() + scale_y_log10() + ggtitle("Smoothed Frequencies (union)")
ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
p1 <- ch_freq_smooth %>% filter(on_cdi==1) %>%
ggplot(aes(x=CHILDES, y=`Children's Books`)) +
geom_point(alpha=.3) + theme_classic() +
scale_x_log10() + scale_y_log10() +
geom_text_repel(aes(label=word)) +
geom_abline(intercept = 0, slope = 1, linetype="dashed") +
ggtitle("CHILDES vs. Montag book corpus")
#geom_text(data=subset(ch_freq_smooth, on_cdi==1 & (wt > 4 | mpg > 25)), aes(label=name))
p2 <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
ggplot(aes(x=speech, y=books)) +
geom_point(alpha=.3) + theme_classic() +
scale_x_log10() + scale_y_log10() +
geom_text_repel(aes(label=word)) + xlab("CHILDES") +
geom_abline(intercept = 0, slope = 1, linetype="dashed") +
ggtitle("CHILDES vs. FB Children's Books")
p3 <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
ggplot(aes(x=speech, y=tv)) +
geom_point(alpha=.3) + theme_classic() +
scale_x_log10() + scale_y_log10() +
geom_text_repel(aes(label=word)) + ylab("Children's TV and Movies") +
geom_abline(intercept = 0, slope = 1, linetype="dashed") +
ggtitle("CHILDES vs. Children's Movies")
# merge in lexical_class and color points by it?
ggpubr::ggarrange(p1, p2, p3, nrow=1, common.legend = T)
## Warning: ggrepel: 639 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 639 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 642 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggsave("CHILDES_smoothed_norm_freqs_vs_books_TV.pdf", width=12, height=6)
## Warning: ggrepel: 633 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 630 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 636 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# holds even if we filter out many low frequency words
ch_key <- ch_freq %>% #filter(CHILDES>1, `Children's Books`>1) %>%
group_by(on_cdi) %>%
tidyboot_mean(ch_book_vs_speech)
ch_key_smooth <- ch_freq_smooth %>%
group_by(on_cdi) %>%
tidyboot_mean(ch_book_vs_speech)
p1 <- ch_key %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
ggplot(aes(x=CDI, y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Children's Book Keyness") + xlab("")
p2 <- ch_key_smooth %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
ggplot(aes(x=CDI, y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Children's Book Keyness") + xlab("")
ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)
ch_booky <- ch_freq %>% group_by(on_cdi) %>%
tidyboot_mean(prop_booky)
ch_booky_smooth <- ch_freq_smooth %>% group_by(on_cdi) %>%
tidyboot_mean(prop_booky)
p1 <- ch_booky %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
ggplot(aes(x=CDI, y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Proportion of Book Occurrences")
p2 <- ch_booky_smooth %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
ggplot(aes(x=CDI, y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Proportion of Book Occurrences")
ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)
Table based on Charlesworth corpora
ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
select(-ch_book_vs_speech) %>%
DT::datatable() %>%
DT::formatRound(columns=c('books','tv','speech'), digits=0) %>%
DT::formatPercentage(columns=c('prop_booky'), digits=0)
Which corpus best predicts words’ AoA? Which corpora contribute unique variance to predicting AoA? (Does the book corpus interact with SES, since we expect high-SES parents read more to their children?)
load("data/en_ws_production.Rdata")
dd <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
select(-ch_book_vs_speech)
summary(dd)
## word on_cdi books tv
## Length:656 Min. :1 Min. : 10.00 Min. : 10.00
## Class :character 1st Qu.:1 1st Qu.: 19.19 1st Qu.: 42.14
## Mode :character Median :1 Median : 74.36 Median : 117.77
## Mean :1 Mean : 852.34 Mean : 785.30
## 3rd Qu.:1 3rd Qu.: 293.14 3rd Qu.: 402.05
## Max. :1 Max. :51379.87 Max. :45860.64
## speech prop_booky
## Min. : 10.18 Min. :0.001372
## 1st Qu.: 67.05 1st Qu.:0.192417
## Median : 156.39 Median :0.314225
## Mean : 974.55 Mean :0.355120
## 3rd Qu.: 474.73 3rd Qu.:0.508740
## Max. :48017.26 Max. :0.887475
# poor man's AoA
cdi_acc <- d_en_ws %>% group_by(definition, category, lexical_class) %>%
summarise(mean_produces = mean(produces, na.rm=T)) %>%
mutate(word = case_when(definition=="a lot" ~ "lot",
definition=="all gone" ~ "gone",
#definition=="babysitter's name" ~ "babysitter",
definition=="buttocks/bottom*" ~ "butt",
definition=="call (on phone)" ~ "call",
definition=="can (auxiliary)" ~ "can",
definition=="can (object)" ~ "can",
definition=="chicken (animal)" ~ "chicken",
definition=="chicken (food)" ~ "chicken",
definition=="clean (action)" ~ "clean",
definition=="clean (description)" ~ "clean",
definition=="church*" ~ "church", # add temple, synagogue..?
definition=="daddy*" ~ "daddy",
definition=="mommy*" ~ "mommy",
definition=="grandma*" ~ "grandma",
definition=="grandpa*" ~ "grandpa",
definition=="vagina*" ~ "vagina",
definition=="penis*" ~ "penis",
definition=="drink (object)" ~ "drink",
definition=="drink (action)" ~ "drink",
definition=="did/did ya" ~ "did",
definition=="dress (object)" ~ "dress",
definition=="dry (action)" ~ "dry",
definition=="dry (description)" ~ "dry",
definition=="baa baa" ~ "bah",
definition=="choo choo" ~ "choo",
definition=="woof woof" ~ "woof",
definition=="yum yum" ~ "yum",
definition=="fish (animal)" ~ "fish",
definition=="fish (food)" ~ "fish",
definition=="gas station" ~ "gas", # or station? or average?
definition=="work (action)" ~ "work",
definition=="work (place)" ~ "work",
definition=="water (beverage)" ~ "water",
definition=="water (not beverage)" ~ "water",
definition=="wanna/want to" ~ "wanna", # or want
definition=="watch (action)" ~ "watch",
definition=="watch (object)" ~ "watch",
definition=="TV" ~ "tv",
definition=="washing machine" ~ "washing",
definition=="lawn mower" ~ "mower",
definition=="soda/pop" ~ "soda",
definition=="swing (action)" ~ "swing",
definition=="swing (object)" ~ "swing",
TRUE ~ definition)) %>%
filter(!is.element(definition, c("babysitter's name", "child's own name", "gonna get you!", "give me five!")))
## `summarise()` has grouped output by 'definition', 'category'. You can override using the `.groups` argument.
# remove some words we can't match
itdat <- left_join(cdi_acc, dd)
## Joining, by = "word"
# need to fix all the parenthetical / starred / slashed items
itdat[which(!complete.cases(itdat)),]
## # A tibble: 44 × 10
## # Groups: definition, category [44]
## definition category lexical_class mean_produces word on_cdi books tv
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 baa baa sounds other 0.726 bah NA NA NA
## 2 belly button body_par… nouns 0.622 belly … NA NA NA
## 3 buttocks/bo… body_par… nouns 0.544 butt NA NA NA
## 4 drink (beve… food_dri… nouns 0.593 drink … NA NA NA
## 5 french fries food_dri… nouns 0.509 french… NA NA NA
## 6 gas station places other 0.238 gas NA NA NA
## 7 go potty games_ro… other 0.521 go pot… NA NA NA
## 8 gonna/going… helping_… function_wor… 0.282 gonna/… NA NA NA
## 9 gotta/got to helping_… function_wor… 0.167 gotta/… NA NA NA
## 10 green beans food_dri… nouns 0.307 green … NA NA NA
## # … with 34 more rows, and 2 more variables: speech <dbl>, prop_booky <dbl>
ddd <- itdat %>% filter(!is.na(on_cdi))
summary(lm(mean_produces ~ prop_booky, data=ddd)) # r^2 = .22; more booky = harder
##
## Call:
## lm(formula = mean_produces ~ prop_booky, data = ddd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33885 -0.10905 -0.01788 0.09270 0.45780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54247 0.01214 44.70 <2e-16 ***
## prop_booky -0.38494 0.02920 -13.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1514 on 630 degrees of freedom
## Multiple R-squared: 0.2162, Adjusted R-squared: 0.215
## F-statistic: 173.8 on 1 and 630 DF, p-value: < 2.2e-16
ddd$books = scale(log(ddd$books), center=F)[,1]
ddd$tv = scale(log(ddd$tv), center=F)[,1]
ddd$speech = scale(log(ddd$speech), center=F)[,1]
# we should tack on adult_books
cor.test(itdat$books, itdat$speech) # .73
##
## Pearson's product-moment correlation
##
## data: itdat$books and itdat$speech
## t = 26.441, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6860737 0.7602459
## sample estimates:
## cor
## 0.7252575
cor.test(itdat$tv, itdat$speech) # .83
##
## Pearson's product-moment correlation
##
## data: itdat$tv and itdat$speech
## t = 37.006, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8013227 0.8506763
## sample estimates:
## cor
## 0.8275923
cor.test(itdat$tv, itdat$books) # .94
##
## Pearson's product-moment correlation
##
## data: itdat$tv and itdat$books
## t = 67.224, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9265349 0.9457217
## sample estimates:
## cor
## 0.9368292
# Books and TV are most correlated; books and speech least correlated
# correlations change if we scale and log the frequencies...
cor.test(ddd$books, ddd$speech) # .81
##
## Pearson's product-moment correlation
##
## data: ddd$books and ddd$speech
## t = 34.882, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7833010 0.8367256
## sample estimates:
## cor
## 0.8117043
cor.test(ddd$tv, ddd$speech) # .86
##
## Pearson's product-moment correlation
##
## data: ddd$tv and ddd$speech
## t = 41.713, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8346215 0.8762715
## sample estimates:
## cor
## 0.8568381
cor.test(ddd$tv, ddd$books) # .87
##
## Pearson's product-moment correlation
##
## data: ddd$tv and ddd$books
## t = 45.055, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8537643 0.8908800
## sample estimates:
## cor
## 0.8735865
# with all 3 sources, only speech and books are significant
summary(lm(mean_produces ~ books + speech + tv, data=ddd)) #
##
## Call:
## lm(formula = mean_produces ~ books + speech + tv, data = ddd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49532 -0.10147 -0.00957 0.09145 0.44777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29524 0.02184 13.519 <2e-16 ***
## books -0.37662 0.03339 -11.280 <2e-16 ***
## speech 0.45929 0.04261 10.780 <2e-16 ***
## tv 0.01697 0.04557 0.372 0.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1495 on 628 degrees of freedom
## Multiple R-squared: 0.2386, Adjusted R-squared: 0.235
## F-statistic: 65.62 on 3 and 628 DF, p-value: < 2.2e-16
summary(lm(mean_produces ~ books + tv + speech, data=ddd))
##
## Call:
## lm(formula = mean_produces ~ books + tv + speech, data = ddd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49532 -0.10147 -0.00957 0.09145 0.44777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29524 0.02184 13.519 <2e-16 ***
## books -0.37662 0.03339 -11.280 <2e-16 ***
## tv 0.01697 0.04557 0.372 0.71
## speech 0.45929 0.04261 10.780 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1495 on 628 degrees of freedom
## Multiple R-squared: 0.2386, Adjusted R-squared: 0.235
## F-statistic: 65.62 on 3 and 628 DF, p-value: < 2.2e-16
summary(lm(mean_produces ~ books + speech, data=ddd))
##
## Call:
## lm(formula = mean_produces ~ books + speech, data = ddd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50584 -0.10149 -0.00948 0.09314 0.45282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29663 0.02150 13.79 <2e-16 ***
## books -0.36927 0.02690 -13.72 <2e-16 ***
## speech 0.46754 0.03637 12.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1494 on 629 degrees of freedom
## Multiple R-squared: 0.2385, Adjusted R-squared: 0.2361
## F-statistic: 98.49 on 2 and 629 DF, p-value: < 2.2e-16
# let's take the 200 bookiest CDI items and look at relation of subscore to mom_ed
booky_items <- itdat %>% arrange(desc(prop_booky)) %>% head(200)
samp <- d_demo %>% filter(!is.na(mom_ed)) # 2773
booky_scores <- d_en_ws %>% filter(is.element(definition, booky_items$definition)) %>%
group_by(data_id) %>% summarise(booky_production = sum(produces))
# most children get 0 booky words (they're hard)
# hist(booky_scores$booky_production)
samp <- left_join(samp, booky_scores)
## Joining, by = "data_id"
#AoA ~ book_freq * mom_ed + spoken_freq * mom_ed
summary(lm(production ~ age, data=samp)) # .465
##
## Call:
## lm(formula = production ~ age, data = samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -501.96 -97.08 -9.08 105.10 442.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -467.7552 15.4683 -30.24 <2e-16 ***
## age 32.8239 0.6684 49.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 149.5 on 2771 degrees of freedom
## Multiple R-squared: 0.4653, Adjusted R-squared: 0.4651
## F-statistic: 2411 on 1 and 2771 DF, p-value: < 2.2e-16
summary(lm(production ~ age + mom_ed, data=samp)) # r^2 = .477
##
## Call:
## lm(formula = production ~ age + mom_ed, data = samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -494.10 -98.27 -8.20 101.73 489.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -617.2343 54.5348 -11.318 < 2e-16 ***
## age 33.1189 0.6645 49.842 < 2e-16 ***
## mom_edSome Secondary 78.2147 54.0253 1.448 0.14780
## mom_edSecondary 132.7710 52.8525 2.512 0.01206 *
## mom_edSome College 128.1405 52.6907 2.432 0.01508 *
## mom_edCollege 147.9896 52.5906 2.814 0.00493 **
## mom_edSome Graduate 160.2862 53.6271 2.989 0.00282 **
## mom_edGraduate 168.1299 52.7087 3.190 0.00144 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 148.1 on 2765 degrees of freedom
## Multiple R-squared: 0.4765, Adjusted R-squared: 0.4751
## F-statistic: 359.5 on 7 and 2765 DF, p-value: < 2.2e-16
summary(lm(booky_production ~ age , data=samp)) # r^2 = .452
##
## Call:
## lm(formula = booky_production ~ age, data = samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.490 -25.485 -2.487 25.510 139.515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -141.5253 4.3585 -32.47 <2e-16 ***
## age 9.0005 0.1883 47.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.12 on 2770 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4519, Adjusted R-squared: 0.4517
## F-statistic: 2284 on 1 and 2770 DF, p-value: < 2.2e-16
summary(lm(booky_production ~ age + mom_ed, data=samp)) # .462
##
## Call:
## lm(formula = booky_production ~ age + mom_ed, data = samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.132 -25.644 -3.391 24.205 149.644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.1241 15.3860 -11.707 < 2e-16 ***
## age 9.0815 0.1875 48.442 < 2e-16 ***
## mom_edSome Secondary 21.9320 15.2423 1.439 0.15030
## mom_edSecondary 33.8121 14.9114 2.268 0.02343 *
## mom_edSome College 32.4484 14.8659 2.183 0.02914 *
## mom_edCollege 37.7979 14.8375 2.547 0.01091 *
## mom_edSome Graduate 40.1762 15.1300 2.655 0.00797 **
## mom_edGraduate 44.4836 14.8709 2.991 0.00280 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.77 on 2764 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4619, Adjusted R-squared: 0.4606
## F-statistic: 339 on 7 and 2764 DF, p-value: < 2.2e-16
samp %>%
mutate(nonbooky_production = production - booky_production) %>%
ggplot(aes(x=booky_production, y=production, color=mom_ed, shape=mom_ed)) +
geom_point(alpha=.3) + theme_classic() #+ geom_smooth()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 585 rows containing missing values (geom_point).
d_age <- d_en_ws %>% left_join(d_demo %>% select(data_id, age)) %>%
filter(!is.na(age)) %>%
group_by(definition, category, lexical_class, age) %>%
summarise(mean_produces = mean(produces, na.rm=T))
## Joining, by = "data_id"
## `summarise()` has grouped output by 'definition', 'category', 'lexical_class'. You can override using the `.groups` argument.
d_age %>% ggplot(aes(x=age, y=mean_produces, group=definition, color=lexical_class)) +
geom_line(alpha=.3) + theme_classic()
#d50pct <- d_age %>% filter(mean_produces < .55, mean_produces > .45)
# ToDo: get AoA model fits
For our reference corpus, we will use adult speech (movie subtitles), as this is the target language distribution that children will eventually learn. (Although we could use child-directed speech, as that is what is given to children, or Google book frequency, which could be considered the epitome of an ‘educated’ distribution.)
Non-SUBTLEX adult speech (Charlesworth corpus):
We first examine the keyness of words in child-directed speech vs. that in adult speech. Here are the 10 words most over-represented in child-directed speech compared to adult movies:
Here are the 10 words most under-represented in CHILDES, compared to adult movies:
Below we show the average keyness of child-directed speech and children’s books for CDI vs. non-CDI words, using adult speech as the reference corpus. This is based on the words common to all corpora.
Now we do the same for the union of all words across the corpora (Laplace-smoothed).