Where do CDI words come from?

Goal

Can we identify words on the CDI that are more bookish, speechy, or from TV? (Controlling for difficulty, as more bookish words are probably more difficult)

Using this distributional source information, can we find features of children (e.g., mother’s education) that relate to knowledge of subsets of words (e.g. bookish words)?

To start, we will look at American English, before extending to British English and French. We will begin with unlemmatized corpora, and disregard part-of-speech classification.

##    
##      books speech     TV
##   0 826776  26918  62240
##   1    638    618    636

##    
##     books speech    tv
##   0 42045  38107 83681
##   1   596    656   652

Our data sources:

Adult speech: 60384 types, 4.970565810^{7} tokens from the SUBTLEX corpus (646 CDI words)
Children speech: 27673 types, 7.64608110^{6} tokens from the CHILDES corpus (657 CDI words)
Adult books: 97565 types, 7.438429210^{11} tokens from the Google books corpus
Children’s books: 5824 types, 68103 tokens from the CHILDES corpus (567 CDI words)

Normalize Word Frequencies

We will calculate keyness scores for each word, i.e. the ratio of normalized frequency in focus corpus to normalized frequency in a reference corpus. For now, we will use the subset of words that are found in all four corpora (N=4229), but we may consider following Dawson et al. (2021) and adding a constant (e.g., 10) to all normalized frequencies in every corpus in order to not eliminate the bulk of words that do not appear in the smaller corpora (which are the child-directed corpora).

Child-directed Speech vs. Books

Most Child Booky Words

#ch_freq %>% arrange(desc(ch_book_vs_speech)) %>% head(10) %>% kable(digits=2)
#ch_freq_smooth %>% arrange(desc(ch_book_vs_speech)) %>% head(10) %>% kable(digits=2)

p1 <- ch_freq %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
  ggplot(aes(x=CHILDES, y=`Children's Books`, color=Word)) + 
  geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
  scale_x_log10() + scale_y_log10() + ggtitle("Raw Frequencies (intersection)")

p2 <- ch_freq_smooth %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
  ggplot(aes(x=CHILDES, y=`Children's Books`, color=Word)) + 
  geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
  scale_x_log10() + scale_y_log10() + ggtitle("Smoothed Frequencies (union)")

ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

cor(subset(ch_freq, on_cdi==1)$CHILDES, subset(ch_freq, on_cdi==1)$`Children's Books`) # .75

## [1] 0.7522343

cor(subset(ch_freq, on_cdi==0)$CHILDES, subset(ch_freq, on_cdi==0)$`Children's Books`) # .42

## [1] 0.4237489

Same, but with Charlesworth corpora

p1 <- ch_freq_charles %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
  ggplot(aes(x=speech, y=books, color=Word)) + 
  geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
  scale_x_log10() + scale_y_log10() + ggtitle("Raw Frequencies (intersection)")

p2 <- ch_freq_smooth_charles %>% mutate(Word = ifelse(on_cdi==1, "CDI", "Non-CDI")) %>%
  ggplot(aes(x=speech, y=books, color=Word)) + 
  geom_point(alpha=.3) + theme_classic() + geom_smooth(method='lm') +
  scale_x_log10() + scale_y_log10() + ggtitle("Smoothed Frequencies (union)")

ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Which CDI words are more booky/speechy?

p1 <- ch_freq_smooth %>% filter(on_cdi==1) %>%
  ggplot(aes(x=CHILDES, y=`Children's Books`)) + 
  geom_point(alpha=.3) + theme_classic() + 
  scale_x_log10() + scale_y_log10() +
  geom_text_repel(aes(label=word)) +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("CHILDES vs. Montag book corpus")
  #geom_text(data=subset(ch_freq_smooth, on_cdi==1 & (wt > 4 | mpg > 25)), aes(label=name))

p2 <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
  ggplot(aes(x=speech, y=books)) + 
  geom_point(alpha=.3) + theme_classic() + 
  scale_x_log10() + scale_y_log10() +
  geom_text_repel(aes(label=word)) + xlab("CHILDES") + 
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("CHILDES vs. FB Children's Books")

p3 <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>%
  ggplot(aes(x=speech, y=tv)) + 
  geom_point(alpha=.3) + theme_classic() + 
  scale_x_log10() + scale_y_log10() +
  geom_text_repel(aes(label=word)) + ylab("Children's TV and Movies") + 
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("CHILDES vs. Children's Movies")

# merge in lexical_class and color points by it?

ggpubr::ggarrange(p1, p2, p3, nrow=1, common.legend = T)

## Warning: ggrepel: 639 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 639 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 642 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

ggsave("CHILDES_smoothed_norm_freqs_vs_books_TV.pdf", width=12, height=6)

## Warning: ggrepel: 633 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 630 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 636 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# holds even if we filter out many low frequency words
ch_key <- ch_freq %>% #filter(CHILDES>1, `Children's Books`>1) %>%
  group_by(on_cdi) %>%
  tidyboot_mean(ch_book_vs_speech)

ch_key_smooth <- ch_freq_smooth %>% 
  group_by(on_cdi) %>%
  tidyboot_mean(ch_book_vs_speech)

p1 <- ch_key %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
  ggplot(aes(x=CDI, y=mean)) + 
  geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
  theme_classic() + ylab("Children's Book Keyness") + xlab("")

p2 <- ch_key_smooth %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
  ggplot(aes(x=CDI, y=mean)) + 
  geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
  theme_classic() + ylab("Children's Book Keyness") + xlab("")

ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)

ch_booky <- ch_freq %>% group_by(on_cdi) %>%
  tidyboot_mean(prop_booky)

ch_booky_smooth <- ch_freq_smooth %>% group_by(on_cdi) %>%
  tidyboot_mean(prop_booky)

p1 <- ch_booky %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
  ggplot(aes(x=CDI, y=mean)) + 
  geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
  theme_classic() + ylab("Proportion of Book Occurrences") 

p2 <- ch_booky_smooth %>% mutate(CDI = ifelse(on_cdi==1, "CDI Words", "non-CDI Words")) %>%
  ggplot(aes(x=CDI, y=mean)) + 
  geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
  theme_classic() + ylab("Proportion of Book Occurrences")

ggpubr::ggarrange(p1, p2, nrow=1, common.legend = T)

Booky vs. Speechy vs. TV Words

Table based on Charlesworth corpora

ch_freq_smooth_charles %>% filter(on_cdi==1) %>% 
  select(-ch_book_vs_speech) %>%
  DT::datatable() %>%
  DT::formatRound(columns=c('books','tv','speech'), digits=0) %>%
  DT::formatPercentage(columns=c('prop_booky'), digits=0)

Predicting AoA

Which corpus best predicts words’ AoA? Which corpora contribute unique variance to predicting AoA? (Does the book corpus interact with SES, since we expect high-SES parents read more to their children?)

load("data/en_ws_production.Rdata")


dd <- ch_freq_smooth_charles %>% filter(on_cdi==1) %>% 
  select(-ch_book_vs_speech)

summary(dd)

##      word               on_cdi      books                tv          
##  Length:656         Min.   :1   Min.   :   10.00   Min.   :   10.00  
##  Class :character   1st Qu.:1   1st Qu.:   19.19   1st Qu.:   42.14  
##  Mode  :character   Median :1   Median :   74.36   Median :  117.77  
##                     Mean   :1   Mean   :  852.34   Mean   :  785.30  
##                     3rd Qu.:1   3rd Qu.:  293.14   3rd Qu.:  402.05  
##                     Max.   :1   Max.   :51379.87   Max.   :45860.64  
##      speech           prop_booky      
##  Min.   :   10.18   Min.   :0.001372  
##  1st Qu.:   67.05   1st Qu.:0.192417  
##  Median :  156.39   Median :0.314225  
##  Mean   :  974.55   Mean   :0.355120  
##  3rd Qu.:  474.73   3rd Qu.:0.508740  
##  Max.   :48017.26   Max.   :0.887475

# poor man's AoA
cdi_acc <- d_en_ws %>% group_by(definition, category, lexical_class) %>%
  summarise(mean_produces = mean(produces, na.rm=T)) %>%
  mutate(word = case_when(definition=="a lot" ~ "lot",
                          definition=="all gone" ~ "gone",
                          #definition=="babysitter's name" ~ "babysitter",
                          definition=="buttocks/bottom*" ~ "butt",
                          definition=="call (on phone)" ~ "call",
                          definition=="can (auxiliary)" ~ "can",
                          definition=="can (object)" ~ "can",
                          definition=="chicken (animal)" ~ "chicken",
                          definition=="chicken (food)" ~ "chicken",
                          definition=="clean (action)" ~ "clean",
                          definition=="clean (description)" ~ "clean",
                          definition=="church*" ~ "church", # add temple, synagogue..?
                          definition=="daddy*" ~ "daddy",
                          definition=="mommy*" ~ "mommy",
                          definition=="grandma*" ~ "grandma",
                          definition=="grandpa*" ~ "grandpa",
                          definition=="vagina*" ~ "vagina",
                          definition=="penis*" ~ "penis",
                          definition=="drink (object)" ~ "drink",
                          definition=="drink (action)" ~ "drink",
                          definition=="did/did ya" ~ "did",
                          definition=="dress (object)" ~ "dress",
                          definition=="dry (action)" ~ "dry",
                          definition=="dry (description)" ~ "dry",
                          definition=="baa baa" ~ "bah",
                          definition=="choo choo" ~ "choo",
                          definition=="woof woof" ~ "woof",
                          definition=="yum yum" ~ "yum",
                          definition=="fish (animal)" ~ "fish",
                          definition=="fish (food)" ~ "fish",
                          definition=="gas station" ~ "gas", # or station? or average?
                          definition=="work (action)" ~ "work",
                          definition=="work (place)" ~ "work",
                          definition=="water (beverage)" ~ "water",
                          definition=="water (not beverage)" ~ "water",
                          definition=="wanna/want to" ~ "wanna", # or want
                          definition=="watch (action)" ~ "watch",
                          definition=="watch (object)" ~ "watch",
                          definition=="TV" ~ "tv",
                          definition=="washing machine" ~ "washing",
                          definition=="lawn mower" ~ "mower",
                          definition=="soda/pop" ~ "soda",
                          definition=="swing (action)" ~ "swing",
                          definition=="swing (object)" ~ "swing",
                          TRUE ~ definition)) %>%
  filter(!is.element(definition, c("babysitter's name", "child's own name", "gonna get you!", "give me five!")))

## `summarise()` has grouped output by 'definition', 'category'. You can override using the `.groups` argument.

# remove some words we can't match

itdat <- left_join(cdi_acc, dd)

## Joining, by = "word"

# need to fix all the parenthetical / starred / slashed items
itdat[which(!complete.cases(itdat)),]

## # A tibble: 44 × 10
## # Groups:   definition, category [44]
##    definition   category  lexical_class mean_produces word    on_cdi books    tv
##    <chr>        <chr>     <chr>                 <dbl> <chr>    <dbl> <dbl> <dbl>
##  1 baa baa      sounds    other                 0.726 bah         NA    NA    NA
##  2 belly button body_par… nouns                 0.622 belly …     NA    NA    NA
##  3 buttocks/bo… body_par… nouns                 0.544 butt        NA    NA    NA
##  4 drink (beve… food_dri… nouns                 0.593 drink …     NA    NA    NA
##  5 french fries food_dri… nouns                 0.509 french…     NA    NA    NA
##  6 gas station  places    other                 0.238 gas         NA    NA    NA
##  7 go potty     games_ro… other                 0.521 go pot…     NA    NA    NA
##  8 gonna/going… helping_… function_wor…         0.282 gonna/…     NA    NA    NA
##  9 gotta/got to helping_… function_wor…         0.167 gotta/…     NA    NA    NA
## 10 green beans  food_dri… nouns                 0.307 green …     NA    NA    NA
## # … with 34 more rows, and 2 more variables: speech <dbl>, prop_booky <dbl>

ddd <- itdat %>% filter(!is.na(on_cdi)) 

summary(lm(mean_produces ~ prop_booky, data=ddd)) # r^2 = .22; more booky = harder

## 
## Call:
## lm(formula = mean_produces ~ prop_booky, data = ddd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33885 -0.10905 -0.01788  0.09270  0.45780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.54247    0.01214   44.70   <2e-16 ***
## prop_booky  -0.38494    0.02920  -13.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1514 on 630 degrees of freedom
## Multiple R-squared:  0.2162, Adjusted R-squared:  0.215 
## F-statistic: 173.8 on 1 and 630 DF,  p-value: < 2.2e-16

ddd$books = scale(log(ddd$books), center=F)[,1]
ddd$tv = scale(log(ddd$tv), center=F)[,1]
ddd$speech = scale(log(ddd$speech), center=F)[,1]

# we should tack on adult_books
cor.test(itdat$books, itdat$speech) # .73

## 
##  Pearson's product-moment correlation
## 
## data:  itdat$books and itdat$speech
## t = 26.441, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6860737 0.7602459
## sample estimates:
##       cor 
## 0.7252575

cor.test(itdat$tv, itdat$speech) # .83

## 
##  Pearson's product-moment correlation
## 
## data:  itdat$tv and itdat$speech
## t = 37.006, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8013227 0.8506763
## sample estimates:
##       cor 
## 0.8275923

cor.test(itdat$tv, itdat$books) # .94

## 
##  Pearson's product-moment correlation
## 
## data:  itdat$tv and itdat$books
## t = 67.224, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9265349 0.9457217
## sample estimates:
##       cor 
## 0.9368292

# Books and TV are most correlated; books and speech least correlated

# correlations change if we scale and log the frequencies...
cor.test(ddd$books, ddd$speech) # .81

## 
##  Pearson's product-moment correlation
## 
## data:  ddd$books and ddd$speech
## t = 34.882, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7833010 0.8367256
## sample estimates:
##       cor 
## 0.8117043

cor.test(ddd$tv, ddd$speech) # .86

## 
##  Pearson's product-moment correlation
## 
## data:  ddd$tv and ddd$speech
## t = 41.713, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8346215 0.8762715
## sample estimates:
##       cor 
## 0.8568381

cor.test(ddd$tv, ddd$books) # .87

## 
##  Pearson's product-moment correlation
## 
## data:  ddd$tv and ddd$books
## t = 45.055, df = 630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8537643 0.8908800
## sample estimates:
##       cor 
## 0.8735865

# with all 3 sources, only speech and books are significant
summary(lm(mean_produces ~ books + speech + tv, data=ddd)) #

## 
## Call:
## lm(formula = mean_produces ~ books + speech + tv, data = ddd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49532 -0.10147 -0.00957  0.09145  0.44777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29524    0.02184  13.519   <2e-16 ***
## books       -0.37662    0.03339 -11.280   <2e-16 ***
## speech       0.45929    0.04261  10.780   <2e-16 ***
## tv           0.01697    0.04557   0.372     0.71    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1495 on 628 degrees of freedom
## Multiple R-squared:  0.2386, Adjusted R-squared:  0.235 
## F-statistic: 65.62 on 3 and 628 DF,  p-value: < 2.2e-16

summary(lm(mean_produces ~ books + tv + speech, data=ddd))

## 
## Call:
## lm(formula = mean_produces ~ books + tv + speech, data = ddd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49532 -0.10147 -0.00957  0.09145  0.44777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29524    0.02184  13.519   <2e-16 ***
## books       -0.37662    0.03339 -11.280   <2e-16 ***
## tv           0.01697    0.04557   0.372     0.71    
## speech       0.45929    0.04261  10.780   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1495 on 628 degrees of freedom
## Multiple R-squared:  0.2386, Adjusted R-squared:  0.235 
## F-statistic: 65.62 on 3 and 628 DF,  p-value: < 2.2e-16

summary(lm(mean_produces ~ books + speech, data=ddd))

## 
## Call:
## lm(formula = mean_produces ~ books + speech, data = ddd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50584 -0.10149 -0.00948  0.09314  0.45282 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29663    0.02150   13.79   <2e-16 ***
## books       -0.36927    0.02690  -13.72   <2e-16 ***
## speech       0.46754    0.03637   12.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1494 on 629 degrees of freedom
## Multiple R-squared:  0.2385, Adjusted R-squared:  0.2361 
## F-statistic: 98.49 on 2 and 629 DF,  p-value: < 2.2e-16

# let's take the 200 bookiest CDI items and look at relation of subscore to mom_ed
booky_items <- itdat %>% arrange(desc(prop_booky)) %>% head(200)

samp <- d_demo %>% filter(!is.na(mom_ed)) # 2773

booky_scores <- d_en_ws %>% filter(is.element(definition, booky_items$definition)) %>%
  group_by(data_id) %>% summarise(booky_production = sum(produces))

# most children get 0 booky words (they're hard)
# hist(booky_scores$booky_production)
samp <- left_join(samp, booky_scores)

## Joining, by = "data_id"

#AoA ~ book_freq * mom_ed + spoken_freq * mom_ed

summary(lm(production ~ age, data=samp)) # .465

## 
## Call:
## lm(formula = production ~ age, data = samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -501.96  -97.08   -9.08  105.10  442.28 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -467.7552    15.4683  -30.24   <2e-16 ***
## age           32.8239     0.6684   49.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 149.5 on 2771 degrees of freedom
## Multiple R-squared:  0.4653, Adjusted R-squared:  0.4651 
## F-statistic:  2411 on 1 and 2771 DF,  p-value: < 2.2e-16

summary(lm(production ~ age + mom_ed, data=samp)) # r^2 = .477

## 
## Call:
## lm(formula = production ~ age + mom_ed, data = samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -494.10  -98.27   -8.20  101.73  489.76 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -617.2343    54.5348 -11.318  < 2e-16 ***
## age                    33.1189     0.6645  49.842  < 2e-16 ***
## mom_edSome Secondary   78.2147    54.0253   1.448  0.14780    
## mom_edSecondary       132.7710    52.8525   2.512  0.01206 *  
## mom_edSome College    128.1405    52.6907   2.432  0.01508 *  
## mom_edCollege         147.9896    52.5906   2.814  0.00493 ** 
## mom_edSome Graduate   160.2862    53.6271   2.989  0.00282 ** 
## mom_edGraduate        168.1299    52.7087   3.190  0.00144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 148.1 on 2765 degrees of freedom
## Multiple R-squared:  0.4765, Adjusted R-squared:  0.4751 
## F-statistic: 359.5 on 7 and 2765 DF,  p-value: < 2.2e-16

summary(lm(booky_production ~ age , data=samp)) # r^2 = .452

## 
## Call:
## lm(formula = booky_production ~ age, data = samp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.490  -25.485   -2.487   25.510  139.515 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -141.5253     4.3585  -32.47   <2e-16 ***
## age            9.0005     0.1883   47.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.12 on 2770 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4519, Adjusted R-squared:  0.4517 
## F-statistic:  2284 on 1 and 2770 DF,  p-value: < 2.2e-16

summary(lm(booky_production ~ age + mom_ed, data=samp)) # .462

## 
## Call:
## lm(formula = booky_production ~ age + mom_ed, data = samp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.132  -25.644   -3.391   24.205  149.644 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -180.1241    15.3860 -11.707  < 2e-16 ***
## age                     9.0815     0.1875  48.442  < 2e-16 ***
## mom_edSome Secondary   21.9320    15.2423   1.439  0.15030    
## mom_edSecondary        33.8121    14.9114   2.268  0.02343 *  
## mom_edSome College     32.4484    14.8659   2.183  0.02914 *  
## mom_edCollege          37.7979    14.8375   2.547  0.01091 *  
## mom_edSome Graduate    40.1762    15.1300   2.655  0.00797 ** 
## mom_edGraduate         44.4836    14.8709   2.991  0.00280 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.77 on 2764 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4619, Adjusted R-squared:  0.4606 
## F-statistic:   339 on 7 and 2764 DF,  p-value: < 2.2e-16

samp %>% 
  mutate(nonbooky_production = production - booky_production) %>%
  ggplot(aes(x=booky_production, y=production, color=mom_ed, shape=mom_ed)) +
  geom_point(alpha=.3) + theme_classic() #+ geom_smooth()

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 585 rows containing missing values (geom_point).

d_age <- d_en_ws %>% left_join(d_demo %>% select(data_id, age)) %>%
  filter(!is.na(age)) %>%
  group_by(definition, category, lexical_class, age) %>%
  summarise(mean_produces = mean(produces, na.rm=T))

## Joining, by = "data_id"

## `summarise()` has grouped output by 'definition', 'category', 'lexical_class'. You can override using the `.groups` argument.

d_age %>% ggplot(aes(x=age, y=mean_produces, group=definition, color=lexical_class)) + 
  geom_line(alpha=.3) + theme_classic()

#d50pct <- d_age %>% filter(mean_produces < .55, mean_produces > .45)
# ToDo: get AoA model fits

Comparing to Adult Speech

For our reference corpus, we will use adult speech (movie subtitles), as this is the target language distribution that children will eventually learn. (Although we could use child-directed speech, as that is what is given to children, or Google book frequency, which could be considered the epitome of an ‘educated’ distribution.)

Non-SUBTLEX adult speech (Charlesworth corpus):

Child-directed speech vs. Adult speech

We first examine the keyness of words in child-directed speech vs. that in adult speech. Here are the 10 words most over-represented in child-directed speech compared to adult movies:

Here are the 10 words most under-represented in CHILDES, compared to adult movies:

Below we show the average keyness of child-directed speech and children’s books for CDI vs. non-CDI words, using adult speech as the reference corpus. This is based on the words common to all corpora.

Now we do the same for the union of all words across the corpora (Laplace-smoothed).

Other Ideas

Does the mean IRT difficulty of CDI/non-CDI words systematically vary with the source?
Use adult books as reference corpus for keyness metric (instad of subtitles)