Of the 289 corpora in childes-db (v2018.1) we have decided to tag Manchester from the Eng-UK collection and Providence from the Eng-NA collection.
We chose our word types from the Words and Sentences list from the MacArthur-Bates Communicative Development Inventories (MB-CDI).
This is 499 lemmas (lemmatized with the WordNet Lemmatizer) and 725 lemma+pos word types
We subsampled to a maximum of 50 tokens per word type, per speaker type (caretaker & child), per 3 month interval.
At 299568 selected tokens, this is 0.078232 of the tokens available in the Providence and Manchester Corpora.
These tokens were divided across sets of at most 24 continuous tokens (about 15 minutes of tagging, each).
TODO: These should be out of tokens that are in the MB-CDI WSWG, not all tokens, to properly show the subsampling.
token_info_df %>% ggplot() + geom_bar(aes(x = age_interval, fill = requires_tags)) + scale_y_continuous(labels = comma) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap(~corpus_name)
filter(token_info_df, requires_tags == TRUE) %>% ggplot() + geom_bar(aes(x = age_interval, fill = tagged)) + scale_y_continuous(labels = comma) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap(~corpus_name)
TODO: Get demographic information from qualtrics
num_users <- raw_tags_df %>%
group_by(user_type) %>%
summarize(n_tokens = length(unique(token_id)),
n_tags = n(),
n_participants = length(unique(participant_id)),) %>%
mutate(tokens_per_part = n_tokens/n_participants) %>% arrange(desc(tokens_per_part))
## `summarise()` ungrouping output (override with `.groups` argument)
We recruited 6 participant types:
* subject pool: Students at UC Berkeley given 2 hours of tagging as an assignment - 208
* in_lab_staff: Hired RAs through UC Berkeley and University of Edinburgh - 18
* berkeley_rpp: Undergraduates in the UC Berkeley Psych Department, 1 hour of tagging - 337
* princeton_rpp: Undergraduates in the Princeton Psych Department, 2 hours of tagging - 257
* berkeley_two_hour_rpp: Undergraduates in the UC Berkeley Psych Department, 1 hour of tagging - 104
* berkeley_ra_pool: UC Berkeley Lab RAs given the option to tag for credit - 2
n_participants_per_token <- raw_tags_df %>% group_by(token_id) %>% summarise(n_part = length(unique(participant_id)))
## `summarise()` ungrouping output (override with `.groups` argument)
20.76% of our tokens were tagged by two or more participants.
idk_tags <- raw_tags_df %>% filter(sense_offset == -1)
wrong_pos_tags <- raw_tags_df %>% filter(sense_offset == -3)
sense_tags <- raw_tags_df %>% filter(sense_offset != -1 &
sense_offset != -3)
2.13% of tags were marked with “I don’t know” and 3.67% were marked with “Wrong Part of Speech”
raw_tags_df %>% group_by(participant_id, token_id) %>% summarize(n = n()) %>% ggplot() + geom_histogram(aes(x=n), bins=15) + xlab("Number of Senses Assigned to Token") + xlab("Num. Tokens")
## `summarise()` regrouping output by 'participant_id' (override with `.groups` argument)
Distribution of average diversity (average across children) for types, by POS
rank order senses according to frequency in each child in each type. Then calculate a correlation coefficient between the rankings across all kids (spearman) probably within each part of speech?