Method

choice of items to tag

selection of corpora

Of the 289 corpora in childes-db (v2018.1) we have decided to tag Manchester from the Eng-UK collection and Providence from the Eng-NA collection.

choice of types (CDI - WSWG, tokenized w/ WordNetLemmatizer)

We chose our word types from the Words and Sentences list from the MacArthur-Bates Communicative Development Inventories (MB-CDI).

This is 499 lemmas (lemmatized with the WordNet Lemmatizer) and 725 lemma+pos word types

downsampling of tokens

We subsampled to a maximum of 50 tokens per word type, per speaker type (caretaker & child), per 3 month interval.

At 299568 selected tokens, this is 0.078232 of the tokens available in the Providence and Manchester Corpora.

These tokens were divided across sets of at most 24 continuous tokens (about 15 minutes of tagging, each).

TODO: These should be out of tokens that are in the MB-CDI WSWG, not all tokens, to properly show the subsampling.

token_info_df %>% ggplot() + geom_bar(aes(x = age_interval, fill = requires_tags)) + scale_y_continuous(labels = comma) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap(~corpus_name)

filter(token_info_df, requires_tags == TRUE) %>% ggplot() + geom_bar(aes(x = age_interval, fill = tagged)) + scale_y_continuous(labels = comma) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap(~corpus_name)

participants: heterogeneous set of taggers

TODO: Get demographic information from qualtrics

num_users <- raw_tags_df %>%
  group_by(user_type) %>% 
  summarize(n_tokens = length(unique(token_id)),
            n_tags = n(),
            n_participants = length(unique(participant_id)),) %>%
  mutate(tokens_per_part = n_tokens/n_participants) %>% arrange(desc(tokens_per_part))
## `summarise()` ungrouping output (override with `.groups` argument)

We recruited 6 participant types:
* subject pool: Students at UC Berkeley given 2 hours of tagging as an assignment - 208
* in_lab_staff: Hired RAs through UC Berkeley and University of Edinburgh - 18
* berkeley_rpp: Undergraduates in the UC Berkeley Psych Department, 1 hour of tagging - 337
* princeton_rpp: Undergraduates in the Princeton Psych Department, 2 hours of tagging - 257
* berkeley_two_hour_rpp: Undergraduates in the UC Berkeley Psych Department, 1 hour of tagging - 104
* berkeley_ra_pool: UC Berkeley Lab RAs given the option to tag for credit - 2

double-coding units

n_participants_per_token <- raw_tags_df %>% group_by(token_id) %>% summarise(n_part = length(unique(participant_id)))
## `summarise()` ungrouping output (override with `.groups` argument)

20.76% of our tokens were tagged by two or more participants.

participant agreement calculation

Treatment of toys, idioms, etc.

POS fixing; “I don’t know” as a response

idk_tags <- raw_tags_df %>% filter(sense_offset == -1)
wrong_pos_tags <- raw_tags_df %>% filter(sense_offset == -3)

sense_tags <- raw_tags_df %>% filter(sense_offset != -1 &
                                       sense_offset != -3)

2.13% of tags were marked with “I don’t know” and 3.67% were marked with “Wrong Part of Speech”

Participants select combinations of senses

raw_tags_df %>% group_by(participant_id, token_id) %>% summarize(n = n()) %>% ggplot() + geom_histogram(aes(x=n), bins=15) + xlab("Number of Senses Assigned to Token") + xlab("Num. Tokens")
## `summarise()` regrouping output by 'participant_id' (override with `.groups` argument)

Majority rule/treatment of tags w/ mult-taggers

majority_tag_token <- raw_tags_df %>% group_by(token_id) %>% slice(which.max(table(sense_offset)))

164008 (100%) of tokens have a single tag that won the majority of the tags

Results

Total number tagged (tokens, types)

Average sense diversity per type/token over children per part of speech

Distribution of average diversity (average across children) for types, by POS

PMI distribution for senses – if high, interesting, if low, good indication that they are orthogonal

Sense diversity by age: is it increasing

Use type entropy to predict AoP (fit any model?)

Convergence across kids

rank order senses according to frequency in each child in each type. Then calculate a correlation coefficient between the rankings across all kids (spearman) probably within each part of speech?

British vs. American English