Stimulus selection for NextKids

What animal items should we use in the NextKids experiment? Ideally, we’d want them to (a) be sufficiently similar yet diverse enough to reveal interesting embeddings, (b) have high naming agreement, and (c) have age of acquistion (aoa) variability so we can look at embeddings as a function vocab size.

Martin, Clint, and I discussed using the Snodgrass and Vanderwart pictures (1980, n = 270), since there are already lots of exisiting norms on them. Note that one question here is whether we want to use the black and white or the (newer) colored versions, eg:

knitr::include_graphics(c("pics/021.jpg", "pics/021_color.jpg"))

knitr::include_graphics(c("pics/003.jpg","pics/003_color.jpg"))

There are 43 animals in the Snodgrass and Vanderwart set. The next question is which subset to use, which depends on how many items we need. In terms of similarity, I’m not totally sure how to go about selecting a set of items with the right level of diversity (Would it be useful to make use of existing similarity norms to guide our choice e.g. from this paper?).

In terms of naming agreement, Martin and Clint found this this very useful paper, which has norms for the SV pictures, as rated by 5- and 6- year olds.

In terms of aoa, I estimated aoas using the Wordbank dataset, which is based on parent report checklists (CDI). 17 of 43 of SV items are not on the CDI, so for those items I estimated aoa using adult aoa estimates, rescaled based on the wordbank data.

## Read in data, clean and merge.

# norms for Snodagrass and Vanderwart pictures (from Cyowicz, et al. 1997)
snod.norms = read.csv("cycowicz_data.csv") 

# wordbank aoas
aoa.norms.wb = read.csv("eng_ws_production_aoas.csv")  %>%
  rename(wb.aoa = aoa) %>%
  filter(category == "animals") %>%
  mutate(definition = unlist(lapply(strsplit(as.character(definition),
                                             "\\("), function(x) x[1])),
  definition = str_trim(definition))

# Kuperman et al AOAs (for missing values from wordbank)
aoa.norms.kuperman = read.csv("AoA_ratings_Kuperman_et_al_BRM.csv") %>%
  select(Word, Rating.Mean) %>%
  rename(adult.aoa = Rating.Mean)

sna = snod.norms %>%
    left_join(aoa.norms.wb %>% select(wb.aoa, definition),
        c("Intentional.name" = "definition")) %>%
      left_join(aoa.norms.kuperman, c("Intentional.name"="Word")) %>%
    filter(is_insect == 0) %>%
    select(-is_insect, -modal.name) %>%
    arrange(wb.aoa)

## Rescale adult aoas for missing wb aoas.
scale.params = summary(lm(wb.aoa ~ adult.aoa, sna)) %>%
  tidy() %>%
  select(estimate)

intercept = scale.params[1,1]
slope = scale.params[2,1]

sna = sna %>%
  mutate(all.aoa = ifelse(is.na(wb.aoa), 
                          intercept + (adult.aoa*slope), wb.aoa),
         imputed = ifelse(is.na(wb.aoa), "estimated", "reported")) %>%
  arrange(all.aoa) %>%
  mutate(n = 1:n())

#ggplot(sna, aes(x = all.aoa, y = adult.aoa)) +
#  geom_point(aes(color = imputed), size = 2) +
#  geom_smooth(method = "lm")

Correlations between norms

Aoas are negatively correlated with percent agreement (words that are learned later have less naming agreement). Familarity and (visual) complexity are additional child-rated measures.

correlate(sna %>%  select(4:6,9)) %>%
  shave() %>%
  fashion() %>%
  kable()

rowname	percent_agreement	Familiarity	Complexity
percent_agreement
Familiarity	.40
Complexity	-.39	-.26
all.aoa	-.65	-.46	.25

Distribution of aoas and percentage agreement

Aoas are given in months and denote the age at which at least 50 percent of children know the word. The red dash reference line in the left facet corresponds to two years of age.

sna %>%
  gather(measure, value, c(4,9)) %>%
  mutate(ref = ifelse(measure == "all.aoa", 24, 50)) %>%
  ggplot(aes(x = value)) +
  geom_histogram() + 
  geom_vline(aes(xintercept = ref), linetype = 2, color = "red") +
  facet_wrap(~measure, scales = "free_x") +
  theme_bw()

All items

(sorted by aoa)

kable(sna %>% select(11,2,9,4,5,6), 
      col.names = c("", "name", "aoa (months)", "% child agreement",
                    "child familiarity", "child visual complexity"))

	name	aoa (months)	% child agreement	child familiarity	child visual complexity
1	dog	12.79127	100	3.47	3.80
2	duck	16.37416	93	2.67	3.60
3	bird	16.41657	90	3.07	4.07
4	cat	17.28150	100	3.00	3.80
5	fish	18.57089	100	2.93	3.13
6	bear	18.58330	80	2.50	3.93
7	cow	19.63397	87	2.70	3.89
8	horse	20.06142	97	3.53	3.53
9	rabbit	20.06388	70	3.40	3.93
10	pig	20.47164	100	1.81	3.67
11	monkey	20.87461	97	3.20	3.60
12	elephant	22.09924	100	2.93	4.00
13	chicken	22.16242	50	1.97	4.10
14	mouse	22.39705	77	2.52	3.34
15	lion	22.42629	90	2.00	3.93
16	sheep	22.79360	40	2.44	4.04
17	owl	23.27922	93	1.83	4.03
18	tiger	23.28858	77	2.21	4.57
19	fox	23.57170	73	2.21	3.79
20	giraffe	23.80614	83	2.79	4.14
21	snake	23.83154	97	2.33	2.53
22	camel	23.86402	90	2.17	3.90
23	goat	24.18881	60	2.07	4.00
24	squirrel	24.38394	83	3.07	3.97
25	zebra	24.40859	97	2.38	4.17
26	skunk	24.54609	73	2.64	4.00
27	alligator	24.70700	60	1.80	4.13
28	seal	24.87089	80	2.59	3.83
29	kangaroo	25.29313	90	2.93	3.90
30	gorilla	25.91024	70	2.47	4.13
31	peacock	25.97520	40	2.04	4.44
32	deer	26.02419	70	2.77	3.92
33	eagle	26.20256	30	2.77	4.31
34	seahorse	26.20256	50	2.70	4.04
35	penguin	26.33383	93	2.86	3.71
36	rooster	26.72931	40	2.59	3.97
37	rhinoceros	26.75472	43	1.85	3.77
38	donkey	27.66212	70	2.60	3.60
39	swan	27.79407	57	2.93	3.29
40	ostrich	28.11887	37	2.09	4.00
41	raccoon	29.32063	47	2.04	4.13
42	leopard	29.48302	30	2.15	3.77
43	lobster	31.43181	53	2.08	4.62

So, the question is, what should we try to optimize in selecting the stimulus set?

Stimulus selection for NextKids

Molly Lewis

2017-02-22

Correlations between norms

Distribution of aoas and percentage agreement

All items