What animal items should we use in the NextKids experiment? Ideally, we’d want them to (a) be sufficiently similar yet diverse enough to reveal interesting embeddings, (b) have high naming agreement, and (c) have age of acquistion (aoa) variability so we can look at embeddings as a function vocab size.

Martin, Clint, and I discussed using the Snodgrass and Vanderwart pictures (1980, n = 270), since there are already lots of exisiting norms on them. Note that one question here is whether we want to use the black and white or the (newer) colored versions, eg:

knitr::include_graphics(c("pics/021.jpg", "pics/021_color.jpg"))

knitr::include_graphics(c("pics/003.jpg","pics/003_color.jpg"))

There are 43 animals in the Snodgrass and Vanderwart set. The next question is which subset to use, which depends on how many items we need. In terms of similarity, I’m not totally sure how to go about selecting a set of items with the right level of diversity (Would it be useful to make use of existing similarity norms to guide our choice e.g. from this paper?).

In terms of naming agreement, Martin and Clint found this this very useful paper, which has norms for the SV pictures, as rated by 5- and 6- year olds.

In terms of aoa, I estimated aoas using the Wordbank dataset, which is based on parent report checklists (CDI). 17 of 43 of SV items are not on the CDI, so for those items I estimated aoa using adult aoa estimates, rescaled based on the wordbank data.

## Read in data, clean and merge.

# norms for Snodagrass and Vanderwart pictures (from Cyowicz, et al. 1997)
snod.norms = read.csv("cycowicz_data.csv") 

# wordbank aoas
aoa.norms.wb = read.csv("eng_ws_production_aoas.csv")  %>%
  rename(wb.aoa = aoa) %>%
  filter(category == "animals") %>%
  mutate(definition = unlist(lapply(strsplit(as.character(definition),
                                             "\\("), function(x) x[1])),
  definition = str_trim(definition))

# Kuperman et al AOAs (for missing values from wordbank)
aoa.norms.kuperman = read.csv("AoA_ratings_Kuperman_et_al_BRM.csv") %>%
  select(Word, Rating.Mean) %>%
  rename(adult.aoa = Rating.Mean)

sna = snod.norms %>%
    left_join(aoa.norms.wb %>% select(wb.aoa, definition),
        c("Intentional.name" = "definition")) %>%
      left_join(aoa.norms.kuperman, c("Intentional.name"="Word")) %>%
    filter(is_insect == 0) %>%
    select(-is_insect, -modal.name) %>%
    arrange(wb.aoa)
## Rescale adult aoas for missing wb aoas.
scale.params = summary(lm(wb.aoa ~ adult.aoa, sna)) %>%
  tidy() %>%
  select(estimate)

intercept = scale.params[1,1]
slope = scale.params[2,1]

sna = sna %>%
  mutate(all.aoa = ifelse(is.na(wb.aoa), 
                          intercept + (adult.aoa*slope), wb.aoa),
         imputed = ifelse(is.na(wb.aoa), "estimated", "reported")) %>%
  arrange(all.aoa) %>%
  mutate(n = 1:n())

#ggplot(sna, aes(x = all.aoa, y = adult.aoa)) +
#  geom_point(aes(color = imputed), size = 2) +
#  geom_smooth(method = "lm") 

Correlations between norms

Aoas are negatively correlated with percent agreement (words that are learned later have less naming agreement). Familarity and (visual) complexity are additional child-rated measures.

correlate(sna %>%  select(4:6,9)) %>%
  shave() %>%
  fashion() %>%
  kable()
rowname percent_agreement Familiarity Complexity all.aoa
percent_agreement
Familiarity .40
Complexity -.39 -.26
all.aoa -.65 -.46 .25

Distribution of aoas and percentage agreement

Aoas are given in months and denote the age at which at least 50 percent of children know the word. The red dash reference line in the left facet corresponds to two years of age.

sna %>%
  gather(measure, value, c(4,9)) %>%
  mutate(ref = ifelse(measure == "all.aoa", 24, 50)) %>%
  ggplot(aes(x = value)) +
  geom_histogram() + 
  geom_vline(aes(xintercept = ref), linetype = 2, color = "red") +
  facet_wrap(~measure, scales = "free_x") +
  theme_bw()

All items

(sorted by aoa)

kable(sna %>% select(11,2,9,4,5,6), 
      col.names = c("", "name", "aoa (months)", "% child agreement",
                    "child familiarity", "child visual complexity"))
name aoa (months) % child agreement child familiarity child visual complexity
1 dog 12.79127 100 3.47 3.80
2 duck 16.37416 93 2.67 3.60
3 bird 16.41657 90 3.07 4.07
4 cat 17.28150 100 3.00 3.80
5 fish 18.57089 100 2.93 3.13
6 bear 18.58330 80 2.50 3.93
7 cow 19.63397 87 2.70 3.89
8 horse 20.06142 97 3.53 3.53
9 rabbit 20.06388 70 3.40 3.93
10 pig 20.47164 100 1.81 3.67
11 monkey 20.87461 97 3.20 3.60
12 elephant 22.09924 100 2.93 4.00
13 chicken 22.16242 50 1.97 4.10
14 mouse 22.39705 77 2.52 3.34
15 lion 22.42629 90 2.00 3.93
16 sheep 22.79360 40 2.44 4.04
17 owl 23.27922 93 1.83 4.03
18 tiger 23.28858 77 2.21 4.57
19 fox 23.57170 73 2.21 3.79
20 giraffe 23.80614 83 2.79 4.14
21 snake 23.83154 97 2.33 2.53
22 camel 23.86402 90 2.17 3.90
23 goat 24.18881 60 2.07 4.00
24 squirrel 24.38394 83 3.07 3.97
25 zebra 24.40859 97 2.38 4.17
26 skunk 24.54609 73 2.64 4.00
27 alligator 24.70700 60 1.80 4.13
28 seal 24.87089 80 2.59 3.83
29 kangaroo 25.29313 90 2.93 3.90
30 gorilla 25.91024 70 2.47 4.13
31 peacock 25.97520 40 2.04 4.44
32 deer 26.02419 70 2.77 3.92
33 eagle 26.20256 30 2.77 4.31
34 seahorse 26.20256 50 2.70 4.04
35 penguin 26.33383 93 2.86 3.71
36 rooster 26.72931 40 2.59 3.97
37 rhinoceros 26.75472 43 1.85 3.77
38 donkey 27.66212 70 2.60 3.60
39 swan 27.79407 57 2.93 3.29
40 ostrich 28.11887 37 2.09 4.00
41 raccoon 29.32063 47 2.04 4.13
42 leopard 29.48302 30 2.15 3.77
43 lobster 31.43181 53 2.08 4.62

So, the question is, what should we try to optimize in selecting the stimulus set?