There are two samples of complexity norms, RC43 and RC44. They are identical except that in RC44 we don’t give examples in the instrucitons and there are no anchor trials (ball and motherboard). We did this with the motivation of getting more variability in our sample, but it doesn’t really look like we achieved this. Nevertheless, the complexity bias is much bigger in the second sample. Does it make sense to use this sample instead?

Get norms

RC43 and RC44

files = dir("../experiments/RC43/production-results/")
d43 = data.frame()
for (i in 1:length(files)[1]) {
    s <- as.data.frame(fromJSON(paste("../experiments/RC43/production-results/", files[i], sep = "")))
    d43 = rbind(d43, s)
}
d43$exp = 43

# clean up names
names(d43) = unlist(strsplit(names(d43), "rs."))[unlist(strsplit(names(d43), "rs."))
                                             != "answe"]
d43 = d43 %>%
  gather(variable, value, contains("_")) %>%
  mutate(trial_num =  unlist(lapply(strsplit(as.character(variable),
                                      "_"),function(x) x[2])),
         variable = unlist(lapply(strsplit(as.character(variable),
                                      "_"),function(x) x[1]))) %>%
  spread(variable, value) %>%
  mutate(value = as.numeric(value)) %>%
  filter(condition == "swadesh")  %>%#Split datasets by study.
  select(WorkerId, exp, trial_num, value, word) %>%
  filter(word != "motherboard" & word != "ball")

files = dir("../experiments/RC44/production-results/")
d44 = data.frame()
for (i in 1:length(files)[1]) {
    s <- as.data.frame(fromJSON(paste("../experiments/RC44/production-results/", files[i], sep = "")))
    d44 = rbind(d44, s)
}
d44$exp = 44

# clean up names
names(d44) = unlist(strsplit(names(d44), "rs."))[unlist(strsplit(names(d44), "rs."))
                                             != "answe"]
d44 = d44 %>%
  gather(variable, value, contains("_")) %>%
  mutate(trial_num =  unlist(lapply(strsplit(as.character(variable),
                                      "_"),function(x) x[2])),
         variable = unlist(lapply(strsplit(as.character(variable),
                                      "_"),function(x) x[1]))) %>%
  spread(variable, value) %>%
  mutate(value = as.numeric(value)) %>%
  select(WorkerId, exp, trial_num, value, word)

d = rbind(d43, d44) %>%
      mutate(wordLength = nchar(word))

Look at distributions of complexity norms.

Distributions don’t look wildly different (in fact there’s less variability in the second sample.)

d %>%
    ggplot(aes(x=value)) +
    geom_density(fill = "red") +
    facet_grid(~exp) +
    geom_vline(aes(xintercept = mean(value))) +
    xlim(1,7) +
    theme_bw()

sw.means = d %>%
  mutate(word = trimws(word)) %>%
  group_by(word, exp) %>%
  summarise(mean = mean(value))  %>%
  mutate(wordLength.eng = nchar(word))


 sw.means %>%
   ungroup() %>%
   group_by(exp) %>%
   summarise(sd = sd(mean),
            mean = mean(mean)) %>%
    kable()

exp	sd	mean
43	0.5930842	2.877284
44	0.4361398	2.537750

Complexity bias in two samples

The bias is much bigger in the second sample (RC44).

ggplot(sw.means, aes(y = wordLength.eng, x = mean )) +
  geom_label(aes(label = word),position = "jitter") +
  facet_grid(~exp) +
  geom_smooth(method = "lm") +
  ylab("word length (char)") +
  xlab("Mean complexity rating") +
  theme_bw()

sw.means %>%
  group_by(exp) %>%
  do(tidy(cor.test(.$wordLength.eng, .$mean))) %>%
    select(exp, estimate, statistic, p.value) %>%
  kable()

exp	estimate	statistic	p.value
43	0.3168701	2.059443	0.0463512
44	0.5122684	3.676922	0.0007269

Swadesh Complexity norms (comparing two samples)

Molly Lewis

2016-10-21

Get norms

Look at distributions of complexity norms.

Complexity bias in two samples