To get an estimate of a language’s complexity bias for a greater sample of languages, we are norming words included in the ASJP database (wikipedia) . This database constains translations of the same set of words for a very large sample of language. The list of words is based on the Swadesh List (N = 100 words), but most languages only have translations for a sublist (N = 40 words).

wichmann = read.csv("wichmann_2013.csv")  

num_words = wichmann %>%
  distinct(ISO) %>%
  gather(word,translation,I:name) %>%
  filter(translation != "") %>%
  filter(ISO != "") %>%
  group_by(ISO) %>%
  summarise(N_WORDS = n()) %>%
  arrange(-N_WORDS)

TOTAL_LANG = dim(num_words)[1]

ggplot(num_words, aes(x=N_WORDS)) +
 geom_histogram(aes(y =..count..)) + 
 ylab ("Number of languages") + 
 xlab("Number of words") + 
 annotate("text", 75, 550, 
          label = paste("Total languages= \n" , TOTAL_LANG), size = 6, color = "red")

This is what the distribution of word-lengths look like for the set of 40-common words.

words = read.csv("RC43_words.csv")  

words %>%
  select(swadesh) %>%
  mutate(nchar = nchar(as.character(swadesh))) %>%
  ggplot(aes(x=nchar)) +
  ggtitle("Distribution of word lengths for 40 words") +
  geom_histogram(aes(y =..count..), binwidth = 1) +
  scale_x_continuous(breaks = 1:10) +
   ylab ("Number of languages") + 
   xlab("Number of characters") 

And, here are the words:

sort(words$swadesh)
##  [1] blood    bone     breast   come     die      dog      drink   
##  [8] ear      eye      fire     fish     full     hand     hear    
## [15] horn     I        knee     leaf     liver    louse    mountain
## [22] name     new      night    nose     one      path     person  
## [29] see      skin     star     stone    sun      tongue   tooth   
## [36] tree     two      water    we       you     
## 40 Levels: blood bone breast come die dog drink ear eye fire fish ... you

This is a very small sample of words, with limited variability in length – is it worth norming these? Or should we try to norm the full 100?