To get an estimate of a language’s complexity bias for a greater sample of languages, we are norming words included in the ASJP database (wikipedia) . This database constains translations of the same set of words for a very large sample of language. The list of words is based on the Swadesh List (N = 100 words), but most languages only have translations for a sublist (N = 40 words).
wichmann = read.csv("wichmann_2013.csv")
num_words = wichmann %>%
distinct(ISO) %>%
gather(word,translation,I:name) %>%
filter(translation != "") %>%
filter(ISO != "") %>%
group_by(ISO) %>%
summarise(N_WORDS = n()) %>%
arrange(-N_WORDS)
TOTAL_LANG = dim(num_words)[1]
ggplot(num_words, aes(x=N_WORDS)) +
geom_histogram(aes(y =..count..)) +
ylab ("Number of languages") +
xlab("Number of words") +
annotate("text", 75, 550,
label = paste("Total languages= \n" , TOTAL_LANG), size = 6, color = "red")
This is what the distribution of word-lengths look like for the set of 40-common words.
words = read.csv("RC43_words.csv")
words %>%
select(swadesh) %>%
mutate(nchar = nchar(as.character(swadesh))) %>%
ggplot(aes(x=nchar)) +
ggtitle("Distribution of word lengths for 40 words") +
geom_histogram(aes(y =..count..), binwidth = 1) +
scale_x_continuous(breaks = 1:10) +
ylab ("Number of languages") +
xlab("Number of characters")
And, here are the words:
sort(words$swadesh)
## [1] blood bone breast come die dog drink
## [8] ear eye fire fish full hand hear
## [15] horn I knee leaf liver louse mountain
## [22] name new night nose one path person
## [29] see skin star stone sun tongue tooth
## [36] tree two water we you
## 40 Levels: blood bone breast come die dog drink ear eye fire fish ... you
This is a very small sample of words, with limited variability in length – is it worth norming these? Or should we try to norm the full 100?