This is a revised markdown of the RC43 dataset of swadesh worrds.
Is word length correlated with complexity ratings in English? Yes.
sw.means = read.csv("swadesh_norms.csv") %>%
select(-1)
ggplot(sw.means, aes(y = wordLength.eng, x = mean )) +
geom_label(aes(label = word),position = "jitter") +
geom_smooth(method = "lm") +
xlim(1,7) +
ylab("word length (char)") +
xlab("Mean complexity rating") +
theme_bw(base_size = 18)
tidy(cor.test(sw.means$wordLength.eng, sw.means$mean)) %>%
select(estimate, statistic, p.value) %>%
kable()
| estimate | statistic | p.value |
|---|---|---|
| 0.3168701 | 2.059443 | 0.0463512 |
Is word length correlated with complexity ratings in other languages? On average, yes.
Read in translations. Dataset comes from http://asjp.clld.org/.
# note- had to manually delete all "'" characters from raw text file (read.table doesn't parse)
swadesh = read.table("dataset.tab", fill = T,
header = T, sep = "\t") %>%
select(-e,-hh,-lon,-lat, -wcode) %>%
gather("word","translation",I:name) %>%
mutate(nchar = unlist( # get translation length
lapply(
lapply(
strsplit(
gsub("[[:space:]]", "", translation) ,
","),
nchar), mean))) %>%
filter(translation != "") %>%
filter(is.element(word, sw.means$word)) %>% # get crit words only
left_join(sw.means) %>% # add in complexity norms
rename(complexity = mean) %>%
arrange(-pop, iso) %>%
select(-ci_lower, -ci_upper)
#NOTE: Length is coded in the ASJP coding system. About ASJP code: "We have developed an ASJP orthography (ASJPcode) into which all original lexical
lists are converted (3.2). ASJPcode can be viewed as a very simplified version of the International
Phonetic Alphabet (IPA). A major feature this standard orthography is that it
entails only symbols found on the common QWERTY keyboard for English.Words on lists
in a given orthography can typically be converted into ASJPcode in a short period of time,
usually less than an hour, by a trained transcriber.[...]ASJPcode consists of 41 symbols (representing 7 vowels and 34 consonants), all found
on the standard QWERTY keyboard (See Appendix C for a full description of ASJPcode).
Some symbols of ASJPcode, like those of IPA, represent only one sound, e. g., N = velar
nasal (IPA: N). Some single ASJP symbols represent sounds designated by combined symbols
in IPA, e.g., C = voiceless palato-alveolar affricate (IPA: tS). Unlike IPA, some ASJP
symbols can represent more than one sound, e.g., c = both the voiceless alveolar fricative
(IPA: ts) and the voiced alveolar fricative (IPA: dz). Some symbols are cover symbols for a
relatively broad range of sounds, usually including those occurring rarely in languages. For
example, L is used to represent all laterals other than normal l (the voiced alveolar lateral
288 Cecil H. Brown …, Automated classification of the world’s languages
approximate). The symbols used for vocalic sounds cover broad ranges. For example, the
symbols a and 3 represent all central vowels, with a restricted to the low central vowel and
3 covering all other central vowels.
ASJPcode is designed to represent all the commonly occurring sounds of the world’s
languages. Occasionally, rare sounds are encountered in languages not explicitly identified
in the orthography. Such a sound is represented by a symbol in the orthography that identifies
the sound that is closest to the rare sound in place and manner of articulation. For
example, S, which represents the voiceless palato-alveolar fricative (IPA:S), can be used to
designate the relatively rarely occurring retroflexed palato-alveolar fricative (IPA: s¸)." (Brown, 2008)
ms = swadesh %>%
filter(iso != "") %>%
group_by(iso, word) %>%
summarise(names = names[1],
wls_fam = wls_fam[1],
wls_gen = wls_gen[1],
log.pop = log(mean(pop, na.rm = TRUE)),
nchar = mean(nchar),
complexity = complexity[1])
total.languages = ms %>%
ungroup() %>%
distinct(iso) %>%
summarise(count = n())
# add number of items in each language (max == 40), and mean and sd length
ms = ms %>%
left_join(ms %>%
group_by(iso) %>%
summarise(n = n(),
mean.length = mean(nchar),
sd.length = sd(nchar)))
complete.languages = ms %>%
ungroup() %>%
filter(n == 40) %>%
distinct(iso) %>%
summarise(count = n())
There are 4674 languages in this dataset. 1037 of those languages have translations for all 40 words. The below analyses are only with those languages with the full set of 40 words. When including all languages, we see the same pattern, but smaller (presumably due to noise in measurement).
MIN_ITEMS = 40
empirical.corrs = ms %>%
filter(n > (MIN_ITEMS-1)) %>%
group_by(iso) %>%
summarise(r.empirical = tidy(cor.test(nchar,
complexity))$estimate,
p = tidy(cor.test(nchar, complexity))$p.value,
sig = ifelse(p<.05, "*", ""),
language = tolower(names[1]),
fam = tolower(wls_fam[1]),
wls_gen = tolower(wls_gen[1]),
mean.length = mean(nchar),
sd.length = sd(nchar),
log.pop = log.pop[1])
# write.csv(empirical.corrs, "swadesh_cb.csv")
ggplot(empirical.corrs, aes(x = r.empirical)) +
geom_density() +
geom_vline(aes(xintercept = 0)) +
theme_bw()
Across languages, the mean complexity bias is 0.13.
ec.fam = empirical.corrs %>%
group_by(fam) %>%
summarise(r.empirical = mean(r.empirical))
ggplot(ec.fam, aes(x = r.empirical)) +
geom_density() +
geom_vline(aes(xintercept = 0)) +
theme_bw()
Across families, the mean complexity bias is 0.1.
Read in google
google.cb = read.csv('/Documents/GRADUATE_SCHOOL/Projects/langLearnVar/data/lewis_2015.csv') %>%
select(language, corr)
google.raw <- fread('https://raw.githubusercontent.com/mllewis/RC/master/data/corpus/xling_lens.csv')
google.summary = google.raw %>%
gather( "language", "length",contains("len")) %>%
mutate(language = as.character(tolower(lapply(str_split(language,"_"),
function(x) {x[1]})))) %>%
group_by(language) %>%
summarise(mean.length = mean(length, na.rm = T),
sd.length = sd(length, na.rm = T))
google = left_join(google.cb, google.summary) %>%
mutate(dataset = "google")
Munge swadesh
swadesh.ms = empirical.corrs %>%
select(language, r.empirical, mean.length, sd.length ) %>%
rename(corr = r.empirical) %>%
mutate(dataset = "swadesh")
all.cbs = rbind(google,swadesh.ms)
all.cbs %>%
group_by(dataset) %>%
summarise(corr = mean(corr),
mean.length = mean(mean.length, na.rm = T),
sd.length = mean(sd.length, na.rm = T)) %>%
kable()
| dataset | corr | mean.length | sd.length |
|---|---|---|---|
| 0.3370378 | 10.148804 | 4.84671 | |
| swadesh | 0.1292288 | 4.376036 | 1.47627 |
The correlation is higher in the google dataset, as well as the overall length and sd.
Corelation between complexit bias corelations between two samples (for overalapping languages):
all.cbs.only = inner_join(google,swadesh.ms, by="language") %>%
select(language, corr.x, corr.y) %>%
rename(google.corr = corr.x,
swadesh.corr = corr.y)
ggplot(all.cbs.only, aes(x = google.corr, y = swadesh.corr)) +
geom_label(aes(label = language)) +
geom_smooth(method = "lm") +
theme_bw()
tidy(cor.test(all.cbs.only$google.corr,all.cbs.only$swadesh.corr, na.rm = T)) %>%
select(estimate, statistic, p.value) %>%
kable()
| estimate | statistic | p.value |
|---|---|---|
| 0.2815381 | 1.901491 | 0.0641125 |
Two samples are positively correlated.
Next, let’s look at properties of the distribution of word lengths in each language that might affect the magnitude of the complexity bias correlation:
all.cbs %>%
filter(dataset == "google") %>%
select(corr, mean.length, sd.length) %>%
ggcorr(label = T)
all.cbs %>%
filter(dataset == "swadesh") %>%
select(corr, mean.length, sd.length) %>%
ggcorr(label = T)
In both datsets, the overall length and sd of length are correlated. Next, we ask whether these aspects of the length distribution are related to the magnitude of the complexity bias.
tidy(lm(corr ~ sd.length + mean.length, d = swadesh.ms)) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0506746 | 0.0282402 | 1.7944124 | 0.0730394 |
| sd.length | -0.0059012 | 0.0145299 | -0.4061425 | 0.6847219 |
| mean.length | 0.0199418 | 0.0081493 | 2.4470422 | 0.0145690 |
tidy(lm(corr ~ sd.length + mean.length, d = google)) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.3570051 | 0.0303313 | 11.770199 | 0.0000000 |
| sd.length | -0.0262404 | 0.0071590 | -3.665356 | 0.0004655 |
| mean.length | 0.0105364 | 0.0052502 | 2.006861 | 0.0484682 |
In the google dataset, languages that have longer words on average tend to have bigger complexity bias. Interestingly, this goes in the opposite direction for SD (smaller sd, bigger complexity bias). This is hard to interpret though because of sd.length and mean.length are highly correlated. This relationship is less robust for swadesh, but this seems likely due to floor effects.
Do languages with bigger population have smaller complexity bias (as we find in the google sample)? No.
empirical.corrs = empirical.corrs %>%
filter(is.finite(log.pop))
ggplot(empirical.corrs, aes(x = log.pop, y = r.empirical)) +
#geom_histogram() +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
tidy(cor.test(empirical.corrs$log.pop, empirical.corrs$r.empirical, na.rm = T)) %>%
select(estimate, statistic, p.value) %>%
kable()
| estimate | statistic | p.value |
|---|---|---|
| 0.0502521 | 1.577537 | 0.1149937 |
tidy(lm(r.empirical ~ log.pop + mean.length, d = empirical.corrs)) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0002835 | 0.0380624 | -0.007448 | 0.9940589 |
| log.pop | 0.0035157 | 0.0014212 | 2.473752 | 0.0135384 |
| mean.length | 0.0209184 | 0.0069457 | 3.011713 | 0.0026643 |
tidy(cor.test(empirical.corrs$log.pop, empirical.corrs$mean.length, na.rm = T)) %>%
select(estimate, statistic, p.value) %>%
kable()
| estimate | statistic | p.value |
|---|---|---|
| -0.3237582 | -10.72858 | 0 |
ggplot(empirical.corrs, aes(x = log.pop, y = mean.length)) +
#geom_histogram() +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
In this sample, population and mean.length are strongly negatively correlated. Controling, for length, there’s a positive correlation between complexity bias and population.
empirical.corrs.fam = empirical.corrs %>%
group_by(fam) %>%
summarise(log.pop = mean(log.pop),
r.empirical = mean(r.empirical))
ggplot(empirical.corrs.fam, aes(x = log.pop, y = r.empirical)) +
#geom_histogram() +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
tidy(cor.test(empirical.corrs.fam$log.pop, empirical.corrs.fam$r.empirical, na.rm = T)) %>%
select(estimate, statistic, p.value) %>%
kable()
| estimate | statistic | p.value |
|---|---|---|
| -0.0484087 | -0.4918715 | 0.6238568 |
There’s no relationship with population.
In summary:
* In a sample of about 1000 language with 40 words each, there tends to be a complexity bias
* In the set of languages that intersect with google, the complexity biases are correlated
* but, overall smaller complexity bias in swadesh sample
* and, unlike google sample, no correlation with population.