Swadesh Analysis

This is a revised markdown of the RC43 dataset of swadesh worrds.

Swadesh by-word complexity means

Is word length correlated with complexity ratings in English? Yes.

sw.means = read.csv("swadesh_norms.csv") %>%
  select(-1)

ggplot(sw.means, aes(y = wordLength.eng, x = mean )) +
  geom_label(aes(label = word),position = "jitter") +
  geom_smooth(method = "lm") +
  xlim(1,7) +
  ylab("word length (char)") +
  xlab("Mean complexity rating") +
  theme_bw(base_size = 18)

tidy(cor.test(sw.means$wordLength.eng, sw.means$mean)) %>%
  select(estimate, statistic, p.value) %>%
  kable()

estimate	statistic	p.value
0.3168701	2.059443	0.0463512

X-ling correlations

Is word length correlated with complexity ratings in other languages? On average, yes.

Read in translations. Dataset comes from http://asjp.clld.org/.

# note- had to manually delete all "'" characters from raw text file (read.table doesn't parse)
swadesh =  read.table("dataset.tab", fill = T, 
                      header = T, sep = "\t") %>% 
  select(-e,-hh,-lon,-lat, -wcode) %>%
  gather("word","translation",I:name) %>%
  mutate(nchar = unlist( # get translation length
    lapply(
      lapply(
        strsplit(
          gsub("[[:space:]]", "", translation) ,
          ","), 
        nchar), mean))) %>%
  filter(translation != "") %>%
  filter(is.element(word, sw.means$word)) %>% # get crit words only
  left_join(sw.means) %>%  # add in complexity norms
  rename(complexity = mean) %>%
  arrange(-pop, iso) %>%
  select(-ci_lower, -ci_upper)

#NOTE: Length is coded in the ASJP coding system. About ASJP code: "We have developed an ASJP orthography (ASJPcode) into which all original lexical
lists are converted (3.2). ASJPcode can be viewed as a very simplified version of the International
Phonetic Alphabet (IPA). A major feature this standard orthography is that it
entails only symbols found on the common QWERTY keyboard for English.Words on lists
in a given orthography can typically be converted into ASJPcode in a short period of time,
usually less than an hour, by a trained transcriber.[...]ASJPcode consists of 41 symbols (representing 7 vowels and 34 consonants), all found
on the standard QWERTY keyboard (See Appendix C for a full description of ASJPcode).
Some symbols of ASJPcode, like those of IPA, represent only one sound, e. g., N = velar
nasal (IPA: N). Some single ASJP symbols represent sounds designated by combined symbols
in IPA, e.g., C = voiceless palato-alveolar affricate (IPA: tS). Unlike IPA, some ASJP
symbols can represent more than one sound, e.g., c = both the voiceless alveolar fricative
(IPA: ts) and the voiced alveolar fricative (IPA: dz). Some symbols are cover symbols for a
relatively broad range of sounds, usually including those occurring rarely in languages. For
example, L is used to represent all laterals other than normal l (the voiced alveolar lateral
288 Cecil H. Brown …, Automated classification of the world’s languages
approximate). The symbols used for vocalic sounds cover broad ranges. For example, the
symbols a and 3 represent all central vowels, with a restricted to the low central vowel and
3 covering all other central vowels.
ASJPcode is designed to represent all the commonly occurring sounds of the world’s
languages. Occasionally, rare sounds are encountered in languages not explicitly identified
in the orthography. Such a sound is represented by a symbol in the orthography that identifies
the sound that is closest to the rare sound in place and manner of articulation. For
example, S, which represents the voiceless palato-alveolar fricative (IPA:S), can be used to
designate the relatively rarely occurring retroflexed palato-alveolar fricative (IPA: s¸)." (Brown, 2008)

ms = swadesh %>%
  filter(iso != "") %>%
  group_by(iso, word) %>%
  summarise(names = names[1],
            wls_fam = wls_fam[1],
            wls_gen = wls_gen[1], 
            log.pop = log(mean(pop, na.rm = TRUE)),
            nchar = mean(nchar),
            complexity = complexity[1])

total.languages = ms %>%
  ungroup() %>%
  distinct(iso) %>%
  summarise(count = n())

# add number of items in each language (max == 40), and mean and sd length
ms =  ms %>%
           left_join(ms %>%
                           group_by(iso) %>% 
                            summarise(n = n(),
                                      mean.length = mean(nchar),
                                      sd.length = sd(nchar)))

complete.languages = ms %>%
  ungroup() %>%
  filter(n == 40) %>%
  distinct(iso) %>%
  summarise(count = n())

There are 4674 languages in this dataset. 1037 of those languages have translations for all 40 words. The below analyses are only with those languages with the full set of 40 words. When including all languages, we see the same pattern, but smaller (presumably due to noise in measurement).

by language

MIN_ITEMS = 40

empirical.corrs = ms %>%
  filter(n > (MIN_ITEMS-1)) %>%
  group_by(iso) %>%
  summarise(r.empirical = tidy(cor.test(nchar,
                                        complexity))$estimate,
            p = tidy(cor.test(nchar, complexity))$p.value,
            sig = ifelse(p<.05, "*", ""),
            language = tolower(names[1]),
            fam = tolower(wls_fam[1]),
            wls_gen = tolower(wls_gen[1]),
            mean.length = mean(nchar),
            sd.length = sd(nchar),
            log.pop = log.pop[1]) 

# write.csv(empirical.corrs, "swadesh_cb.csv")

ggplot(empirical.corrs, aes(x = r.empirical)) +
  geom_density() +
  geom_vline(aes(xintercept = 0)) +
  theme_bw()

Across languages, the mean complexity bias is 0.13.

by family

ec.fam = empirical.corrs %>%
  group_by(fam) %>%
  summarise(r.empirical = mean(r.empirical))

ggplot(ec.fam, aes(x = r.empirical)) +
  geom_density() +
  geom_vline(aes(xintercept = 0)) +
  theme_bw()

Across families, the mean complexity bias is 0.1.

Compare Swadesh to Google sample

Read in google

google.cb = read.csv('/Documents/GRADUATE_SCHOOL/Projects/langLearnVar/data/lewis_2015.csv') %>%
  select(language, corr)

google.raw <- fread('https://raw.githubusercontent.com/mllewis/RC/master/data/corpus/xling_lens.csv')

google.summary = google.raw %>%
        gather( "language", "length",contains("len")) %>%
            mutate(language = as.character(tolower(lapply(str_split(language,"_"),
                                                          function(x) {x[1]})))) %>%
        group_by(language) %>%
        summarise(mean.length = mean(length, na.rm = T),
                  sd.length = sd(length, na.rm = T))

google = left_join(google.cb, google.summary) %>%
  mutate(dataset = "google")

Munge swadesh

swadesh.ms = empirical.corrs %>%
  select(language, r.empirical, mean.length, sd.length ) %>%
  rename(corr = r.empirical) %>%
  mutate(dataset = "swadesh")

all.cbs = rbind(google,swadesh.ms)

all.cbs %>%
  group_by(dataset) %>%
  summarise(corr = mean(corr),
            mean.length = mean(mean.length, na.rm = T),
            sd.length = mean(sd.length, na.rm = T)) %>%
  kable()

dataset	corr	mean.length	sd.length
google	0.3370378	10.148804	4.84671
swadesh	0.1292288	4.376036	1.47627

The correlation is higher in the google dataset, as well as the overall length and sd.

Corelation between complexit bias corelations between two samples (for overalapping languages):

all.cbs.only = inner_join(google,swadesh.ms, by="language") %>%
  select(language, corr.x, corr.y) %>%
  rename(google.corr = corr.x, 
         swadesh.corr = corr.y) 

ggplot(all.cbs.only, aes(x = google.corr, y = swadesh.corr)) +
  geom_label(aes(label = language)) +
  geom_smooth(method = "lm") +
  theme_bw()

tidy(cor.test(all.cbs.only$google.corr,all.cbs.only$swadesh.corr, na.rm = T)) %>%
  select(estimate, statistic, p.value) %>%
  kable()

estimate	statistic	p.value
0.2815381	1.901491	0.0641125

Two samples are positively correlated.

Next, let’s look at properties of the distribution of word lengths in each language that might affect the magnitude of the complexity bias correlation:

all.cbs %>%
  filter(dataset == "google") %>%
  select(corr, mean.length, sd.length) %>%
  ggcorr(label = T)

all.cbs %>%
  filter(dataset == "swadesh") %>%
  select(corr, mean.length, sd.length) %>%
  ggcorr(label = T)

In both datsets, the overall length and sd of length are correlated. Next, we ask whether these aspects of the length distribution are related to the magnitude of the complexity bias.

tidy(lm(corr ~ sd.length + mean.length, d = swadesh.ms)) %>%
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.0506746	0.0282402	1.7944124	0.0730394
sd.length	-0.0059012	0.0145299	-0.4061425	0.6847219
mean.length	0.0199418	0.0081493	2.4470422	0.0145690

tidy(lm(corr ~ sd.length + mean.length, d = google)) %>%
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.3570051	0.0303313	11.770199	0.0000000
sd.length	-0.0262404	0.0071590	-3.665356	0.0004655
mean.length	0.0105364	0.0052502	2.006861	0.0484682

In the google dataset, languages that have longer words on average tend to have bigger complexity bias. Interestingly, this goes in the opposite direction for SD (smaller sd, bigger complexity bias). This is hard to interpret though because of sd.length and mean.length are highly correlated. This relationship is less robust for swadesh, but this seems likely due to floor effects.

Population correlations

Do languages with bigger population have smaller complexity bias (as we find in the google sample)? No.

by language

empirical.corrs = empirical.corrs %>%
  filter(is.finite(log.pop))
  
ggplot(empirical.corrs, aes(x = log.pop, y = r.empirical)) +
  #geom_histogram() +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

tidy(cor.test(empirical.corrs$log.pop, empirical.corrs$r.empirical, na.rm = T)) %>%
    select(estimate, statistic, p.value) %>%
  kable()

estimate	statistic	p.value
0.0502521	1.577537	0.1149937

tidy(lm(r.empirical ~ log.pop + mean.length, d = empirical.corrs)) %>%
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.0002835	0.0380624	-0.007448	0.9940589
log.pop	0.0035157	0.0014212	2.473752	0.0135384
mean.length	0.0209184	0.0069457	3.011713	0.0026643

tidy(cor.test(empirical.corrs$log.pop, empirical.corrs$mean.length, na.rm = T)) %>%
    select(estimate, statistic, p.value) %>%
  kable()

estimate	statistic	p.value
-0.3237582	-10.72858	0

ggplot(empirical.corrs, aes(x = log.pop, y = mean.length)) +
  #geom_histogram() +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

In this sample, population and mean.length are strongly negatively correlated. Controling, for length, there’s a positive correlation between complexity bias and population.

by family

empirical.corrs.fam = empirical.corrs %>%
  group_by(fam) %>%
  summarise(log.pop = mean(log.pop),
            r.empirical = mean(r.empirical))
ggplot(empirical.corrs.fam, aes(x = log.pop, y = r.empirical)) +
  #geom_histogram() +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

tidy(cor.test(empirical.corrs.fam$log.pop, empirical.corrs.fam$r.empirical, na.rm = T)) %>%
    select(estimate, statistic, p.value) %>%
  kable()

estimate	statistic	p.value
-0.0484087	-0.4918715	0.6238568

There’s no relationship with population.

In summary:

* In a sample of about 1000 language with 40 words each, there tends to be a complexity bias
* In the set of languages that intersect with google, the complexity biases are correlated
* but, overall smaller complexity bias in swadesh sample
* and, unlike google sample, no correlation with population.

Swadesh Analysis

Molly Lewis

2016-10-20

Swadesh by-word complexity means

X-ling correlations

by language

by family

Compare Swadesh to Google sample

Population correlations

by language

by family