Language learnability variability

Across several sets of analyses, I’ve been looking at the relationship between proxies for learnability pressures and features of language. Here I try to aggregate these analyses.

There are three motivating theoretical ideas. The first is the idea of multiple timescales: We assume there are multiple timescales – in-the-moment, development, and language evolution – that are relevant to the phenomenon of language.

Second, we consider the possibility that language adapts to its contextual niche at all timescales. At the in-the-moment timescales, this might mean taking into account the character of the referential alternatives when formulating an utterance. At the developmental timescale, we assume the learner has acquired a lexicon that is tailored to the child’s cognitive capacity. And, at the evolution timescale, we consider the possibility that language systems might adapt over time to the learnability pressures of the population (“Linguistic Niche Hypothesis,” Lupyan & Dale, 2011). This might mean, for example, that languages with more adult second language speakers become easier to learn.

The third motivating theory comes from pragmatics: competing communicative pressures that give rise to equilibria. These competeing pressures have been formulated in a number of different ways across the literature, e.g., Horn’s Speaker and Hearer pressure. The consequences of these pressures have been considered primarily at the timescale of in-the-moment communication. The present work extends the relevance of these pressures to longer timescales. In particuar, these analyses explore the following hypothesis: The same equilibria that emerge as a consequence of communicative pressures at the pragmatic timescale, also emerge at the developmental and language evolution timescales. These analyses try to explore predictions of this hypothesis.

As a first step in testing this proposals, we try to aggregate demographic predictors and linguistic outocome variables.

Next, we look for parallels between ontonogeny and cross-linguistic variability. (i.e. Are the features of easy-to-learn languages shared with the vocabularies of young children?)

A secondardy prediction is that languages that are easier to learn for L2 speakers should also be easier to learn for children. This could be tested by looking at mean AOA in a language and learnability features.

Learnability and language variables

Read in all data

Population features

Merge together all population learnability features:

population size (Lupyan & Dale)
area (Lupyan & Dale)
number of neighbors (Lupyan & Dale)
temperature (Lupyan & Dale)
number of L2 speakers (Bentz)

Read in Lupyan and Dale data

ld = read.table("../data/lupyan_2010.txt", fill = T, 
               header = T, sep = "\t", na.strings = "*") %>%
    rename(log.pop2 = logpop2, 
           mean.temp = aveTemp,
           n.neighbors = numNeighbors,
           sum.precip = sumPrecip,
           sd.temp = sdTemp,
           sd.precip = sdPrecip)

# fix language labels
ld$ethnologue_lang_name = tolower(ld$ethnologue_lang_name)
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "standard german"] <- "german"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "tosk albanian"] <- "albanian"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "catalan-valencian-balear"] <- "catalan"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "haitian creole french"] <- "haitian.creole"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "irish gaelic"] <- "irish"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "central khmer"] <- "khmer"

# Need to collapse across different dialects of same language (e.g. "eastern mongolian" and "peripherial mongolian") [arabic, azerbaijani, chinese, hmong, mongolian, yiddish]

# but first, remove sign languages (e.g."chinese sign language")
ld = ld[-which(grepl("sign language", ld$ethnologue_lang_name)),]

ld$eln2 = ifelse(grepl("arabic", ld$ethnologue_lang_name), "arabic",
                   ifelse(grepl("azerbaijani", ld$ethnologue_lang_name), "azerbaijani",
                          ifelse(grepl("chinese", ld$ethnologue_lang_name), "chinese",
                                 ifelse(grepl("hmong", ld$ethnologue_lang_name), "hmong",
                                        ifelse(grepl("mongolian", ld$ethnologue_lang_name), "mongolian",
                                               ifelse(grepl("yiddish", ld$ethnologue_lang_name), "yiddish",
                                                      ld$ethnologue_lang_name))))))
ld$eln2 = as.factor(ld$eln2)

# 12 langs not in lupyan 2010, but in LF [belarusian, bosnian, cebuana, croatian, esperanto, filipino, kanada, latin, norwegian, persian, punjabi, serbian]

# get means across language
demo_ld = ld %>%
     group_by(eln2) %>%
     summarise_each(funs(mean(., na.rm = TRUE)), 
                    c(8:9, 16:121))  %>%
     select(1, 93:95, 100:101, 103:105, 108)

# add in data family
fams = ld %>%
      group_by(eln2) %>%
      filter(row_number() == 1) %>%
      select(eln2, langFamily, langGenus)  

demo_ld = left_join(fams, demo_ld, by = "eln2") %>%
          rename(language = eln2) %>%
          ungroup()

Read in Bentz data (Note here that we are only looking at languages from the UDHR corpus. The other corpora have smaller languages only.)

demo_bentz = read.csv("../data/bentz_2015.csv") %>%
    gather(temp, LDT, starts_with("LDT")) %>% 
    unite(temp1, measure, temp, sep = ".") %>% 
    spread(temp1, LDT) %>%
    filter(text == "UDHR") %>%
    select(-iso_639_3, -fileName, 
           -contains("Stock"), -contains("speakers"),
           -contains("Region"), -text,
           -genus_wals, -family_wals)  %>%
    mutate(Language = gsub("[[:space:]]", "", tolower(Language))) %>%
    rename(language = Language)

demo_bentz$language[demo_bentz$language == "swahili,tanzania"] <- "swahili"
demo_bentz$language[demo_bentz$language == "irishgaelic"] <- "irish"

Join all population variables

demo = left_join(demo_ld, demo_bentz, by="language") %>%
       mutate(language = as.factor(language))

Linguistic features

Merge together all linguistic features:

complexity bias (Lewis & Frank)
dependency length (Futrell)
type-token ratio (Bentz)
uid-ness (Pelligrino)
aoa (wor)
morphological complexity (Lupyan & Dale) – missing
mean length – missing

# complexity bias
cb = read.csv("../data/lewis_2015.csv") %>%
      rename(complexity.bias = corr,
              p.complexity.bias = p.corr,
              mono.complexity.bias = mono.cor,
              open.complexity.bias = open.cor) %>%
    select(-X.1, -X, -lower.ci, -upper.ci, -checked, -mean.length) %>%
    filter(language != "english") # english is an outlier in this dataset because norms colelcted in english

# dependency length
dl = read.csv("../data/futrell_2015.csv")

# UID
uid = read.csv("../data/pellegrino_2015.csv")

# phoneme inventory
labs = read.csv("../data/ethnologue_labs.csv", 
                fileEncoding="latin1") %>%
         select(-CountryID, -LangStatus) 
phoneme = read.csv("../data/moran_2012.csv") %>%
          select(lang, pho, con, vow, logPop) %>%
          left_join(labs, by = c("lang" = "LangID")) %>%
          select(-lang) %>%
          rename(language = Name, log.pop.moran = logPop) %>%
          mutate(language = tolower(language))

# aoa (wordbank)
aoa_data = read.csv("../data/aoa_data.csv") %>%
          filter( aoa > 5 & aoa < 31) 

aoa_central = aoa_data %>% 
  mutate(language = tolower(language)) %>%
  filter(measure == "understands", 
         lexical_category == "nouns") %>%
  group_by(language) %>%
  summarise(wb.mean.aoa = mean(aoa))

# mean length - is this IPA? - fix this!
asjp = read.csv("../data/asjp.csv") %>% 
  select(1,11:110) %>%
  rename(language = names) %>%
  mutate(language = tolower(language)) %>%
  gather(word,translation,I:name) %>%
  mutate(nchar = unlist(
    lapply(
      lapply(
        strsplit(
          gsub("[[:space:]]", "", translation) ,
          ","), 
           nchar), mean))) %>%
  filter(translation != "")

# subset to only those words in the swadesh list (n = 40)
swadesh.words  = asjp[asjp$language == "afrikaans", "word"] 
asjp = asjp %>%
      filter(is.element(word, swadesh.words))

# fix language names so map onto (minimally) cb dataset
asjp$language[asjp$language == "standard_arabic"] <- "arabic"
asjp$language[asjp$language == "standard_german"] <- "german"
asjp$language[asjp$language == "haitian_creole"] <- "haitian.creole"
asjp$language[asjp$language == "igbo_onitsha"] <- "igbo"
asjp$language[asjp$language == "irish_gaelic"] <- "irish"
asjp$language = ifelse(grepl("norwegian", asjp$language),
                       "norwegian", asjp$language)
asjp$language = ifelse(grepl("javanese", asjp$language),
                       "javanese", asjp$language)
asjp$language = ifelse(grepl("armenian", asjp$language),
                       "armenian", asjp$language)
asjp$language = ifelse(grepl("azerbaijani", asjp$language),
                       "azerbaijani", asjp$language)
asjp$language = ifelse(grepl("chinese", asjp$language),
                       "chinese", asjp$language)
asjp$language = ifelse(grepl("hmong", asjp$language),
                       "hmong", asjp$language)
asjp$language = ifelse(grepl("punjabi", asjp$language),
                       "punjabi", asjp$language)
asjp$language = ifelse(grepl("yiddish", asjp$language),
                       "yiddish", asjp$language)

asjp.d = asjp %>%
        group_by(language) %>%
        summarize(asjp.mean.length = mean(nchar, na.rm = T))

AOA (Luniewska)

aoa_lun.raw = read.csv("../data/luniewska_2015.csv", header = T, 
                   fileEncoding = "latin1")
  
# aoas
aoa_lun.aoa = aoa_lun.raw %>%
  gather(language_aoa, aoa, grep("aoa", names(aoa_lun.raw))) %>%
  mutate(language_aoa = tolower(unlist(lapply(strsplit(
    as.character(language_aoa),"_"), function(x) x[1])))) %>%
  select(-3:-27) %>%
  rename(language = language_aoa) %>%
  group_by(language, word_class) %>%
  summarize(lun.mean.aoa= mean(aoa, na.rm = T)) %>%
  spread(word_class,lun.mean.aoa) %>%
  rename(lun.aoa.noun = noun, lun.aoa.verb = verb)

# mean length (just for fun)
trans_col_is = which(is.element(names(aoa_lun.raw),names(aoa_lun.raw)
                                [!grepl("aoa", names(aoa_lun.raw))]))
trans_col_is = trans_col_is[-1:-2]
aoa_lun = aoa_lun.raw %>%
  gather(language, translation, trans_col_is) %>%
  mutate(language = tolower(language)) %>%
  select(-3:-27) %>%
  group_by(language) %>%
  mutate(length = nchar(translation)) %>%
  summarize(lun.mean.length = mean(length, na.rm = T)) %>%
  left_join(aoa_lun.aoa)

Merge population with all language variables

d = full_join(demo, cb) %>%
    full_join(dl) %>%
    full_join(uid) %>%
    full_join(aoa_central) %>%
    full_join(phoneme) %>%
    left_join(asjp.d) %>%
    left_join(aoa_lun)

Distributions of variables

#l = ""
#for (i in 1:length(names(d))) {
#  l = paste(l, '"', names(d)[i], '",', sep = "")
#}
demo_vars = c("area","perimeter","n.neighbors","mean.temp","sum.precip","sd.temp","sd.precip","growingSeason","log.pop2","RatioL2","PercL2", "log.pop.moran")

lang_vars = c("a.LDT","a.LDTinv","a.LDTscaled","H.LDT","H.LDTinv","H.LDTscaled","TTR.LDT","TTR.LDTinv","TTR.LDTscaled","complexity.bias","p.complexity.bias","mono.complexity.bias","open.complexity.bias","mean.length","lun.mean.length","m_rand","m_obs","mean_dependency_diff","fixed.random.baseline.slope","observed.slope","slope_dif","information.density","syllable.rate","information.rate","wb.mean.aoa","pho","con","vow", "aoa_lun", "lun.mean.length.noun",
  "lun.mean.length.verb", "lun.aoa.noun", "lun.aoa.verb")

d %>%
  gather(variable, num, area:length(d))  %>%
  mutate(var_type = ifelse(is.element(variable, demo_vars),
                           "demo", "lang")) %>%
  ggplot(aes(x=num, fill = var_type)) + 
    geom_histogram(position = "identity") +
    facet_wrap( ~ variable, scales = "free") + 
    theme_bw()

Take the log of of the non-normal-looking variables and re-plot

d = d %>%
    mutate(log.area = log(area),
              log.perimeter = log(perimeter),
              log.n.neighbors = log(n.neighbors),
              log.sum.precip = log(sum.precip),
              log.sd.temp = log(sd.temp),
              log.sd.precip = log(sd.precip),
              log.RatioL2= log(RatioL2),
              log.n.phonemes = log(pho),
              log.n.consonants = log(con),
              log.n.vow = log(vow),
              log.asjp.mean.length = log(asjp.mean.length))%>%
    select(-area, -perimeter, -n.neighbors, -sd.temp,
           -sd.precip, -RatioL2, -sum.precip, -pho, -con, -vow, -asjp.mean.length)

demo_vars = c(intersect(names(d), demo_vars), "log.area", "log.perimeter", "log.n.neighbors", "log.sum.precip", "log.sd.temp", "log.sd.precip", "log.RatioL2")
lang_vars = c(intersect(names(d), lang_vars), "lang_vars", "log.n.phonemes", "log.n.consonants", "log.n.vow", "log.asjp.mean.length")

d %>%
  gather(variable, num, mean.temp:length(d))  %>%
  mutate(var_type = ifelse(is.element(variable, demo_vars), "demo", "lang")) %>%
  arrange(var_type) %>%
  ggplot(aes(x=num, fill = var_type)) + 
    geom_histogram(position = "identity") +
    facet_wrap( ~ variable, scales = "free") + 
    theme_bw()

Describe data available for each language

    d %>%
    gather(variable, num, language:length(d))  %>%
    group_by(variable) %>%
    summarize(counts = length(which(is.na(num) == F)))  %>%
    mutate(var_type = ifelse(is.element(variable, demo_vars), "demo", "lang")) %>%
    mutate(counts_trunc = ifelse(counts > 150, 150, counts))  %>%
    ggplot(aes(y=counts_trunc, x = variable, fill = var_type, order = variable)) + 
      geom_bar(stat = "identity") +
      theme_bw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1),
            legend.position = "none") +
      ylab("Number of languages (truncated at 150)") +
      ggtitle("Number languages per measure")

Relationships between variables

Correlation matrix (stuff in the upper right quadrant matters)

is.na(d) <- sapply(d, is.infinite)
corrs = cor(d[,-c(1:3)], use = "pairwise.complete.obs")
corrplot(corrs)

Scatter plots

d.scatter = d %>% 
  select(-a.LDT, -a.LDTinv, -a.LDTscaled, -H.LDTinv,
         -H.LDTscaled, -TTR.LDT, -TTR.LDTinv, -TTR.LDTscaled,
         -complexity.bias, -mono.complexity.bias,
         -open.complexity.bias, -m_rand, -mean_dependency_diff,
         -fixed.random.baseline.slope, -slope_dif,
         -PercL2, -information.rate, -log.pop.moran,
         -log.n.vow, -log.n.phonemes, -log.perimeter,
         -observed.slope, -lun.mean.length) %>%
  gather(lang_measure, lang_value, 
         which(is.element(names(.), lang_vars))) %>%
  group_by(lang_measure) %>%
  gather(pop_measure, pop_value,
         which(is.element(names(.), demo_vars)))

text.pos = d.scatter %>%
          group_by(lang_measure, pop_measure) %>%
          summarize(min_lang = min(lang_value, na.rm = T),
                    min_pop = min(pop_value, na.rm = T),
                    dif_lang =  max(lang_value, na.rm = T) - min_lang,
                    dif_pop =  max(pop_value, na.rm = T) - min_pop) 

is.na(d.scatter) <- sapply(d.scatter, is.infinite)

# get correlations
d.scatter.corrs = d.scatter %>%
  group_by(lang_measure, pop_measure) %>%
  do(tidy(cor.test(~ .$pop_value + .$lang_value))) %>%
  mutate(sig = ifelse(p.value < .05, "*", "")) %>%
  mutate(sig.col = ifelse(p.value < .05 & estimate > 0, "pos",
                          ifelse(p.value < .05 & estimate < 0, "neg",
                                 "none"))) %>%
  mutate(lab = paste("r = ", round(estimate,2),sig, sep = "")) %>%
  mutate(pop_value = 1, lang_value = 1) %>% # this is a hack
  full_join(text.pos) %>%
  ungroup

ggplot(d.scatter, aes(x = pop_value, y = lang_value)) +
      geom_point(size = .5) +
      geom_smooth(method = "lm", color = "green") +
      facet_grid(lang_measure~pop_measure, scale = "free") +
      geom_text(aes(min_pop + .4*dif_pop, min_lang + .05*dif_lang, label=lab),
              data = d.scatter.corrs, color = "red") +
      geom_rect(data = d.scatter.corrs, aes(fill = sig.col), 
                xmin = -Inf, xmax = Inf,
                ymin = -Inf, ymax = Inf, alpha = 0.2) +
    scale_fill_manual(name = "grp",values = c( "mediumblue", "grey99","red1")) +
    theme_bw() +
    theme(legend.position = "none")

ggplot(d.scatter, aes(x = pop_value, y = lang_value)) +
      geom_point(size = .5) +
      geom_smooth(method = "lm", color = "green") +
      facet_grid(lang_measure~pop_measure, scale = "free") +
    theme_bw() +
    theme(legend.position = "none")

WALS features of complexity

# aggregate across countries to get quantitative measures by language
most.frequent.level = function (x){
    mf = names(which.max(table(x)))
    return(mf)
}

# aggregate L&D data by language
qual = ld %>%
  colwise(as.factor)(.)  %>%
  select(-1:-17, -104:-122) %>%
  #select(-1:-3, -5:-9, -13, -15:-17, -104:-122) %>%
  group_by(eln2) %>%
  summarise_each(funs(most.frequent.level))

qual <- colwise(as.factor)(qual)

# include 
labNames= read.csv("../data/lupyan_WALS_feature_names.csv")
toInclude = labNames[labNames$include == 1, ]
qualVarNames = intersect(names(qual), toInclude$Name)

# Some vars are missing from L&D data set ["HASIND" "GILDIF" "AUWPRH" "DRYSBV" "DRYOBV" "DRYXOV" "DRYREL" "POLANT" "POLAPP" "SONPER" "SONNON" "NICMTP" "NICNMP" "DAHTEA"]

# remap factor levels to human readable values (not all present)
labMappings = read.csv("../data/lupyan_WALS_lab_mappings.csv")
for (i in 1:length(qualVarNames)){
  thisVarLabs = labMappings[labMappings$featureName == qualVarNames[i],]
  old = thisVarLabs$oldLab
  new = thisVarLabs$newLab
  col_i = grep(qualVarNames[i], colnames(qual))
  qual[,col_i] = mapvalues(qual[,col_i], 
                           from = as.character(old),
                           to = as.character(new))
}

featureCats = levels(droplevels(toInclude$Feature.Class))

Parallels between language ontogeny and evolution

get measures of developmental change in vocabulary

Learnability pressures predicts AOA across languages

Predict AOA by social variables, by social variable

aoa_central = aoa_data %>% 
  mutate(language = tolower(language)) %>%
  group_by(language, measure, lexical_category) %>%
  summarise(mean_aoa = mean(aoa))
  
d3 = inner_join(d, aoa_central, by = "language") %>%
  select(language, log.RatioL2, log.pop2,
         log.numNeighbors, log.area, 
         mean.temp, mean_aoa, measure, lexical_category) %>%
  mutate(language = as.factor(language)) %>%
  gather(pop_measure, pop_value, 2:6) 

ggplot(d3, aes(y = mean_aoa, 
              x = pop_value,
              label = language,
              group = measure,
              color = measure)) +
  geom_text(size = 3, aes(group = measure)) +
  facet_grid(lexical_category ~ pop_measure, scales = "free") +
  geom_smooth(method = "lm") +
  theme_bw()