Across several sets of analyses, I’ve been looking at the relationship between proxies for learnability pressures and features of language. Here I try to aggregate these analyses.
There are three motivating theoretical ideas. The first is the idea of multiple timescales: We assume there are multiple timescales – in-the-moment, development, and language evolution – that are relevant to the phenomenon of language.
Second, we consider the possibility that language adapts to its contextual niche at all timescales. At the in-the-moment timescales, this might mean taking into account the character of the referential alternatives when formulating an utterance. At the developmental timescale, we assume the learner has acquired a lexicon that is tailored to the child’s cognitive capacity. And, at the evolution timescale, we consider the possibility that language systems might adapt over time to the learnability pressures of the population (“Linguistic Niche Hypothesis,” Lupyan & Dale, 2011). This might mean, for example, that languages with more adult second language speakers become easier to learn.
The third motivating theory comes from pragmatics: competing communicative pressures that give rise to equilibria. These competeing pressures have been formulated in a number of different ways across the literature, e.g., Horn’s Speaker and Hearer pressure. The consequences of these pressures have been considered primarily at the timescale of in-the-moment communication. The present work extends the relevance of these pressures to longer timescales. In particuar, these analyses explore the following hypothesis: The same equilibria that emerge as a consequence of communicative pressures at the pragmatic timescale, also emerge at the developmental and language evolution timescales. These analyses try to explore predictions of this hypothesis.
As a first step in testing this proposals, we try to aggregate demographic predictors and linguistic outocome variables.
Next, we look for parallels between ontonogeny and cross-linguistic variability. (i.e. Are the features of easy-to-learn languages shared with the vocabularies of young children?)
A secondardy prediction is that languages that are easier to learn for L2 speakers should also be easier to learn for children. This could be tested by looking at mean AOA in a language and learnability features.
Merge together all population learnability features:
Read in Lupyan and Dale data
ld = read.table("../data/lupyan_2010.txt", fill = T,
header = T, sep = "\t", na.strings = "*") %>%
rename(log.pop2 = logpop2,
mean.temp = aveTemp,
n.neighbors = numNeighbors,
sum.precip = sumPrecip,
sd.temp = sdTemp,
sd.precip = sdPrecip)
# fix language labels
ld$ethnologue_lang_name = tolower(ld$ethnologue_lang_name)
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "standard german"] <- "german"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "tosk albanian"] <- "albanian"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "catalan-valencian-balear"] <- "catalan"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "haitian creole french"] <- "haitian.creole"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "irish gaelic"] <- "irish"
ld$ethnologue_lang_name[ld$ethnologue_lang_name == "central khmer"] <- "khmer"
# Need to collapse across different dialects of same language (e.g. "eastern mongolian" and "peripherial mongolian") [arabic, azerbaijani, chinese, hmong, mongolian, yiddish]
# but first, remove sign languages (e.g."chinese sign language")
ld = ld[-which(grepl("sign language", ld$ethnologue_lang_name)),]
ld$eln2 = ifelse(grepl("arabic", ld$ethnologue_lang_name), "arabic",
ifelse(grepl("azerbaijani", ld$ethnologue_lang_name), "azerbaijani",
ifelse(grepl("chinese", ld$ethnologue_lang_name), "chinese",
ifelse(grepl("hmong", ld$ethnologue_lang_name), "hmong",
ifelse(grepl("mongolian", ld$ethnologue_lang_name), "mongolian",
ifelse(grepl("yiddish", ld$ethnologue_lang_name), "yiddish",
ld$ethnologue_lang_name))))))
ld$eln2 = as.factor(ld$eln2)
# 12 langs not in lupyan 2010, but in LF [belarusian, bosnian, cebuana, croatian, esperanto, filipino, kanada, latin, norwegian, persian, punjabi, serbian]
# get means across language
demo_ld = ld %>%
group_by(eln2) %>%
summarise_each(funs(mean(., na.rm = TRUE)),
c(8:9, 16:121)) %>%
select(1, 93:95, 100:101, 103:105, 108)
# add in data family
fams = ld %>%
group_by(eln2) %>%
filter(row_number() == 1) %>%
select(eln2, langFamily, langGenus)
demo_ld = left_join(fams, demo_ld, by = "eln2") %>%
rename(language = eln2) %>%
ungroup()
Read in Bentz data (Note here that we are only looking at languages from the UDHR corpus. The other corpora have smaller languages only.)
demo_bentz = read.csv("../data/bentz_2015.csv") %>%
gather(temp, LDT, starts_with("LDT")) %>%
unite(temp1, measure, temp, sep = ".") %>%
spread(temp1, LDT) %>%
filter(text == "UDHR") %>%
select(-iso_639_3, -fileName,
-contains("Stock"), -contains("speakers"),
-contains("Region"), -text,
-genus_wals, -family_wals) %>%
mutate(Language = gsub("[[:space:]]", "", tolower(Language))) %>%
rename(language = Language)
demo_bentz$language[demo_bentz$language == "swahili,tanzania"] <- "swahili"
demo_bentz$language[demo_bentz$language == "irishgaelic"] <- "irish"
Join all population variables
demo = left_join(demo_ld, demo_bentz, by="language") %>%
mutate(language = as.factor(language))
Merge together all linguistic features:
# complexity bias
cb = read.csv("../data/lewis_2015.csv") %>%
rename(complexity.bias = corr,
p.complexity.bias = p.corr,
mono.complexity.bias = mono.cor,
open.complexity.bias = open.cor) %>%
select(-X.1, -X, -lower.ci, -upper.ci, -checked, -mean.length) %>%
filter(language != "english") # english is an outlier in this dataset because norms colelcted in english
# dependency length
dl = read.csv("../data/futrell_2015.csv")
# UID
uid = read.csv("../data/pellegrino_2015.csv")
# phoneme inventory
labs = read.csv("../data/ethnologue_labs.csv",
fileEncoding="latin1") %>%
select(-CountryID, -LangStatus)
phoneme = read.csv("../data/moran_2012.csv") %>%
select(lang, pho, con, vow, logPop) %>%
left_join(labs, by = c("lang" = "LangID")) %>%
select(-lang) %>%
rename(language = Name, log.pop.moran = logPop) %>%
mutate(language = tolower(language))
# aoa (wordbank)
aoa_data = read.csv("../data/aoa_data.csv") %>%
filter( aoa > 5 & aoa < 31)
aoa_central = aoa_data %>%
mutate(language = tolower(language)) %>%
filter(measure == "understands",
lexical_category == "nouns") %>%
group_by(language) %>%
summarise(wb.mean.aoa = mean(aoa))
# mean length - is this IPA? - fix this!
asjp = read.csv("../data/asjp.csv") %>%
select(1,11:110) %>%
rename(language = names) %>%
mutate(language = tolower(language)) %>%
gather(word,translation,I:name) %>%
mutate(nchar = unlist(
lapply(
lapply(
strsplit(
gsub("[[:space:]]", "", translation) ,
","),
nchar), mean))) %>%
filter(translation != "")
# subset to only those words in the swadesh list (n = 40)
swadesh.words = asjp[asjp$language == "afrikaans", "word"]
asjp = asjp %>%
filter(is.element(word, swadesh.words))
# fix language names so map onto (minimally) cb dataset
asjp$language[asjp$language == "standard_arabic"] <- "arabic"
asjp$language[asjp$language == "standard_german"] <- "german"
asjp$language[asjp$language == "haitian_creole"] <- "haitian.creole"
asjp$language[asjp$language == "igbo_onitsha"] <- "igbo"
asjp$language[asjp$language == "irish_gaelic"] <- "irish"
asjp$language = ifelse(grepl("norwegian", asjp$language),
"norwegian", asjp$language)
asjp$language = ifelse(grepl("javanese", asjp$language),
"javanese", asjp$language)
asjp$language = ifelse(grepl("armenian", asjp$language),
"armenian", asjp$language)
asjp$language = ifelse(grepl("azerbaijani", asjp$language),
"azerbaijani", asjp$language)
asjp$language = ifelse(grepl("chinese", asjp$language),
"chinese", asjp$language)
asjp$language = ifelse(grepl("hmong", asjp$language),
"hmong", asjp$language)
asjp$language = ifelse(grepl("punjabi", asjp$language),
"punjabi", asjp$language)
asjp$language = ifelse(grepl("yiddish", asjp$language),
"yiddish", asjp$language)
asjp.d = asjp %>%
group_by(language) %>%
summarize(asjp.mean.length = mean(nchar, na.rm = T))
AOA (Luniewska)
aoa_lun.raw = read.csv("../data/luniewska_2015.csv", header = T,
fileEncoding = "latin1")
# aoas
aoa_lun.aoa = aoa_lun.raw %>%
gather(language_aoa, aoa, grep("aoa", names(aoa_lun.raw))) %>%
mutate(language_aoa = tolower(unlist(lapply(strsplit(
as.character(language_aoa),"_"), function(x) x[1])))) %>%
select(-3:-27) %>%
rename(language = language_aoa) %>%
group_by(language, word_class) %>%
summarize(lun.mean.aoa= mean(aoa, na.rm = T)) %>%
spread(word_class,lun.mean.aoa) %>%
rename(lun.aoa.noun = noun, lun.aoa.verb = verb)
# mean length (just for fun)
trans_col_is = which(is.element(names(aoa_lun.raw),names(aoa_lun.raw)
[!grepl("aoa", names(aoa_lun.raw))]))
trans_col_is = trans_col_is[-1:-2]
aoa_lun = aoa_lun.raw %>%
gather(language, translation, trans_col_is) %>%
mutate(language = tolower(language)) %>%
select(-3:-27) %>%
group_by(language) %>%
mutate(length = nchar(translation)) %>%
summarize(lun.mean.length = mean(length, na.rm = T)) %>%
left_join(aoa_lun.aoa)
Merge population with all language variables
d = full_join(demo, cb) %>%
full_join(dl) %>%
full_join(uid) %>%
full_join(aoa_central) %>%
full_join(phoneme) %>%
left_join(asjp.d) %>%
left_join(aoa_lun)
#l = ""
#for (i in 1:length(names(d))) {
# l = paste(l, '"', names(d)[i], '",', sep = "")
#}
demo_vars = c("area","perimeter","n.neighbors","mean.temp","sum.precip","sd.temp","sd.precip","growingSeason","log.pop2","RatioL2","PercL2", "log.pop.moran")
lang_vars = c("a.LDT","a.LDTinv","a.LDTscaled","H.LDT","H.LDTinv","H.LDTscaled","TTR.LDT","TTR.LDTinv","TTR.LDTscaled","complexity.bias","p.complexity.bias","mono.complexity.bias","open.complexity.bias","mean.length","lun.mean.length","m_rand","m_obs","mean_dependency_diff","fixed.random.baseline.slope","observed.slope","slope_dif","information.density","syllable.rate","information.rate","wb.mean.aoa","pho","con","vow", "aoa_lun", "lun.mean.length.noun",
"lun.mean.length.verb", "lun.aoa.noun", "lun.aoa.verb")
d %>%
gather(variable, num, area:length(d)) %>%
mutate(var_type = ifelse(is.element(variable, demo_vars),
"demo", "lang")) %>%
ggplot(aes(x=num, fill = var_type)) +
geom_histogram(position = "identity") +
facet_wrap( ~ variable, scales = "free") +
theme_bw()
Take the log of of the non-normal-looking variables and re-plot
d = d %>%
mutate(log.area = log(area),
log.perimeter = log(perimeter),
log.n.neighbors = log(n.neighbors),
log.sum.precip = log(sum.precip),
log.sd.temp = log(sd.temp),
log.sd.precip = log(sd.precip),
log.RatioL2= log(RatioL2),
log.n.phonemes = log(pho),
log.n.consonants = log(con),
log.n.vow = log(vow),
log.asjp.mean.length = log(asjp.mean.length))%>%
select(-area, -perimeter, -n.neighbors, -sd.temp,
-sd.precip, -RatioL2, -sum.precip, -pho, -con, -vow, -asjp.mean.length)
demo_vars = c(intersect(names(d), demo_vars), "log.area", "log.perimeter", "log.n.neighbors", "log.sum.precip", "log.sd.temp", "log.sd.precip", "log.RatioL2")
lang_vars = c(intersect(names(d), lang_vars), "lang_vars", "log.n.phonemes", "log.n.consonants", "log.n.vow", "log.asjp.mean.length")
d %>%
gather(variable, num, mean.temp:length(d)) %>%
mutate(var_type = ifelse(is.element(variable, demo_vars), "demo", "lang")) %>%
arrange(var_type) %>%
ggplot(aes(x=num, fill = var_type)) +
geom_histogram(position = "identity") +
facet_wrap( ~ variable, scales = "free") +
theme_bw()
Describe data available for each language
d %>%
gather(variable, num, language:length(d)) %>%
group_by(variable) %>%
summarize(counts = length(which(is.na(num) == F))) %>%
mutate(var_type = ifelse(is.element(variable, demo_vars), "demo", "lang")) %>%
mutate(counts_trunc = ifelse(counts > 150, 150, counts)) %>%
ggplot(aes(y=counts_trunc, x = variable, fill = var_type, order = variable)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") +
ylab("Number of languages (truncated at 150)") +
ggtitle("Number languages per measure")
Correlation matrix (stuff in the upper right quadrant matters)
is.na(d) <- sapply(d, is.infinite)
corrs = cor(d[,-c(1:3)], use = "pairwise.complete.obs")
corrplot(corrs)
Scatter plots
d.scatter = d %>%
select(-a.LDT, -a.LDTinv, -a.LDTscaled, -H.LDTinv,
-H.LDTscaled, -TTR.LDT, -TTR.LDTinv, -TTR.LDTscaled,
-complexity.bias, -mono.complexity.bias,
-open.complexity.bias, -m_rand, -mean_dependency_diff,
-fixed.random.baseline.slope, -slope_dif,
-PercL2, -information.rate, -log.pop.moran,
-log.n.vow, -log.n.phonemes, -log.perimeter,
-observed.slope, -lun.mean.length) %>%
gather(lang_measure, lang_value,
which(is.element(names(.), lang_vars))) %>%
group_by(lang_measure) %>%
gather(pop_measure, pop_value,
which(is.element(names(.), demo_vars)))
text.pos = d.scatter %>%
group_by(lang_measure, pop_measure) %>%
summarize(min_lang = min(lang_value, na.rm = T),
min_pop = min(pop_value, na.rm = T),
dif_lang = max(lang_value, na.rm = T) - min_lang,
dif_pop = max(pop_value, na.rm = T) - min_pop)
is.na(d.scatter) <- sapply(d.scatter, is.infinite)
# get correlations
d.scatter.corrs = d.scatter %>%
group_by(lang_measure, pop_measure) %>%
do(tidy(cor.test(~ .$pop_value + .$lang_value))) %>%
mutate(sig = ifelse(p.value < .05, "*", "")) %>%
mutate(sig.col = ifelse(p.value < .05 & estimate > 0, "pos",
ifelse(p.value < .05 & estimate < 0, "neg",
"none"))) %>%
mutate(lab = paste("r = ", round(estimate,2),sig, sep = "")) %>%
mutate(pop_value = 1, lang_value = 1) %>% # this is a hack
full_join(text.pos) %>%
ungroup
ggplot(d.scatter, aes(x = pop_value, y = lang_value)) +
geom_point(size = .5) +
geom_smooth(method = "lm", color = "green") +
facet_grid(lang_measure~pop_measure, scale = "free") +
geom_text(aes(min_pop + .4*dif_pop, min_lang + .05*dif_lang, label=lab),
data = d.scatter.corrs, color = "red") +
geom_rect(data = d.scatter.corrs, aes(fill = sig.col),
xmin = -Inf, xmax = Inf,
ymin = -Inf, ymax = Inf, alpha = 0.2) +
scale_fill_manual(name = "grp",values = c( "mediumblue", "grey99","red1")) +
theme_bw() +
theme(legend.position = "none")
ggplot(d.scatter, aes(x = pop_value, y = lang_value)) +
geom_point(size = .5) +
geom_smooth(method = "lm", color = "green") +
facet_grid(lang_measure~pop_measure, scale = "free") +
theme_bw() +
theme(legend.position = "none")
WALS features of complexity
# aggregate across countries to get quantitative measures by language
most.frequent.level = function (x){
mf = names(which.max(table(x)))
return(mf)
}
# aggregate L&D data by language
qual = ld %>%
colwise(as.factor)(.) %>%
select(-1:-17, -104:-122) %>%
#select(-1:-3, -5:-9, -13, -15:-17, -104:-122) %>%
group_by(eln2) %>%
summarise_each(funs(most.frequent.level))
qual <- colwise(as.factor)(qual)
# include
labNames= read.csv("../data/lupyan_WALS_feature_names.csv")
toInclude = labNames[labNames$include == 1, ]
qualVarNames = intersect(names(qual), toInclude$Name)
# Some vars are missing from L&D data set ["HASIND" "GILDIF" "AUWPRH" "DRYSBV" "DRYOBV" "DRYXOV" "DRYREL" "POLANT" "POLAPP" "SONPER" "SONNON" "NICMTP" "NICNMP" "DAHTEA"]
# remap factor levels to human readable values (not all present)
labMappings = read.csv("../data/lupyan_WALS_lab_mappings.csv")
for (i in 1:length(qualVarNames)){
thisVarLabs = labMappings[labMappings$featureName == qualVarNames[i],]
old = thisVarLabs$oldLab
new = thisVarLabs$newLab
col_i = grep(qualVarNames[i], colnames(qual))
qual[,col_i] = mapvalues(qual[,col_i],
from = as.character(old),
to = as.character(new))
}
featureCats = levels(droplevels(toInclude$Feature.Class))
get measures of developmental change in vocabulary
Predict AOA by social variables, by social variable
aoa_central = aoa_data %>%
mutate(language = tolower(language)) %>%
group_by(language, measure, lexical_category) %>%
summarise(mean_aoa = mean(aoa))
d3 = inner_join(d, aoa_central, by = "language") %>%
select(language, log.RatioL2, log.pop2,
log.numNeighbors, log.area,
mean.temp, mean_aoa, measure, lexical_category) %>%
mutate(language = as.factor(language)) %>%
gather(pop_measure, pop_value, 2:6)
ggplot(d3, aes(y = mean_aoa,
x = pop_value,
label = language,
group = measure,
color = measure)) +
geom_text(size = 3, aes(group = measure)) +
facet_grid(lexical_category ~ pop_measure, scales = "free") +
geom_smooth(method = "lm") +
theme_bw()