“First, if this paper were written before the Piantadosi 2011 paper and the Mahowald 2013 papers, I could understand why the authors would feel that by controlling for frequency in experiment 9, they could get away from the Zipfian assumption that the main predictor of word length is frequency. By now the common assumption (e.g. Seyfarth 2014; Cognition) is that predictability, rather than frequency, should be the main determiner of length. For the first two series of experiments the authors may have felt that they could evade that issue by constructing novel objects, or objects for which no prior predictability of frequency accounts would apply, but that’s a very narrow view of predictability-based accounts. For instance, in the artificial object experiment series, if number of parts predicts complexity, it is not far-fetched to assume that subjects do have some predictability assessment for the novel objects, based on a naïve P(object) =
To address this, here we use the 2-gram surprisal measure calculated from the British National Corpus (BNC), and reported in the Piantadosi et al. (2011) paper. I’m not using the google books measure because I keep running in to an error when I try to process the bigrams.
In a model predicting word length (in phonemes) with complexity, log frequency, and surprisal, complexity and log frequency are reliable predictors of word length, but not surprisal.
I also added to the cross-linguistic figure the correlation between complexity and length partialing out surprisal. The reviewer suggests using the residual effect in Experiment 10, rather than partialing out surprisal, but this seems unncessarily complicated. Seems like the partial correlation is more consistent with the other analyses and is more straight-forward. What do you think?
Read in BNC data and complexity norms.
bnc.2gram = read.table('../data/corpus/English-KNN-H-2.txt', header = T) %>%
mutate(log.bnc.frequency = log(context.count)) %>%
top_n(25000, abs(log.bnc.frequency)) # following Piantadosi, resttrict to most frequent words
lf.data = read.csv("../data/corpus/english_complexity_norms.csv") %>%
group_by(word) %>%
select(-X, -workerid, -trial) %>%
summarise_each(funs(mean)) %>%
left_join(bnc.2gram, by="word")
Surprisals and frequencies for our 499 words are normally distributed
ggplot(lf.data, aes(x=surprisal)) +
geom_histogram(fill = "black", alpha = .6 , binwidth = 1, origin = -0.5) +
xlab("Surprisal") +
ggtitle('BNC Surprisal') +
themeML
ggplot(lf.data, aes(x=log.bnc.frequency)) +
geom_histogram(fill = "black", alpha = .6 , binwidth = 1, origin = -0.5) +
xlab("BNC log frequency") +
ggtitle('BNC log frequency') +
themeML
Surprisal and complexity are correlated with each other (r = .29), and both are correlated with length (surprisal: r = .42, complexity: .67).
Do Piantadosi analysis complexity words and all words (UNIGRAM surprisals).
Do Piantadosi analysis complexity words and all words (BIGRAM surprisals).
Do Piantadosi analysis complexity words and all words (TRIGRAM surprisals).
paired.r(abs(cor(bnc.2gram$len, bnc.2gram$log.bnc.frequency, method = "spearman")),
cor(bnc.2gram$len, bnc.2gram$surprisal, method = "spearman"),n=length(bnc.2gram$len))
## Call: paired.r(xy = abs(cor(bnc.2gram$len, bnc.2gram$log.bnc.frequency,
## method = "spearman")), xz = cor(bnc.2gram$len, bnc.2gram$surprisal,
## method = "spearman"), n = length(bnc.2gram$len))
## [1] "test of difference between two independent correlations"
## z = 3.6 With probability = 0
paired.r(abs(cor(bnc.3gram$len, bnc.3gram$log.bnc.frequency, method = "spearman")),
cor(bnc.3gram$len, bnc.3gram$surprisal, method = "spearman"),n=length(bnc.3gram$len))
## Call: paired.r(xy = abs(cor(bnc.3gram$len, bnc.3gram$log.bnc.frequency,
## method = "spearman")), xz = cor(bnc.3gram$len, bnc.3gram$surprisal,
## method = "spearman"), n = length(bnc.3gram$len))
## [1] "test of difference between two independent correlations"
## z = 4.19 With probability = 0
pcor.test(bnc.2gram$len, bnc.2gram$log.bnc.frequency,bnc.2gram$surprisal,method = "spearman")
## estimate p.value statistic n gn Method Use
## 1 -0.005151654 0.4136878 -0.8174211 25179 1 Spearman Var-Cov matrix
tidy(lm(mrc.phon ~ complexity + surprisal + log.bnc.frequency, lf.data))
## term estimate std.error statistic p.value
## 1 (Intercept) -0.02933587 1.63366388 -0.0179571 9.856816e-01
## 2 complexity 1.07923234 0.06705291 16.0952349 8.640533e-46
## 3 surprisal 0.32585515 0.24551279 1.3272431 1.851459e-01
## 4 log.bnc.frequency -0.04332642 0.10223195 -0.4238050 6.719239e-01
Now, let’s look cross-linguistically.
Read in xling data and merge with English complexity norms
Get correlations for length and complexity with for all 499 words with bootstrapped CIs
Partial correlations (with frequency and surprisal)
Get correlation for monomorphemic and open class subsets
prep for plotting
We counted the number of unicode characters for each translation. Variability in word length within languages was positively correlated with complexity ratings. Below the correlation coefficients are plotted for each language. Red bars indicate languages where the accuracy was checked by a native speaker and pink bars indicate unchecked languages. The dashed line indicates the grand mean correlation across languages. Full circles indicate the correlation between complexity and length, partialling out log spoken frequency in English. Empty circles show the correlation between complexity and length, partialling out surprisal. Triangles indicate the correlation between complexity and length for the subset of words that are monomorphemic in English. Squares indicate the correlation between complexity and length for the subset of open class words.