Normal distribution occurs frequently in various fields of science.
It would be surprising if there is not any normal distribution in language.
So can we find such distribution for the occurrence of a specific word in a Chinese corpus?
For example, 我們(Nh) is a hign-frequency word ranked 22th in Sinica Speech Corpora,
corpora = read.table("1_Wordfrequency.txt", T, "\t", fileEncoding = "CP950")
head(corpora, 22)
## 詞項..Word. 詞類.CKIP_POS. 拼音.Pinyin. 詞音節數.Word_size.
## 1 的 DE de5 1
## 2 我 Nh wo3 1
## 3 是 SHI shi4 1
## 4 你 Nh ni3 1
## 5 就 D jiu4 1
## 6 有 V_2 you3 1
## 7 對 P dui4 1
## 8 不 D bu4 1
## 9 個 Nf ge5 1
## 10 他 Nh ta1 1
## 11 一 Neu yi1 1
## 12 就是 Cbb jiu4 shi4 2
## 13 那 Nep na4 1
## 14 在 P zai4 1
## 15 然後 D ran2 hou4 2
## 16 都 D dou1 1
## 17 對 VH dui4 1
## 18 說 VE shuo1 1
## 19 很 Dfa hen3 1
## 20 也 D ye3 1
## 21 因為 Cbb yin1 wei4 2
## 22 我們 Nh wo3 men5 2
## 詞頻.Frequency. 累積詞頻.Accumulated.Frequency.
## 1 15778 3.892
## 2 13999 7.345
## 3 13397 10.649
## 4 7429 12.481
## 5 7092 14.230
## 6 6991 15.955
## 7 6705 17.608
## 8 6677 19.255
## 9 6330 20.817
## 10 5453 22.162
## 11 5301 23.469
## 12 5260 24.767
## 13 4827 25.957
## 14 4694 27.115
## 15 4473 28.218
## 16 4419 29.308
## 17 4414 30.397
## 18 4242 31.443
## 19 4100 32.454
## 20 3882 33.412
## 21 3435 34.259
## 22 3412 35.101
freq = corpora[, 5]
ranks = 1:length(freq)
plot(log(ranks), log(freq))
text(log(22), log(3412), "we(Nh)", col = "red")
Let's examine the frequency distribution of 我們(Nh) in a chunked ASBC corpus of 40000 words:
KW = "我們(Nh)"
corpus.file = scan("sentences.txt", "char", sep = "\n", fileEncoding = "UTF-8")
word.list = lapply(corpus.file, strsplit, split = " ")
word.vector = unlist(word.list)
asbc = data.frame(word = word.vector, chunk = cut(1:length(word.vector), breaks = 40,
labels = F))
asbc$KW = asbc$word == KW
countofKW = tapply(asbc$KW, asbc$chunk, sum)
hist(countofKW)
shapiro.test(countofKW)
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8984, p-value = 0.001717
It seems normally distributed, but is it just a chance?
Let's re-examine the normality with the corpus size varying:
a = function(word.vector, word = KW, b = 40) {
asbc = data.frame(word.vector, chunk = cut(1:length(word.vector), breaks = b,
labels = F))
asbc$KW = asbc$word == KW
countofKW = tapply(asbc$KW, asbc$chunk, sum)
print(length(word.vector))
print(shapiro.test(countofKW))
}
len = length(word.vector)
for (i in seq(len/40, len, len/40)) a(word.vector[1:i])
## [1] 1018
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.147, p-value = 6.64e-14
##
## [1] 2037
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.479, p-value = 9.707e-11
##
## [1] 3055
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.4604, p-value = 5.988e-11
##
## [1] 4074
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5745, p-value = 1.425e-09
##
## [1] 5092
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5325, p-value = 4.181e-10
##
## [1] 6111
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.611, p-value = 4.433e-09
##
## [1] 7129
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.6284, p-value = 7.794e-09
##
## [1] 8148
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5529, p-value = 7.514e-10
##
## [1] 9166
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5977, p-value = 2.909e-09
##
## [1] 10185
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5603, p-value = 9.352e-10
##
## [1] 11203
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.5675, p-value = 1.155e-09
##
## [1] 12222
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.6607, p-value = 2.334e-08
##
## [1] 13240
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7189, p-value = 2.019e-07
##
## [1] 14259
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7198, p-value = 2.093e-07
##
## [1] 15277
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.6843, p-value = 5.424e-08
##
## [1] 16296
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.6812, p-value = 4.845e-08
##
## [1] 17314
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7186, p-value = 1.995e-07
##
## [1] 18333
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7115, p-value = 1.509e-07
##
## [1] 19351
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7589, p-value = 1.045e-06
##
## [1] 20370
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7748, p-value = 2.103e-06
##
## [1] 21388
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7597, p-value = 1.08e-06
##
## [1] 22407
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.7147, p-value = 1.71e-07
##
## [1] 23425
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8164, p-value = 1.515e-05
##
## [1] 24444
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8061, p-value = 9.106e-06
##
## [1] 25462
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8308, p-value = 3.173e-05
##
## [1] 26481
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8032, p-value = 7.919e-06
##
## [1] 27499
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8445, p-value = 6.65e-05
##
## [1] 28518
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.869, p-value = 0.0002696
##
## [1] 29536
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8765, p-value = 0.0004239
##
## [1] 30555
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8254, p-value = 2.402e-05
##
## [1] 31573
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.823, p-value = 2.125e-05
##
## [1] 32592
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.9067, p-value = 0.003007
##
## [1] 33610
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8448, p-value = 6.73e-05
##
## [1] 34629
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.9264, p-value = 0.0123
##
## [1] 35647
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.9045, p-value = 0.002596
##
## [1] 36666
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.818, p-value = 1.64e-05
##
## [1] 37684
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8535, p-value = 0.0001095
##
## [1] 38703
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8626, p-value = 0.0001847
##
## [1] 39721
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8583, p-value = 0.0001444
##
## [1] 40740
##
## Shapiro-Wilk normality test
##
## data: countofKW
## W = 0.8984, p-value = 0.001717
Therefore, with the corpus size growing (in this case more than 37000 words), the occurrence of a high-frequency word such as 我們(Vh) in a chunked Chinese quality corpus is more likely to be normally distributed.
In the future, 10-fold cross validation of the corpus could be performed.
If it is true in a manually tagged corpus, normality could be regarded as one of the criteria for a good-enough corpus.