Normal Distribution in Corpus Linguistics?

Normal distribution occurs frequently in various fields of science.
It would be surprising if there is not any normal distribution in language.
So can we find such distribution for the occurrence of a specific word in a Chinese corpus?
For example, 我們(Nh) is a hign-frequency word ranked 22th in Sinica Speech Corpora,

corpora = read.table("1_Wordfrequency.txt", T, "\t", fileEncoding = "CP950")
head(corpora, 22)
##    詞項..Word. 詞類.CKIP_POS. 拼音.Pinyin. 詞音節數.Word_size.
## 1           的             DE         de5                    1
## 2           我             Nh         wo3                    1
## 3           是            SHI        shi4                    1
## 4           你             Nh         ni3                    1
## 5           就              D        jiu4                    1
## 6           有            V_2        you3                    1
## 7           對              P        dui4                    1
## 8           不              D         bu4                    1
## 9           個             Nf         ge5                    1
## 10          他             Nh         ta1                    1
## 11          一            Neu         yi1                    1
## 12        就是            Cbb   jiu4 shi4                    2
## 13          那            Nep         na4                    1
## 14          在              P        zai4                    1
## 15        然後              D   ran2 hou4                    2
## 16          都              D        dou1                    1
## 17          對             VH        dui4                    1
## 18          說             VE       shuo1                    1
## 19          很            Dfa        hen3                    1
## 20          也              D         ye3                    1
## 21        因為            Cbb   yin1 wei4                    2
## 22        我們             Nh    wo3 men5                    2
##    詞頻.Frequency. 累積詞頻.Accumulated.Frequency.
## 1            15778                           3.892
## 2            13999                           7.345
## 3            13397                          10.649
## 4             7429                          12.481
## 5             7092                          14.230
## 6             6991                          15.955
## 7             6705                          17.608
## 8             6677                          19.255
## 9             6330                          20.817
## 10            5453                          22.162
## 11            5301                          23.469
## 12            5260                          24.767
## 13            4827                          25.957
## 14            4694                          27.115
## 15            4473                          28.218
## 16            4419                          29.308
## 17            4414                          30.397
## 18            4242                          31.443
## 19            4100                          32.454
## 20            3882                          33.412
## 21            3435                          34.259
## 22            3412                          35.101
freq = corpora[, 5]
ranks = 1:length(freq)
plot(log(ranks), log(freq))
text(log(22), log(3412), "we(Nh)", col = "red")

plot of chunk unnamed-chunk-1

Let's examine the frequency distribution of 我們(Nh) in a chunked ASBC corpus of 40000 words:

KW = "我們(Nh)"
corpus.file = scan("sentences.txt", "char", sep = "\n", fileEncoding = "UTF-8")
word.list = lapply(corpus.file, strsplit, split = " ")
word.vector = unlist(word.list)
asbc = data.frame(word = word.vector, chunk = cut(1:length(word.vector), breaks = 40, 
    labels = F))
asbc$KW = asbc$word == KW
countofKW = tapply(asbc$KW, asbc$chunk, sum)
hist(countofKW)

plot of chunk unnamed-chunk-2

shapiro.test(countofKW)
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8984, p-value = 0.001717

It seems normally distributed, but is it just a chance?
Let's re-examine the normality with the corpus size varying:

a = function(word.vector, word = KW, b = 40) {
    asbc = data.frame(word.vector, chunk = cut(1:length(word.vector), breaks = b, 
        labels = F))
    asbc$KW = asbc$word == KW
    countofKW = tapply(asbc$KW, asbc$chunk, sum)
    print(length(word.vector))
    print(shapiro.test(countofKW))
}
len = length(word.vector)
for (i in seq(len/40, len, len/40)) a(word.vector[1:i])
## [1] 1018
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.147, p-value = 6.64e-14
## 
## [1] 2037
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.479, p-value = 9.707e-11
## 
## [1] 3055
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.4604, p-value = 5.988e-11
## 
## [1] 4074
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5745, p-value = 1.425e-09
## 
## [1] 5092
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5325, p-value = 4.181e-10
## 
## [1] 6111
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.611, p-value = 4.433e-09
## 
## [1] 7129
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.6284, p-value = 7.794e-09
## 
## [1] 8148
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5529, p-value = 7.514e-10
## 
## [1] 9166
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5977, p-value = 2.909e-09
## 
## [1] 10185
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5603, p-value = 9.352e-10
## 
## [1] 11203
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.5675, p-value = 1.155e-09
## 
## [1] 12222
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.6607, p-value = 2.334e-08
## 
## [1] 13240
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7189, p-value = 2.019e-07
## 
## [1] 14259
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7198, p-value = 2.093e-07
## 
## [1] 15277
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.6843, p-value = 5.424e-08
## 
## [1] 16296
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.6812, p-value = 4.845e-08
## 
## [1] 17314
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7186, p-value = 1.995e-07
## 
## [1] 18333
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7115, p-value = 1.509e-07
## 
## [1] 19351
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7589, p-value = 1.045e-06
## 
## [1] 20370
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7748, p-value = 2.103e-06
## 
## [1] 21388
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7597, p-value = 1.08e-06
## 
## [1] 22407
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.7147, p-value = 1.71e-07
## 
## [1] 23425
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8164, p-value = 1.515e-05
## 
## [1] 24444
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8061, p-value = 9.106e-06
## 
## [1] 25462
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8308, p-value = 3.173e-05
## 
## [1] 26481
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8032, p-value = 7.919e-06
## 
## [1] 27499
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8445, p-value = 6.65e-05
## 
## [1] 28518
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.869, p-value = 0.0002696
## 
## [1] 29536
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8765, p-value = 0.0004239
## 
## [1] 30555
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8254, p-value = 2.402e-05
## 
## [1] 31573
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.823, p-value = 2.125e-05
## 
## [1] 32592
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.9067, p-value = 0.003007
## 
## [1] 33610
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8448, p-value = 6.73e-05
## 
## [1] 34629
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.9264, p-value = 0.0123
## 
## [1] 35647
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.9045, p-value = 0.002596
## 
## [1] 36666
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.818, p-value = 1.64e-05
## 
## [1] 37684
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8535, p-value = 0.0001095
## 
## [1] 38703
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8626, p-value = 0.0001847
## 
## [1] 39721
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8583, p-value = 0.0001444
## 
## [1] 40740
## 
##  Shapiro-Wilk normality test
## 
## data:  countofKW 
## W = 0.8984, p-value = 0.001717

Therefore, with the corpus size growing (in this case more than 37000 words), the occurrence of a high-frequency word such as 我們(Vh) in a chunked Chinese quality corpus is more likely to be normally distributed.
In the future, 10-fold cross validation of the corpus could be performed.
If it is true in a manually tagged corpus, normality could be regarded as one of the criteria for a good-enough corpus.