Lexicon acquisition is essential and the first stage in language development. In particular, automatic acquisition of Chinese lexicon is of critical importance because the Chinese language has no delimiter for word boundary. Quality automatic lexicon acqusition may contribute to the study of vocabulary life cycle. It can also be used to estimate the size of mental lexicon across languages. Besides, automatic acquisition of new words would facilitate the task of word segmentation on a dynamic corpus. Before investigating the acquisition of words, it is necessary to discuss what a word is. Cruse (2000) characterizes a prototypical word as a minimal permutable element. That is, a word can be moved within a sentence, and cannot be interrupted or reordered. Sagot (2005) retrieved a Slovak lexicon from a raw corpus with a morphological description but only applicable to other languages with rich morphology. In this paper we conduct a pilot study to acquire lexicon of languages including those without rich morphology (e.g., Chinese) From the above literature review, it seems that we need a more operable definition for Chinese words. In this study, a word is defined as the lexical unit that can be developed from early acquired basic words, and this definition will be evluated by its application to a raw corpus.
MOTIVATION
Word segmentation in NLP is fundamental but simple for alphabetic languages, mostly relying on delimiter, but it is much difficult for non-alphabetic languages such as Chinese. Also, a consistent word definition is of theoretical and practical importance, especially for machine translation. Therefore, this study proposes an operational definition of words for language processing.
A word is defined as a lexical unit that can be developed from early acquired basic words.
METHOD
We examined raw sentences and segmented sentences in Child Language Data Exchange System.
dir = "D:/tccmcorpus/"
files = list.files(dir, "CHOU2")
df = data.frame(names = NA, utterance = NA, mor = NA)
counter = 0
for (file in files) {
corpus.file = scan(paste(dir, file, sep = ""), what = "char", sep = "\n")
for (i in corpus.file) {
first.character = substr(i, 1, 1)
tier.identifier = substr(i, 2, 4)
ann = substring(i, 7)
if (first.character == "*") {
counter = counter + 1
df[counter, ]$names = tier.identifier
df[counter, ]$utterance = ann
} else if (tier.identifier == "mor") {
df[counter, ]$mor = ann
}
}
}
count = 1
chou2.words.df = data.frame(word = NA)
chou2.df = df[df$names == "CHI", ]
for (utterance in chou2.df$utterance) {
for (word in unlist(strsplit(utterance, " "))) {
chou2.words.df[count, ] = word
count = count + 1
}
}
chou2.voc.vector = unique(chou2.words.df$word)
RESULTS
If we consider the acquisition of words at 2 year and 1 month old from the 2-year-old vocabulary, known words can account for 67% of words, and 13% more can be automatically identified (80%, CHOU 98%)
corpus.file = scan("D:/tccmcorpus/HTC01_CHOU300_12.cha", what = "char", sep = "\n")
df = data.frame(names = NA, utterance = NA, mor = NA, candidate = NA, iov = NA,
oov = NA, guess = NA)
counter = 0
for (i in corpus.file) {
first.character = substr(i, 1, 1)
tier.identifier = substr(i, 2, 4)
ann = substring(i, 7)
if (first.character == "*") {
counter = counter + 1
df[counter, ]$names = tier.identifier
df[counter, ]$utterance = ann
} else if (tier.identifier == "mor") {
df[counter, ]$mor = ann
}
}
# pan_voc200=append(pan_voc200,'.')
chou300.df = df[df$names == "CHI", ]
for (j in 1:nrow(chou300.df)) {
utter = chou300.df[j, ]$utterance
words = unlist(strsplit(utter, split = " "))
iov = oov = guess = 0
for (word in words) {
if (word %in% chou2.voc.vector) {
chou300.df$candidate[j] = paste(chou300.df$candidate[j], word)
iov = iov + 1
} else {
oov = oov + 1
}
}
chou300.df$iov[j] = iov
chou300.df$oov[j] = oov
if (oov == 1) {
chou300.df$guess[j] = iov + 1
} else {
chou300.df$guess[j] = iov
}
}
iov = sum(chou300.df$iov)
oov = sum(chou300.df$oov)
guess = sum(chou300.df$guess)
base = iov/(iov + oov)
accuracy = guess/(iov + oov)
print(accuracy)
## [1] 0.9857