- Motivation
- Lexicon Model
- DeepLEX: framework and applications
- Conclusion
visual, auditory, verbal, and invisible suprasystems interacting in a congruent state
Affect is a broader all-encompassing term whch refers to general topics of emotion, feelings, and mood together.
不是我要說他,已經什麼都有了他還在那邊邱屁邱。別人我不敢說,這種人喔實在是 $#* :-(
tokens("不是我要說他,已經什麼都有了他還在那邊邱屁邱。我覺得喔這種人喔實在是 $#*")
## tokens from 1 document. ## text1 : ## [1] "不是" "我要" "說" "他" "," "已經" "什麼" ## [8] "都有" "了" "他" "還在" "那邊" "邱" "屁" ## [15] "邱" "。" "我" "覺得" "喔" "這種" "人" ## [22] "喔" "實在是" "$" "#" "*"
linguistically shallow
Meanings are negotiated, subordinated to the sequential requirements of talk-in-interaction.
What we typically regard as fixed meanings, such as those codified in a dictionary, are merely sedimented or stabilized structures that emerge as negotiated recurring patterns that have achieved cross-textual consistency.
Pre-packageed information (formulaic expressions) relatively immune from negotiation, which figure prominently in oral discourse, and significantly often coincide with the boundaries of intonational units, where syntactic and pragmatic completion points often converge.(Huang, 1995).
It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)
Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.
library(jiebaR) seg <- worker() seg["據台大語言所小編謝舒凱表示,宅宅也是非常用功 der"]
## [1] "據" "台大" "語言所" "小編" "謝舒凱" "表示" "宅宅" ## [8] "也" "是" "非常" "用功" "der"
txt_khmer <- "តៃវ៉ាន់បោះជំហានឆ្ពោះទៅរកការធ្វើពាណិជ្ជកម្មនៅអាស៊ីដើម្បីកាត់បន្ថយភាពអាស្រ័យលើប្រទេសចិន " #Taiwan Steps up Asia Business to Reduce Dependence on China #taivean baohchomhan chhpaohtow rokkarothveu peanechchokamm now asai daembi #katbanthoy pheap asry leu bratesa chen tokens(txt_khmer)
## tokens from 1 document. ## text1 : ## [1] "តៃវ៉ាន់" "បោះជំហាន" "ឆ្ពោះទៅ" "រកការធ្វើ" "ពាណិជ្ជកម្ម" "នៅ" "អាស៊ី" ## [8] "ដើម្បី" "កាត់បន្ថយ" "ភាព" "អាស្រ័យ" "លើ" "ប្រទេស" "ចិន"
| Hanzi | Semantics | Emotion | Lexical.Age | Aquisition | Social Network | Morpho-syntax | —– |
|---|---|---|---|---|---|---|---|
| phonetics | sense | polarity | 1930.freq | 3y.freq | indegree | POS | —– |
| components | relations | classes | 1940.freq | 4y.freq | outdegree | productivity | —– |
| —— | —— | —– | —– | —– | —– | —– | —– |
At the moment there are 55k units (ranging from characters to lexical chunks) with over than 150 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.
DeepLEX-based pilot sudies:
LexicoR: a R packages for Chinese Lexicon(s)## Loading `database.rds`...
## # A tibble: 3,156 x 300 ## lu_trad lu_simp cld.C1 cld.C123Backwar… cld.C123Backwar… ## <chr> <chr> <chr> <dbl> <dbl> ## 1 心 心 心 NA NA ## 2 心星 心星 <NA> NA NA ## 3 心淨 心净 <NA> NA NA ## 4 心驚 心惊 心 NA NA ## 5 平心 平心 <NA> NA NA ## 6 舒心 舒心 舒 NA NA ## 7 心寒 心寒 心 NA NA ## 8 丹心 丹心 <NA> NA NA ## 9 心疚 心疚 <NA> NA NA ## 10 淨心 净心 <NA> NA NA ## # … with 3,146 more rows, and 295 more variables: ## # cld.C123ConditionalProbability <dbl>, cld.C123Entropy <dbl>, ## # cld.C12BackwardConditionalProbability <dbl>, ## # cld.C12BackwardEntropy <dbl>, cld.C12ConditionalProbability <dbl>, ## # cld.C12Entropy <dbl>, cld.C1BackwardConditionalProbability <dbl>, ## # cld.C1BackwardEntropy <dbl>, cld.C1ConditionalProbability <dbl>, ## # cld.C1Entropy <dbl>, cld.C1FamilyFrequency <dbl>, ## # cld.C1FamilySize <int>, cld.C1Frequency <dbl>, ## # cld.C1FrequencyRaw <int>, cld.C1FrequencyRawSUBTL <int>, ## # cld.C1FrequencyRawWeibo <int>, cld.C1FrequencySUBTL <dbl>, ## # cld.C1FrequencyWeibo <dbl>, cld.C1Friends <int>, ## # cld.C1FriendsFrequency <dbl>, cld.C1HomographsFrequency <dbl>, ## # cld.C1HomographTokens <int>, cld.C1HomographTypes <int>, ## # cld.C1HomophonesFrequency <dbl>, cld.C1HomophoneTokens <int>, ## # cld.C1HomophoneTypes <int>, cld.C1InitialDiphoneFrequency <dbl>, ## # cld.C1InitialPhonemeFrequency <dbl>, cld.C1IPA <chr>, ## # cld.C1MaxDiphoneFrequency <dbl>, cld.C1MaxPhonemeFrequency <dbl>, ## # cld.C1MeanDiphoneFrequency <dbl>, cld.C1MeanPhonemeFrequency <dbl>, ## # cld.C1MinDiphoneFrequency <dbl>, cld.C1MinPhonemeFrequency <dbl>, ## # cld.C1OLDPixels <dbl>, cld.C1Phonemes <int>, ## # cld.C1PhonologicalFrequency <dbl>, cld.C1PhonologicalN <int>, ## # cld.C1PictureSize <int>, cld.C1Pinyin <chr>, cld.C1Pixels <int>, ## # cld.C1PLD <dbl>, cld.C1PR <chr>, ## # cld.C1PRBackwardEnemiesFrequency <dbl>, ## # cld.C1PRBackwardEnemiesTokens <int>, ## # cld.C1PRBackwardEnemiesTypes <int>, cld.C1PREnemiesFrequency <dbl>, ## # cld.C1PREnemiesTokens <int>, cld.C1PREnemiesTypes <int>, ## # cld.C1PRFamilySize <int>, cld.C1PRFrequency <dbl>, ## # cld.C1PRFriends <int>, cld.C1PRFriendsFrequency <dbl>, ## # cld.C1PRPinyin <chr>, cld.C1PRRegularity <int>, cld.C1PRStrokes <int>, ## # cld.C1RE <dbl>, cld.C1SR <chr>, cld.C1SRFamilySize <int>, ## # cld.C1SRFrequency <dbl>, cld.C1SRStrokes <int>, cld.C1Strokes <int>, ## # cld.C1Structure <fct>, cld.C1Tone <fct>, cld.C1Type <fct>, ## # cld.C2 <chr>, cld.C2FamilyFrequency <dbl>, cld.C2FamilySize <int>, ## # cld.C2Frequency <dbl>, cld.C2FrequencyRaw <int>, ## # cld.C2FrequencyRawSUBTL <int>, cld.C2FrequencyRawWeibo <int>, ## # cld.C2FrequencySUBTL <dbl>, cld.C2FrequencyWeibo <dbl>, ## # cld.C2Friends <int>, cld.C2FriendsFrequency <dbl>, ## # cld.C2HomographsFrequency <dbl>, cld.C2HomographTokens <int>, ## # cld.C2HomographTypes <int>, cld.C2HomophonesFrequency <dbl>, ## # cld.C2HomophoneTokens <int>, cld.C2HomophoneTypes <int>, ## # cld.C2InitialDiphoneFrequency <dbl>, ## # cld.C2InitialPhonemeFrequency <dbl>, cld.C2IPA <chr>, ## # cld.C2MaxDiphoneFrequency <dbl>, cld.C2MaxPhonemeFrequency <dbl>, ## # cld.C2MeanDiphoneFrequency <dbl>, cld.C2MeanPhonemeFrequency <dbl>, ## # cld.C2MinDiphoneFrequency <dbl>, cld.C2MinPhonemeFrequency <dbl>, ## # cld.C2OLDPixels <dbl>, cld.C2Phonemes <int>, ## # cld.C2PhonologicalFrequency <dbl>, cld.C2PhonologicalN <int>, ## # cld.C2PictureSize <int>, cld.C2Pinyin <chr>, cld.C2Pixels <int>, ## # cld.C2PLD <dbl>, …
Bybee, J. 1998. The Emergent Lexicon. CLD34: The Panels.
Huang, S.F. 1995. Emergent Lexical Semantics.
Hsieh, S.K. et al. 2019. Fluid Annotation: A Granularity-aware Annotation Tool for Chinese Word Fluidity. LREC. 2019.
Hsieh, Shu-Kai and Yu-Hsiang Tseng. 2019. Linguistic Granularity Annottion Framework: a granularity-aware approach to Chinese NLP. Journal of Granular Computing. Springer. (under review).
Tseng, Yu-Hiang and Shu-Kai Hsieh. 2019. Eigencharacters.
謝舒凱,曾昱翔。2019. 深度詞庫:邁向知識導向的人工智慧基礎。中華心理學刊 61(3).