DeepLEX: yet another Lexicon?

Outline

Motivation
Lexicon Model
DeepLEX: framework and applications
Conclusion

Motivation

Sentiment and Sense-aware computing (2015-)
- What is an ideal Lexicon Model for (NLP/NLU)?

Language Complexity in Communicative Settings

visual, auditory, verbal, and invisible suprasystems interacting in a congruent state

Polymorphic behaviour: A sign (word, phrase) can have multiple senses in varied contexts. Resulting ambiguity in use: lexical, structural, scope (每個人都喜歡他的孩子), anaphora/deictic expressions (他上禮拜開始就應該待在這裡), etc.
Paralinguistic and extralinguistic information merging in communicative settings: kinesics (body gestures), acoustic (vocal activity), silence, etc.

Affective Expressions

Affect is a broader all-encompassing term whch refers to general topics of emotion, feelings, and mood together.

Emotion, Mood, Feeling, Attitude, Temperament, Personal trait

Toxic comments

Language, Discourse and Multimodality

Expressive Units of Emotions

Affect Lexicon oversimplifies the affaction process

Pattern Grammar of Affect

Local Grammar of Affect (Bednarek. 2008)
Use pattern grammar (Hunston, 1999) in approaching evaluative (semantic) prosody (Lee and Hsieh, 2016).

Chinese Evaluative Patterns

Reading (evaluative) texts

Chinese reader did not follow the segmentation rules, and tend to chunk single words into large information units (Liu et al. 2013)

Deeper understanding/resources of affective expressions?

Multimodality affect state (paralinguistics)
Lexical items vs lexical chunks/bundles
Sense-sentiment concurrent processing
Tracing varieties and changes

Form-Meaning pairings: functional take

Shift towards usage-based model
word-meaning pair is fluid in nature, whose granularity (in terms of form length and meaning varieties) is influenced by its underlying ontology (paradigmatic dimension), surrounding context (syntagmatic dimension) and real-world application (pragmatic force).

「還在那邊」(‘still over there’)

BEFORE

不是我要說他，已經什麼都有了他還在那邊邱屁邱。別人我不敢說，這種人喔實在是 $#* :-(

AFTER

tokens("不是我要說他，已經什麼都有了他還在那邊邱屁邱。我覺得喔這種人喔實在是 $#*")

## tokens from 1 document.
## text1 :
##  [1] "不是"   "我要"   "說"     "他"     "，"     "已經"   "什麼"  
##  [8] "都有"   "了"     "他"     "還在"   "那邊"   "邱"     "屁"    
## [15] "邱"     "。"     "我"     "覺得"   "喔"     "這種"   "人"    
## [22] "喔"     "實在是" "$"      "#"      "*"

revisit the Chinese wordhood (and consider the word segmentation issue as
wordhood annotation rather than processing).

The Lexicon/Lexica in NLP

linguistically shallow
can’t interprete ‘usage in context’.
- e.g., ‘我常常’(I often) in depressive texts

Theories

enumerative.generative.emergent

The Enumerative Lexicon

(Miller and Fellbaum, 1990)

Wordnet
- A fine-grained sense inventory and lexical relations database 中文詞彙網路

The Generative Lexicon

(Pustejovsky, 1995)

The Emergent Lexicon

(Huang, 1995; Bybee, 1998)

Meanings are negotiated, subordinated to the sequential requirements of talk-in-interaction.
What we typically regard as fixed meanings, such as those codified in a dictionary, are merely sedimented or stabilized structures that emerge as negotiated recurring patterns that have achieved cross-textual consistency.

DeepLEX (Hsieh, 2018; Hsieh and Tseng 2019)

Rationale

The lexicon reflects (chunks of) linguistic experience,

Pre-packageed information (formulaic expressions) relatively immune from negotiation, which figure prominently in oral discourse, and significantly often coincide with the boundaries of intonational units, where syntactic and pragmatic completion points often converge.(Huang, 1995).

DeepLEX: Assumptions

It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)
Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.

First challenge from CWS

No more gold standard, please

library(jiebaR)
seg <- worker()
seg["據台大語言所小編謝舒凱表示，宅宅也是非常用功 der"]

##  [1] "據"     "台大"   "語言所" "小編"   "謝舒凱" "表示"   "宅宅"  
##  [8] "也"     "是"     "非常"   "用功"   "der"

We are not lonely (Japanese, Thai, Lao, Khmer)

txt_khmer <- "តៃវ៉ាន់បោះជំហានឆ្ពោះទៅរកការធ្វើពាណិជ្ជកម្មនៅអាស៊ីដើម្បីកាត់បន្ថយភាពអាស្រ័យលើប្រទេសចិន
"
#Taiwan Steps up Asia Business to Reduce Dependence on China
#taivean baohchomhan chhpaohtow rokkarothveu peanechchokamm now asai daembi 
#katbanthoy pheap asry leu bratesa chen
tokens(txt_khmer)

## tokens from 1 document.
## text1 :
##  [1] "តៃវ៉ាន់"    "បោះជំហាន"  "ឆ្ពោះទៅ"  "រកការធ្វើ" "ពាណិជ្ជកម្ម" "នៅ"      "អាស៊ី"     
##  [8] "ដើម្បី"    "កាត់បន្ថយ"  "ភាព"      "អាស្រ័យ"    "លើ"      "ប្រទេស"   "ចិន"

Form varieties at different levels

Character variants (異體字)
Fragments
Non-verbal expressions
Mixture (e.g., aphabetic words A錢 B咖….)
Formulaic expressions (incl. Qudraic-syllabic expressions (QIEs))

Meaning representation: symbolic/numeric dimensions

sense vector
pragmatic annotation

Modules

Hanzi	Semantics	Emotion	Lexical.Age	Aquisition	Social Network	Morpho-syntax	—–
phonetics	sense	polarity	1930.freq	3y.freq	indegree	POS	—–
components	relations	classes	1940.freq	4y.freq	outdegree	productivity	—–
——	——	—–	—–	—–	—–	—–	—–

At the moment there are 55k units (ranging from characters to lexical chunks) with over than 150 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.

Linguistic Granularity Annotation

Distribution (2017)

Experiments and Applications

Experiments

(Hsieh and Tseng, 2019)

DeepLEX-based pilot sudies:

Affective dialogue system
Mega- and meta-analysis

CLD + DeepLEX

LexicoR: a R packages for Chinese Lexicon(s)

## Loading `database.rds`...

## # A tibble: 3,156 x 300
##    lu_trad lu_simp cld.C1 cld.C123Backwar… cld.C123Backwar…
##    <chr>   <chr>   <chr>             <dbl>            <dbl>
##  1 心      心      心                   NA               NA
##  2 心星    心星    <NA>                 NA               NA
##  3 心淨    心净    <NA>                 NA               NA
##  4 心驚    心惊    心                   NA               NA
##  5 平心    平心    <NA>                 NA               NA
##  6 舒心    舒心    舒                   NA               NA
##  7 心寒    心寒    心                   NA               NA
##  8 丹心    丹心    <NA>                 NA               NA
##  9 心疚    心疚    <NA>                 NA               NA
## 10 淨心    净心    <NA>                 NA               NA
## # … with 3,146 more rows, and 295 more variables:
## #   cld.C123ConditionalProbability <dbl>, cld.C123Entropy <dbl>,
## #   cld.C12BackwardConditionalProbability <dbl>,
## #   cld.C12BackwardEntropy <dbl>, cld.C12ConditionalProbability <dbl>,
## #   cld.C12Entropy <dbl>, cld.C1BackwardConditionalProbability <dbl>,
## #   cld.C1BackwardEntropy <dbl>, cld.C1ConditionalProbability <dbl>,
## #   cld.C1Entropy <dbl>, cld.C1FamilyFrequency <dbl>,
## #   cld.C1FamilySize <int>, cld.C1Frequency <dbl>,
## #   cld.C1FrequencyRaw <int>, cld.C1FrequencyRawSUBTL <int>,
## #   cld.C1FrequencyRawWeibo <int>, cld.C1FrequencySUBTL <dbl>,
## #   cld.C1FrequencyWeibo <dbl>, cld.C1Friends <int>,
## #   cld.C1FriendsFrequency <dbl>, cld.C1HomographsFrequency <dbl>,
## #   cld.C1HomographTokens <int>, cld.C1HomographTypes <int>,
## #   cld.C1HomophonesFrequency <dbl>, cld.C1HomophoneTokens <int>,
## #   cld.C1HomophoneTypes <int>, cld.C1InitialDiphoneFrequency <dbl>,
## #   cld.C1InitialPhonemeFrequency <dbl>, cld.C1IPA <chr>,
## #   cld.C1MaxDiphoneFrequency <dbl>, cld.C1MaxPhonemeFrequency <dbl>,
## #   cld.C1MeanDiphoneFrequency <dbl>, cld.C1MeanPhonemeFrequency <dbl>,
## #   cld.C1MinDiphoneFrequency <dbl>, cld.C1MinPhonemeFrequency <dbl>,
## #   cld.C1OLDPixels <dbl>, cld.C1Phonemes <int>,
## #   cld.C1PhonologicalFrequency <dbl>, cld.C1PhonologicalN <int>,
## #   cld.C1PictureSize <int>, cld.C1Pinyin <chr>, cld.C1Pixels <int>,
## #   cld.C1PLD <dbl>, cld.C1PR <chr>,
## #   cld.C1PRBackwardEnemiesFrequency <dbl>,
## #   cld.C1PRBackwardEnemiesTokens <int>,
## #   cld.C1PRBackwardEnemiesTypes <int>, cld.C1PREnemiesFrequency <dbl>,
## #   cld.C1PREnemiesTokens <int>, cld.C1PREnemiesTypes <int>,
## #   cld.C1PRFamilySize <int>, cld.C1PRFrequency <dbl>,
## #   cld.C1PRFriends <int>, cld.C1PRFriendsFrequency <dbl>,
## #   cld.C1PRPinyin <chr>, cld.C1PRRegularity <int>, cld.C1PRStrokes <int>,
## #   cld.C1RE <dbl>, cld.C1SR <chr>, cld.C1SRFamilySize <int>,
## #   cld.C1SRFrequency <dbl>, cld.C1SRStrokes <int>, cld.C1Strokes <int>,
## #   cld.C1Structure <fct>, cld.C1Tone <fct>, cld.C1Type <fct>,
## #   cld.C2 <chr>, cld.C2FamilyFrequency <dbl>, cld.C2FamilySize <int>,
## #   cld.C2Frequency <dbl>, cld.C2FrequencyRaw <int>,
## #   cld.C2FrequencyRawSUBTL <int>, cld.C2FrequencyRawWeibo <int>,
## #   cld.C2FrequencySUBTL <dbl>, cld.C2FrequencyWeibo <dbl>,
## #   cld.C2Friends <int>, cld.C2FriendsFrequency <dbl>,
## #   cld.C2HomographsFrequency <dbl>, cld.C2HomographTokens <int>,
## #   cld.C2HomographTypes <int>, cld.C2HomophonesFrequency <dbl>,
## #   cld.C2HomophoneTokens <int>, cld.C2HomophoneTypes <int>,
## #   cld.C2InitialDiphoneFrequency <dbl>,
## #   cld.C2InitialPhonemeFrequency <dbl>, cld.C2IPA <chr>,
## #   cld.C2MaxDiphoneFrequency <dbl>, cld.C2MaxPhonemeFrequency <dbl>,
## #   cld.C2MeanDiphoneFrequency <dbl>, cld.C2MeanPhonemeFrequency <dbl>,
## #   cld.C2MinDiphoneFrequency <dbl>, cld.C2MinPhonemeFrequency <dbl>,
## #   cld.C2OLDPixels <dbl>, cld.C2Phonemes <int>,
## #   cld.C2PhonologicalFrequency <dbl>, cld.C2PhonologicalN <int>,
## #   cld.C2PictureSize <int>, cld.C2Pinyin <chr>, cld.C2Pixels <int>,
## #   cld.C2PLD <dbl>, …

Conclusion and Future Works

Reference

Bybee, J. 1998. The Emergent Lexicon. CLD34: The Panels.
Huang, S.F. 1995. Emergent Lexical Semantics.
Hsieh, S.K. et al. 2019. Fluid Annotation: A Granularity-aware Annotation Tool for Chinese Word Fluidity. LREC. 2019.
Hsieh, Shu-Kai and Yu-Hsiang Tseng. 2019. Linguistic Granularity Annottion Framework: a granularity-aware approach to Chinese NLP. Journal of Granular Computing. Springer. (under review).
Tseng, Yu-Hiang and Shu-Kai Hsieh. 2019. Eigencharacters.
謝舒凱，曾昱翔。2019. 深度詞庫:邁向知識導向的人工智慧基礎。中華心理學刊 61(3).