Outline

  • Motivation
  • Lexicon Model
  • DeepLEX: framework and applications
  • Conclusion

Motivation

  • Sentiment and Sense-aware computing (2015-)
    • What is an ideal Lexicon Model for (NLP/NLU)?

Language Complexity in Communicative Settings

visual, auditory, verbal, and invisible suprasystems interacting in a congruent state

  • Polymorphic behaviour: A sign (word, phrase) can have multiple senses in varied contexts. Resulting ambiguity in use: lexical, structural, scope (每個人都喜歡他的孩子), anaphora/deictic expressions (他上禮拜開始就應該待在這裡), etc.
  • Paralinguistic and extralinguistic information merging in communicative settings: kinesics (body gestures), acoustic (vocal activity), silence, etc.

Affective Expressions

Affect is a broader all-encompassing term whch refers to general topics of emotion, feelings, and mood together.

  • Emotion, Mood, Feeling, Attitude, Temperament, Personal trait

Toxic comments

Language, Discourse and Multimodality

Expressive Units of Emotions

Affect Lexicon oversimplifies the affaction process

Drawing

Pattern Grammar of Affect

  • Local Grammar of Affect (Bednarek. 2008)
  • Use pattern grammar (Hunston, 1999) in approaching evaluative (semantic) prosody (Lee and Hsieh, 2016).

Chinese Evaluative Patterns

Reading (evaluative) texts

  • Chinese reader did not follow the segmentation rules, and tend to chunk single words into large information units (Liu et al. 2013)

Deeper understanding/resources of affective expressions?

  • Multimodality affect state (paralinguistics)
  • Lexical items vs lexical chunks/bundles
  • Sense-sentiment concurrent processing
  • Tracing varieties and changes

Form-Meaning pairings: functional take

  • Shift towards usage-based model
  • word-meaning pair is fluid in nature, whose granularity (in terms of form length and meaning varieties) is influenced by its underlying ontology (paradigmatic dimension), surrounding context (syntagmatic dimension) and real-world application (pragmatic force).

「還在那邊」(‘still over there’)

BEFORE

不是我要說他,已經什麼都有了他還在那邊邱屁邱。別人我不敢說,這種人喔實在是 $#* :-(

AFTER

tokens("不是我要說他,已經什麼都有了他還在那邊邱屁邱。我覺得喔這種人喔實在是 $#*")
## tokens from 1 document.
## text1 :
##  [1] "不是"   "我要"   "說"     "他"     ","     "已經"   "什麼"  
##  [8] "都有"   "了"     "他"     "還在"   "那邊"   "邱"     "屁"    
## [15] "邱"     "。"     "我"     "覺得"   "喔"     "這種"   "人"    
## [22] "喔"     "實在是" "$"      "#"      "*"
  • revisit the Chinese wordhood (and consider the word segmentation issue as
    wordhood annotation rather than processing).

The Lexicon/Lexica in NLP

  • linguistically shallow

  • can’t interprete ‘usage in context’.
    • e.g., ‘我常常’(I often) in depressive texts

Theories

enumerative.generative.emergent

The Enumerative Lexicon

(Miller and Fellbaum, 1990)

  • Wordnet

The Generative Lexicon

(Pustejovsky, 1995)

The Emergent Lexicon

(Huang, 1995; Bybee, 1998)

  • Meanings are negotiated, subordinated to the sequential requirements of talk-in-interaction.

  • What we typically regard as fixed meanings, such as those codified in a dictionary, are merely sedimented or stabilized structures that emerge as negotiated recurring patterns that have achieved cross-textual consistency.

DeepLEX (Hsieh, 2018; Hsieh and Tseng 2019)

Rationale

  • The lexicon reflects (chunks of) linguistic experience,

Pre-packageed information (formulaic expressions) relatively immune from negotiation, which figure prominently in oral discourse, and significantly often coincide with the boundaries of intonational units, where syntactic and pragmatic completion points often converge.(Huang, 1995).

DeepLEX: Assumptions

  • It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)

  • Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.

First challenge from CWS

  • No more gold standard, please
library(jiebaR)
seg <- worker()
seg["據台大語言所小編謝舒凱表示,宅宅也是非常用功 der"]
##  [1] "據"     "台大"   "語言所" "小編"   "謝舒凱" "表示"   "宅宅"  
##  [8] "也"     "是"     "非常"   "用功"   "der"

We are not lonely (Japanese, Thai, Lao, Khmer)

txt_khmer <- "តៃវ៉ាន់បោះជំហានឆ្ពោះទៅរកការធ្វើពាណិជ្ជកម្មនៅអាស៊ីដើម្បីកាត់បន្ថយភាពអាស្រ័យលើប្រទេសចិន
"
#Taiwan Steps up Asia Business to Reduce Dependence on China
#taivean baohchomhan chhpaohtow rokkarothveu peanechchokamm now asai daembi 
#katbanthoy pheap asry leu bratesa chen
tokens(txt_khmer)
## tokens from 1 document.
## text1 :
##  [1] "តៃវ៉ាន់"    "បោះជំហាន"  "ឆ្ពោះទៅ"  "រកការធ្វើ" "ពាណិជ្ជកម្ម" "នៅ"      "អាស៊ី"     
##  [8] "ដើម្បី"    "កាត់បន្ថយ"  "ភាព"      "អាស្រ័យ"    "លើ"      "ប្រទេស"   "ចិន"

Form varieties at different levels

  • Character variants (異體字)
  • Fragments
  • Non-verbal expressions
  • Mixture (e.g., aphabetic words A錢 B咖….)
  • Formulaic expressions (incl. Qudraic-syllabic expressions (QIEs))

Meaning representation: symbolic/numeric dimensions

  • sense vector
  • pragmatic annotation

Modules

Hanzi Semantics Emotion Lexical.Age Aquisition Social Network Morpho-syntax —–
phonetics sense polarity 1930.freq 3y.freq indegree POS —–
components relations classes 1940.freq 4y.freq outdegree productivity —–
—— —— —– —– —– —– —– —–

At the moment there are 55k units (ranging from characters to lexical chunks) with over than 150 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.

Linguistic Granularity Annotation

Distribution (2017)

Experiments and Applications

Experiments

(Hsieh and Tseng, 2019)

DeepLEX-based pilot sudies:

  • Affective dialogue system
  • Mega- and meta-analysis

CLD + DeepLEX

## Loading `database.rds`...
## # A tibble: 3,156 x 300
##    lu_trad lu_simp cld.C1 cld.C123Backwar… cld.C123Backwar…
##    <chr>   <chr>   <chr>             <dbl>            <dbl>
##  1 心      心      心                   NA               NA
##  2 心星    心星    <NA>                 NA               NA
##  3 心淨    心净    <NA>                 NA               NA
##  4 心驚    心惊    心                   NA               NA
##  5 平心    平心    <NA>                 NA               NA
##  6 舒心    舒心    舒                   NA               NA
##  7 心寒    心寒    心                   NA               NA
##  8 丹心    丹心    <NA>                 NA               NA
##  9 心疚    心疚    <NA>                 NA               NA
## 10 淨心    净心    <NA>                 NA               NA
## # … with 3,146 more rows, and 295 more variables:
## #   cld.C123ConditionalProbability <dbl>, cld.C123Entropy <dbl>,
## #   cld.C12BackwardConditionalProbability <dbl>,
## #   cld.C12BackwardEntropy <dbl>, cld.C12ConditionalProbability <dbl>,
## #   cld.C12Entropy <dbl>, cld.C1BackwardConditionalProbability <dbl>,
## #   cld.C1BackwardEntropy <dbl>, cld.C1ConditionalProbability <dbl>,
## #   cld.C1Entropy <dbl>, cld.C1FamilyFrequency <dbl>,
## #   cld.C1FamilySize <int>, cld.C1Frequency <dbl>,
## #   cld.C1FrequencyRaw <int>, cld.C1FrequencyRawSUBTL <int>,
## #   cld.C1FrequencyRawWeibo <int>, cld.C1FrequencySUBTL <dbl>,
## #   cld.C1FrequencyWeibo <dbl>, cld.C1Friends <int>,
## #   cld.C1FriendsFrequency <dbl>, cld.C1HomographsFrequency <dbl>,
## #   cld.C1HomographTokens <int>, cld.C1HomographTypes <int>,
## #   cld.C1HomophonesFrequency <dbl>, cld.C1HomophoneTokens <int>,
## #   cld.C1HomophoneTypes <int>, cld.C1InitialDiphoneFrequency <dbl>,
## #   cld.C1InitialPhonemeFrequency <dbl>, cld.C1IPA <chr>,
## #   cld.C1MaxDiphoneFrequency <dbl>, cld.C1MaxPhonemeFrequency <dbl>,
## #   cld.C1MeanDiphoneFrequency <dbl>, cld.C1MeanPhonemeFrequency <dbl>,
## #   cld.C1MinDiphoneFrequency <dbl>, cld.C1MinPhonemeFrequency <dbl>,
## #   cld.C1OLDPixels <dbl>, cld.C1Phonemes <int>,
## #   cld.C1PhonologicalFrequency <dbl>, cld.C1PhonologicalN <int>,
## #   cld.C1PictureSize <int>, cld.C1Pinyin <chr>, cld.C1Pixels <int>,
## #   cld.C1PLD <dbl>, cld.C1PR <chr>,
## #   cld.C1PRBackwardEnemiesFrequency <dbl>,
## #   cld.C1PRBackwardEnemiesTokens <int>,
## #   cld.C1PRBackwardEnemiesTypes <int>, cld.C1PREnemiesFrequency <dbl>,
## #   cld.C1PREnemiesTokens <int>, cld.C1PREnemiesTypes <int>,
## #   cld.C1PRFamilySize <int>, cld.C1PRFrequency <dbl>,
## #   cld.C1PRFriends <int>, cld.C1PRFriendsFrequency <dbl>,
## #   cld.C1PRPinyin <chr>, cld.C1PRRegularity <int>, cld.C1PRStrokes <int>,
## #   cld.C1RE <dbl>, cld.C1SR <chr>, cld.C1SRFamilySize <int>,
## #   cld.C1SRFrequency <dbl>, cld.C1SRStrokes <int>, cld.C1Strokes <int>,
## #   cld.C1Structure <fct>, cld.C1Tone <fct>, cld.C1Type <fct>,
## #   cld.C2 <chr>, cld.C2FamilyFrequency <dbl>, cld.C2FamilySize <int>,
## #   cld.C2Frequency <dbl>, cld.C2FrequencyRaw <int>,
## #   cld.C2FrequencyRawSUBTL <int>, cld.C2FrequencyRawWeibo <int>,
## #   cld.C2FrequencySUBTL <dbl>, cld.C2FrequencyWeibo <dbl>,
## #   cld.C2Friends <int>, cld.C2FriendsFrequency <dbl>,
## #   cld.C2HomographsFrequency <dbl>, cld.C2HomographTokens <int>,
## #   cld.C2HomographTypes <int>, cld.C2HomophonesFrequency <dbl>,
## #   cld.C2HomophoneTokens <int>, cld.C2HomophoneTypes <int>,
## #   cld.C2InitialDiphoneFrequency <dbl>,
## #   cld.C2InitialPhonemeFrequency <dbl>, cld.C2IPA <chr>,
## #   cld.C2MaxDiphoneFrequency <dbl>, cld.C2MaxPhonemeFrequency <dbl>,
## #   cld.C2MeanDiphoneFrequency <dbl>, cld.C2MeanPhonemeFrequency <dbl>,
## #   cld.C2MinDiphoneFrequency <dbl>, cld.C2MinPhonemeFrequency <dbl>,
## #   cld.C2OLDPixels <dbl>, cld.C2Phonemes <int>,
## #   cld.C2PhonologicalFrequency <dbl>, cld.C2PhonologicalN <int>,
## #   cld.C2PictureSize <int>, cld.C2Pinyin <chr>, cld.C2Pixels <int>,
## #   cld.C2PLD <dbl>, …

Conclusion and Future Works

Reference

  • Bybee, J. 1998. The Emergent Lexicon. CLD34: The Panels.

  • Huang, S.F. 1995. Emergent Lexical Semantics.

  • Hsieh, S.K. et al. 2019. Fluid Annotation: A Granularity-aware Annotation Tool for Chinese Word Fluidity. LREC. 2019.

  • Hsieh, Shu-Kai and Yu-Hsiang Tseng. 2019. Linguistic Granularity Annottion Framework: a granularity-aware approach to Chinese NLP. Journal of Granular Computing. Springer. (under review).

  • Tseng, Yu-Hiang and Shu-Kai Hsieh. 2019. Eigencharacters.

  • 謝舒凱,曾昱翔。2019. 深度詞庫:邁向知識導向的人工智慧基礎。中華心理學刊 61(3).