巨量資料下的中文語料庫語言學:方法與省思

Chinese Corpus Linguistics and the Big Data

謝舒凱
Graduate Institute of Linguistics, National Taiwan University

Table of Contents

  • Background
  • Issues
  • Topics for Discussion

基本的東西 Basics

McEnery and Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge. Support website alt text

Corpus Linguistics: From Past to Future

  • 40-50 years development/debate on Introspection vs corpus evidence.

  • A separate field of linguistics ? (Tool vs Theory) > corpus linguistics has become an indispensable component of the methodological toolbox throughout linguistics.

  • But we must not confuse corpus data with language itself. Corpora allow us to observe language, but they are not language itself.

  • 技術牽動科學工作思維 What kind of future progressions can be predict for corpus linguistics?

不確定值不值得的爭辯 Corpus-based versus Corpus-driven

  • 語料庫工具觀 Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach to the use of corpus data in linguistics.

  • 語料庫理論觀 Corpus-driven linguistics rejects the characterization of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language (Tognini-Bonelli 2001). >> neo-Firthians

  • "All corpus linguistics can just be described as corpus-based."

網路語料庫 The Web Corpus:背景

  • 資訊油田:2011 年每兩天產生的資料量,等於人類有文明至 2003 年之前所產生全部的資料量 (1.8 ZB)。MIT Deb Roy 教授的語言習得實驗(3年 九萬個小時 200 TB)(引自林守德,科學人 2013.03)。

  • 當前人類產生的資料總量有 90% 是過去兩年創造出來的。

  • Facebook 每天增加上億張相片,每天增加五千萬的讚:Twitter 每天產生 2.3 億條 tweets (7 TB)

  • 世界的數位化(醫療雲、教育雲):一切都被感知化(instrumented,所有生活世界物件都被感測),物聯化(interconnected) (能送到後台分析),數據智能化 (intelligent) (AI-web 幫助決策)

巨量資料分析 Big Data Analysis

  • Big Data Analysis: the material returned from the web search tends to be an undifferentiated mass, which requires a new Data Science to process and extract meaningful patterns.

網路語料庫 The Web Corpus

  • Web for Corpus: collecting the texts from the Web.

  • Web as Corpus: accessing the Web as a corpus (in real-time), e.g., WebCorp.

  • Multimodality (images, videos and audios); Human Language Archive

語料庫系統與工具發展

  • 語料庫與語料搜尋工具系統是容易被混淆的兩件事。

  • 3rd-generation Concordancer 幾乎是語料分析工具的同義詞,晚近著名工具有 WordSmith, AntConc, Xaira 等,但是限制何在?

  • 4th-generation : 從網站到網路服務 From Website to Web Service

    • CQP (Corpus Query Processor)(CWB) and SQL-based (BYU)
    • Word Sketch Engine

語料庫系統與工具發展

內建、客製到自製分析程式工具

  • It has been suggested that corpus linguists, rather than using general corpus analysis packages, should instead fully embrace computer programming and individually develop their own as hoc tools to address the tasks that face them.

    • you can do analyses that are not possible with concordancers;
    • you can do analyses 'more quickly and more accurately';
    • you can tailor the output to fit your own research needs;
    • you can analyse a corpus of any size.
  • 建網路語料庫變得很容易 (e.g., BootCaT) and APIs (Plurk, Sina Weibo, and ... everything!)

臺灣語料庫 ASBC, TCCM, Social Corpus and i-Corpus

  • 多元、分眾、動態、發展、開放、脈絡豐富、連結。

lope-corpora

TCCM

CWM

CWN

語料庫廣義來說可以很多元有趣 (1)

alt text

[We feel Fine] http://www.wefeelfine.org/

語料分析廣義來說可以很多元有趣 (1)

`` How does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?

``

Soft drink terms around the U.S. Soft drink terms around the world

We can probably hypothesize that:

  • The South is pretty Coke-heavy.
  • Soda belongs to the Northeast and far West.
  • Pop gets the mid-West, except for some interesting spots of blue around Wisconsin and the Illinois-Missouri border.
  • “pop” seems to be prevalent only in parts of the United States and Canada.

語料分析廣義來說可以很多元有趣 (2)

You can also do some other fun observations during break: What do people want for Christmas, compared to what they actually get?

corpus typology

未來路上學界與業界不可以放棄彼此

打開潘朵拉的盒子 Open Pandora's Box

  • Pandora's box: multivariate, multimodal, multimedia, multilingualism, etc.
  • From Corupus Linguistics to Cloud Linguistics.
  • 多語 多模態 開放 跨界 多媒體 雲端

未來的語料庫語言學 Future of Corpus in Linguistics

呼籲語言學教育的改革

  • Urgent needs for linguistic data scientist and/or language engineer !

There will be an important role for corpus specilaist whose research is concerned with the methodology itself - the construction and annotation of corpora, the development of new tools and new procedures, the expansion of the conceptual bases of the methodology and other such issues. [1]

  • It is (not) realistic to expect every linguist to become fully competent in programming or more complex statistical analyses. But a shift in what is meant by 'corpus linguist' is expected - - From corpus user to a "researcher into the methodology, esp. one who develops new methods and enables other linguists to apply them."

語料庫語言學的未來 Future of Corpus Linguistics in Humanities and Science

新一代 corpus linguist 不僅服務語言學

  • Just as corpus linguistics has become increasingly integrated as a method with other fields of linguistics, it may/will be aopted outside linguistics by other disciplines within the humanities and social science in particular.

  • The triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic.

  • such methdological pluralism is already happening, to some extend, in the case of corpus methods and the methods of experimental psycholinguistics and neurolinguistics.

語料方法學 Accessing and Analyzing Methods

  • The shift mentioned has already taken place to an extend, in that the research with the greatest impact in corpus linguistics is very often valued not for what it > discovers about language, but for the methods it introduces or develops.

The findings about particular English grannatical constructions made by Stefanowitsh and Gries (2003), for example, are not especially revolutionary in themselves. It is, rather, the method that these findings exemplify - and the associated theoretical and statistical apparatus linked to collostruction - that makes this paper a key contribution to recent research in corpus linguistics.

語料庫的一般使用 Who uses Corpus and How?

  • 誰?怎麼用?

  • 焦點通常是所關心的語言單位之「行為量度」(`頻度分佈共現模式)(Word) Frequency, Concordance, Collocation, Collostructure, N-gram/Lexical Bundles/multi-word units, etc.

  • Corpus-informed research 的問題: Researchers use the corpus simply as a bank of examples to illustrate a theory they are developing. This runs counter to the scientific method, insofar as there is no attempt to account for the rest of the (potentially falsifying) evidence in the corpus.

  • 質性(qualitative)或量性的發展不是沒有爭議:e.g., Critical Discourse Anlysis. To undertake a detailed analysis of a small amount of data, taking into account not just the text itself, but also the social context in which it was produced and the social context in which it was interpreted.

語料庫分析方法 Corpus-based Empirical Methods

  • Fundamental commitmment to empiricism

  • Both corpus linguistics and other experimental linguistics study language system not directly but by observation of epiphenomena - output on the large scale, or either the blood-flow requirements or some other psysiological feature associated with it.

  • Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.

老問題,更嚴重 Issues

  • 需要平衡嗎?需要多大才夠?How to determine sample size(balance, representativeness and comparability)
  • 中文分詞的夢靨 Word segmentation error as unbiased sampling error
  • 相應的經驗研究方法論 New Empirical methodology reloaded
  • 實驗倫理與個資保護 Issue of copyright/left and of research ethics; <!--'fair use'-->

Balance, Representativeness and Comparability

  • These three are ideals which corpus builders strive for but rarely, if ever, attain.

  • Biber's proposal for representativeness: measure internal variation within a corpus - i.e., a corpus is representative if it fully captures the variability of a language.

  • 是一個程度問題,還是一個偽科學論題?(因為程度預設了 100% 是清楚的)

分詞問題本質上無法解決 Comparing NTU Plurk Corpus and ASBC

PLURK corpus PLURK corpus

規模大後愈嚴重 Things get worse when scaled

PLURK corpus PLURK corpus

Conclusion and Discussion

  • 語料的鉅量增長與多樣,是語言研究與工作者必須認真面對的趨勢。

    From Corpus Linguistics to Cloud Linguistics

    Big Data Analysis

  • 語料處理技術每個語言工作者都值得學學。

    Revisiting old friends: computational linguistics

    Reinvigorated friendship: Semantic tagging and Sentiment Analysis/ Opinion Mining

  • 重要的還有經驗研究的方法多元論,需要在研究社群累積醞釀。包括:

    統計與機器學習(輔助與自學)

    探勘式看語料 (exploratory data analysis)

    重製 (reproducible) : Doing linguistics in a Reproducible way: Results only count if someone can follow your recipe and get the same answers. See http://www.reproducibleresearch.org

Discussion for Solutions?

  • 呼籲大有為政府力量正視此基礎建設並永續經營(國家語言資源偵測中心)

  • 期待新的語言學理論典範轉移放棄詞的概念

    • 詞彙是功能性的便宜單位。
    • 標記思維難以「解決」非標記現象(分詞也是標記)。
  • 語料的分眾客製化 customizedfind the 'local optimum'

  • 人端計算 (眾包)crowd-sourcing 與開放市集 open market

    • (語料市集:每個人處理好的語料與工具開放市集。團結力量大。) determine enough sample size and clean the corpus (need easy-to-use tools),也提高互為重製可能性 mutual-reproducibility。