Table of Contents
- Background
- Issues
- Topics for Discussion
謝舒凱
Graduate Institute of Linguistics, National Taiwan University
McEnery and Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge. Support website
40-50 years development/debate on Introspection vs corpus evidence.
A separate field of linguistics ? (Tool vs Theory) > corpus linguistics has become an indispensable component of the methodological toolbox throughout linguistics.
But we must not confuse corpus data with language itself. Corpora allow us to observe language, but they are not language itself.
技術牽動科學工作思維 What kind of future progressions can be predict for corpus linguistics?
語料庫工具觀 Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach to the use of corpus data in linguistics.
語料庫理論觀 Corpus-driven linguistics rejects the characterization of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language (Tognini-Bonelli 2001). >> neo-Firthians
"All corpus linguistics can just be described as corpus-based."
資訊油田:2011 年每兩天產生的資料量,等於人類有文明至 2003 年之前所產生全部的資料量 (1.8 ZB)。MIT Deb Roy 教授的語言習得實驗(3年 九萬個小時 200 TB)(引自林守德,科學人 2013.03)。
當前人類產生的資料總量有 90% 是過去兩年創造出來的。
Facebook 每天增加上億張相片,每天增加五千萬的讚:Twitter 每天產生 2.3 億條 tweets (7 TB)
世界的數位化(醫療雲、教育雲):一切都被感知化(instrumented,所有生活世界物件都被感測),物聯化(interconnected) (能送到後台分析),數據智能化 (intelligent) (AI-web 幫助決策)
Big Data Analysis: the material returned from the web search tends to be an undifferentiated mass, which requires a new Data Science to process and extract meaningful patterns.Web for Corpus: collecting the texts from the Web.
Web as Corpus: accessing the Web as a corpus (in real-time), e.g., WebCorp.
Multimodality (images, videos and audios); Human Language Archive
語料庫與語料搜尋工具系統是容易被混淆的兩件事。
3rd-generation Concordancer 幾乎是語料分析工具的同義詞,晚近著名工具有 WordSmith, AntConc, Xaira 等,但是限制何在?
4th-generation : 從網站到網路服務 From Website to Web Service
內建、客製到自製分析程式工具
It has been suggested that corpus linguists, rather than using general corpus analysis packages, should instead fully embrace computer programming and individually develop their own as hoc tools to address the tasks that face them.
建網路語料庫變得很容易 (e.g., BootCaT) and APIs (Plurk, Sina Weibo, and ... everything!)




[We feel Fine] http://www.wefeelfine.org/
`` How does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?
``

We can probably hypothesize that:
You can also do some other fun observations during break: What do people want for Christmas, compared to what they actually get?

Google Endangered Language Project
in Linguistics呼籲語言學教育的改革
linguistic data scientist and/or language engineer !There will be an important role for corpus specilaist whose research is concerned with the methodology itself - the construction and annotation of corpora, the development of new tools and new procedures, the expansion of the conceptual bases of the methodology and other such issues. [1]
in Humanities and Science新一代 corpus linguist 不僅服務語言學
Just as corpus linguistics has become increasingly integrated as a method with other fields of linguistics, it may/will be aopted outside linguistics by other disciplines within the humanities and social science in particular.
The triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic.
The findings about particular English grannatical constructions made by Stefanowitsh and Gries (2003), for example, are not especially revolutionary in themselves. It is, rather, the method that these findings exemplify - and the associated theoretical and statistical apparatus linked to collostruction - that makes this paper a key contribution to recent research in corpus linguistics.
誰?怎麼用?
焦點通常是所關心的語言單位之「行為量度」(`頻度、分佈與共現模式)(Word) Frequency, Concordance, Collocation, Collostructure, N-gram/Lexical Bundles/multi-word units, etc.
Corpus-informed research 的問題: Researchers use the corpus simply as a bank of examples to illustrate a theory they are developing. This runs counter to the scientific method, insofar as there is no attempt to account for the rest of the (potentially falsifying) evidence in the corpus.
質性(qualitative)或量性的發展不是沒有爭議:e.g., Critical Discourse Anlysis.
To undertake a detailed analysis of a small amount of data, taking into account not just the text itself, but also the social context in which it was produced and the social context in which it was interpreted.
Fundamental commitmment to empiricism
Both corpus linguistics and other experimental linguistics study language system not directly but by observation of epiphenomena - output on the large scale, or either the blood-flow requirements or some other psysiological feature associated with it.
Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.
These three are ideals which corpus builders strive for but rarely, if ever, attain.
Biber's proposal for representativeness: measure internal variation within a corpus - i.e., a corpus is representative if it fully captures the variability of a language.
是一個程度問題,還是一個偽科學論題?(因為程度預設了 100% 是清楚的)


鉅量增長與多樣,是語言研究與工作者必須認真面對的趨勢。From Corpus Linguistics to Cloud Linguistics
Big Data Analysis
語料處理技術每個語言工作者都值得學學。Revisiting old friends: computational linguistics
Reinvigorated friendship: Semantic tagging and Sentiment Analysis/ Opinion Mining
方法多元論,需要在研究社群累積醞釀。包括:統計與機器學習(輔助與自學)
探勘式看語料 (exploratory data analysis)
重製 (reproducible) : Doing linguistics in a Reproducible way: Results only count if someone can follow your recipe and get the same answers. See http://www.reproducibleresearch.org
呼籲大有為政府力量正視此基礎建設並永續經營(國家語言資源偵測中心)
期待新的語言學理論典範轉移放棄詞的概念
語料的分眾客製化 customizedfind the 'local optimum'
人端計算 (眾包)crowd-sourcing 與開放市集 open market