Introducing Corpus Linguistics [I]

An R-chitecture

Shu-Kai Hsieh 謝舒凱
Graduate Institute of Linguistics, National Taiwan University

Introducing Corpus Linguistics

Part one: Background, Theory and Success Story, Methods

介紹語料庫的概念、歷史、類型與方法。(3月6日)
Part two: Web Corpus: Put into Practice

介紹實作與分析網路語料庫的工具。(7月3日)

Basics
Accessing and Analzing Methods
Corpus-based Studies
WARNING: Empirical Methodology needed
Conclusion and Discussion

基本的東西 Basics

這次導讀主要以這本書為主，並增添研究註解。

McEnery and Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge. Support website

alt text

語料庫是啥 What is Corpus

~~胼胝體（Corpus callosum）?~~
~~文集/典藏 (textual collection ／archive)?~~
語言資料庫 (linguistic database)?
有標記語言資訊的資料庫 (database with linguistically annotated information)

語料, 語料, 語料！Data, Data, Data!

"Corpus linguistics is the study of language data on a large scale - the comput-aided analysis of very extensive collections of transcribed utterances or written texts.[1] "

Data are values of qualitative or quantitative variables, belonging to a set of items (i.e., populations). (wiki)
Linguistic data: Fieldwork, Experiment and Corpus.
Linguistics: a data-intensive discipline and a textually mediated world.

歷史 Corpus Linguistics：History

Looking at trends evident in the history of corpus linguistics up to the present time abd considering how those trends are likely to continue, or, rather, how we think they should continue. [1]

Corpus Linguistics: From Past to Future

40-50 years development/debate on Introspection vs corpus evidence.
A separate field of linguistics ? (Tool vs Theory) > corpus linguistics has become an indispensable component of the methodological toolbox throughout linguistics.
But we must not confuse corpus data with language itself. Corpora allow us to observe language, but they are not language itself.
技術帶動思維 What kind of future progressions can be predict for corpus linguistics?

不確定值不值得的爭辯 Corpus-based versus Corpus-driven

語料庫工具觀 Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach to the use of corpus data in linguistics.
語料庫理論觀 Corpus-driven linguistics rejects the characterization of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language (Tognini-Bonelli 2001). >> neo-Firthians
"All corpus linguistics can just be described as corpus-based."

語料庫類型 Corpus Typology

corpus typology

由研究題材決定

語料庫類型 Corpus Typology

Two broad approaches to the issue of choosing what data to collect:

Monitor approach (Sinclair 1991:24-6), where the corpus continually expands to include more and more texts over time; and
Balanced or Snapshot approach (Biber 1993 and Leech 2007), where a careful sample corpus, reflecting the language as it exists at a given point time, is constructed according to a specific sampling frame.

網路語料庫 The Web Corpus

背景知識

資訊油田：2011 年每兩天產生的資料量，等於人類有文明至 2003 年之前所產生全部的資料量 (1.8 ZB)。MIT Deb Roy 教授的語言習得實驗（3年九萬個小時 200 TB）（引自林守德，科學人 2013.03）。
臉書：每天增加上億張相片，每天增加五千萬的讚。
世界的數位化（醫療雲、教育雲）
Big Data Analysis: the material returned from the web search tends to be an undifferentiated mass, which requires a new Data Science to process and extract meaningful patterns.

網路語料庫 The Web Corpus

Web for Corpus: collecting the texts from the Web.
Web as Corpus: accessing the Web as a corpus (in real-time), e.g., WebCorp.
Multimodality (images, videos and audios); Human Language Archive

語料庫系統與工具發展

語料庫與語料搜尋工具系統是容易被混淆的兩件事。
3rd-generation Concordancer 幾乎是語料分析工具的同義詞，晚近著名工具有 WordSmith, AntConc, Xaira 等，但是限制何在？
4th-generation : 從網站到網路服務 From Website to Web Service
- CQP (Corpus Query Processor)(CWB) and SQL-based (BYU)
- Word Sketch Engine

語料庫系統與工具發展

內建、客製到自製分析程式工具

It has been suggested that corpus linguists, rather than using general corpus analysis packages, should instead fully embrace computer programming and individually develop their own as hoc tools to address the tasks that face them.
- you can do analyses that are not possible with concordancers;
- you can do analyses 'more quickly and more accurately';
- you can tailor the output to fit your own research needs;
- you can analyse a corpus of any size.

語料庫系統實作技術

資料庫 Database (SQL? NOSQL? MongoDB?)
查詢語法 Query Language (CQL)
索引演算法 Index Algorithm

盛行的英語語料庫 BNC, ANC, WaC

BYU: Brigham Young University

BNCweb-CQP edition

臺灣語料庫 ASBC, TCCM, Social Corpus and i-Corpus (NTU-LOPE)

多元、分眾、動態、發展、開放、脈絡豐富、連結。

lope-corpora

TCCM

CWM

CWN

語料庫廣義來說可以很多元有趣 (1)

alt text

[We feel Fine] http://www.wefeelfine.org/

語料分析廣義來說可以很多元有趣 (1)

`` How does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?

Soft drink terms around the U.S. Soft drink terms around the world

We can probably hypothesize that:

The South is pretty Coke-heavy.
Soda belongs to the Northeast and far West.
Pop gets the mid-West, except for some interesting spots of blue around Wisconsin and the Illinois-Missouri border.
“pop” seems to be prevalent only in parts of the United States and Canada.

語料分析廣義來說可以很多元有趣 (2)

You can also do some other fun observations during break: What do people want for Christmas, compared to what they actually get?

corpus typology

未來路上學界與業界不可以放棄彼此

Google Endangered Language Project

alt text

打開潘朵拉的盒子 Open Pandora's Box

Pandora's box: multivariate, multimodal, multimedia, multilingualism, etc.
From Corupus Linguistics to Cloud Linguistics.
多語多模態開放跨界多媒體雲端

未來的語料庫語言學 Future of Corpus `in` Linguistics

呼籲語言學教育的改革

Urgent needs for linguistic data scientist and/or language engineer !

There will be an important role for corpus specilaist whose research is concerned with the methodology itself - the construction and annotation of corpora, the development of new tools and new procedures, the expansion of the conceptual bases of the methodology and other such issues. [1]

linguistic data scientist

It is (not) realistic to expect every linguist to become fully competent in programming or more complex statistical analyses. But a shift in what is meant by 'corpus linguist' is expected - - From corpus user to a "researcher into the methodology, esp. one who develops new methods and enables other linguists to apply them."

語料庫語言學的未來 Future of Corpus Linguistics `in` Humanities and Science

新一代 corpus linguist 不僅服務語言學

Just as corpus linguistics has become increasingly integrated as a method with other fields of linguistics, it may/will be aopted outside linguistics by other disciplines within the humanities and social science in particular.
The triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic.

such methdological pluralism is already happening, to some extend, in the case of corpus methods and the methods of experimental psycholinguistics and neurolinguistics.

語腦新解 Brain and Corpus Evidence

Some attempts to link neurolinguistics and corpus linguistics.

alt text

語心新探 Corpus-driven Semantic Space

NTU-AS-Toulous M3 project

linguistic data scientist

語料方法學 Accessing and Analyzing Methods

The shift mentioned has already taken place to an extend, in that the research with the greatest impact in corpus linguistics is very often valued not for what it > discovers about language, but for the methods it introduces or develops.

The findings about particular English grannatical constructions made by Stefanowitsh and Gries (2003), for example, are not especially revolutionary in themselves. It is>, rather, the method that these findings exemplify - and the associated theoretic> al and statistical apparatus linked to collostruction - that makes this paper a key > contribution to recent research in corpus linguistics.

語料庫的一般使用 Who uses Corpus and How?

誰？怎麼用？
焦點通常是所關心的語言單位之「行為量度」（`頻度、分佈與共現模式）(Word) Frequency, Concordance, Collocation, Collostructure, N-gram/Lexical Bundles/multi-word units, etc.
Corpus-informed research 的問題: Researchers use the corpus simply as a bank of examples to illustrate a theory they are developing. This runs counter to the scientific method, insofar as there is no attempt to account for the rest of the (potentially falsifying) evidence in the corpus.
質性(qualitative)或量性的發展不是沒有爭議：e.g., Critical Discourse Anlysis. To undertake a detailed analysis of a small amount of data, taking into account not just the text itself, but also the social context in which it was produced and the social context in which it was interpreted.

語料庫分析方法 Corpus-based Empirical Methods

Fundamental commitmment to empiricism
Both corpus linguistics and other experimental linguistics study language system not directly but by observation of epiphenomena - output on the large scale, or either the blood-flow requirements or some other psysiological feature associated with it.
Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.

經驗方法與統計學 Empirical Methods and Statistics

我們的一生當中，大半部分的時候，都是需要根據不完整的資訊來做決定 ... Empirical methods and statistics are two sides of the same coin: it is pointless to study one without the other. Statistics not only provide us with methods for summmarizing sample data sets, they also allow us to make confident statements about entire population.

- **Summarizing the data** summarizes various attributes of a variable

- **Characterizing the data**: Prior to building a predictive model or looking for hidden trends in the data, 
it is important to characterize the variables and the relationships between them and statistics gives us many 
tools to accomplish this.

- **Making statements about 'hidden' facts**: Once a group of observations, within the data has been defined 
as interesting through the use of data mining techniques, 
statistics give us the ability to make confident statements about the groups.

當然懂愈多愈好

(Kochanski, 2009): How to formulate empirical research questions and hypothesis (and good understanding of the fundamental logic of experiment design)?

Counting and Sampling: When are two counts significantly different? When counts get sparse: how to deal with just a few examples
$Sqrt(N) is your friend: planning the size of an experiment. %square root How many subjects? How much can you say if it's significant? If it's not? Bonferroni corrections and doing more than one test.
Choosing your statistical test: (Gaussian or non-Gaussian; Continuous data, ordered data, vs. separate categories; Paired sample vs. not; t-tests or Non-parametric relatives of t-tests)
Linear regression and the stuff you do to prepare for it. If your data is too rich: Principle component analysis, Multidimensional Scaling, etc.
Modern statistics that you should be aware of: Monte-Carlo simulation; Bootstrap Resampling, etc.

再囉嗦一句：語言學家不可以小看資料分析的專業性。

先就一般的資料分析需求來說：

You have some linguistic data that you need to collect, summarize, transform, explore, visualize, or present. In a word, we want to make sense of our data and communicate that understanding to others!

What would be the best collection of methods helpful in exploring corpus data?

至少要了解的語料處理程序

In general, Corpus data science involves a chain of works

Pre-processing (cleaning, tokenizing, segmentation, etc)
Data annotation (Semi-automatic) Labeling (POS tagging) and Management
Exploratory Data Analysis (with workable knowledge of Statistics)
Hypothesis testing
Prediction and Statistical Modeling, etc
Presentation and Web application (Demo: Shiny-LexicoR)

前處理 Pre-processing

搜語料 data collection 與先導實驗 Run a pilot experiment

Getting word frequencies from Google:

- Do you get about the same frequencies on different days? 
- Does it reproduce (approximately) known frequency ratios? 
- Look at some of the documents: Are they what you expect? 
- Find the words in the documents: Are they used in the way you expect?

前處理 Pre-processing

清語料 Cleaning the data : Prior to analysis, it is important to consider applying certain transformations to the data since many data analysis will have difficulty making sense of data in its raw form. Some common transformations include normaliation, etc.
斷/段語料 data segmentation/tokenization

標記 Annotation

語言學家應該最擅長的語料處理能力。

Annotated vs. unannotated: whether or not the corpus has been analysed in a particular way yet. In annotation we engage in a process of labeling.
Metadata tells you something about the text itself; Textual Markup encodes information within the text other than the actual words; and Annotation encode linguistic information within a corpus text in such a way that we can systematically and accurately recover that analysis later.
Corpus annotation is a commonplace of linguistics (!), BUT it does not mean that we can do the good job.
Assumption; Inaccuracy and Inconsistency (also in terms of inter-annotator agreement).

探勘式分析 Exploratory Data Analysis

EDA is more than the methods – it represents an attitude or philosophy about how data should be explored. Tukey (1977) makes a clear distinction betwee confirmatory data analysis, where one is primarily interested in drawing inferential conclusions, and exploratory methods, where one is placing few assumptions on the distributional shape of the data and simply looking for interesting patterns.

探勘式分析要回應的問題: Total accountability, Falsifiability and Replicability

語料庫把語言研究拉回科學方法的脈絡 (Leech, 1992)。
證實偏見 (confirmation bias): If you approach a corpus with a specific theory in mind, it can be easy to unintentionally focus on and pull out only the examples from the corpus that support the theory. But the theory can never be shown to be false by such an approach.
Such approach runs counter to one of the key features of the scientific method identified by Popper (1934): 可否證性 falsifiability.
The principle of total accountability is simply that we must not select a favourable subset of the data in this way.
Another key feature of the scientific method is replicability. A result is considered replicable if a reapplication of the methods that led to it consistently produces the same result. New result are typically considered provisional until they are known to be replicable.

統計檢定與分析 Employing Statistical Techniques

Descriptive statistics
Significance and Hypotheses Testing
Multivariate Analysis: investigating structure and relationships within the data, rather than testing the significance of a particular result. E.g, factor analysis, clustering, multi-dimensional method, etc.

大推 R/R-chitecture (enhanced with Python)

R-chitecture

When analyzing moderate-to-large data sets, Excel and other corpus tools don't have the power or flexibility. R was designed for these sutiations, with good graphical capabilities and a large, robust library of contributed packages.

實例 Corpus-based Studies

功能 Functional Linguistics
計算 Computational Linguistics
歷程 Psycholinguistics and Language Acquisition
應用 Lexicography and Language Teaching

Functional Linguistics

Computational Linguistics

Psycholinguistics and Language Acquisition

Lexicography and Language Teaching

WARNING Words

有哪些要注意的事

需要平衡嗎？需要多大才夠？balance, representativeness and comparability
中文斷詞的夢靨 Chinese segmentation
你的經驗研究方法論 Empirical methodology needed
資料科學 Data Science 會滲透到人文社會領域，要注意 the development of exploitation of corpus tools
實驗倫理與個資保護 issue of copyright law and of research ethics;  ---

Comparing Corpora: Why and How?

COCA and ANC http://corpus.byu.edu/coca/compare-anc.asp
LOB and Brown

Balance, Representativeness and Comparability

These three are ideals which corpus builders strive for but rarely, if ever, attain.
Biber's proposal for representativeness: measure internal variation within a corpus - i.e., a corpus is representative if it fully captures the variability of a language.
是一個程度問題，還是一個偽科學論題？（因為程度預設了 100% 是清楚的）

Comparing NTU Plurk Corpus and ASBC

PLURK corpus

Comparing NTU Plurk Corpus and ASBC

PLURK corpus

Conclusion and Discussion

Conclusion

語料的鉅量增長與多樣，是語言研究與工作者必須認真面對的趨勢。

From Corpus Linguistics to Cloud Linguistics

Big Data Analysis
語料處理技術每個語言工作者都值得學學。

Revisiting old friends: computational linguistics

Reinvigorated friendship: Semantic tagging and Sentiment Analysis/ Opinion Mining

重要的還有經驗研究的方法多元論，需要在研究社群累積醞釀。包括：

統計與機器學習（輔助與自學）

探勘式看語料 (exploratory data analysis)

重製 (reproducible) : Doing linguistics in a Reproducible way: Results only count if someone can follow your recipe and get the same answers. See http://www.reproducibleresearch.org

開放詞彙計劃

PLURK corpus

library(ggplot2)
qplot(hp, mpg, data = mtcars) + geom_smooth()

plot of chunk md-cars-scatter

Reference

[1] Tony McEnery and Andrew Hardie. 2012. Corpus Linguistics. Cambridge Textbooks in Linguistics.

歡迎來信加入 email shukaihsieh@ntu.edu.tw

Introducing Corpus Linguistics [I]

An R-chitecture