Introducing Corpus Linguistics [I]

An R-chitecture

Shu-Kai Hsieh 謝舒凱
Graduate Institute of Linguistics, National Taiwan University

Introducing Corpus Linguistics

  • Part one: Background, Theory and Success Story, Methods

    介紹語料庫的概念、歷史、類型與方法。(3月6日)

  • Part two: Web Corpus: Put into Practice

    介紹實作與分析網路語料庫的工具。(7月3日)

Table of Contents

  • Basics
  • Accessing and Analzing Methods
  • Corpus-based Studies
  • WARNING: Empirical Methodology needed
  • Conclusion and Discussion

基本的東西 Basics

這次導讀主要以這本書為主,並增添研究註解。

McEnery and Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge. Support website

alt text

語料庫是啥 What is Corpus

  • 胼胝體(Corpus callosum)?
  • 文集/典藏 (textual collection /archive)?
  • 語言資料庫 (linguistic database)?
  • 有標記語言資訊的資料庫 (database with linguistically annotated information)

語料, 語料, 語料!Data, Data, Data!

"Corpus linguistics is the study of language data on a large scale - the comput-aided analysis of very extensive collections of transcribed utterances or written texts.[1] "

  • Data are values of qualitative or quantitative variables, belonging to a set of items (i.e., populations). (wiki)

  • Linguistic data: Fieldwork, Experiment and Corpus.

  • Linguistics: a data-intensive discipline and a textually mediated world.

歷史 Corpus Linguistics:History

Looking at trends evident in the history of corpus linguistics up to the present time abd considering how those trends are likely to continue, or, rather, how we think they should continue. [1]

Corpus Linguistics: From Past to Future

  • 40-50 years development/debate on Introspection vs corpus evidence.

  • A separate field of linguistics ? (Tool vs Theory) > corpus linguistics has become an indispensable component of the methodological toolbox throughout linguistics.

  • But we must not confuse corpus data with language itself. Corpora allow us to observe language, but they are not language itself.

  • 技術帶動思維 What kind of future progressions can be predict for corpus linguistics?

不確定值不值得的爭辯 Corpus-based versus Corpus-driven

  • 語料庫工具觀 Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach to the use of corpus data in linguistics.

  • 語料庫理論觀 Corpus-driven linguistics rejects the characterization of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language (Tognini-Bonelli 2001). >> neo-Firthians

  • "All corpus linguistics can just be described as corpus-based."

語料庫類型 Corpus Typology

corpus typology

由研究題材決定

語料庫類型 Corpus Typology

Two broad approaches to the issue of choosing what data to collect:

  • Monitor approach (Sinclair 1991:24-6), where the corpus continually expands to include more and more texts over time; and

  • Balanced or Snapshot approach (Biber 1993 and Leech 2007), where a careful sample corpus, reflecting the language as it exists at a given point time, is constructed according to a specific sampling frame.

網路語料庫 The Web Corpus

背景知識

  • 資訊油田:2011 年每兩天產生的資料量,等於人類有文明至 2003 年之前所產生全部的資料量 (1.8 ZB)。MIT Deb Roy 教授的語言習得實驗(3年 九萬個小時 200 TB)(引自林守德,科學人 2013.03)。

  • 臉書:每天增加上億張相片,每天增加五千萬的讚。

  • 世界的數位化(醫療雲、教育雲)

  • Big Data Analysis: the material returned from the web search tends to be an undifferentiated mass, which requires a new Data Science to process and extract meaningful patterns.

網路語料庫 The Web Corpus

  • Web for Corpus: collecting the texts from the Web.

  • Web as Corpus: accessing the Web as a corpus (in real-time), e.g., WebCorp.

  • Multimodality (images, videos and audios); Human Language Archive

語料庫系統與工具發展

  • 語料庫與語料搜尋工具系統是容易被混淆的兩件事。

  • 3rd-generation Concordancer 幾乎是語料分析工具的同義詞,晚近著名工具有 WordSmith, AntConc, Xaira 等,但是限制何在?

  • 4th-generation : 從網站到網路服務 From Website to Web Service

    • CQP (Corpus Query Processor)(CWB) and SQL-based (BYU)
    • Word Sketch Engine

語料庫系統與工具發展

內建、客製到自製分析程式工具

  • It has been suggested that corpus linguists, rather than using general corpus analysis packages, should instead fully embrace computer programming and individually develop their own as hoc tools to address the tasks that face them.

    • you can do analyses that are not possible with concordancers;
    • you can do analyses 'more quickly and more accurately';
    • you can tailor the output to fit your own research needs;
    • you can analyse a corpus of any size.

語料庫系統實作技術

  • 資料庫 Database (SQL? NOSQL? MongoDB?)

  • 查詢語法 Query Language (CQL)

  • 索引演算法 Index Algorithm

盛行的英語語料庫 BNC, ANC, WaC

臺灣語料庫 ASBC, TCCM, Social Corpus and i-Corpus (NTU-LOPE)

  • 多元、分眾、動態、發展、開放、脈絡豐富、連結。

lope-corpora

TCCM

CWM

CWN

語料庫廣義來說可以很多元有趣 (1)

語料分析廣義來說可以很多元有趣 (1)

`` How does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?

``

Soft drink terms around the U.S. Soft drink terms around the world

We can probably hypothesize that:

  • The South is pretty Coke-heavy.
  • Soda belongs to the Northeast and far West.
  • Pop gets the mid-West, except for some interesting spots of blue around Wisconsin and the Illinois-Missouri border.
  • “pop” seems to be prevalent only in parts of the United States and Canada.

語料分析廣義來說可以很多元有趣 (2)

You can also do some other fun observations during break: What do people want for Christmas, compared to what they actually get?

corpus typology

未來路上學界與業界不可以放棄彼此

打開潘朵拉的盒子 Open Pandora's Box

  • Pandora's box: multivariate, multimodal, multimedia, multilingualism, etc.
  • From Corupus Linguistics to Cloud Linguistics.
  • 多語 多模態 開放 跨界 多媒體 雲端

未來的語料庫語言學 Future of Corpus in Linguistics

呼籲語言學教育的改革

  • Urgent needs for linguistic data scientist and/or language engineer !

There will be an important role for corpus specilaist whose research is concerned with the methodology itself - the construction and annotation of corpora, the development of new tools and new procedures, the expansion of the conceptual bases of the methodology and other such issues. [1]

linguistic data scientist

  • It is (not) realistic to expect every linguist to become fully competent in programming or more complex statistical analyses. But a shift in what is meant by 'corpus linguist' is expected - - From corpus user to a "researcher into the methodology, esp. one who develops new methods and enables other linguists to apply them."

語料庫語言學的未來 Future of Corpus Linguistics in Humanities and Science

新一代 corpus linguist 不僅服務語言學

  • Just as corpus linguistics has become increasingly integrated as a method with other fields of linguistics, it may/will be aopted outside linguistics by other disciplines within the humanities and social science in particular.

  • The triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic.

  • such methdological pluralism is already happening, to some extend, in the case of corpus methods and the methods of experimental psycholinguistics and neurolinguistics.

語腦新解 Brain and Corpus Evidence

  • Some attempts to link neurolinguistics and corpus linguistics.

alt text

語心新探 Corpus-driven Semantic Space

NTU-AS-Toulous M3 project

linguistic data scientist

語料方法學 Accessing and Analyzing Methods

The shift mentioned has already taken place to an extend, in that the research with the greatest impact in corpus linguistics is very often valued not for what it > discovers about language, but for the methods it introduces or develops.

The findings about particular English grannatical constructions made by Stefanowitsh and Gries (2003), for example, are not especially revolutionary in themselves. It is>, rather, the method that these findings exemplify - and the associated theoretic> al and statistical apparatus linked to collostruction - that makes this paper a key > contribution to recent research in corpus linguistics.

語料庫的一般使用 Who uses Corpus and How?

  • 誰?怎麼用?

  • 焦點通常是所關心的語言單位之「行為量度」(`頻度分佈共現模式)(Word) Frequency, Concordance, Collocation, Collostructure, N-gram/Lexical Bundles/multi-word units, etc.

  • Corpus-informed research 的問題: Researchers use the corpus simply as a bank of examples to illustrate a theory they are developing. This runs counter to the scientific method, insofar as there is no attempt to account for the rest of the (potentially falsifying) evidence in the corpus.

  • 質性(qualitative)或量性的發展不是沒有爭議:e.g., Critical Discourse Anlysis. To undertake a detailed analysis of a small amount of data, taking into account not just the text itself, but also the social context in which it was produced and the social context in which it was interpreted.

語料庫分析方法 Corpus-based Empirical Methods

  • Fundamental commitmment to empiricism

  • Both corpus linguistics and other experimental linguistics study language system not directly but by observation of epiphenomena - output on the large scale, or either the blood-flow requirements or some other psysiological feature associated with it.

  • Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.

經驗方法與統計學 Empirical Methods and Statistics

我們的一生當中,大半部分的時候,都是需要根據不完整的資訊來做決定 ... Empirical methods and statistics are two sides of the same coin: it is pointless to study one without the other. Statistics not only provide us with methods for summmarizing sample data sets, they also allow us to make confident statements about entire population.

- **Summarizing the data** summarizes various attributes of a variable

- **Characterizing the data**: Prior to building a predictive model or looking for hidden trends in the data, 
it is important to characterize the variables and the relationships between them and statistics gives us many 
tools to accomplish this.

- **Making statements about 'hidden' facts**: Once a group of observations, within the data has been defined 
as interesting through the use of data mining techniques, 
statistics give us the ability to make confident statements about the groups.

當然懂愈多愈好

(Kochanski, 2009): How to formulate empirical research questions and hypothesis (and good understanding of the fundamental logic of experiment design)?

  • Counting and Sampling: When are two counts significantly different? When counts get sparse: how to deal with just a few examples

  • $Sqrt(N) is your friend: planning the size of an experiment. %square root How many subjects? How much can you say if it's significant? If it's not? Bonferroni corrections and doing more than one test.

  • Choosing your statistical test: (Gaussian or non-Gaussian; Continuous data, ordered data, vs. separate categories; Paired sample vs. not; t-tests or Non-parametric relatives of t-tests)

  • Linear regression and the stuff you do to prepare for it. If your data is too rich: Principle component analysis, Multidimensional Scaling, etc.

  • Modern statistics that you should be aware of: Monte-Carlo simulation; Bootstrap Resampling, etc.

再囉嗦一句:語言學家不可以小看資料分析的專業性。

先就一般的資料分析需求來說:

  • You have some linguistic data that you need to collect, summarize, transform, explore, visualize, or present. In a word, we want to make sense of our data and communicate that understanding to others!
  • What would be the best collection of methods helpful in exploring corpus data?

至少要了解的語料處理程序

In general, Corpus data science involves a chain of works

  1. Pre-processing (cleaning, tokenizing, segmentation, etc)
  2. Data annotation (Semi-automatic) Labeling (POS tagging) and Management
  3. Exploratory Data Analysis (with workable knowledge of Statistics)
  4. Hypothesis testing
  5. Prediction and Statistical Modeling, etc
  6. Presentation and Web application (Demo: Shiny-LexicoR)

前處理 Pre-processing

  • 搜語料 data collection 與先導實驗 Run a pilot experiment

Getting word frequencies from Google:

- Do you get about the same frequencies on different days? 
- Does it reproduce (approximately) known frequency ratios? 
- Look at some of the documents: Are they what you expect? 
- Find the words in the documents: Are they used in the way you expect?

前處理 Pre-processing

  • 清語料 Cleaning the data : Prior to analysis, it is important to consider applying certain transformations to the data since many data analysis will have difficulty making sense of data in its raw form. Some common transformations include normaliation, etc.

  • 斷/段語料 data segmentation/tokenization

標記 Annotation

語言學家應該最擅長的語料處理能力。

  • Annotated vs. unannotated: whether or not the corpus has been analysed in a particular way yet. In annotation we engage in a process of labeling.

  • Metadata tells you something about the text itself; Textual Markup encodes information within the text other than the actual words; and Annotation encode linguistic information within a corpus text in such a way that we can systematically and accurately recover that analysis later.

  • Corpus annotation is a commonplace of linguistics (!), BUT it does not mean that we can do the good job.

  • Assumption; Inaccuracy and Inconsistency (also in terms of inter-annotator agreement).

探勘式分析 Exploratory Data Analysis

EDA is more than the methods – it represents an attitude or philosophy about how data should be explored. Tukey (1977) makes a clear distinction betwee confirmatory data analysis, where one is primarily interested in drawing inferential conclusions, and exploratory methods, where one is placing few assumptions on the distributional shape of the data and simply looking for interesting patterns.

探勘式分析要回應的問題: Total accountability, Falsifiability and Replicability

  • 語料庫把語言研究拉回科學方法的脈絡 (Leech, 1992)。

  • 證實偏見 (confirmation bias): If you approach a corpus with a specific theory in mind, it can be easy to unintentionally focus on and pull out only the examples from the corpus that support the theory. But the theory can never be shown to be false by such an approach.

  • Such approach runs counter to one of the key features of the scientific method identified by Popper (1934): 可否證性 falsifiability.

  • The principle of total accountability is simply that we must not select a favourable subset of the data in this way.

  • Another key feature of the scientific method is replicability. A result is considered replicable if a reapplication of the methods that led to it consistently produces the same result. New result are typically considered provisional until they are known to be replicable.

統計檢定與分析 Employing Statistical Techniques

  • Descriptive statistics

  • Significance and Hypotheses Testing

  • Multivariate Analysis: investigating structure and relationships within the data, rather than testing the significance of a particular result. E.g, factor analysis, clustering, multi-dimensional method, etc.

大推 R/R-chitecture (enhanced with Python)

R-chitecture

When analyzing moderate-to-large data sets, Excel and other corpus tools don't have the power or flexibility. R was designed for these sutiations, with good graphical capabilities and a large, robust library of contributed packages.

實例 Corpus-based Studies

  1. 功能 Functional Linguistics
  2. 計算 Computational Linguistics
  3. 歷程 Psycholinguistics and Language Acquisition
  4. 應用 Lexicography and Language Teaching

Functional Linguistics

Computational Linguistics

Psycholinguistics and Language Acquisition

Lexicography and Language Teaching

WARNING Words

有哪些要注意的事

  • 需要平衡嗎?需要多大才夠?balance, representativeness and comparability
  • 中文斷詞的夢靨 Chinese segmentation
  • 你的經驗研究方法論 Empirical methodology needed
  • 資料科學 Data Science 會滲透到人文社會領域,要注意 the development of exploitation of corpus tools
  • 實驗倫理與個資保護 issue of copyright law and of research ethics; <!--'fair use'--> ---

Comparing Corpora: Why and How?

Balance, Representativeness and Comparability

  • These three are ideals which corpus builders strive for but rarely, if ever, attain.

  • Biber's proposal for representativeness: measure internal variation within a corpus - i.e., a corpus is representative if it fully captures the variability of a language.

  • 是一個程度問題,還是一個偽科學論題?(因為程度預設了 100% 是清楚的)

Comparing NTU Plurk Corpus and ASBC

PLURK corpus PLURK corpus

Comparing NTU Plurk Corpus and ASBC

PLURK corpus PLURK corpus

Conclusion and Discussion

Conclusion

  • 語料的鉅量增長與多樣,是語言研究與工作者必須認真面對的趨勢。

    From Corpus Linguistics to Cloud Linguistics

    Big Data Analysis

  • 語料處理技術每個語言工作者都值得學學。

    Revisiting old friends: computational linguistics

    Reinvigorated friendship: Semantic tagging and Sentiment Analysis/ Opinion Mining

  • 重要的還有經驗研究的方法多元論,需要在研究社群累積醞釀。包括:

    統計與機器學習(輔助與自學)

    探勘式看語料 (exploratory data analysis)

    重製 (reproducible) : Doing linguistics in a Reproducible way: Results only count if someone can follow your recipe and get the same answers. See http://www.reproducibleresearch.org

開放詞彙計劃

PLURK corpus

library(ggplot2)
qplot(hp, mpg, data = mtcars) + geom_smooth()

plot of chunk md-cars-scatter

相關延伸閱讀

alt text alt text alt text

Rice and Newman (eds). 2012. Empirical and Experimental Methods in Cognitive/Functional Research. CSLI.

R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge.

Reference

[1] Tony McEnery and Andrew Hardie. 2012. Corpus Linguistics. Cambridge Textbooks in Linguistics.

歡迎來信加入 email shukaihsieh@ntu.edu.tw