Part one: Background, Theory and Success Story, Methods
介紹語料庫的概念、歷史、類型與方法。(3月6日)
Part two: Web Corpus: Put into Practice
介紹實作與分析網路語料庫的工具。(7月3日)
Shu-Kai Hsieh 謝舒凱
Graduate Institute of Linguistics, National Taiwan University
Part one: Background, Theory and Success Story, Methods
Part two: Web Corpus: Put into Practice
這次導讀主要以這本書為主,並增添研究註解。
McEnery and Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge. Support website
"Corpus linguistics is the study of language data on a large scale - the comput-aided analysis of very extensive collections of transcribed utterances or written texts.[1] "
Data are values of qualitative or quantitative variables, belonging to a set of items (i.e., populations). (wiki)
Linguistic data: Fieldwork, Experiment and Corpus.
Linguistics: a data-intensive discipline and a textually mediated world.
Looking at trends evident in the history of corpus linguistics up to the present time abd considering how those trends are likely to continue, or, rather, how we think they should continue. [1]
40-50 years development/debate on Introspection vs corpus evidence.
A separate field of linguistics ? (Tool vs Theory) > corpus linguistics has become an indispensable component of the methodological toolbox throughout linguistics.
But we must not confuse corpus data with language itself. Corpora allow us to observe language, but they are not language itself.
技術帶動思維 What kind of future progressions can be predict for corpus linguistics?
語料庫工具觀 Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach to the use of corpus data in linguistics.
語料庫理論觀 Corpus-driven linguistics rejects the characterization of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language (Tognini-Bonelli 2001). >> neo-Firthians
"All corpus linguistics can just be described as corpus-based."
由研究題材決定
Two broad approaches to the issue of choosing what data to collect:
Monitor approach
(Sinclair 1991:24-6), where the corpus continually expands to include more and more texts over time; and
Balanced or Snapshot approach
(Biber 1993 and Leech 2007), where a careful sample corpus, reflecting the language as it exists at a given point time, is constructed according to a specific sampling frame.
背景知識
資訊油田:2011 年每兩天產生的資料量,等於人類有文明至 2003 年之前所產生全部的資料量 (1.8 ZB)。MIT Deb Roy 教授的語言習得實驗(3年 九萬個小時 200 TB)(引自林守德,科學人 2013.03)。
臉書:每天增加上億張相片,每天增加五千萬的讚。
世界的數位化(醫療雲、教育雲)
Big Data Analysis
: the material returned from the web search tends to be an undifferentiated mass
, which requires a new Data Science to process and extract meaningful
patterns.
Web for Corpus
: collecting the texts from the Web.
Web as Corpus
: accessing the Web as a corpus (in real-time), e.g., WebCorp.
Multimodality
(images, videos and audios); Human Language Archive
語料庫與語料搜尋工具系統是容易被混淆的兩件事。
3rd-generation Concordancer 幾乎是語料分析工具的同義詞,晚近著名工具有 WordSmith, AntConc, Xaira 等,但是限制何在?
4th-generation : 從網站到網路服務 From Website to Web Service
內建、客製到自製分析程式工具
It has been suggested that corpus linguists, rather than using general corpus analysis packages, should instead fully embrace computer programming and individually develop their own as hoc tools to address the tasks that face them.
資料庫 Database (SQL? NOSQL? MongoDB?)
查詢語法 Query Language (CQL)
索引演算法 Index Algorithm
[We feel Fine] http://www.wefeelfine.org/
`` How does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?
``
We can probably hypothesize that:
You can also do some other fun observations during break: What do people want for Christmas, compared to what they actually get?
Google Endangered Language Project
in
Linguistics呼籲語言學教育的改革
linguistic data scientist
and/or language engineer
!There will be an important role for corpus specilaist whose research is concerned with the methodology itself - the construction and annotation of corpora, the development of new tools and new procedures, the expansion of the conceptual bases of the methodology and other such issues. [1]
in
Humanities and Science新一代 corpus linguist 不僅服務語言學
Just as corpus linguistics has become increasingly integrated as a method with other fields of linguistics, it may/will be aopted outside linguistics by other disciplines within the humanities and social science in particular.
The triangulation of corpus methods with other research methodologies will be an important further step in enhancing both the rigour of corpus linguistics and its incorporation into all kinds of research, both linguistic and non-linguistic.
NTU-AS-Toulous M3 project
The shift mentioned has already taken place to an extend, in that the research with the greatest impact in corpus linguistics is very often valued not for what it > discovers about language, but for the methods it introduces or develops.
The findings about particular English grannatical constructions made by Stefanowitsh and Gries (2003), for example, are not especially revolutionary in themselves. It is>, rather, the method that these findings exemplify - and the associated theoretic> al and statistical apparatus linked to collostruction - that makes this paper a key > contribution to recent research in corpus linguistics.
誰?怎麼用?
焦點通常是所關心的語言單位之「行為量度」(`頻度
、分佈
與共現模式
)(Word) Frequency, Concordance, Collocation, Collostructure, N-gram/Lexical Bundles/multi-word units, etc.
Corpus-informed research 的問題: Researchers use the corpus simply as a bank of examples to illustrate a theory they are developing. This runs counter to the scientific method, insofar as there is no attempt to account for the rest of the (potentially falsifying) evidence in the corpus.
質性
(qualitative)或量性
的發展不是沒有爭議:e.g., Critical Discourse Anlysis.
To undertake a detailed analysis of a small amount of data, taking into account not just the text itself, but also the social context in which it was produced and the social context in which it was interpreted.
Fundamental commitmment to empiricism
Both corpus linguistics and other experimental linguistics study language system not directly but by observation of epiphenomena
- output on the large scale, or either the blood-flow requirements or some other psysiological feature associated with it.
Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.
我們的一生當中,大半部分的時候,都是需要根據不完整的資訊來做決定 ... Empirical methods and statistics are two sides of the same coin: it is pointless to study one without the other. Statistics not only provide us with methods for summmarizing sample data sets, they also allow us to make confident statements about entire population.
- **Summarizing the data** summarizes various attributes of a variable
- **Characterizing the data**: Prior to building a predictive model or looking for hidden trends in the data,
it is important to characterize the variables and the relationships between them and statistics gives us many
tools to accomplish this.
- **Making statements about 'hidden' facts**: Once a group of observations, within the data has been defined
as interesting through the use of data mining techniques,
statistics give us the ability to make confident statements about the groups.
(Kochanski, 2009): How to formulate empirical research questions and hypothesis (and good understanding of the fundamental logic of experiment design)?
Counting and Sampling
: When are two counts significantly different?
When counts get sparse: how to deal with just a few examples
$Sqrt(N)
is your friend: planning the size of an experiment. %square root
How many subjects? How much can you say if it's significant? If it's not?
Bonferroni corrections and doing more than one test.
Choosing your statistical test
: (Gaussian or non-Gaussian; Continuous data, ordered data, vs. separate categories; Paired sample vs. not; t-tests or Non-parametric relatives of t-tests)
Linear regression
and the stuff you do to prepare for it. If your data is too rich: Principle component analysis, Multidimensional Scaling, etc.
Modern statistics that you should be aware of: Monte-Carlo simulation
; Bootstrap Resampling, etc.
先就一般的資料分析需求來說:
In general, Corpus data science involves a chain of works
Pre-processing
(cleaning, tokenizing, segmentation, etc)Data annotation
(Semi-automatic) Labeling (POS tagging) and ManagementExploratory Data Analysis
(with workable knowledge of Statistics)Hypothesis testing
Prediction and Statistical Modeling
, etcPresentation and Web application
(Demo: Shiny-LexicoR) - Do you get about the same frequencies on different days?
- Does it reproduce (approximately) known frequency ratios?
- Look at some of the documents: Are they what you expect?
- Find the words in the documents: Are they used in the way you expect?
清語料 Cleaning the data : Prior to analysis, it is important to consider applying certain transformations to the data since many data analysis will have difficulty making sense of data in its raw form. Some common transformations include normaliation, etc.
斷/段語料 data segmentation/tokenization
語言學家應該最擅長的語料處理能力。
Annotated vs. unannotated: whether or not the corpus has been analysed in a particular way yet. In annotation we engage in a process of labeling.
Metadata tells you something about the text itself; Textual Markup encodes information within the text other than the actual words; and Annotation encode linguistic information within a corpus text in such a way that we can systematically and accurately recover that analysis later.
Corpus annotation is a commonplace of linguistics (!), BUT it does not mean that we can do the good job.
Assumption; Inaccuracy and Inconsistency (also in terms of inter-annotator agreement).
EDA is more than the methods – it represents an attitude or philosophy about how data should be explored. Tukey (1977) makes a clear distinction betwee confirmatory data analysis, where one is primarily interested in drawing inferential conclusions, and exploratory methods, where one is placing few assumptions on the distributional shape of the data and simply looking for interesting patterns.
語料庫把語言研究拉回科學方法的脈絡 (Leech, 1992)。
證實偏見 (confirmation bias): If you approach a corpus with a specific theory in mind, it can be easy to unintentionally focus on and pull out only the examples from the corpus that support the theory. But the theory can never be shown to be false by such an approach.
Such approach runs counter to one of the key features of the scientific method identified by Popper (1934): 可否證性 falsifiability.
The principle of total accountability is simply that we must not select a favourable subset of the data in this way.
Another key feature of the scientific method is replicability. A result is considered replicable if a reapplication of the methods that led to it consistently produces the same result. New result are typically considered provisional until they are known to be replicable.
Descriptive statistics
Significance and Hypotheses Testing
Multivariate Analysis: investigating structure and relationships within the data, rather than testing the significance of a particular result. E.g, factor analysis, clustering, multi-dimensional method, etc.
When analyzing moderate-to-large data sets, Excel and other corpus tools don't have the power or flexibility. R was designed for these sutiations, with good graphical capabilities and a large, robust library of contributed packages.
COCA and ANC http://corpus.byu.edu/coca/compare-anc.asp
LOB and Brown
These three are ideals which corpus builders strive for but rarely, if ever, attain.
Biber's proposal for representativeness: measure internal variation within a corpus - i.e., a corpus is representative if it fully captures the variability of a language.
是一個程度問題,還是一個偽科學論題?(因為程度預設了 100% 是清楚的)
鉅量增長與多樣
,是語言研究與工作者必須認真面對的趨勢。From Corpus Linguistics to Cloud Linguistics
Big Data Analysis
語料處理技術
每個語言工作者都值得學學。Revisiting old friends: computational linguistics
Reinvigorated friendship: Semantic tagging and Sentiment Analysis/ Opinion Mining
方法多元論
,需要在研究社群累積醞釀。包括:統計與機器學習(輔助與自學)
探勘式看語料 (exploratory data analysis)
重製 (reproducible) : Doing linguistics in a Reproducible way: Results only count if someone can follow your recipe and get the same answers. See http://www.reproducibleresearch.org
library(ggplot2)
qplot(hp, mpg, data = mtcars) + geom_smooth()
Rice and Newman (eds). 2012. Empirical and Experimental Methods in Cognitive/Functional Research. CSLI.
R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge.
[1] Tony McEnery and Andrew Hardie. 2012. Corpus Linguistics. Cambridge Textbooks in Linguistics.
歡迎來信加入 email shukaihsieh@ntu.edu.tw