Email Mining for Journalists

M. Edward (Ed) Borasky
December 17, 2013

Text Mining 101

A corpus (plural corpora) is a collection of documents
In the tm package (Feinerer et al. 2008 ), there are two types of corpora
- Volatile: Resides in RAM (but can be saved to and restored from disk!)
- Permanent: Resides on disk with only some indexing in RAM

Replace variations on a word with its stem
Most often done using an algorithm called the Porter Stemmer
You'll also see the term Snowball - that's the implemented version
- Example: replace “stem”, “stems”, “stemmer”, “stemming”, “stemmed” with “stem”.

Frequently occurring words with little semantic value
- Examples: a, an, and, or, this, …
Stopword removal cuts down the dataset size and run time
Defining stopwords depends on what's in the corpus
- Usually start with a predefined set and add more stopwords

A document is treated as a bag of words
After you've cleaned a document you have a bag
- aka multiset - like a set but can have more than one copy of an element
- like a set, order doesn't matter
Of words - we got rid of the numbers, punctuation, etc.
- we probably stemmed and maybe removed stopwords
In the literature, you'll hear the words called terms or types

We have some documents in our corpus - bags of words
We'll call the words terms
Make a matrix - a spreadsheet
- The row labels are documents
- The column labels are terms
That's a document-term matrix

Numbers!
Three common options
1. One if the term occurs in the document, zero if it doesn't
2. The number of times the term occurs in the document
3. A mathematical function of the number of times the term occurs in the document and corpus

Most of the numbers are zero
So usually you only have to store triples
- (document index, term index, non-zero value)

Computational linear algebra, of course!
Most text mining algorithms are based on linear algebra / vector spaces
Programming language choices
- R and Python both have good libraries for this
- The hard stuff is usually done via C/C++
- A lot of Java code out there too

The transpose of a document-term matrix is called … wait for it …
- A term-document matrix!
- You'll see both in the literature
- We'll stick with document-term matrix
- That's the way R Commander's text mining code works

Bouchet-Valat & Bastin (2013) : adds basic and advanced text mining functions
Plan of attack
1. Get some data - 2006 R-devel mailing list, to be precise (Feinerer et al. 2011 ; Feinerer et al. 2008 )
2. Unpack to a flat directory - each message is a single file
3. Build a corpus and document-term matrix
4. Save the script and package it

Ingo Feinerer, Kurt Bohn, Patrick Mair, (2011) Content-Based Social Network Analysis of Mailing Lists. http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Bohn~et~al.pdf
Milan Bouchet-Valat, Gilles Bastin, (2013) RcmdrPlugin.temis, a Graphical Integrated Text Mining Solution in R. http://journal.r-project.org/archive/2013-1/bouchetvalat-bastin.pdf
Ingo Feinerer, Kurt Hornik, David Meyer, (2008) Text Mining Infrastructure in R. http://www.jstatsoft.org/v25/i05/
John Fox, (2005) The R Commander: A Basic Statistics Graphical User Interface to R. http://www.jstatsoft.org/v14/i09
Thomas Landauer, Danielle McNamara, Simon Dennis, Walter Kintsch, (2007) Handbook of Latent Semantic Analysis. http://www.amazon.com/Handbook-Semantic-University-Institute-Cognitive-ebook/dp/B00CXU38Z4