Email Mining for Journalists
M. Edward (Ed) Borasky
December 17, 2013
Corpora
- A corpus (plural corpora) is a collection of documents
- In the tm package (Feinerer et al. 2008 ), there are two types of corpora
- Volatile: Resides in RAM (but can be saved to and restored from disk!)
- Permanent: Resides on disk with only some indexing in RAM
Typical Text Cleaning Operations
- Remove Non-ASCII characters
- Remove numbers
- Consolidate whitespace
- Remove punctuation
- Convert to all lower case
- Can be done in almost any programming language via regular expressions
Stemming
- Replace variations on a word with its stem
- Most often done using an algorithm called the Porter Stemmer
- You'll also see the term Snowball - that's the implemented version
- Example: replace “stem”, “stems”, “stemmer”, “stemming”, “stemmed” with “stem”.
Stopwords
- Frequently occurring words with little semantic value
- Examples: a, an, and, or, this, …
- Stopword removal cuts down the dataset size and run time
- Defining stopwords depends on what's in the corpus
- Usually start with a predefined set and add more stopwords
Bag of Words
- A document is treated as a bag of words
- After you've cleaned a document you have a bag
- aka multiset - like a set but can have more than one copy of an element
- like a set, order doesn't matter
- Of words - we got rid of the numbers, punctuation, etc.
- we probably stemmed and maybe removed stopwords
- In the literature, you'll hear the words called terms or types
Document-Term Matrix
- We have some documents in our corpus - bags of words
- We'll call the words terms
- Make a matrix - a spreadsheet
- The row labels are documents
- The column labels are terms
- That's a document-term matrix
Ah, But What's in the Cells?
- Numbers!
- Three common options
- One if the term occurs in the document, zero if it doesn't
- The number of times the term occurs in the document
- A mathematical function of the number of times the term occurs in the document and corpus
The Document-Term Matrix is Sparse
- Most of the numbers are zero
- So usually you only have to store triples
- (document index, term index, non-zero value)
What Can You Do With a Matrix?
- Computational linear algebra, of course!
- Most text mining algorithms are based on linear algebra / vector spaces
- Programming language choices
- R and Python both have good libraries for this
- The hard stuff is usually done via C/C++
- A lot of Java code out there too
Wait - You Mangled the Documents - How Can This Possibly Work?
- It works surprisingly well, actually
- The tricky part is visualizing the multi-dimensional objects
- See Landauer et al. (2007) for some of the math and applications
One Final Note
- The transpose of a document-term matrix is called … wait for it …
- A term-document matrix!
- You'll see both in the literature
- We'll stick with document-term matrix
- That's the way R Commander's text mining code works
Tools - R Commander and RcmdrPlugin.temis
R Commander
- R Commander (Fox, 2005 ) is a general purpose GUI for R
- Most common packaged analyses are available
- Can generate reports and web pages
- Can save a script of the processing for documentation or later execution!
- Wide range of plugins, including …
Coming Soon To a Github Repo Near You!
- Explore the text mining algorithms in RcmdrPlugin.temis
- Add social network analysis via package snatm (Feinerer et al. 2011 )
- Add functionality to the package with journalism uses in mind
- Port the Bash and Perl pieces to R so the package will run on Windows
- ???
- Profit!
References
- Ingo Feinerer, Kurt Bohn, Patrick Mair, (2011) Content-Based Social Network Analysis
of Mailing Lists. http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Bohn~et~al.pdf
- Milan Bouchet-Valat, Gilles Bastin, (2013) RcmdrPlugin.temis, a Graphical Integrated Text Mining Solution in R. http://journal.r-project.org/archive/2013-1/bouchetvalat-bastin.pdf
- Ingo Feinerer, Kurt Hornik, David Meyer, (2008) Text Mining Infrastructure in R. http://www.jstatsoft.org/v25/i05/
- John Fox, (2005) The R Commander: A Basic Statistics Graphical User Interface to R. http://www.jstatsoft.org/v14/i09
- Thomas Landauer, Danielle McNamara, Simon Dennis, Walter Kintsch, (2007) Handbook of Latent Semantic Analysis. http://www.amazon.com/Handbook-Semantic-University-Institute-Cognitive-ebook/dp/B00CXU38Z4