Email Mining for Journalists

M. Edward (Ed) Borasky
December 17, 2013

Text Mining 101

Corpora

  • A corpus (plural corpora) is a collection of documents
  • In the tm package (Feinerer et al. 2008 ), there are two types of corpora
    • Volatile: Resides in RAM (but can be saved to and restored from disk!)
    • Permanent: Resides on disk with only some indexing in RAM

Typical Text Cleaning Operations

  • Remove Non-ASCII characters
  • Remove numbers
  • Consolidate whitespace
  • Remove punctuation
  • Convert to all lower case
  • Can be done in almost any programming language via regular expressions

Stemming

  • Replace variations on a word with its stem
  • Most often done using an algorithm called the Porter Stemmer
  • You'll also see the term Snowball - that's the implemented version
    • Example: replace “stem”, “stems”, “stemmer”, “stemming”, “stemmed” with “stem”.

Stopwords

  • Frequently occurring words with little semantic value
    • Examples: a, an, and, or, this, …
  • Stopword removal cuts down the dataset size and run time
  • Defining stopwords depends on what's in the corpus
    • Usually start with a predefined set and add more stopwords

Text Mining 102

Bag of Words

  • A document is treated as a bag of words
  • After you've cleaned a document you have a bag
    • aka multiset - like a set but can have more than one copy of an element
    • like a set, order doesn't matter
  • Of words - we got rid of the numbers, punctuation, etc.
    • we probably stemmed and maybe removed stopwords
  • In the literature, you'll hear the words called terms or types

Document-Term Matrix

  • We have some documents in our corpus - bags of words
  • We'll call the words terms
  • Make a matrix - a spreadsheet
    • The row labels are documents
    • The column labels are terms
  • That's a document-term matrix

Ah, But What's in the Cells?

  • Numbers!
  • Three common options
    1. One if the term occurs in the document, zero if it doesn't
    2. The number of times the term occurs in the document
    3. A mathematical function of the number of times the term occurs in the document and corpus

The Document-Term Matrix is Sparse

  • Most of the numbers are zero
  • So usually you only have to store triples
    • (document index, term index, non-zero value)

What Can You Do With a Matrix?

  • Computational linear algebra, of course!
  • Most text mining algorithms are based on linear algebra / vector spaces
  • Programming language choices
    • R and Python both have good libraries for this
    • The hard stuff is usually done via C/C++
    • A lot of Java code out there too

Wait - You Mangled the Documents - How Can This Possibly Work?

  • It works surprisingly well, actually
  • The tricky part is visualizing the multi-dimensional objects
  • See Landauer et al. (2007) for some of the math and applications

One Final Note

  • The transpose of a document-term matrix is called … wait for it …
    • A term-document matrix!
    • You'll see both in the literature
    • We'll stick with document-term matrix
    • That's the way R Commander's text mining code works

Tools - R Commander and RcmdrPlugin.temis

R Commander

  • R Commander (Fox, 2005 ) is a general purpose GUI for R
  • Most common packaged analyses are available
  • Can generate reports and web pages
  • Can save a script of the processing for documentation or later execution!
  • Wide range of plugins, including …

RcmdrPlugin.temis

Demo

Road Map

Coming Soon To a Github Repo Near You!

  • Explore the text mining algorithms in RcmdrPlugin.temis
  • Add social network analysis via package snatm (Feinerer et al. 2011 )
  • Add functionality to the package with journalism uses in mind
  • Port the Bash and Perl pieces to R so the package will run on Windows
  • ???
  • Profit!

References