Text Mining in R with the tm Package

Rafael Marino. GSSA Data Analyst.
Programming Capability SIT. July 25th 2016.

Introduction

This presentation is an offshoot of the Jonhs Hopkins University's Data Science Specialization's final Capstone project. The project consisted in creating a predictive text app in Shiny that would estimate and display the likeliest word to be typed in next by the user, given the previous input text.
CRAN Natural Language Processing Task View.
Text Mining Infrastructure in R <- Comprehensive Journal of Statistical Software article.
Intro to the tm package <- Getting started pdf.

The tm Package

Developed by Feinerer et al at the Wirtschaftsuniversitaet Wien (Vienna University of Economics and Business).
Capabilities include
- Preprocessing
- Association
- Clustering
- Summarization
- Categorization
- API availibility

This presentation will focus on preprocessing.

Example. Loading the data.

The dataset is the HC Corpora
This corpus contains 2,360,148 blog posts so we'll work with a 1% random sample.
The blogs object is a character vector where each element from 1 to n is a blog entry.The number of blog entries is equivalent to the length of the vector.

blogs <- readLines("C:/Users/marino.re/Box Sync/Capstone/data/en_US/en_US.blogs.txt", 
    encoding = "utf-8", skipNul = TRUE)
set.seed(50) #Reproducibility seed
blogsSample <- sample(blogs, size= length(blogs)*0.01) #Sampling 1%
rm(blogs)

Creating a Corpus

Corpora can be created using the VCorpus() function (V stands for volatile. The corpus is an R object fully held in memory; when deleted the corpus is gone). Then a source has to be specified, in this case the source is a character vector, so VectorSource() can be used.

library(tm)
corpus <- VCorpus(VectorSource(blogsSample))

Representative Blog Entry

as.character(corpus[[6835]])

“The charity inspired by the encounter has raised $60m and in 2009 said it was supporting 54 schools in across Homeward serving 28,475 students. Obama donated $100,000 to the group from the proceeds of his Nobel prize. The book has become required reading in the US of A.”

This blog post contains numbers, punctuation, capitalized words.

Preprocessing Transformations

Function	What does it do?
asPlain()	Converts the document to a plain text document
loadDoc()	Triggers load on demand
removeCitation()	Removes citations from e-mails
removeMultipart()	Removes non-text from multipart e-mails
removeNumbers()	Removes numbers
removePunctuation()	Removes punctuation marks
removeSignature()	Removes signatures from e-mails
removeWords()	Removes stopwords
replaceWords()	Replaces a set of words with a given phrase
stemDoc()	Stems the text document
stripWhitespace()	Removes extra whitespace
tmTolower()	Conversion to lower case letters

Running Transformations

transformations <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  #Removing words creates white spaces, so stripWhitespace has to come later
  corpus <- tm_map(corpus, stripWhitespace)
}

Comments on Transformation Functions

The order of execution matters. e.g. stripWhitespace() function has to come after removeNumbers()
The transformations applied depend strictly on objective of the analysis. e.g. profane words would be kept for sentiment analysis and removed for prediction.

Example Blog entry: Before and After

cleanCorpus <- transformations(corpus)
as.character(cleanCorpus[[6835]])

[1] “ charity inspired encounter raised m said supporting schools across homeward serving students obama donated group proceeds nobel prize book become required reading us ”

What's next?

N-grams. Ann-gram is a contiguous sequence of n items from a given sequence of text or speech
Document Term Matrix (dtm). A dtm is a matrix where each document is placed in the rows and each unique word (in the whole corpus) constitutes a feature or a column. This is a very convenient way to carry out frequency counts.
One problem with dtms is that the matrix can be very sparse, difficulting calculations. The slam package (Sparse Lightweight Arrays and Matrices) is recommended to deal with sparse matrices.

Example dtm

	Term1	Term 2	Term 3	…	nth Term
Doc 1	0	2	0	…	w
Doc 2	3	0	1	…	x
Doc 3	0	0	1	…	y
…	…	…	…	…	…
mth Doc	0	0	1	…	z

Each element in the matrix is a term frequency counter for a specific term in a specific document.

Final Capstone

Capstone Link