Rafael Marino. GSSA Data Analyst.
Programming Capability SIT. July 25th 2016.
This presentation is an offshoot of the Jonhs Hopkins University's Data Science Specialization's final Capstone project. The project consisted in creating a predictive text app in Shiny that would estimate and display the likeliest word to be typed in next by the user, given the previous input text.
CRAN Natural Language Processing Task View.
Text Mining Infrastructure in R <- Comprehensive Journal of Statistical Software article.
Intro to the tm package <- Getting started pdf.
This presentation will focus on preprocessing.
blogs <- readLines("C:/Users/marino.re/Box Sync/Capstone/data/en_US/en_US.blogs.txt",
encoding = "utf-8", skipNul = TRUE)
set.seed(50) #Reproducibility seed
blogsSample <- sample(blogs, size= length(blogs)*0.01) #Sampling 1%
rm(blogs)
Corpora can be created using the VCorpus() function (V stands for volatile. The corpus is an R object fully held in memory; when deleted the corpus is gone). Then a source has to be specified, in this case the source is a character vector, so VectorSource() can be used.
library(tm)
corpus <- VCorpus(VectorSource(blogsSample))
as.character(corpus[[6835]])
“The charity inspired by the encounter has raised $60m and in 2009 said it was supporting 54 schools in across Homeward serving 28,475 students. Obama donated $100,000 to the group from the proceeds of his Nobel prize. The book has become required reading in the US of A.”
| Function | What does it do? |
|---|---|
| asPlain() | Converts the document to a plain text document |
| loadDoc() | Triggers load on demand |
| removeCitation() | Removes citations from e-mails |
| removeMultipart() | Removes non-text from multipart e-mails |
| removeNumbers() | Removes numbers |
| removePunctuation() | Removes punctuation marks |
| removeSignature() | Removes signatures from e-mails |
| removeWords() | Removes stopwords |
| replaceWords() | Replaces a set of words with a given phrase |
| stemDoc() | Stems the text document |
| stripWhitespace() | Removes extra whitespace |
| tmTolower() | Conversion to lower case letters |
transformations <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Removing words creates white spaces, so stripWhitespace has to come later
corpus <- tm_map(corpus, stripWhitespace)
}
“The charity inspired by the encounter has raised $60m and in 2009 said it was supporting 54 schools in across Homeward serving 28,475 students. Obama donated $100,000 to the group from the proceeds of his Nobel prize. The book has become required reading in the US of A.”
cleanCorpus <- transformations(corpus)
as.character(cleanCorpus[[6835]])
[1] “ charity inspired encounter raised m said supporting schools across homeward serving students obama donated group proceeds nobel prize book become required reading us ”
N-grams. Ann-gram is a contiguous sequence of n items from a given sequence of text or speech
Document Term Matrix (dtm). A dtm is a matrix where each document is placed in the rows and each unique word (in the whole corpus) constitutes a feature or a column. This is a very convenient way to carry out frequency counts.
One problem with dtms is that the matrix can be very sparse, difficulting calculations. The slam package (Sparse Lightweight Arrays and Matrices) is recommended to deal with sparse matrices.
| Term1 | Term 2 | Term 3 | … | nth Term | |
|---|---|---|---|---|---|
| Doc 1 | 0 | 2 | 0 | … | w |
| Doc 2 | 3 | 0 | 1 | … | x |
| Doc 3 | 0 | 0 | 1 | … | y |
| … | … | … | … | … | … |
| mth Doc | 0 | 0 | 1 | … | z |
Each element in the matrix is a term frequency counter for a specific term in a specific document.