This document summarizes the exploratory analysis performed on a corpus of documents. The objective of the proyect is to develop and implement a text prediction algorithm. The idea is to provide the user with predictions for the next words to be typed in, based on the last words typed by the user.
tm and slam. tm is a suite of text mining functions. For a quick overview of the functions included in the tm package, visit this link. In this proyect we use the function to convert a vector of documents into a corpus and then we use the function to convert that corpus into a document term matrix (DTM). This latter has the option to tokenize using custom functions. After te DTM is constructed, the slam package is used to sum over the columns of that sparce matrix and create a vector of n-grams and their frecuencies in all the documents.We load the data for blogs, tweets, and news, and sample each collection to reduce the size of the whole corpus. We only take 5% of every collection of documents.
Basic Summary
The following table summarizes the number of documents (lines) on each source table (blogs, news, and tweets), the total number of words on each collection of documents and the average number of words per line (or document). For instance, the blogs collection contains close to 900,000 lines or documents and a total word count of over 38 million. On average, each line in this collection contains 42 words.
| sourceDoc | numberLines | totalNumberWords | meanNumberWords |
|---|---|---|---|
| Blogs | 899288 | 38370723 | 42.67 |
| Tweets | 2360148 | 31149374 | 13.2 |
| News | 1010242 | 35783083 | 35.42 |
all.docs.sample <- c(blogs.sample,twitter.sample,news.sample)
library(tm)
# Create a volatile corpus
docs.s <- VCorpus(VectorSource(all.docs.sample))
# Look at the contents
docs.s[[3]]$content
[1] "Of course it goes without saying that all the music above would probably have been familiar to Gardel's audience, too. But it's not that widely known here. It's not difficult or inaccessible music, but it's not a routine part of the popular culture in the same way that it would have been when Gardel was playing El dia que mi quieras, or even as it was when my grandmother's relations were making their own entertainment in Australia with performances of Bizet's Au fond du temple saint."
All these transformations are done using the tm_map function from the tm package, which allow to apply several transformations such as stopword (or other list) removal, punctuation removal, transformation to lower cases, etc.
# Stop word removal
#docs.s <- tm_map(docs.s, removeWords, stopwords("english"))
# to lower case
docs.s <- tm_map(docs.s, content_transformer(tolower))
# Profanity removal
# We use a list of 10 censored words stored in the vector prof
docs.s <- tm_map(docs.s, removeWords, prof)
# Punctuation removal
docs.s <- tm_map(docs.s, removePunctuation, preserve_intra_word_dashes = TRUE)
Using the corpus of documents, we now construct a Document Term Matrix (DTM). This object is a simple triplet matrix structure (efficient for storing large sparse matrices), that has each document as a row and each n-gram (or term) as a column.
dtm.docs <- DocumentTermMatrix(docs.s)
Once we have constructed the DTM, we can use the column apply function from the slam package in order to roll up the DTM and obtain a named vector of frecuencies (total times each n-gram appears in all documents) with the n-grams as the names of the vector.
# To get the word dist, we use the slam package for ops with simple triplet mat
library(slam)
sums <- colapply_simple_triplet_matrix(dtm.docs,FUN=sum)
sums <- sort(sums, decreasing=T)
In this case, we create three different tokenizer functions in order to construct the DTM for 2-grams, 3-grams, and 4-grams.
# Functions
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))}
ThreegramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=3, max=3))}
FourgramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=4, max=4))}
# Bigrams
options(mc.cores=1)
dtm.docs.2g <- DocumentTermMatrix(docs.s, control=list(tokenize=BigramTokenizer))
#Threegrams
options(mc.cores=1)
dtm.docs.3g <- DocumentTermMatrix(docs.s, control=list(tokenize=ThreegramTokenizer))
#Fourgrams
options(mc.cores=1)
dtm.docs.4g <- DocumentTermMatrix(docs.s, control=list(tokenize=FourgramTokenizer))
# freqTerms.4g.docs <- findFreqTerms(dtm.docs.4g,20,Inf)
Using these DTM’s we now proceed to convert those into frecuency vectors. Notice that we sort the resulting vectors in descending order. That way, the top n-grams end up being the most common.
# To get the bigram dist, we use the slam package for ops with simple triplet mat
sums.2g <- colapply_simple_triplet_matrix(dtm.docs.2g,FUN=sum)
sums.2g <- sort(sums.2g, decreasing=T)
# To get the threegram dist, we use the slam package for ops with simple triplet mat
sums.3g <- colapply_simple_triplet_matrix(dtm.docs.3g,FUN=sum)
sums.3g <- sort(sums.3g, decreasing=T)
# To get the fourgram dist, we use the slam package for ops with simple triplet mat
sums.4g <- colapply_simple_triplet_matrix(dtm.docs.4g,FUN=sum)
sums.4g <- sort(sums.4g, decreasing=T)
Let’s now plot histograms for each n-gram distribution.
Notice how for the case of single terms (1-grams), a few words have larger frequencies. For instance, the one-gram “the” appears 238 thousand times in our sample of documents (recall we only take 5% of all three collections of documents), followed by “and” which appears 120 thousand times, half the frequency of the top term! From then, the most common words rapidly decrease in frecuency. After the top 50 term, the frequency of the next common one-grams is below 10 thousand.
As we move forward in terms of the order of the n-grams, the frequencies drop but the distributions become less skewed. For instance, the top 4-gram is “the end of the”, which appears 374 times. As we see in the plot, the top 50 4-grams occur between 85 and 374 times in all the sampled documents.
Based on this exploratory analysis, I now sketch a basic algorithm for text prediction using n-gram tables.
For instance, a prediction for “and a case of” would be:
input.text <- "case of"
predict.ngram(input.text)
case of the case of a case of an
26 8 4