Introduction

This document summarizes the exploratory analysis performed on a corpus of documents. The objective of the proyect is to develop and implement a text prediction algorithm. The idea is to provide the user with predictions for the next words to be typed in, based on the last words typed by the user.

Data comes from a corpus called HC Corpora. The corpora includes documents crawled from public sites such as twitter, blogs, and news entries.
R packages used in this proyect are tm and slam. tm is a suite of text mining functions. For a quick overview of the functions included in the tm package, visit this link. In this proyect we use the function to convert a vector of documents into a corpus and then we use the function to convert that corpus into a document term matrix (DTM). This latter has the option to tokenize using custom functions. After te DTM is constructed, the slam package is used to sum over the columns of that sparce matrix and create a vector of n-grams and their frecuencies in all the documents.
Preprocessing: only profanities and punctuation are removed. Stopwords are left in order to also predict their use. However, this might generate noise to the performance and therefore, during the next stage of the proyect, an alternative would be to explore models without stopwords.

Load Data & Summary

We load the data for blogs, tweets, and news, and sample each collection to reduce the size of the whole corpus. We only take 5% of every collection of documents.

Basic Summary

The following table summarizes the number of documents (lines) on each source table (blogs, news, and tweets), the total number of words on each collection of documents and the average number of words per line (or document). For instance, the blogs collection contains close to 900,000 lines or documents and a total word count of over 38 million. On average, each line in this collection contains 42 words.

sourceDoc	numberLines	totalNumberWords	meanNumberWords
Blogs	899288	38370723	42.67
Tweets	2360148	31149374	13.2
News	1010242	35783083	35.42

Cleaning and Preprocessing

Corpus and inspection

all.docs.sample <- c(blogs.sample,twitter.sample,news.sample)

library(tm)
# Create a volatile corpus
docs.s <- VCorpus(VectorSource(all.docs.sample))

# Look at the contents
docs.s[[3]]$content

[1] "Of course it goes without saying that all the music above would probably have been familiar to Gardel's audience, too. But it's not that widely known here. It's not difficult or inaccessible music, but it's not a routine part of the popular culture in the same way that it would have been when Gardel was playing El dia que mi quieras, or even as it was when my grandmother's relations were making their own entertainment in Australia with performances of Bizet's Au fond du temple saint."

Transformations

All these transformations are done using the tm_map function from the tm package, which allow to apply several transformations such as stopword (or other list) removal, punctuation removal, transformation to lower cases, etc.

# Stop word removal
#docs.s <- tm_map(docs.s, removeWords, stopwords("english"))

# to lower case
docs.s <- tm_map(docs.s, content_transformer(tolower))

# Profanity removal
# We use a list of 10 censored words stored in the vector prof
docs.s <- tm_map(docs.s, removeWords, prof)

# Punctuation removal
docs.s <- tm_map(docs.s, removePunctuation, preserve_intra_word_dashes = TRUE)

Doc Term Matrix (single words or 1-grams)

Using the corpus of documents, we now construct a Document Term Matrix (DTM). This object is a simple triplet matrix structure (efficient for storing large sparse matrices), that has each document as a row and each n-gram (or term) as a column.

dtm.docs <- DocumentTermMatrix(docs.s)

Once we have constructed the DTM, we can use the column apply function from the slam package in order to roll up the DTM and obtain a named vector of frecuencies (total times each n-gram appears in all documents) with the n-grams as the names of the vector.

# To get the word dist, we use the slam package for ops with simple triplet mat
library(slam)
sums <- colapply_simple_triplet_matrix(dtm.docs,FUN=sum)
sums <- sort(sums, decreasing=T)

Doc Term Matrix (n-grams)

In this case, we create three different tokenizer functions in order to construct the DTM for 2-grams, 3-grams, and 4-grams.

# Functions
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))}
ThreegramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=3, max=3))}
FourgramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=4, max=4))}

# Bigrams
options(mc.cores=1)
dtm.docs.2g <- DocumentTermMatrix(docs.s, control=list(tokenize=BigramTokenizer))

#Threegrams
options(mc.cores=1)
dtm.docs.3g <- DocumentTermMatrix(docs.s, control=list(tokenize=ThreegramTokenizer))

#Fourgrams
options(mc.cores=1)
dtm.docs.4g <- DocumentTermMatrix(docs.s, control=list(tokenize=FourgramTokenizer))
# freqTerms.4g.docs <- findFreqTerms(dtm.docs.4g,20,Inf)

Using these DTM’s we now proceed to convert those into frecuency vectors. Notice that we sort the resulting vectors in descending order. That way, the top n-grams end up being the most common.

# To get the bigram dist, we use the slam package for ops with simple triplet mat
sums.2g <- colapply_simple_triplet_matrix(dtm.docs.2g,FUN=sum)
sums.2g <- sort(sums.2g, decreasing=T)

# To get the threegram dist, we use the slam package for ops with simple triplet mat
sums.3g <- colapply_simple_triplet_matrix(dtm.docs.3g,FUN=sum)
sums.3g <- sort(sums.3g, decreasing=T)

# To get the fourgram dist, we use the slam package for ops with simple triplet mat
sums.4g <- colapply_simple_triplet_matrix(dtm.docs.4g,FUN=sum)
sums.4g <- sort(sums.4g, decreasing=T)

Let’s now plot histograms for each n-gram distribution.

Notice how for the case of single terms (1-grams), a few words have larger frequencies. For instance, the one-gram “the” appears 238 thousand times in our sample of documents (recall we only take 5% of all three collections of documents), followed by “and” which appears 120 thousand times, half the frequency of the top term! From then, the most common words rapidly decrease in frecuency. After the top 50 term, the frequency of the next common one-grams is below 10 thousand.

As we move forward in terms of the order of the n-grams, the frequencies drop but the distributions become less skewed. For instance, the top 4-gram is “the end of the”, which appears 374 times. As we see in the plot, the top 50 4-grams occur between 85 and 374 times in all the sampled documents.

n-gram Model and Prediction

Based on this exploratory analysis, I now sketch a basic algorithm for text prediction using n-gram tables.

1, 2, 3 and 4 n-gram tables are stored as text files.
Only n-grams that have fequency higher or equal to 2 are kept in the model.
The n-gram tables are loaded from the text files.
For a string of text that is input into the predictor the prediction algorithm performs a search on each n-gram table, starting with the 4-gram table.
From the imput text, the last three terms are obtained and searched in the 4-gram table. If one or more matches are found, then the algorithm outputs the best predictions for the next word given those three terms.
If no match is found in the 4-gram table, then the search continues in the 3-gram table using the last two words from the input. And so on. If no match is found, the prediction is then the most common one-gram (single terms).

For instance, a prediction for “and a case of” would be:

input.text <- "case of"
predict.ngram(input.text)

case of the   case of a  case of an 
         26           8           4

Text Prediction Using n-gram Models: Milestone Report

Carlos Ignacio Patiño (cpatinof@gmail.com)

December 29, 2015