Initial Viewing of Data and Data Reduction

The three text files that we will be using to create a prediction algorithm are quite large. The blogs file has a line count of about 900,000 and a word count of about 37 million. The news file has a line count of about one million and a word count of about 34 million. Lastly, the twitter file has a line count of about two million and a word count of about 30 million.

Subsamples were created for each file to work with in the beginning stages, with 5000 lines each.

Create a Corpus

A corpus was created from the three texts using the tm library, and profanity was removed from the corpus. The list of profanity to remove was obtained here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

contexts <- c(blogs, news, twitter)
doc.vec <- VectorSource(contexts)
doc.corpus <- Corpus(doc.vec)
profanity <- readLines("project/profanity_list.txt")
doc.corpus <- tm_map(doc.corpus, removeWords, profanity)

Tokenization

Document term matrices were created by dividing the corpus into groupings of one, two, and three terms using the RWeka library. These groupings, called “n-grams” will be used to build what is referred to as an n-gram model to predict the next word based on the previous one, two, or three words. Sparse items (items not commonly occuring in the corpus) were removed, and the data was reshaped into a matrix with the following columns: Docs (blogs, news, or twitter) Terms, and value.

Taking a look for example at the head (first six lines) of the document term matrix in which the corpus is divided into pairs of words, we can see that the most common pairs are “of the”, “in the”, “to the, and”to be“, at least in the context of blogs and news text.

##       Docs  Terms value
## 10354    1 of the  1000
## 10355    2 of the   876
## 7268     2 in the   875
## 7267     1 in the   837
## 16366    1 to the   502
## 15829    1  to be   385

Visualization

Here is a visualization of the counts of the five most common bi-grams (pairs of words).

data <- head(DTM_bi_gram.common.dense[order(-DTM_bi_gram.common.dense$value),], n= 5)
counts <- c(data$value)
barplot(counts, main = "Frequency of the Top Five Word Pairs")