HM
July 26, 2015
The data used was graciously made available at: http://www.corpora.heliohost.org/
After loading the data sets, I will create corpora with the data sets.
I will clean each corpus using the tm package in R removing punctuations, white space, dirty words and stop words.
I will then tokenize it using the RWeka package
This will result in 2-gram, 3-gram, 4-gram, and 5-gram
tokens <- unique(tokens)
date()
[1] "Sun Jul 26 16:49:38 2015"