HM
8/23/2015
The data used was graciously made available at: http://www.corpora.heliohost.org/
After loading the data sets, I created corpora with the data sets.
I cleaned each corpus using the tm package in R removing punctuations, white space, dirty words.
I then tokenized them using the RWeka package
This resulted in 2-gram, 3-gram, 4-gram, and 5-gram
Finally, duplicates were removed and frequency for each term was calculated