Clean Data-set with “tm” package.
- Make all words to lowercase
- Remove stop-words
- Strip the punctuation
- Strip numeric
- Remove additional space
Merge cleaned news, blog, and twitter data-set.
Make a 3-grams dictionary with the ‘ngram’ package.
- we can tokenize three consecutive words with the package. It calculates the frequency of each three-grams and probability.