Mark A. Jack
November 24, 2016
Three data files from a 'blogs', 'twitter' and 'news' feed are combined to a text corpus using the 'tm' package.
Via libary 'quanteda', the corpus is tokenized and text features such as punctuation, numbers, white space, lowercase words etc. are removed.
'ngrams' - unigrams, bigrams, trigrams and quadgrams - are generated via a document-frequency matrix (dfm).
A 'dfm' allows for quick and easy analysis of the most frequently occurying ngrams.
We show the number of occurances of each of the most common unigrams and bigrams in horizontal bar plots.
For unigrams, bigrams, and trigrams, 1% of the selected text corpus was used. The sample for the quadgrams was restricted to 0.1% of the corpus due to memory limitations for the generated look-up tables.
Continuation probability of each unigram estimated from bigram occurrences that continue a unigram etc.
Probabilities are corrected via Kneser-Ney smoothing by 'estimating' the likelihood of 'ngrams' missing in the corpus.