Mark A. Jack
November 14, 2016
Three data files from a 'blogs', 'twitter' and 'news' feed are combined to a text corpus using the 'tm' package.
Via libary 'quanteda', the corpus is tokenized and text features such as punctuation, numbers, white space, lowercase words etc. are removed.
'ngrams' - unigrams, bigrams, trigrams and quadgrams - are generated via a document-frequency matrix (dfm).
A 'dfm' allows for quick and easy analysis of the most frequently occurying ngrams.
In three bar plots, we show the number of occurances of each of the most common words or 2- or 3- or 4-word combinations (unigrams, bigrams, trigrams, quadgrams) in horizontal bar plots.
Continuation probability of each unigram estimated from bigram occurrences that continue a unigram etc.
Probabilities are corrected via Kneser-Ney smoothing by 'estimating' the likelihood of 'ngrams' missing in the corpus.