Coursera Capstone Project - A natural language processing model (NLP) with 'ngram' continuation probabilities via Kneser-Ney smoothing.

Mark A. Jack

November 14, 2016

Text corpus, package 'quanteda' and ngrams

  • Three data files from a 'blogs', 'twitter' and 'news' feed are combined to a text corpus using the 'tm' package.

  • Via libary 'quanteda', the corpus is tokenized and text features such as punctuation, numbers, white space, lowercase words etc. are removed.

  • 'ngrams' - unigrams, bigrams, trigrams and quadgrams - are generated via a document-frequency matrix (dfm).

  • A 'dfm' allows for quick and easy analysis of the most frequently occurying ngrams.

Most frequently occurring ngrams

In three bar plots, we show the number of occurances of each of the most common words or 2- or 3- or 4-word combinations (unigrams, bigrams, trigrams, quadgrams) in horizontal bar plots. plot of chunk unnamed-chunk-3

Figure 1:

Bigram and trigram occurrences, alphabetically sorted

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Figure 2:

Continuation probabilities, Kneser-Ney smoothing and shiny app

  • Continuation probability of each unigram estimated from bigram occurrences that continue a unigram etc.

  • Probabilities are corrected via Kneser-Ney smoothing by 'estimating' the likelihood of 'ngrams' missing in the corpus.