Coursera Capstone Project - A natural language processing model (NLP) with 'ngram' continuation probabilities via Kneser-Ney smoothing.

Mark A. Jack; November 24, 2016.

Text corpus, package 'quanteda' and ngrams

  • Three data files from a 'blogs', 'twitter' and 'news' feed are combined to a text corpus using the 'tm' package.

  • Via libary 'quanteda', the corpus is tokenized and text features such as punctuation, numbers, white space, lowercase words etc. are removed.

  • 'ngrams' - unigrams, bigrams, trigrams and quadgrams - are generated via a document-frequency matrix (dfm).

  • A 'dfm' allows for quick and easy analysis of the most frequently occurying ngrams.

Most frequently occurring ngrams

We show the number of occurances of each of the most common unigrams and bigrams in horizontal bar plots. plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Figure 1. Unigram and bigram occurences in descending order.

Trigram and quadgram occurrences

For unigrams, bigrams, and trigrams, 1% of the selected text corpus was used. The sample for the quadgrams was restricted to 0.1% of the corpus due to memory limitations. plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Figure 2. Trigram and quadgram occurrences, alphabetically sorted.

Continuation probabilities, Kneser-Ney smoothing and shiny app

  • Continuation probability of each unigram estimated from bigram occurrences that continue a unigram etc.

  • Probabilities are corrected via Kneser-Ney smoothing by 'estimating' the likelihood of 'ngrams' missing in the corpus.