Data Science Capstone

Douglas M Okamoto
October 17, 2016

Corpus is Latin for “body.” It may contain texts in a single language or multiple languages. The corpora in this presentation consist of one or more 0.1% random samples of size n= 2,000, 150 and 5,000 lines extracted from blogs, news and twitter text files archived in the Helsinki Corpora English language database. link

Text File Line Count Word Count
en_us blogs 899,288 lines 37,279,275 words
en_us news 77,259 lines 34,494,539 words
en_us twitter 2,360,148 lines 30,451,128 words

N-gram Prediction Model - Algorithm

An n-gram is a sequence of n words, e.g., “new york” is a sequence of two words or bigram (n=2). An n-gram model is a probabilistic language model for predicting the next item in such a sequence. For example, “new york city” is a trigram (n=3) with the third word “city” predicted from a probabilistic language model.

If at first you don't succeed.

This n-gram prediction model uses bootstrap re-sampling, successive corpora of 0.1% random samples, to predict the next word in a quadrigram from the first three words in a trigram.

Shiny Application - Inpt

alt text

Enter three words in a trigram, e.g., “New York City.”

Shiny Application - Output

alt text

Press the Go! button. If the result is NA, then press the Go! button again.

Maximum Likelihood Estimation (MLE)

In the example, the quadrigram “new york city mayor” occurs three times. Since there are six occurrences of the trigram “new york city,” the conditional probability of “mayor” being the next word is 50-50.