Data Science Capstone

Douglas M Okamoto, PhD
October 9, 2016

Corpus is the Latin word for “body.” It may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). These corpora consist of one or more 0.1% stratified random samples of size n= 2,000, 150 and 5,000 lines from blogs, news and twitter text files residing in the HC Corpora English language database. link

Text File	Line Count	Word Count
en_us blogs	899,288 lines	37,279,275 words
en_us news	77,259 lines	34,494,539 words
en_us twitter	2,360,148 lines	30,451,128 words

N-gram Prediction Model - Algorithm

An n-gram is a sequence of n words, e.g., “new york” is a sequence of two words or bigram (n=2). An n-gram model is a probabilistic language model for predicting the next item in such a sequence. For example, “new york city” is a trigram (n=3) with the third word “city” predicted from a probabilistic language model.

If at first you don't succeed …

This n-gram prediction model uses bootstrap re-sampling, i.e., successive corpora or 0.1% random samples to predict the fourth word in a quadrigram (n=4) from the first three words, e.g., “churches” is the fourth word in the quadrigram “new york city churches” (Slide 4).

Shiny Application - Input

Without loss of generality, the three-word trigram is input one word at-a-time.

alt text

Input three words, e.g., “New”, “York” and “City.”

Shiny Application - Output

If pressing the “Go!” button gives NA as a result, then press the button again.

alt text

Maximum Likelihood Estimation (MLE)

In the example, the quadrigram “new york city churches” occurs three times. Since there are six occurrence of the trigram “new york city,” the conditional probability of “churches” being the fourth word is 50-50.