---output: html_document---
Douglas M Okamoto, PhD
October 9, 2016
Corpus is the Latin for “body.” It may contain texts in a single language or text data in multiple languages. The corpora consist of 0.1% stratified random samples of size n= 2,000, 150 and 5,000 lines from blogs, news and twitter text files residing in the HC Corpora English language database. link
| Text File | Line Count | Word Count |
|---|---|---|
| en_us blogs | 899,288 lines | 37,279,275 words |
| en_us news | 77,259 lines | 34,494,539 words |
| en_us twitter | 2,360,148 lines | 30,451,128 words |
An n-gram is a sequence of n words, e.g., “new york” is a sequence of two words or bigram (n=2). An n-gram model is a probabilistic language model for predicting the next item in such a sequence. For example, “new york city” is a trigram (n=3) with the third word “city” predicted from a probabilistic language model.
If at first you don't succeed …
This n-gram prediction model uses bootstrap re-sampling, i.e., successive corpora or 0.1% random samples to predict the fourth word in a quadrigram (n=4) from the first three words, e.g., “walking” is the next word in the quadrigram “new york city walking” (Slide 4).
Without loss of generality, a three-word trigram is input one word at-a-time.
Input three words, e.g., “New”, “York” and “City.”
If pressing the “Go!” button gives NA as a result, then press the button again.
In the example, the quadrigram “new york city walking” occurs three times. Since there are six occurrence of the trigram “new york city,” the conditional probability of “walking” being the fourth word is 50-50.