JJ Espinoza
December 2015
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file for details on the corpora available. The text data consist of blogs, tweets, and news stories collected via a web crawler.
The frequency of word triples, doubles, and singles are calculated and saved in lookup tables
Model uses these lookup tables to predict the following word
If phrase not found returns the word “the”
| Last Two Words | Next Word | Frequency |
|---|---|---|
| I love | you | 100 |
| I love | bacon | 75 |
| Second to Last Word | Third Word | Frequency |
|---|---|---|
| love | it | 200 |
| love | sucks | 150 |