In developing a predictive language algorithm a corpus of texts must be used to construct the basis of a language model. For this project we consider the HC Corpora. As a first step in the development of the language model we consider an exploratory analysis of the HC Corpora presented in the next section.
The corpus considered here is comprised of three sub-collections; a number of blog entries, a set of news items, and a set of twitter posts. A summery of the content the theses three collections is given in the table below.
| Collection | Lines | Words |
|---|---|---|
| Blog | 899288 | 37334114 |
| News | 1010242 | 34365936 |
| 2360148 | 30359804 |
The most frequent terms in each of the three collections is given below and with the number times each of these words appears in the collection.
| Blog Terms | Blog Counts | News Terms | News Counts | Twitter Terms | Twitter Counts |
|---|---|---|---|---|---|
| the | 1855769 | the | 1970687 | the | 934163 |
| and | 1086109 | and | 884180 | you | 543619 |
| that | 459499 | for | 352942 | and | 433678 |
| for | 362866 | that | 346539 | for | 384533 |
| you | 296852 | with | 254686 | that | 232893 |
| with | 286176 | said | 250342 | with | 172994 |
| was | 278002 | was | 228905 | your | 170769 |
| this | 257977 | his | 157567 | have | 168048 |
| have | 218541 | from | 152087 | this | 162725 |
| but | 203446 | but | 150908 | are | 158111 |
From the above it is clear that that the most frequent words in any collection is not unique to a given collection. The most common words occur frequently in each collection.
To examine the distribution of word frequencies in each collection we consider the histograms for each of the three collections. For legibility the histograms vertical axis has be scaled using a log(x)+1 representation where x is the number of tokens that occur in a given count bin.
In these histograms we see that the most frequent words make up a disproportionate number of the total word count of collection, while frequent words dominate the list all unique tokens.
Moving forward the corpus considered here will be used to construct a language model built from unigram, bigram, and possibly trigrams. This model will give the frequency of occurrence for any one, two, or three word phrase. For ease of calculation and storage only the phrases that occur above some to-be-determined threshold will be kept. From this language model it will be possible to construct a predictive language model that can predict the most likely next word from a input partial sentence or phrase.