Introduction

In developing a predictive language algorithm a corpus of texts must be used to construct the basis of a language model. For this project we consider the HC Corpora. As a first step in the development of the language model we consider an exploratory analysis of the HC Corpora presented in the next section.

Exploratory Analysis

The corpus considered here is comprised of three sub-collections; a number of blog entries, a set of news items, and a set of twitter posts. A summery of the content the theses three collections is given in the table below.

Collection Lines Words
Blog 899288 37334114
News 1010242 34365936
Twitter 2360148 30359804

The most frequent terms in each of the three collections is given below and with the number times each of these words appears in the collection.

Blog Terms Blog Counts News Terms News Counts Twitter Terms Twitter Counts
the 1855769 the 1970687 the 934163
and 1086109 and 884180 you 543619
that 459499 for 352942 and 433678
for 362866 that 346539 for 384533
you 296852 with 254686 that 232893
with 286176 said 250342 with 172994
was 278002 was 228905 your 170769
this 257977 his 157567 have 168048
have 218541 from 152087 this 162725
but 203446 but 150908 are 158111

From the above it is clear that that the most frequent words in any collection is not unique to a given collection. The most common words occur frequently in each collection.

To examine the distribution of word frequencies in each collection we consider the histograms for each of the three collections. For legibility the histograms vertical axis has be scaled using a log(x)+1 representation where x is the number of tokens that occur in a given count bin.

plot of chunk unnamed-chunk-4 plot of chunk unnamed-chunk-5 plot of chunk unnamed-chunk-6

In these histograms we see that the most frequent words make up a disproportionate number of the total word count of collection, while frequent words dominate the list all unique tokens.

Future Work

Moving forward the corpus considered here will be used to construct a language model built from unigram, bigram, and possibly trigrams. This model will give the frequency of occurrence for any one, two, or three word phrase. For ease of calculation and storage only the phrases that occur above some to-be-determined threshold will be kept. From this language model it will be possible to construct a predictive language model that can predict the most likely next word from a input partial sentence or phrase.