We do exploratory analysis of the en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt corpora in the Capstone Dataset.

Summary statistics

We list some basic statistics for the corpora. Included are word counts, line counts, and the number of unique words.

##         wordCounts lineCounts uniqueWordCounts
## blogs     31707675     899288           210255
## news      28522322    1010242           188288
## twitter   24704578    2360148           212535

Interesting findings

Word frequencies

Here are the most frequent words. We not only give the occurrence count of frequent words, but also list the fraction of occurrences for each word out of the total word count. Note that word counts will increase for larger corpora but the fraction of occurrence should converge for a given word.

head(blogsFreqTable)
##   word   count   fraction
## 1  the 1839731 0.05802163
## 2  and 1069692 0.03373606
## 3   to 1057192 0.03334183
## 4    a  889301 0.02804687
## 5   of  870964 0.02746855
## 6    i  751141 0.02368956
head(newsFreqTable)
##   word   count   fraction
## 1  the 1939362 0.06799453
## 2   to  896061 0.03141613
## 3  and  873231 0.03061571
## 4    a  862459 0.03023804
## 5   of  769229 0.02696937
## 6   in  664848 0.02330974
head(twitterFreqTable)
##   word  count   fraction
## 1  the 925617 0.03746743
## 2   to 779326 0.03154581
## 3    i 702017 0.02841647
## 4    a 602349 0.02438208
## 5  you 475664 0.01925408
## 6  and 429049 0.01736719

In the rest of this document, to save space, we only examine the blogs corpus. Results for the news and twitter corpora would be similar.

We give a histogram of the log of the fractional occurrence of each word. Most words have a very low frequency of occurrence and a histogram of the fractional frequencies would be highly skewed to the right. Hence, we take the log. Note that since the fractional frequency is always less than 1.0, the log is negative.

As you can see, most words occur infrequently, but there are a few words that occur frequently.

Frequencies of 2-grams and 3-grams

We find the most frequent bigrams and trigrams. Note that

[…] an n-gram is a contiguous sequence of n items from a given sequence of text or speech. […] An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”.

—from Wikipedia

Because of the computation time, we only use a sample of the corpus. Also, we convert everything to lower case when tokenizing.

Here are the most frequent bigrams:

##  [1] "of the"   "in the"   "to the"   "on the"   "to be"    "for the" 
##  [7] "and the"  "and i"    "i was"    "it is"    "in a"     "at the"  
## [13] "with the" "i have"   "it was"   "is a"

Here are the most frequent trigrams:

##  [1] "one of the"    "a lot of"      "be able to"    "it was a"     
##  [5] "to be a"       "i had to"      "it is a"       "out of the"   
##  [9] "the end of"    "some of the"   "the fact that" "as well as"   
## [13] "i am not"      "this is a"     "i have a"      "a couple of"

Word coverage

You only need 112 unique words to cover 50% of all word instances in the English language and only need 8350 words to cover 90%.

This plot shows in general how many unique words you need to cover a given fraction of all word instances.

As you can see, the number of words needed increases rapidly if you want to cover more than 90% of the words in a corpus.

Prediction algorithm and Shiny app

I think it is significant that you only need a small fraction of unique words to cover 90% of your corpus. I’m guessing that, in general, a similarly small fraction of n-grams is needed to cover 90% of all instances of n-grams in a given corpora for a given n. I will need to test this assumption, though.

This means that our model can be of limited size and still cover 90% of likely cases.