We do exploratory analysis of the en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt corpora in the Capstone Dataset.
We list some basic statistics for the corpora. Included are word counts, line counts, and the number of unique words.
## wordCounts lineCounts uniqueWordCounts
## blogs 31707675 899288 210255
## news 28522322 1010242 188288
## twitter 24704578 2360148 212535
Here are the most frequent words. We not only give the occurrence count of frequent words, but also list the fraction of occurrences for each word out of the total word count. Note that word counts will increase for larger corpora but the fraction of occurrence should converge for a given word.
head(blogsFreqTable)
## word count fraction
## 1 the 1839731 0.05802163
## 2 and 1069692 0.03373606
## 3 to 1057192 0.03334183
## 4 a 889301 0.02804687
## 5 of 870964 0.02746855
## 6 i 751141 0.02368956
head(newsFreqTable)
## word count fraction
## 1 the 1939362 0.06799453
## 2 to 896061 0.03141613
## 3 and 873231 0.03061571
## 4 a 862459 0.03023804
## 5 of 769229 0.02696937
## 6 in 664848 0.02330974
head(twitterFreqTable)
## word count fraction
## 1 the 925617 0.03746743
## 2 to 779326 0.03154581
## 3 i 702017 0.02841647
## 4 a 602349 0.02438208
## 5 you 475664 0.01925408
## 6 and 429049 0.01736719
In the rest of this document, to save space, we only examine the blogs corpus. Results for the news and twitter corpora would be similar.
We give a histogram of the log of the fractional occurrence of each word. Most words have a very low frequency of occurrence and a histogram of the fractional frequencies would be highly skewed to the right. Hence, we take the log. Note that since the fractional frequency is always less than 1.0, the log is negative.
As you can see, most words occur infrequently, but there are a few words that occur frequently.
We find the most frequent bigrams and trigrams. Note that
[…] an n-gram is a contiguous sequence of n items from a given sequence of text or speech. […] An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”.
—from Wikipedia
Because of the computation time, we only use a sample of the corpus. Also, we convert everything to lower case when tokenizing.
Here are the most frequent bigrams:
## [1] "of the" "in the" "to the" "on the" "to be" "for the"
## [7] "and the" "and i" "i was" "it is" "in a" "at the"
## [13] "with the" "i have" "it was" "is a"
Here are the most frequent trigrams:
## [1] "one of the" "a lot of" "be able to" "it was a"
## [5] "to be a" "i had to" "it is a" "out of the"
## [9] "the end of" "some of the" "the fact that" "as well as"
## [13] "i am not" "this is a" "i have a" "a couple of"
You only need 112 unique words to cover 50% of all word instances in the English language and only need 8350 words to cover 90%.
This plot shows in general how many unique words you need to cover a given fraction of all word instances.
As you can see, the number of words needed increases rapidly if you want to cover more than 90% of the words in a corpus.
I think it is significant that you only need a small fraction of unique words to cover 90% of your corpus. I’m guessing that, in general, a similarly small fraction of n-grams is needed to cover 90% of all instances of n-grams in a given corpora for a given n. I will need to test this assumption, though.
This means that our model can be of limited size and still cover 90% of likely cases.