Our dataset corresponds to 899288 blog entries, 1010242 news entries, and 2360148 twitter entries, totaling 4269678 text entries. Of those 50000 random entries where read into R for each class of data (Blogs, News and Twitter), totalling 150000 text entries.
In order to explore the data, we will remove punctuation, numbers, and Convert texts to lower case in order to explore frequency.
We explore word frequency for each of the text classes on its own (Blogs, News, Twitter), and all of them together. In order to do that, the stream of text was broken up into words in a process called tokenization. We explored word frequencies of the texts by tokenizing into groups of one, two and three words (unigrams, bigrams and trigrams).
In the following graphs we see the 10 most common unigrams, bigrams and trigrams for the 50000 news entries we had.
In the next three graphs we see the 10 most common unigrams, bigrams and trigrams of blog entries in our dataset.
In these three graphs we see the ten most common unigrams, bigrams, and trigrams for twitter.
Finally we see the 10 most common unigrams, bigrams and trigrams of the three datasets together.
Finally a table is developed to show the number of words, number of terms, unique terms (terms that only appear one time) and proportion of unique terms in each sampled dataset.
Class | N of words | N of terms | N unique terms | p of unique terms |
---|---|---|---|---|
All | 4330285 | 129474 | 69355 | 0.54 |
Blogs | 2040146 | 75937 | 38615 | 0.51 |
News | 1664802 | 73771 | 36489 | 0.49 |
625337 | 40965 | 23973 | 0.59 |