date: 2015-12-29

The dataset

Our dataset corresponds to 899288 blog entries, 1010242 news entries, and 2360148 twitter entries, totaling 4269678 text entries. Of those 50000 random entries where read into R for each class of data (Blogs, News and Twitter), totalling 150000 text entries.

Cleaning the Dataset

In order to explore the data, we will remove punctuation, numbers, and Convert texts to lower case in order to explore frequency.

Word Frequency

We explore word frequency for each of the text classes on its own (Blogs, News, Twitter), and all of them together. In order to do that, the stream of text was broken up into words in a process called tokenization. We explored word frequencies of the texts by tokenizing into groups of one, two and three words (unigrams, bigrams and trigrams).

Frequencies of words in News

In the following graphs we see the 10 most common unigrams, bigrams and trigrams for the 50000 news entries we had.

Frequencies of words in Blogs

In the next three graphs we see the 10 most common unigrams, bigrams and trigrams of blog entries in our dataset.

Frequencies of words in Twitter

In these three graphs we see the ten most common unigrams, bigrams, and trigrams for twitter.

All texts

Finally we see the 10 most common unigrams, bigrams and trigrams of the three datasets together.

Data summary

Finally a table is developed to show the number of words, number of terms, unique terms (terms that only appear one time) and proportion of unique terms in each sampled dataset.

Class N of words N of terms N unique terms p of unique terms
All 4330285 129474 69355 0.54
Blogs 2040146 75937 38615 0.51
News 1664802 73771 36489 0.49
Twitter 625337 40965 23973 0.59