Summary

This report provides an exploratory analysis of a large sample of blogs, news and tweets in English taken from HC Corpora. It begins by exploring the size of the data sets (or corpora) and then proceeds to analyze the distribution of terms (words, essentially), bigrams and trigrams. Finally, next steps for modelling are laid out.

Corpora Size

The table belows summarizes the size of the 3 data sets in terms of number of elements, terms (or words) and unique terms (after removing numbers and punctuation). Given the sheer size of the data sets–blogs with 899,288 elements; news with 1,010,242; and twitter with 2,360,148–only a portion of each data set was explored here (specifically 10,000 elements from each). While this doesn’t cover the majority of the raw data, this sample should suffice for analytical purposes.

Blogs contain the most terms (3.1724310^{5}) and unique terms (31048). Tweets are much smaller in size than blogs or news, which is not surprising given the restrictions on their length.

Elements Total_Terms Unique_Terms
blogs 10,000 317,243 31,048
news 10,000 273,688 30,565
twitter 10,000 95,010 15,155

Term Frequencies

The distribution of term frequencies (after removing standard ‘stopwords’) are explored in the plots below. The predominant theme is that frequencies of terms are highly skewed, with the vast majority of terms appearing relatively few times and a few terms appearing relatively frequently. The news data set is especially skewed, which may be due to the relatively standardized language and format of news articles compared with blogs and tweets.

Anywhere from roughly 1% to 4% of unique terms are required to cover at least 50% of term instances among the data sets. For 90% coverage, roughly 40% to 56% unique terms are needed.

The most frequent terms below provide some extra colour to the previous distribution plots. As might be expected, the word ‘said’ is by far the most frequent term among news documents. The top twitter terms tend to reflect views or opinions about something, such as ‘like’, ‘love’ and ‘good’.

Top 10 Most Frequent Terms
blogs blogs news news twitter twitter
1 one 1,327 said 2,484 just 656
2 will 1,233 will 1,084 like 509
3 just 1,158 one 835 get 473
4 can 1,137 new 675 love 435
5 like 1,091 year 598 good 419
6 time 1,016 also 586 will 402
7 get 768 can 576 thanks 394
8 know 717 two 576 can 385
9 now 676 just 552 day 374
10 people 627 last 524 one 359

Bigram and Trigrams Frequencies

The plots below illustrate how bigrams and trigrams are distributed in the data sets. The main difference between these distributions and that of single terms is that the tail is generally fatter, especially with trigrams. This makes sense as it’s unlikely to get very high frequencies of a sequence of words, especially as the sequence gets longer.

The most frequent bigrams and trigrams among the news data set are what one would expect–mainly cities, times and politicians–as are those among tweets where greetings dominate. Other interesting findings to note are the preponderance of New York-related terms as well as the trigram ‘u u u’ taking first place in the news data set.

Top 10 Most Frequent Bigrams
blogs blogs news news twitter twitter
1 can see 62 last year 126 right now 69
2 first time 59 new york 108 last night 57
3 new york 59 high school 89 happy birthday 39
4 make sure 58 st louis 79 just got 37
5 right now 50 years ago 69 good morning 36
6 even though 49 new jersey 66 looking forward 34
7 last year 49 last week 60 can get 32
8 feel like 46 los angeles 46 follow back 31
9 years ago 46 first time 45 thanks follow 30
10 every day 41 health care 43 feel like 25
Top 10 Most Frequent Trigrams
blogs blogs news news twitter twitter
1 new york times 9 u u u 17 happy mothers day 17
2 new york city 8 president barack obama 13 let us know 9
3 amazon services llc 6 first time since 12 cinco de mayo 8
4 happy new year 6 two years ago 12 happy new year 8
5 new york ny 6 gov chris christie 11 follow follow back 6
6 two weeks ago 6 new york city 11 lovz lovz lovz 6
7 hotel birmingham nec 5 pates fountain parks 11 lies lies lies 5
8 makes feel like 5 new york times 9 o o o 5
9 year old daughter 5 st louis county 9 show last night 5
10 years ago now 5 us district court 9 brenda brenda brenda 4

Next Steps

The plan ahead for developing the prediction algorithm and app can be grouped under the following main steps: