This report provides an exploratory analysis of a large sample of blogs, news and tweets in English taken from HC Corpora. It begins by exploring the size of the data sets (or corpora) and then proceeds to analyze the distribution of terms (words, essentially), bigrams and trigrams. Finally, next steps for modelling are laid out.
The table belows summarizes the size of the 3 data sets in terms of number of elements, terms (or words) and unique terms (after removing numbers and punctuation). Given the sheer size of the data sets–blogs with 899,288 elements; news with 1,010,242; and twitter with 2,360,148–only a portion of each data set was explored here (specifically 10,000 elements from each). While this doesn’t cover the majority of the raw data, this sample should suffice for analytical purposes.
Blogs contain the most terms (3.1724310^{5}) and unique terms (31048). Tweets are much smaller in size than blogs or news, which is not surprising given the restrictions on their length.
| Elements | Total_Terms | Unique_Terms | |
|---|---|---|---|
| blogs | 10,000 | 317,243 | 31,048 |
| news | 10,000 | 273,688 | 30,565 |
| 10,000 | 95,010 | 15,155 |
The distribution of term frequencies (after removing standard ‘stopwords’) are explored in the plots below. The predominant theme is that frequencies of terms are highly skewed, with the vast majority of terms appearing relatively few times and a few terms appearing relatively frequently. The news data set is especially skewed, which may be due to the relatively standardized language and format of news articles compared with blogs and tweets.
Anywhere from roughly 1% to 4% of unique terms are required to cover at least 50% of term instances among the data sets. For 90% coverage, roughly 40% to 56% unique terms are needed.
The most frequent terms below provide some extra colour to the previous distribution plots. As might be expected, the word ‘said’ is by far the most frequent term among news documents. The top twitter terms tend to reflect views or opinions about something, such as ‘like’, ‘love’ and ‘good’.
| blogs | blogs | news | news | |||
|---|---|---|---|---|---|---|
| 1 | one | 1,327 | said | 2,484 | just | 656 |
| 2 | will | 1,233 | will | 1,084 | like | 509 |
| 3 | just | 1,158 | one | 835 | get | 473 |
| 4 | can | 1,137 | new | 675 | love | 435 |
| 5 | like | 1,091 | year | 598 | good | 419 |
| 6 | time | 1,016 | also | 586 | will | 402 |
| 7 | get | 768 | can | 576 | thanks | 394 |
| 8 | know | 717 | two | 576 | can | 385 |
| 9 | now | 676 | just | 552 | day | 374 |
| 10 | people | 627 | last | 524 | one | 359 |
The plots below illustrate how bigrams and trigrams are distributed in the data sets. The main difference between these distributions and that of single terms is that the tail is generally fatter, especially with trigrams. This makes sense as it’s unlikely to get very high frequencies of a sequence of words, especially as the sequence gets longer.
The most frequent bigrams and trigrams among the news data set are what one would expect–mainly cities, times and politicians–as are those among tweets where greetings dominate. Other interesting findings to note are the preponderance of New York-related terms as well as the trigram ‘u u u’ taking first place in the news data set.
| blogs | blogs | news | news | |||
|---|---|---|---|---|---|---|
| 1 | can see | 62 | last year | 126 | right now | 69 |
| 2 | first time | 59 | new york | 108 | last night | 57 |
| 3 | new york | 59 | high school | 89 | happy birthday | 39 |
| 4 | make sure | 58 | st louis | 79 | just got | 37 |
| 5 | right now | 50 | years ago | 69 | good morning | 36 |
| 6 | even though | 49 | new jersey | 66 | looking forward | 34 |
| 7 | last year | 49 | last week | 60 | can get | 32 |
| 8 | feel like | 46 | los angeles | 46 | follow back | 31 |
| 9 | years ago | 46 | first time | 45 | thanks follow | 30 |
| 10 | every day | 41 | health care | 43 | feel like | 25 |
| blogs | blogs | news | news | |||
|---|---|---|---|---|---|---|
| 1 | new york times | 9 | u u u | 17 | happy mothers day | 17 |
| 2 | new york city | 8 | president barack obama | 13 | let us know | 9 |
| 3 | amazon services llc | 6 | first time since | 12 | cinco de mayo | 8 |
| 4 | happy new year | 6 | two years ago | 12 | happy new year | 8 |
| 5 | new york ny | 6 | gov chris christie | 11 | follow follow back | 6 |
| 6 | two weeks ago | 6 | new york city | 11 | lovz lovz lovz | 6 |
| 7 | hotel birmingham nec | 5 | pates fountain parks | 11 | lies lies lies | 5 |
| 8 | makes feel like | 5 | new york times | 9 | o o o | 5 |
| 9 | year old daughter | 5 | st louis county | 9 | show last night | 5 |
| 10 | years ago now | 5 | us district court | 9 | brenda brenda brenda | 4 |
The plan ahead for developing the prediction algorithm and app can be grouped under the following main steps: