Synopsis

This document is the result of a exploratory analysis over the HC Corpora english data. Said data will be later used to create a predictive text model. The main objective is to understand the corpus and get some basic statistical information and features.

The HC Corpora data consists of four corpora each of a different language (english, german, finish and russian). For every language there is a twitter, news and blogs file indicating the source from which the text comes from. For this task and the predictive model only the english corpora will be used.

Data Analysis

The corpus consists of the next three files:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Given that this is only an exploratory data analysis, only an small portion of the files total lines are going to be used. To ensure that this is still representative, the lines to be used are sampled randomly. For the blogs and news files 22500 lines will be extracted, while for the twitter file 67500 lines are chosen. This is because it is assumed that a twitter entry is much shorter than a news or a blog one, and we want to have similar amount of words for the stats.

The basic stats for every file are obtained.

size words importedLines totalLines avgWords estWords
blogs.stats 210160014 927872 22500 899288 41.2 37085518
news.stats 205811889 748821 22500 1010242 33.3 33621797
twitter.stats 167105338 847991 67500 2360148 12.6 29650137

After that, the tokenization is applied to the three samples in order to create the respective n-gram, with n taking an integer value between 1 and 4. The most frequent values are plotted vs the normalized cumulative frequency.

N-gram nomalized cumulative frequency

N-gram nomalized cumulative frequency

By the plot, one can see how the first most frequent n-grams are the ones that contribute the most to the cumulative frequency. The steeper the curve, the higher the contributions of the first unigrams are; the more linear the plot is, the low its contribution is.

The unigram plot is a very “steepy” one and this means, that a small number of words are necessary to cover the majority of words in the corpus. To solidify this affirmation, we will calculate the percentage of unigrams (words) needed to cover 85% of the total file words.

twitter blogs news
19.83 21.68 28.49

We see that twitter has the lowest percentage (19.83%) while news has the higher (28.49%). The proposed explanation to that is that news articles have a more formal language than Twitter post which are mostly informal and personal. This, in case, make it so that there is a wider variety of words in the news file.

After that, a table that includes the 15 most frequent bigrams and the number each bigram appears in the sampled lines. For every file a new column is created.

## Adding missing grouping variables: `origin`
twitter blogs news
in the: 2213 of the: 4755 of the: 4088
for the: 2080 in the: 3792 in the: 3958
of the: 1688 to the: 2161 to the: 1787
on the: 1383 on the: 1809 on the: 1705
to be: 1310 to be: 1681 for the: 1508
to the: 1250 and the: 1471 at the: 1291
thanks for: 1201 for the: 1468 and the: 1202
at the: 1062 and i: 1261 in a: 1180
i love: 1055 i was: 1202 to be: 1055
going to: 996 it was: 1191 with the: 975
have a: 966 at the: 1175 from the: 804
thank you: 919 is a: 1163 with a: 754
if you: 916 it is: 1141 he said: 737
for a: 881 i have: 1103 of a: 713
i dont: 853 with the: 1080 as a: 712

As can be seen, in the Twitter column there’s a lot more personal expression while in news and blogs almost all the highest bigrams are logical connectors and prepositions.

Conclusions

There is notable differences between each of the data files and how the words relates to each other (as evidenced when looking at the n-grams models). An ideal predictor would take into account the context in which the words is used to make the correct prediction. Given that, a different model should be fitted to each data file and then the algorithm must learn to differentiate when is the correct use case for any of them. This approach will also make the process faster, because the use of smaller separate files could increase the speed of the prediction process.

Disclaimer

Since this is a executive type of report, I tried to include the least possible amount of code in the HTML. If you wanna see the source code it could be found on github. In case there is doubt on how this report was made, please reffer to the source or contact directly the writer.