This document is the result of a exploratory analysis over the HC Corpora english data. Said data will be later used to create a predictive text model. The main objective is to understand the corpus and get some basic statistical information and features.
The HC Corpora data consists of four corpora each of a different language (english, german, finish and russian). For every language there is a twitter, news and blogs file indicating the source from which the text comes from. For this task and the predictive model only the english corpora will be used.
The corpus consists of the next three files:
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Given that this is only an exploratory data analysis, only an small portion of the files total lines are going to be used. To ensure that this is still representative, the lines to be used are sampled randomly. For the blogs and news files 22500 lines will be extracted, while for the twitter file 67500 lines are chosen. This is because it is assumed that a twitter entry is much shorter than a news or a blog one, and we want to have similar amount of words for the stats.
The basic stats for every file are obtained.
| size | words | importedLines | totalLines | avgWords | estWords | |
|---|---|---|---|---|---|---|
| blogs.stats | 210160014 | 927872 | 22500 | 899288 | 41.2 | 37085518 |
| news.stats | 205811889 | 748821 | 22500 | 1010242 | 33.3 | 33621797 |
| twitter.stats | 167105338 | 847991 | 67500 | 2360148 | 12.6 | 29650137 |
After that, the tokenization is applied to the three samples in order to create the respective n-gram, with n taking an integer value between 1 and 4. The most frequent values are plotted vs the normalized cumulative frequency.
N-gram nomalized cumulative frequency
By the plot, one can see how the first most frequent n-grams are the ones that contribute the most to the cumulative frequency. The steeper the curve, the higher the contributions of the first unigrams are; the more linear the plot is, the low its contribution is.
The unigram plot is a very “steepy” one and this means, that a small number of words are necessary to cover the majority of words in the corpus. To solidify this affirmation, we will calculate the percentage of unigrams (words) needed to cover 85% of the total file words.
| blogs | news | |
|---|---|---|
| 19.83 | 21.68 | 28.49 |
We see that twitter has the lowest percentage (19.83%) while news has the higher (28.49%). The proposed explanation to that is that news articles have a more formal language than Twitter post which are mostly informal and personal. This, in case, make it so that there is a wider variety of words in the news file.
After that, a table that includes the 15 most frequent bigrams and the number each bigram appears in the sampled lines. For every file a new column is created.
## Adding missing grouping variables: `origin`
| blogs | news | |
|---|---|---|
| in the: 2213 | of the: 4755 | of the: 4088 |
| for the: 2080 | in the: 3792 | in the: 3958 |
| of the: 1688 | to the: 2161 | to the: 1787 |
| on the: 1383 | on the: 1809 | on the: 1705 |
| to be: 1310 | to be: 1681 | for the: 1508 |
| to the: 1250 | and the: 1471 | at the: 1291 |
| thanks for: 1201 | for the: 1468 | and the: 1202 |
| at the: 1062 | and i: 1261 | in a: 1180 |
| i love: 1055 | i was: 1202 | to be: 1055 |
| going to: 996 | it was: 1191 | with the: 975 |
| have a: 966 | at the: 1175 | from the: 804 |
| thank you: 919 | is a: 1163 | with a: 754 |
| if you: 916 | it is: 1141 | he said: 737 |
| for a: 881 | i have: 1103 | of a: 713 |
| i dont: 853 | with the: 1080 | as a: 712 |
As can be seen, in the Twitter column there’s a lot more personal expression while in news and blogs almost all the highest bigrams are logical connectors and prepositions.
There is notable differences between each of the data files and how the words relates to each other (as evidenced when looking at the n-grams models). An ideal predictor would take into account the context in which the words is used to make the correct prediction. Given that, a different model should be fitted to each data file and then the algorithm must learn to differentiate when is the correct use case for any of them. This approach will also make the process faster, because the use of smaller separate files could increase the speed of the prediction process.
Since this is a executive type of report, I tried to include the least possible amount of code in the HTML. If you wanna see the source code it could be found on github. In case there is doubt on how this report was made, please reffer to the source or contact directly the writer.