HC Corpora Exploratory Data Analysis

Synopsis

This document is the result of a exploratory analysis over the HC Corpora english data. Said data will be later used to create a predictive text model. The main objective is to understand the corpus and get some basic statistical information and features.

The HC Corpora data consists of four corpora each of a different language (english, german, finish and russian). For every language there is a twitter, news and blogs file indicating the source from which the text comes from. For this task and the predictive model only the english corpora will be used.

Data Analysis

The corpus consists of the next three files:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Given that this is only an exploratory data analysis, only an small portion of the files total lines are going to be used. To ensure that this is still representative, the lines to be used are sampled randomly. For the blogs and news files 22500 lines will be extracted, while for the twitter file 67500 lines are chosen. This is because it is assumed that a twitter entry is much shorter than a news or a blog one, and we want to have similar amount of words for the stats.

The basic stats for every file are obtained.

	size	words	importedLines	totalLines	avgWords	estWords
blogs.stats	210160014	927872	22500	899288	41.2	37085518
news.stats	205811889	748821	22500	1010242	33.3	33621797
twitter.stats	167105338	847991	67500	2360148	12.6	29650137

After that, the tokenization is applied to the three samples in order to create the respective n-gram, with n taking an integer value between 1 and 4. The most frequent values are plotted vs the normalized cumulative frequency.

N-gram nomalized cumulative frequency

By the plot, one can see how the first most frequent n-grams are the ones that contribute the most to the cumulative frequency. The steeper the curve, the higher the contributions of the first unigrams are; the more linear the plot is, the low its contribution is.

The unigram plot is a very “steepy” one and this means, that a small number of words are necessary to cover the majority of words in the corpus. To solidify this affirmation, we will calculate the percentage of unigrams (words) needed to cover 85% of the total file words.

twitter	blogs	news
19.83	21.68	28.49

We see that twitter has the lowest percentage (19.83%) while news has the higher (28.49%). The proposed explanation to that is that news articles have a more formal language than Twitter post which are mostly informal and personal. This, in case, make it so that there is a wider variety of words in the news file.

After that, a table that includes the 15 most frequent bigrams and the number each bigram appears in the sampled lines. For every file a new column is created.

## Adding missing grouping variables: `origin`

twitter	blogs	news
in the: 2213	of the: 4755	of the: 4088
for the: 2080	in the: 3792	in the: 3958
of the: 1688	to the: 2161	to the: 1787
on the: 1383	on the: 1809	on the: 1705
to be: 1310	to be: 1681	for the: 1508
to the: 1250	and the: 1471	at the: 1291
thanks for: 1201	for the: 1468	and the: 1202
at the: 1062	and i: 1261	in a: 1180
i love: 1055	i was: 1202	to be: 1055
going to: 996	it was: 1191	with the: 975
have a: 966	at the: 1175	from the: 804
thank you: 919	is a: 1163	with a: 754
if you: 916	it is: 1141	he said: 737
for a: 881	i have: 1103	of a: 713
i dont: 853	with the: 1080	as a: 712

As can be seen, in the Twitter column there’s a lot more personal expression while in news and blogs almost all the highest bigrams are logical connectors and prepositions.

Conclusions

There is notable differences between each of the data files and how the words relates to each other (as evidenced when looking at the n-grams models). An ideal predictor would take into account the context in which the words is used to make the correct prediction. Given that, a different model should be fitted to each data file and then the algorithm must learn to differentiate when is the correct use case for any of them. This approach will also make the process faster, because the use of smaller separate files could increase the speed of the prediction process.

Disclaimer

Since this is a executive type of report, I tried to include the least possible amount of code in the HTML. If you wanna see the source code it could be found on github. In case there is doubt on how this report was made, please reffer to the source or contact directly the writer.

HC Corpora Exploratory Data Analysis

Juan Osorio

Synopsis

Data Analysis

Conclusions

Disclaimer