Exploratory Analysis of Specific Corpus

1. Introduction

This is an exploratory analysis for the HC Corpora (www.corpora.heliohost.org), which is downloaded from Coursera specific link.

The purpose of this analysis is to find the key properties of the data so that we can build a model to predict potential user input. Here the key properties mean the words/word-pairs frequency in the database. If we know such information, we can predict user input by the neighbour words.

2. Corpus data acquisition and cleaning

2.1 Database Directory Structure

The HC Corpora is already downloaded in a directory named “final”, see below for the directory structure for it:

../final

, ├── de_DE

, ├── en_US

, ├── fi_FI

, └── ru_RU

As you can see these are text files of multiple languages. To make things simple we will only do analysis for English corpus database now.

2.2 Basic Information

The first step is to get the lines, words and file length of each English corpus file:

	Lines	Words	Bytes
en_US.twitter.txt	2360148	30374206	167105338
en_US.blogs.txt	899288	37334690	210160014
en_US.news.txt	1010242	34372720	205811889

2.3 Sampling

As you can see in section 2.2, the database files are very large containing huge number lines. To understand the pattern we don’t need use the full data files. We can use sampling methods to get a very small subset data. The assumption is the key properties of the subset data can represent the ones of the original data. And we will use words/words pair frequency as the key properties, which indeed can be computed on the small subset to mimic the population.

We will save the subset data to a file “subdata.txt” in the same directory of original database. We random choose about 1% of original blogs, news and twitter data and combine them in one subdata.txt file.

	Lines	Words	Bytes
en_US.twitter.txt	2360148	30374206	167105338
en_US.blogs.txt	899288	37334690	210160014
en_US.news.txt	1010242	34372720	205811889
subdata.txt	43800	1088933	6181730

3. Exploratory analysis

3.1 Data Prepare

Now we have a small size file subdata.txt and we will do exploratory analysis on this file. Here we will use a wonderful R package quanteda to implement this task, see attachment for the code.

The procedure is:

Using quanteda in R to read and create corpus
Create document-feature matrix.
2.1 Change all words to lower case
2.2 Tokenization for unigrams, bigrams and trigrams.
2.3 Remove unwanted tokens. (Here we remove numbers and puncts)
2.4 Ignore Seven Dirty Words (https://en.wikipedia.org/wiki/Seven_dirty_words)
2.5 Stem by English language rule

3.2 Words/Words-Pairs Counts Analysis

The following shows the top 10 counts for unigram, bigram and trigram. According to the figure we can know what are the most frequent words/word-pairs, which can be used to build predict model.

	the	to	and	a	of	in	i	for	is	that
unigram	51464	29744	26030	25443	21659	17604	17523	11827	11316	11192

	of_the	in_the	on_the	to_the	for_the	to_be	at_the	and_the	in_a	it_was
bigram	4610	4533	2244	2234	2189	1884	1544	1388	1258	1145

	one_of_the	a_lot_of	thank_for_the	i_want_to	to_be_a	go_to_be	look_forward_to	be_abl_to	out_of_the	the_end_of
trigram	382	331	248	233	205	203	185	173	159	155

As above figure shows, some words are more frequent than others. That means we can exclude some low frequeny words from the word list to save space, while keeping the major possibility. The following shows the token counts for original data, 50% possibility covered data, and 90% possibility covered data.

Table 1: Words Count for Different Possibility Coverage
	100% Covered	90% Covered	50% Covered
unigram	60263	7791	140
bigram	451380	344657	27528
trigram	889120	782397	355505

3.3 Words/Words-Pairs Counts Analysis Without Stop Words

Sometimes we don’t want to predict stop words in English. So we show the similiar figures and tables here without stop words, just for comparison purpose.

	will	just	said	one	like	can	time	get	new	good
unigram	3390	3270	3214	3106	2806	2604	2319	2314	2092	1907

	of_the	in_the	on_the	to_the	for_the	to_be	at_the	and_the	in_a	it_was
bigram	4610	4533	2244	2234	2189	1884	1544	1388	1258	1145

	one_of_the	a_lot_of	thank_for_the	i_want_to	to_be_a	go_to_be	look_forward_to	be_abl_to	out_of_the	the_end_of
trigram	382	331	248	233	205	203	185	173	159	155

The following table shows the table for different word/word-pair counts for different coverage rate, without stop words.

Table 2: Words Count for Different Possibility Coverage (Without Stop Words)
	100% Covered	90% Covered	50% Covered
unigram	60091	16123	1038
bigram	451380	344657	27528
trigram	889120	782397	355505

4. Summary

Until now we’ve already got the words/word-pairs counts. And we also know that we can use only a very small part of the words in the frequency sorted dictionary to represent the whole word instances in the language.

The next step is to build a basic predictive model for user input. We will use the 90% covered data to make the model to save lots of space. And we will use basic N-gram algorithm (https://en.wikipedia.org/wiki/N-gram) for the model.