1. Introduction

This is an exploratory analysis for the HC Corpora (www.corpora.heliohost.org), which is downloaded from Coursera specific link.

The purpose of this analysis is to find the key properties of the data so that we can build a model to predict potential user input. Here the key properties mean the words/word-pairs frequency in the database. If we know such information, we can predict user input by the neighbour words.

2. Corpus data acquisition and cleaning

2.1 Database Directory Structure

The HC Corpora is already downloaded in a directory named “final”, see below for the directory structure for it:

../final

, ├── de_DE

, ├── en_US

, ├── fi_FI

, └── ru_RU

As you can see these are text files of multiple languages. To make things simple we will only do analysis for English corpus database now.

2.2 Basic Information

The first step is to get the lines, words and file length of each English corpus file:

Lines Words Bytes
en_US.twitter.txt 2360148 30374206 167105338
en_US.blogs.txt 899288 37334690 210160014
en_US.news.txt 1010242 34372720 205811889

2.3 Sampling

As you can see in section 2.2, the database files are very large containing huge number lines. To understand the pattern we don’t need use the full data files. We can use sampling methods to get a very small subset data. The assumption is the key properties of the subset data can represent the ones of the original data. And we will use words/words pair frequency as the key properties, which indeed can be computed on the small subset to mimic the population.

We will save the subset data to a file “subdata.txt” in the same directory of original database. We random choose about 1% of original blogs, news and twitter data and combine them in one subdata.txt file.

Lines Words Bytes
en_US.twitter.txt 2360148 30374206 167105338
en_US.blogs.txt 899288 37334690 210160014
en_US.news.txt 1010242 34372720 205811889
subdata.txt 43800 1088933 6181730

3. Exploratory analysis

3.1 Data Prepare

Now we have a small size file subdata.txt and we will do exploratory analysis on this file. Here we will use a wonderful R package quanteda to implement this task, see attachment for the code.

The procedure is:

  1. Using quanteda in R to read and create corpus
  2. Create document-feature matrix.
    2.1 Change all words to lower case
    2.2 Tokenization for unigrams, bigrams and trigrams.
    2.3 Remove unwanted tokens. (Here we remove numbers and puncts)
    2.4 Ignore Seven Dirty Words (https://en.wikipedia.org/wiki/Seven_dirty_words)
    2.5 Stem by English language rule

3.2 Words/Words-Pairs Counts Analysis

The following shows the top 10 counts for unigram, bigram and trigram. According to the figure we can know what are the most frequent words/word-pairs, which can be used to build predict model.

the to and a of in i for is that
unigram 51464 29744 26030 25443 21659 17604 17523 11827 11316 11192
of_the in_the on_the to_the for_the to_be at_the and_the in_a it_was
bigram 4610 4533 2244 2234 2189 1884 1544 1388 1258 1145
one_of_the a_lot_of thank_for_the i_want_to to_be_a go_to_be look_forward_to be_abl_to out_of_the the_end_of
trigram 382 331 248 233 205 203 185 173 159 155

As above figure shows, some words are more frequent than others. That means we can exclude some low frequeny words from the word list to save space, while keeping the major possibility. The following shows the token counts for original data, 50% possibility covered data, and 90% possibility covered data.

Table 1: Words Count for Different Possibility Coverage
100% Covered 90% Covered 50% Covered
unigram 60263 7791 140
bigram 451380 344657 27528
trigram 889120 782397 355505

3.3 Words/Words-Pairs Counts Analysis Without Stop Words

Sometimes we don’t want to predict stop words in English. So we show the similiar figures and tables here without stop words, just for comparison purpose.

will just said one like can time get new good
unigram 3390 3270 3214 3106 2806 2604 2319 2314 2092 1907
of_the in_the on_the to_the for_the to_be at_the and_the in_a it_was
bigram 4610 4533 2244 2234 2189 1884 1544 1388 1258 1145
one_of_the a_lot_of thank_for_the i_want_to to_be_a go_to_be look_forward_to be_abl_to out_of_the the_end_of
trigram 382 331 248 233 205 203 185 173 159 155

The following table shows the table for different word/word-pair counts for different coverage rate, without stop words.

Table 2: Words Count for Different Possibility Coverage (Without Stop Words)
100% Covered 90% Covered 50% Covered
unigram 60091 16123 1038
bigram 451380 344657 27528
trigram 889120 782397 355505

4. Summary

Until now we’ve already got the words/word-pairs counts. And we also know that we can use only a very small part of the words in the frequency sorted dictionary to represent the whole word instances in the language.

The next step is to build a basic predictive model for user input. We will use the 90% covered data to make the model to save lots of space. And we will use basic N-gram algorithm (https://en.wikipedia.org/wiki/N-gram) for the model.