This is an exploratory analysis for the HC Corpora (www.corpora.heliohost.org), which is downloaded from Coursera specific link.
The purpose of this analysis is to find the key properties of the data so that we can build a model to predict potential user input. Here the key properties mean the words/word-pairs frequency in the database. If we know such information, we can predict user input by the neighbour words.
The HC Corpora is already downloaded in a directory named “final”, see below for the directory structure for it:
../final
, ├── de_DE
, ├── en_US
, ├── fi_FI
, └── ru_RU
As you can see these are text files of multiple languages. To make things simple we will only do analysis for English corpus database now.
The first step is to get the lines, words and file length of each English corpus file:
| Lines | Words | Bytes | |
|---|---|---|---|
| en_US.twitter.txt | 2360148 | 30374206 | 167105338 |
| en_US.blogs.txt | 899288 | 37334690 | 210160014 |
| en_US.news.txt | 1010242 | 34372720 | 205811889 |
As you can see in section 2.2, the database files are very large containing huge number lines. To understand the pattern we don’t need use the full data files. We can use sampling methods to get a very small subset data. The assumption is the key properties of the subset data can represent the ones of the original data. And we will use words/words pair frequency as the key properties, which indeed can be computed on the small subset to mimic the population.
We will save the subset data to a file “subdata.txt” in the same directory of original database. We random choose about 1% of original blogs, news and twitter data and combine them in one subdata.txt file.
| Lines | Words | Bytes | |
|---|---|---|---|
| en_US.twitter.txt | 2360148 | 30374206 | 167105338 |
| en_US.blogs.txt | 899288 | 37334690 | 210160014 |
| en_US.news.txt | 1010242 | 34372720 | 205811889 |
| subdata.txt | 43800 | 1088933 | 6181730 |
Now we have a small size file subdata.txt and we will do exploratory analysis on this file. Here we will use a wonderful R package quanteda to implement this task, see attachment for the code.
The procedure is:
The following shows the top 10 counts for unigram, bigram and trigram. According to the figure we can know what are the most frequent words/word-pairs, which can be used to build predict model.
| the | to | and | a | of | in | i | for | is | that | |
|---|---|---|---|---|---|---|---|---|---|---|
| unigram | 51464 | 29744 | 26030 | 25443 | 21659 | 17604 | 17523 | 11827 | 11316 | 11192 |
| of_the | in_the | on_the | to_the | for_the | to_be | at_the | and_the | in_a | it_was | |
|---|---|---|---|---|---|---|---|---|---|---|
| bigram | 4610 | 4533 | 2244 | 2234 | 2189 | 1884 | 1544 | 1388 | 1258 | 1145 |
| one_of_the | a_lot_of | thank_for_the | i_want_to | to_be_a | go_to_be | look_forward_to | be_abl_to | out_of_the | the_end_of | |
|---|---|---|---|---|---|---|---|---|---|---|
| trigram | 382 | 331 | 248 | 233 | 205 | 203 | 185 | 173 | 159 | 155 |
As above figure shows, some words are more frequent than others. That means we can exclude some low frequeny words from the word list to save space, while keeping the major possibility. The following shows the token counts for original data, 50% possibility covered data, and 90% possibility covered data.
| 100% Covered | 90% Covered | 50% Covered | |
|---|---|---|---|
| unigram | 60263 | 7791 | 140 |
| bigram | 451380 | 344657 | 27528 |
| trigram | 889120 | 782397 | 355505 |
Sometimes we don’t want to predict stop words in English. So we show the similiar figures and tables here without stop words, just for comparison purpose.
| will | just | said | one | like | can | time | get | new | good | |
|---|---|---|---|---|---|---|---|---|---|---|
| unigram | 3390 | 3270 | 3214 | 3106 | 2806 | 2604 | 2319 | 2314 | 2092 | 1907 |
| of_the | in_the | on_the | to_the | for_the | to_be | at_the | and_the | in_a | it_was | |
|---|---|---|---|---|---|---|---|---|---|---|
| bigram | 4610 | 4533 | 2244 | 2234 | 2189 | 1884 | 1544 | 1388 | 1258 | 1145 |
| one_of_the | a_lot_of | thank_for_the | i_want_to | to_be_a | go_to_be | look_forward_to | be_abl_to | out_of_the | the_end_of | |
|---|---|---|---|---|---|---|---|---|---|---|
| trigram | 382 | 331 | 248 | 233 | 205 | 203 | 185 | 173 | 159 | 155 |
The following table shows the table for different word/word-pair counts for different coverage rate, without stop words.
| 100% Covered | 90% Covered | 50% Covered | |
|---|---|---|---|
| unigram | 60091 | 16123 | 1038 |
| bigram | 451380 | 344657 | 27528 |
| trigram | 889120 | 782397 | 355505 |
Until now we’ve already got the words/word-pairs counts. And we also know that we can use only a very small part of the words in the frequency sorted dictionary to represent the whole word instances in the language.
The next step is to build a basic predictive model for user input. We will use the 90% covered data to make the model to save lots of space. And we will use basic N-gram algorithm (https://en.wikipedia.org/wiki/N-gram) for the model.