The goal now is to understand the data, which is essential to achieve a good prediction model. This report is an exploratory analysis of the HC Corpora databases, which will be used to create the Capstone text prediction model.
The HC Corpora databases are separated by language: German, English, Finnish, and Russian), and classified according to their origin: Blogs, News, and Twitter. For this analysis we will work only with the sources in English.
The three databases in English, are explored to know their size (in MB), the number of lines that make them up, the number of words, the number of characters, the length (in characters) of the longest line. With this we also have the average number of words per line. Results shown in the following table:
| Size | nLines | nWords | nChars | Longest_line | n.medWord_Line | |
|---|---|---|---|---|---|---|
| Blogs | 200.4 | 899288 | 79779789 | 206824505 | 40833 | 88.71 |
| News | 196.3 | 1010242 | 74316341 | 203223159 | 11384 | 73.56 |
| Tweets | 159.4 | 2360148 | 65264908 | 162096241 | 140 | 27.65 |
With a little pretreatment on the databases, consisting of removing unwanted characters, repeated spaces, stop words typical of the English language and transforming the derived words to their basic forms. A diagram is presented showing the most frequent words in each of the three data sources that are being analyzed.
## [1] "For Blogs"
## [1] "For News"
## [1] "For Twitter"
What is customary is to do the analysis of N-grams, for N = 1, 2, 3, and 4. A Bi-gram, for example, is a study of frequencies of all the pairs of words found in the Corpus. With these N-grams, the detection algorithm can be trained in the next stage of the process.