Jose Antonio Garcia Ramirez
December 23, 2017
We have three large files (sets of words) here.
After unzipping, we have the following
| File | Lines | Max length of line (characters) |
|---|---|---|
| blogs | 899,288 | 40,833 |
| news | 77,259 | 5,760 |
| 2,360,148 | 140 |
Following the work of [1] we perform the following transformations to clean the data:
[1]: Pengda Qin, Weiran Xu and Jun Guo, 'A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,' in Advances in Knowledge Discovery and Data Mining 2017, Springer.
Due to the limitations of the computer equipment, divided the files of blogs and news in 8 parts each, then we process (extraction of n-grams) in the end we join the results of each part.