0. Executive Summary

A first exploratory analysis is performed on the target (sampled) documents. After cleaning up, documents are tokenized (divided into terms) and n-grams are identified.

Visualization takes place next via both barcharts representing most frequent n-grams and wordclouds in order to emphasyze the top ranked uni-bi and tri-grams.

As n-grams are looking good as the base of a potential predictive algorythm, next actions will be driven by this tentative strategy.

1. Documents sampling, loading, corpus generation and check

Documents are sampled by lines (10^{5}) in this case to allow reasoble performance without loosing much information and charged directly into a tm Vcorpus. Rest of the actions will be performed under this structure.

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

2. Exploratory, Transformation & Cleaning actions

Text is transformed to lower case, punctuation is removed as well as english stopwords.Unnecesary spaces are taken out and so are numbers. Stemming is not considered a good technique for this application.

Term Matrx is then compacted by removing ocurrences of terms not showing up at least in 2 of the 3 documents (objective maximum 33% sparsity).

Raw sampled documents
After cleaning up
After removing spare terms
Lines Words Chars Max Line Words Distinct Words Words Distinct Words
en_US.blogs.txt 100.000 4.145.547 23.136.988 19.799 2.129.218 125.226 2.029.547 54.953
en_US.news.txt 100.000 3.393.823 20.123.815 2.377 1.918.991 105.182 1.835.630 53.513
en_US.twitter.txt 100.000 1.286.014 6.881.642 155 696.709 63.756 661.233 35.541
Corpus 300.000 8.825.384 50.142.445 19.799 4.744.918 208.615 4.526.410 58.458
a Raw documents contains 1 and 2 lenght terms

3. N-Grams tokenization and plotting

Documents are tokenized in Uni-Bi-Tri Grams. Terms matrix are generated for each one of the cases.

2 column matrices are generated to set up order lists ranked by number of n-gram ocurrences, then converted to Data Frames and finally plotted via ggplot2.

Alternatively word clouds are generated so as to have a visual impact regarding the most frequents n-grams.

5.Tentative strategy principles.

According to clear performance limitations, data will be needed to be sampled so as to minimize the loss of information. Optimal number of lines will need to be analized.

Most frequent groupings of words (n-grams) seem a promising starting point to base the prediction algorythms. N-grams rankings doesnt vary much with the size of sample as far as the samples are large enough (>1500 lines sampled by file).

[END OF REPORT]