Text Mining

0. Executive Summary

A first exploratory analysis is performed on the target (sampled) documents. After cleaning up, documents are tokenized (divided into terms) and n-grams are identified.

Visualization takes place next via both barcharts representing most frequent n-grams and wordclouds in order to emphasyze the top ranked uni-bi and tri-grams.

As n-grams are looking good as the base of a potential predictive algorythm, next actions will be driven by this tentative strategy.

1. Documents sampling, loading, corpus generation and check

Documents are sampled by lines (10^{5}) in this case to allow reasoble performance without loosing much information and charged directly into a tm Vcorpus. Rest of the actions will be performed under this structure.

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

2. Exploratory, Transformation & Cleaning actions

Text is transformed to lower case, punctuation is removed as well as english stopwords.Unnecesary spaces are taken out and so are numbers. Stemming is not considered a good technique for this application.

Term Matrx is then compacted by removing ocurrences of terms not showing up at least in 2 of the 3 documents (objective maximum 33% sparsity).

	Raw sampled documents				After cleaning up		After removing spare terms
	Lines	Words	Chars	Max Line	Words	Distinct Words	Words	Distinct Words
en_US.blogs.txt	100.000	4.145.547	23.136.988	19.799	2.129.218	125.226	2.029.547	54.953
en_US.news.txt	100.000	3.393.823	20.123.815	2.377	1.918.991	105.182	1.835.630	53.513
en_US.twitter.txt	100.000	1.286.014	6.881.642	155	696.709	63.756	661.233	35.541
Corpus	300.000	8.825.384	50.142.445	19.799	4.744.918	208.615	4.526.410	58.458
^a Raw documents contains 1 and 2 lenght terms

3. N-Grams tokenization and plotting

Documents are tokenized in Uni-Bi-Tri Grams. Terms matrix are generated for each one of the cases.

2 column matrices are generated to set up order lists ranked by number of n-gram ocurrences, then converted to Data Frames and finally plotted via ggplot2.

Alternatively word clouds are generated so as to have a visual impact regarding the most frequents n-grams.

5.Tentative strategy principles.

According to clear performance limitations, data will be needed to be sampled so as to minimize the loss of information. Optimal number of lines will need to be analized.

Most frequent groupings of words (n-grams) seem a promising starting point to base the prediction algorythms. N-grams rankings doesnt vary much with the size of sample as far as the samples are large enough (>1500 lines sampled by file).

[END OF REPORT]