Pier Luigi Olearo
20 August 2015
Predictive algorithm for texts
Data used for analysis are extracted from : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In particular only data contained in directory en_US have been used:
Data dimensions don't allow to use tokenization efficiently.
The analysis is thus limited to the 60% of total data.
During analysis each file has been divided in ten smaller ones where search of more significant words was carried out.
The final frequencies of each ngrams have been computed summing frequencies of single subsets, according to total probability theorem.
Ngrams frequencies are really low, generally under 2%.
From a statistic point of view, this lowness introduces high variability, which doesn't allow to obtain a prediction with a significant confidence interval.
The prediction accuracy is moreover lowered by some highly common words like articles, adverbs,conjunctions, that don't represent a statistical sampling.
This implementation shows that in theory a good prediction can be reached. The possibile practical optimizations could be: