Data Science Capstone

Pier Luigi Olearo
20 August 2015

Predictive algorithm for texts

Database

Data used for analysis are extracted from : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In particular only data contained in directory en_US have been used:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

Analysis

Data dimensions don't allow to use tokenization efficiently. The analysis is thus limited to the 60% of total data.
During analysis each file has been divided in ten smaller ones where search of more significant words was carried out.
The final frequencies of each ngrams have been computed summing frequencies of single subsets, according to total probability theorem.

Results

Ngrams frequencies are really low, generally under 2%. From a statistic point of view, this lowness introduces high variability, which doesn't allow to obtain a prediction with a significant confidence interval.
The prediction accuracy is moreover lowered by some highly common words like articles, adverbs,conjunctions, that don't represent a statistical sampling.

Conclusion

This implementation shows that in theory a good prediction can be reached. The possibile practical optimizations could be:

  • improve tokenization routine
  • create a cloud database in order to analyse frequent words