This is a capstone project for the Data Science Specialization path, by Johns Hopkins University.
It is being requested to create an application of Predictive Text Model, capable of predicting subsequent words and which will be trained with a dataset from blogs, Twitter and news.
The source data set does contain text files in 4 different languages from Twitter, blogs and news. For the purpose of this capstone, we will take the English version (under ‘/en_US/’ folder).
In a first analysis, these are the stats of the source data:
## Source.files Lines Words Unique_words ## 1 en_US.twitter.txt 2360148 17111806 302505 ## 2 en_US.blogs.txt 899288 19347162 252893 ## 3 en_US.news.txt 1010242 19760894 212079
More details of the exploratory data analysis performed can be found in this page https://rpubs.com/rmmoya/swiftkey_project_data_analysis