Marco Letico
27 February 2018
The application was built for the final Capstone project of the Data Science Specialization provided by the Johns Hopkins University through Coursera with the cooperation of SwiftKey. In these slides we will pitch the application going through the usage of it. You can access the app by clicking here.
This application was realized starting from 3 raw textual files downloaded from twitter, from some blogs and news websites.
In the very first moment, the data has been accurately cleaned. Then it was divided in sub-sentences where the delimiter was the punctuation. At this point was created the Corpus and the data was consequently processed to extract the document term matrix for the n-grams. At this point we calculated the frequency. Once the frequency was obtained we used the Markov chain model below to calculate the predicted word:
\[ P(w_i|w_1 w_2 ... w_{i-1})\approx P(w_i|w_{i-1}... _{i-n}) \]
In this case we have been consider a maximum of 5-grams. The algorithm created works in the following way:
Find the complete analysis here: https://rpubs.com/mletico/361214
You can reach and deploy the application downloading the following repo: https://github.com/MarcoLeti/DataScience-SwiftKeyCapstoneProject
What we did not cover: