Alejandro Balderas
18 March 2018
This is the final project for the Data Science Specialization in Coursera by the Johns Hopkins University. In this final project a text predictive model will be created using Natural Language Processing techniques like ngrams and the Katz-Back-Off model.
Input for this project was provide as text data from twitter, blogs and news feeds provided by Swiftkey. An exploratory data analysis phase was completed and can be found under the following link:
The data set is converted into tokens using the quanteda package and then create different ngrams. The frequency of the times each ngram appears in the text is saved an stored with each feature. This frequency give us then the probability that a certain word comes after another set of words. Below you can see the top trigrams from the blog data set
feature frequency
1 one_of_the 4859
2 a_lot_of 4095
3 as_well_as 2292
4 some_of_the 2283
5 to_be_a 2275
6 it_was_a 2273
With this information we can asume that most of the time after the text “the end of” the most probable outcome will be “the”.
The algorithm searches for the last 4 words in a 5-gram and then takes the next word as the prediction. If the algorithm does not find a match then it “backs off” and takes the 3 last words and searchs for them in the 4-gram data. This process is continued until a match is found or no match is found in which case a random sample of the 6 most common words in the data set will be returned.
Try out the application and see for yourself if the application delivers the wanted outcome.
As an add-on I built a code that will create a random sentence based on the previous words you write. Try it out in the extra tab of the app.