Leonard
12/8/2020
This is a final project in Coursera Data Science Specialization by John Hopkins University.
This assignment was also done with a collaboration with Swiftkey Company.
The task involved processing raw large text data into an application that predict user next word based on previous word.
The data used is taken from blogs, news and twitter and was already provided for this assignment.
The first task from this capstone was to understand the data, do tokenization, n-grams, and and create an exploratory graph. The idea is to break down the text into bigram (pair of 2 words), trigram(pair of three), and so on.
I also removed profanity from the text before we further move on to the next step.
You can find the full report on this task in this article, Milestone 1
The Model I've used for this assignment is called Backoff using relative frequencies (Stupid Backkoff, Brants et al. 2007). I've chose this model due to our scope of prediction that is vast and also the simplycity of the model itself.
Also based on a very good explanation by Stanford NLP course. This model is a good method for Web-Scale ngrams.
The algorithm goes as follows:
The final step is to create the application itself.
Below is the link to the final app, it was created using shiny in R.