San James
01/02/2017
Ever had to input text into a portable device? if you have, it must have been apparent that with the small keyboards they come with, you hadly have both speed and accuracy of input.
This project is implemented to enable a user realize a substantial improvement both speed and accuracy while inputing text.
Using the HC Corpora provided by Swift Key in collaboration with Coursera to support the Data Science Specialization Capstone, we created a text prediction model. We then created a shiny app that takes input from the user and returns predictions of the next posible word. Such predictions can be selected by the user. If this is done recursively, it can greatly enhance both speed and accuracy of text input.
We leverage the theory advanced by Andrey Markov, a Russian mathematician, that one can predict the future of a process based on its present state just as much as they would do with the entire history. Basing on this theory, we create a model that uses upto the last 4 words occuring in the input phrase to predict the next word to be input. Such a model is refered to as an N-Gram model where 'N' is the number of words used. The models use probabilities of preceding word(s) to predict the next word.
Several models such as the Katz Backoff, Kneser Ney, Interpolation and Tree Model exist however we use the Kneser Ney, because of the major advantage it gives us. Unlike the other models that simply look at the probability of the words preceding words, Kneser Ney also puts into consideration the bigrams completed by the word (continuation probability). Further more, it is well suited for smaller datasets. The efficacy of the Kneser Ney can be seen from the great example of 'San Francisco'. Where 'Francisco' mostly preceded by 'San'. Considering it alone would easily give it a much higher probability than deserved.
The Shiny App that was created takes a phrase and returns the top 3 words that can be used as the next word in the phrase.