Justin
10/9/2016
Whether you are texting a friend or searching for something on Google, word prediction can help you get results faster.
However, predicting the next word is not easy. We know that Language is a very dynamic and fluid data set, changing over time. In addition, the interactive nature of language and technology have spawned new words or terms such as “lol”.
Consider this, we have developed an approach that blends some of the best approaches to modern day Natural Language Models, while adding a dash of randomness for fun.
The core of our approach uses the Backoff algorithm. As the authors of “Large Language Models in Machine Translation” note, it is “inexpensive to calculate in a distributed environment while approaching the quality of Kneser-Nay smoothing”. As we have acquired a large data set from the News, Blogs, and Tweets, we found the simple approach excellent for quickly handling the data, through scoring n-grams according to frequency and then ranking accordingly. \[ S(\omega_i) = \frac{f(\omega_i)}{\eta} \] Considering the semi-random nature of language, when the probability of accurately predicting the next word begins to decline, we shift gears and employ a Markov chain model with a small random tuning parameter, to mimic the random nature of language.
In the spirit of Edward Tufte, our app takes a very minimalist approach to displaying our algorithm. It is light weight and very fast, dynamically updating as one types. We chose the minimalist approach, because in practice, those interested in the app would already have a user interface via cell phone or web-browser. Simply
Please try our app here!
We hope you enjoyed our novel approach to the problem of text prediction!
We want to thank Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean of Google, Inc who inspired us to try the Stupid-Backoff approach. We also want to thank Daniel Jurafsky and James H. Martin for publicly posting the September 1, 2014 draft of Speech and Language Processing. Particularly, Chapter 4 on n-grams and chapter 8 on Hidden Markov Models.
Finally, we must thank the talented Drew Schimdt, who created the ngram package we used. He wrote n-gram functions in C, with an R wrapper, so we can all create n-grams amazingly fast.