NextKey

S.S.
Aug 2015

What is NextKey?

A web application that will predict the next word based on a phrase that you enter. The main goals of this project are to achieve accuracy while maintaining speed with a user-friendly interface. Similar predictive text software exists on mobile platforms. Natural Language Processing (NLP) and Text Mining are used to achieve the results. A word dictionary which contains n-gram data structures is used to assist in the prediction.

How to use

Simply start typing words in the Phrase box and you will see the next predicted single word in the Prediction box. You can optionally fill the check box to see the top 3 results. Click on About in the left menu for more detailed help on the functionality. You will see six information boxes in the application that are updated in real-time and uniquely give you an idea of what is going on 'under the hood':

Prediction Time: The time required to predict the next word.
Predictions Found: The number of predicted words found (up to 10).
N-Gram Match Found: The n-gram data structure where the top predicted word is found.
Typed Words: The number of words you have typed (separated by space).
Average Typed Characters: The average number of characters in the words you have typed.
N-Gram Frequency Count: The frequency count of the n-gram where the top predicted word is found.

Prediction algorithm

Our prediction model that allows us to determine the next word uses the 'stupid back-off' algorithm commonly used for large Web n-gram datasets. The algorithm in general works by first looking for the best match in the 4-gram table and if found, show the predicted word. If no match is found, it moves on to the 3-gram table looking for a match and so on (i.e. 4-3-2-1). Finally, if no matching word is found, the top 1-grams are shown. NextKeyFlowChart

How it works

Data, Prediction and Results

The data is sourced from HC Corpora . This large 580MB unstructured dataset consists of sentences from blogs, news and Twitter. The dataset is cleaned (such as punctuation), profanity removed, tokenized and formed into compressed n-gram data structures that are indexed using keys for very fast retrieval. The final dataset size is only 10.6MB with 1 to 4-grams and uses 72MB of memory. Each n-gram data structure contains ~750,000 of the most common phrases. Our back-off prediction model searches the n-gram dataset and compares to the user typed text to find the next word in less than 0.005 seconds. Our accuracy was 14% for the top-1 predicted word with a class-leading 1.68 msec runtime using Jan Hagelauer's benchmark.

Conclusion

Our application is very fast, small, easy to use and has relatively high accuracy with this specific benchmark. With more development, this can be an ideal application ported to smartphones and tablets.