S.S.
Aug 2015
A web application that will predict the next word based on a phrase that you enter. The main goals of this project are to achieve accuracy while maintaining speed with a user-friendly interface. Similar predictive text software exists on mobile platforms. Natural Language Processing (NLP) and Text Mining are used to achieve the results. A word dictionary which contains n-gram data structures is used to assist in the prediction.
Simply start typing words in the Phrase box and you will see the next predicted single word in the Prediction box. You can optionally fill the check box to see the top 3 results. Click on About in the left menu for more detailed help on the functionality. You will see six information boxes in the application that are updated in real-time and uniquely give you an idea of what is going on 'under the hood':
Our prediction model that allows us to determine the next word uses the 'stupid back-off' algorithm commonly used for large Web n-gram datasets. The algorithm in general works by first looking for the best match in the 4-gram table and if found, show the predicted word. If no match is found, it moves on to the 3-gram table looking for a match and so on (i.e. 4-3-2-1). Finally, if no matching word is found, the top 1-grams are shown.
The data is sourced from HC Corpora . This large 580MB unstructured dataset consists of sentences from blogs, news and Twitter. The dataset is cleaned (such as punctuation), profanity removed, tokenized and formed into compressed n-gram data structures that are indexed using keys for very fast retrieval. The final dataset size is only 10.6MB with 1 to 4-grams and uses 72MB of memory. Each n-gram data structure contains ~750,000 of the most common phrases. Our back-off prediction model searches the n-gram dataset and compares to the user typed text to find the next word in less than 0.005 seconds. Our accuracy was 14% for the top-1 predicted word with a class-leading 1.68 msec runtime using Jan Hagelauer's benchmark.
Our application is very fast, small, easy to use and has relatively high accuracy with this specific benchmark. With more development, this can be an ideal application ported to smartphones and tablets.