Next Word Prediction

JB
August 2015

The Application

An application for next word prediction can be used to predict the next word, given a phrase of at least one word. This is especially useful when typing on a small keyboard on a mobile device. This gives the user the opportunity to add the next word with just one click assuming the prediction is correct.

Our application is a prototype for English language to test performance and accuracy. It is web based and returns a prediction for the next word after the user enters a phrase in English.

How To Use The App

This is a very easy to use application. On the left side, there is an input field to enter text. After clicking on the submit button, the prediction of the next word will be printed out on the right side.

How the App Works

The application uses a dictionary of n-grams mapped to next words with a certain probablity to make a prediction. For the current version we use n = 3, so the dictionary consists of a long list of single words, bigrams and trigrams, each mapped to possible next words with a certain probablility. After the user enters a phrase, the algorithm looks for this phrase in the dictionary and predicts the next word with the highest probability. The dictionary was created by using a large text corpus compiled from various sources broken down into a list of single words, bigrams, trigrams and 4-grams. This list was then aggregated to unique terms and their frequency. To create a dictionary, the last word of each bigram, trigram and 4-gram was split off as the word to be predicted. We explain the algorithm on the next slide.

The Algorithm

The application uses an algorithm called 'stupid backoff'. It is a simple algorithm that brings the advantage of producing a prediction very fast. We chose it because of the importance of having an immediate next word prediction when typing on a mobile device. Since we use n = 3 in the current version, the algorithm takes the last 3 words of the input phrase (or entire phrase if the phrase is 3 words long or shorter) and looks for these in the dictionary. If there's a match, the prediction will be the word that follows this phrase with the highest probability according to the dictionary. If the phrase is not found in the dictionary, the algorithm drops the first word and searches the dictionary again for the shortened phrase. It does so repeatedly until a match is found or the phrase is empty. In the latter case, the prediction will be the most frequently occurring word in English text, which is 'the'.