Word Predictor

Charles McGuinness, December 2014
Screenshot

Introduction to the Application

I designed the application with several goals in mind:

Accept a sequence of words taken from english language text and predict the most likely next word
Allow the user to test the predictive model with several phrases
Help the user keep score of the tests
Add an element of fun and whimsy to the process

Using the Application

The application keeps score in a “hangman” like game. To use it, you enter a phrase into the input box and then press the “Predict Next Word” button. After a brief calculation, the program will display the predicted word. The user then compares the word predicted by the program with word from their test case. If they match, the user presses the “Yes” button, if not, the “No” button. A “Let's Start over” button is available to reset the counts if needed.

As the tests progress, the program keeps score and updates a drawing of a “hangman” character. After five tests (the definition of a round), the program either celebrates its success or laments its failure.

Model Generation

The prediction algorithm runs on a pre-computed set of n-grams. The final model's n-grams are derived from all 5- and lower n-grams produced by parsing the entire corpora.

The initial, very large list of n-grams is pruned in three steps:

All n-grams that occur twice or less were removed.
All n-grams with the same “predictor” (all but the last word) are compared, and only the most frequent retained. For example, if “the young man” occured 5 times and “the young dog” 2 times, I kept the n-gram ending in “man”.
If a longer n-gram was redundant, because a shorter n-gram would predict the same thing, the longer one was removed.

Prediction Generation

At run time, a phrase is entered into the user interface and fed to the prediction algorithm, which breaks the phrase into individual words and follows these steps:

Look for the longest exact match possible. If one is found, that is used as the prediction.
Begin replacing words in the input phrase with wildcards, starting with the last word, to find a match. The algorithm tries one word at a time, then two, up to all but one word, in hopes of finding a match.
If all else fails (there is no match possible with any permutation of the words), the program returns “The”, the most common word in english usage.