Word Prediction Application

Maximus Decimus Meridius, Legio IV Flavia Felix
a.d. XIX Kal. Ian., MMXIV

Preprocessing

The dataset - which consisted of a large collection of tweets, blogs and news (about 4 M items in total) - was summarized in to ngram counts - counts of occurances of combinations of n words, with n <= 4.

To reduce computation time, the data set was split up in the 500 different files, processed individually and combined in to four different data tables, one for each ngram 1-4.

How the Algorithm Works

The app uses a Backoff algorithm, checking up to 3 words against the library of ngrams:

Check 3 last words entered against 4-grams and return the most common following word. If no matches are found (or less than 3 words have been entered), go to step 2.
Check 2 last words entered against list of 3-grams. If no matches are found (or less than 2 words have been entered), go to step 1.
Check 2 last words entered against list of 2-grams. If no matches are found default to the most common word “the”.

The Word Cloud

Corresponding data tables were created with stopwords removed, i.e. words that contain little predictive information (“and”, “the”, “is” etc). Predictions with the stopwords removed tend to be more topical in nature, rather than word-for-word predictions, and is thus well represented by a word cloud:

alt text

App Instructions

Enter the words in to the text box to get the next predicted word.
To get a wordcloud showing combinations of words with stopwords removed check the box “Include topic suggestions”.
The processing time will increase somewhat but should not be too bad.
More common words will have longer processing times.

Uses

The word cloud could potentially be usefull for touch devices, where it's easy and intuitive to click on a topic to find out more information. Fun things to try:

Play with the relative popularity of celebreties names
Enter the next predicted word and see where the algorithm takes you (it only “remembers” three words of context)