Maximus Decimus Meridius, Legio IV Flavia Felix
a.d. XIX Kal. Ian., MMXIV
The dataset - which consisted of a large collection of tweets, blogs and news (about 4 M items in total) - was summarized in to ngram counts - counts of occurances of combinations of n words, with n <= 4.
To reduce computation time, the data set was split up in the 500 different files, processed individually and combined in to four different data tables, one for each ngram 1-4.
The app uses a Backoff algorithm, checking up to 3 words against the library of ngrams:
Corresponding data tables were created with stopwords removed, i.e. words that contain little predictive information (“and”, “the”, “is” etc). Predictions with the stopwords removed tend to be more topical in nature, rather than word-for-word predictions, and is thus well represented by a word cloud:
The word cloud could potentially be usefull for touch devices, where it's easy and intuitive to click on a topic to find out more information. Fun things to try: