NextWord: a text input prediction framework

Bruno Hanzen
2017/03/21

The Coursera Datascience specialization provides a global training in the basic skills of the discipline, including, but not limited to R language,statistics, presentation skills and Web application development.

The program is concluded with the Capstone Project. The objective is to provide a predictive text model that “predicts” the next word to be entered at the keyboard, based on the preceding words.

Executive Summary

We built a parametrized model, that is highly flexible and can be adapted to suit the needs of different applications. It is basically made up of to parts:

The training pipeline: we “feed” the pipeline with text samples to extract and format some statistics.
The runtime: we predict of “next word” based on some already entered text and the pipeline output.

Coursera provided a compilation of news, blogs and tweets prepared by Swiftkey (556 MB).

Some tips about parameters selection:

Input texts selection is an integral part of the process. You could, for example, target specific professions by compiling typical texts they use, in order to take specific vocabulary or sentence structures into account.
“Feeding” the runtime with large datasets does not improve the prediction success past some limit, but is expensive in memory usage and answer time. It can be tuned with the Coursera-provided benchmark
Using high order n-grams (n>3) does not improve the prediction success significantly, and also impacts memory usage and answer times
In order to improve prediction success, we advise to add a “sieve-as-you-type” module, that will narrow the search for the next word using the already entered letters.
The runtime can run on any R supporting platform, but has to take specific constraints (memory) into consideration

The training pipeline

In the training phase, we analyse a text thesaurus and extract the information that will be provided to the runtime. It is the most highly resource-intensive part of the process, and can require lots of computing resources (memory, processor).

Analysis: we perform an exhaustive analysis of the thesaurus, and extract lists of n-grams (sets of n consecutive words incuded in the thesaurus) and their occurence frequencies. For example “the, the quick, the quick brown , the quick brown fox” are respectively 1-, 2-, 3-, 4-grams extracted from “The quick brown fox jumps over the lazy dog”.
Compression: the annalysis generates big data volumes. In order to reduce the size, we remove all n-grams which contain words not included a dictionnary. In this case, we used hunspell. As the dictionnary is customizable, it can also be used to remove profanity. We further reduce the volume by removing the least frequent n-grams (based on a user-defined parameter)
Formating: we format the data so that the runtime can process them efficiently.

In our case, we used a 8 GB RAM PC for the training phase. We had to apply some “tricks” to be able to process the files provided by Coursera and Swiftkey.

The Runtime

This is where you find the proper prediction algorithm. It implements the Katz-Backoff algorithm. You can find the exact description in Wikipedia (https://en.wikipedia.org/wiki/Katz%27s_back-off_model).

The basic principle is to look in the n-gram tables for occurences of sequence of the precedently typed 1, 2, 3, … words (n-grams), and assign a probability of the successor word fount in the n-gram tables, based on the occurence frequency in the thesaurus. The highest-order ngrams have precedence. A “discount” process controls the comparison of the probability of different orders n-grams.

We used constant discount coefficients. It would have been possible to use a more sophisticated algorithm like Good-Turing, but it would have added to the algorithm complexity, and our tests have shown a rather small sensitivity of the success to the dicsount parameters.

We used the Coursera-supplied benchmark to test precision and runtime. We could achieve an “Overall top-3 precision” of 19.34 % with an average 57.41 msec runtime. The maximum precision we could achieve was 20.5%, withmore than 1,000 ms runtime.

We ran a baseline: the predicted word was simply the most frequent word in the thesaurus (the). The precision was 10.78%, with 8 ms runtime. Our application precision is nearly twice as good.

The web application

The application is deployed on https://bruno-hanzen.shinyapps.io/NextWord/

User manual:

Text input: enter a text in the text box below the “ENTER YOUR TEXT HERE” label
Prediction: the predicted word is displayed in the greyed text box below the “PREDICTED WORD” label
A view over the inner workings is provided in the greyed text box below the “The inner working of the algorithm” label
- run time in ms (sorry, headers in French)
- the 5 most probable words (under “end”) and their probability (under “proba”), in decrasing probability order. The first one is displayed above as “predicted”.

Application view