Next Worp Prediction

A Shiny web Application

Renaud Dufour

Introduction & project scope

Next word prediction (NWP) is the task of suggesting the most probable word a user will type next. Example of application are:

the SwiftKey smart keyboard
improving of the quality and quantity of written work
enhancing the development of written literacy skills
spelling assistance to people with various levels of spelling disorder

Objective of the project is to devellop a Shiny web application highliting a next word prediction algorithm. The main constraint is for the algorithm to have a small memory footprint so that it can run efficiently on a shiny server (limited to 512Mb of memory for medium instances). A second aspect is to achieve a small response time since we ideally want to predict words faster than the user types.

Architecture

The application is built on a 3-tier achitecture:

Front end: the user interface allowing the user to type in a sentence and displaying the predicted next words
Back End: a prediction algorithm querying the language model, the latter being stored in a local SQLite database
The language model: a SQLite database containing sequences of words and their probability. The model is built prior to uploading the application on the shiny server.

The above approach results in a nearly null memory footprint and fast response time. Median response time is typically about 50ms to predict 1 to 5 words. The response time can go up to about 300ms if a large number of words are to be retrieve due to recursive calls to lower order language models (which is however not the case envisioned for the target applications).

Application is available here

Model

The source data used to build the model consists in three text files of about 250Mb each (blog, news and twitter data). All three datasets are used to build the model.

The language model is built as follows:

ngrams of order 1 to 4 are extrated from each dataset using the MeTa C++ library. lowercase and alpha filters are used for preprocessing (the latter removes all non alphabetic characters). ngrams and their counts are output to text files.
ngram files are read into R and data for the three datasets are merged. ngrams with count < 5 are discarded in order to be able to handle the tables in memory.
from the merged ngram tables, language models are computed using interpolated Kneyser-Ney Smoothing method. Models are stored in a local SQLite table.
To perform prediction, the user input sentence is parsed in a similar way to the model input data. Then, look up are performed starting from the highest order model and words with highest Kneyser-Ney probability are retrieved.

Source Code and further improvements

The source code of the shiny App is available on my GitHub
More informations on the Kneyser-Ney smoothing method on wikipedia. I also recommend the Thesis from Martin Körner and this article.
Some potential improvements include :
- handling unknown words, either by limiting the vocabulary when building the model or by considering e.g. unigrams with count 1 as unknown words. This would enable to properly estimate the model accuracy by computing perplexity on a test sample.
- Using ngram encoding to decrease the size of the SQLite database
- Building the model by keeping upper/lowercase letter and punctuation to test in this enables prediction of more complex sentences
- using a tool like Spark to extract the ngram and build the model, in order achieve a scalable solution
Feel free to contact me for any question or suggestion !