Text Prediction App

Emilio González
August 2015

An application for next word prediction that uses Natural Language Processing techniques and has been highly optimised for speed (low cpu requirements) as well as low memory footprint.

Application Key Features

The R app is hosted in shinyapps.io. Visit the application link to play with it. Its main characteristics are:

  • Low memory footprint: 25MB RAM (less on storage)
  • High speed: Instantaneous prediction (<20ms) after pressing the Predict button: processes more than 50 complete sentences (word predictions) per second in batch mode
  • Quick initialization (small input data file)
  • High accuracy: 15% best prediction, 21% top 3 word, measured by (benchmark.r)
  • Scalability: Trained with a new specialized corpus will add a new field of knowledge with small increase of memory requirements and same speed

Instructions and Interface

App interface

Algorithm

  • Training phase with complete corpus (cleaning, creation of 18772636 n-grams till order 6 with count > 1 and posterior pruning of n-grams by count and by heuristics, leaving out 2786639 n-grams)
  • Interpolated BackOff algorithm (manual parameters adjustment) to process each n-gram and ranking its predictions (considering the low order n-grams)
  • Keep only the 3 best predictions for each n-gram and removal of all the other ones –> less than 2M grams.
  • Storage optimization subsituting each n-gram by a hash signature (32 bits integer) reducing dramatically space without adding loss in accuracy (collisions have been treated with traditional 'plain text' table)
  • Dictionary referenced by integer numbers

Performance

  • Prediction time under 15ms: most of the time is spent filtering user input and testing for profanity; prediction itself is just a search in the signatures table to obtain the predicted words from the records that match (simple stupid backoff at execution time as everything at lower level is preprocessed).
  • Very low memory footprint: each record in the lookup table is just 8 bytes (4 signature and 4 for index in dictionary of predicted word). Additional tables are small (dictionary with 200000 entries, set of predicted words with 64000 and badwords for profanity with 110)
  • Scalability Easy to add more training documents (ex. specialized subjects) without impact on prediction time and only increasing storage: small devices can handle huge amount of data thanks to the chosen data representation.