NLP Word Prediction App

Julian Cook
21-Aug-2015

A fast, compact application for word prediction, capable of being used in mobile settings.

Introduction

This presentation is a brief description of the Natural Language word prediction app and the methods that drive the predictions.

User interface

  • App has a very simple user interface; just enter a phrase in the text box
  • Press the
      'Predict Next Word'
    button
  • After a short pause the predicted word will appear in blue to the right.
  • There is a diagnostic table below that shows possibilities that were considered.

Technology and Data that drives the app

To drive the app, data was gathered by sampling 50% of the following English language corpora: corpora.heliohost.org

Statistics for files en_US.blogs en_US.news en_US.twitter
Items per file 899,288 1,010,242 2,360,148
Sizes MB for files 248.5 Mb 249.6 Mb 301.4 Mb

  • The data was cleaned by removing all non-english words and punctuation
  • The data was transformed into bi-grams, tri-grams, quad-grams and 'five'-grams
  • The *-grams were stored in tables in a compact sqlite database
  • We can search the sqlite database and return matches very quickly using specific search algorithms.

Algorithmic approach

  • The approach used is a blend of a Google machine translation method called 'Stupid ( naive) backoff' (Brants 2007)
  • Combined with a Generalized Language Model (Korner 2013)
  • Which is itself extended here to include a shift in position, as well as, a wildcard match.

The score of a word is first computed as: \[ Score \left({ word_{i} \vert word^{i-1}_{i-k} } \right) = \left({ \frac{Count\left(word_{i}\right)} {Count\left(word^{i-1}_{i-k}\right)}}\right) \] Where the initial word sequences \( \left(word^{i-1}_{i-4}\right) \) searched are 5-grams (k=4). We are looking for the Maximum Likelihood 5-gram matching the word sequence.

Backoff algorithm and Architecture

If the 5-gram search fails, we recursively back-off to 4-grams and 3-grams. Failing this, we perform GLM (wild-card) searches on 5-grams and also shift the position of a \( \left(word^{i-1}_{i-2}\right) \) search, inside the 5-gram table to maximize our chance of success. We penalize any sub-5-gram result with a linear ratio of n/5, where n is the n-gram searched for.

Searches are fast, since the database used by the app has pre-indexed tables consisting of 5-grams,4-grams,3-grams, 2-grams. Only the GLM search takes more than 3 secs. User interface

Possible Improvements

  • The app is already fast and compact enough for mobile use.
  • As the search moves down to lesser grams (3-grams and 2-grams), context is increasingly lost.
  • This hurts the 'suitability' of the predicted word.
  • It might be better to increase the 5-gram corpus and work with that directly.
  • Adding punctuation would also help context.