NLP Word Prediction App

Julian Cook
21-Aug-2015

A fast, compact application for word prediction, capable of being used in mobile settings.

Introduction

This presentation is a brief description of the Natural Language word prediction app and the methods that drive the predictions.

User interface

App has a very simple user interface; just enter a phrase in the text box
Press the
button
After a short pause the predicted word will appear in blue to the right.
There is a diagnostic table below that shows possibilities that were considered.

Technology and Data that drives the app

To drive the app, data was gathered by sampling 50% of the following English language corpora: corpora.heliohost.org

Statistics for files	en_US.blogs	en_US.news	en_US.twitter
Items per file	899,288	1,010,242	2,360,148
Sizes MB for files	248.5 Mb	249.6 Mb	301.4 Mb

The data was cleaned by removing all non-english words and punctuation
The data was transformed into bi-grams, tri-grams, quad-grams and 'five'-grams
The *-grams were stored in tables in a compact sqlite database
We can search the sqlite database and return matches very quickly using specific search algorithms.

Algorithmic approach

The approach used is a blend of a Google machine translation method called 'Stupid ( naive) backoff' (Brants 2007)
Combined with a Generalized Language Model (Korner 2013)
Which is itself extended here to include a shift in position, as well as, a wildcard match.

The score of a word is first computed as: \[ Score \left({ word_{i} \vert word^{i-1}_{i-k} } \right) = \left({ \frac{Count\left(word_{i}\right)} {Count\left(word^{i-1}_{i-k}\right)}}\right) \] Where the initial word sequences \( \left(word^{i-1}_{i-4}\right) \) searched are 5-grams (k=4). We are looking for the Maximum Likelihood 5-gram matching the word sequence.

Backoff algorithm and Architecture

If the 5-gram search fails, we recursively back-off to 4-grams and 3-grams. Failing this, we perform GLM (wild-card) searches on 5-grams and also shift the position of a \( \left(word^{i-1}_{i-2}\right) \) search, inside the 5-gram table to maximize our chance of success. We penalize any sub-5-gram result with a linear ratio of n/5, where n is the n-gram searched for.

Searches are fast, since the database used by the app has pre-indexed tables consisting of 5-grams,4-grams,3-grams, 2-grams. Only the GLM search takes more than 3 secs. User interface

Possible Improvements

The app is already fast and compact enough for mobile use.
As the search moves down to lesser grams (3-grams and 2-grams), context is increasingly lost.
This hurts the 'suitability' of the predicted word.
It might be better to increase the 5-gram corpus and work with that directly.
Adding punctuation would also help context.