Julian Cook
21-Aug-2015
A fast, compact application for word prediction, capable of being used in mobile settings.
This presentation is a brief description of the Natural Language word prediction app and the methods that drive the predictions.
To drive the app, data was gathered by sampling 50% of the following English language corpora: corpora.heliohost.org
| Statistics for files | en_US.blogs | en_US.news | en_US.twitter |
|---|---|---|---|
| Items per file | 899,288 | 1,010,242 | 2,360,148 |
| Sizes MB for files | 248.5 Mb | 249.6 Mb | 301.4 Mb |
The score of a word is first computed as: \[ Score \left({ word_{i} \vert word^{i-1}_{i-k} } \right) = \left({ \frac{Count\left(word_{i}\right)} {Count\left(word^{i-1}_{i-k}\right)}}\right) \] Where the initial word sequences \( \left(word^{i-1}_{i-4}\right) \) searched are 5-grams (k=4). We are looking for the Maximum Likelihood 5-gram matching the word sequence.
If the 5-gram search fails, we recursively back-off to 4-grams and 3-grams. Failing this, we perform GLM (wild-card) searches on 5-grams and also shift the position of a \( \left(word^{i-1}_{i-2}\right) \) search, inside the 5-gram table to maximize our chance of success. We penalize any sub-5-gram result with a linear ratio of n/5, where n is the n-gram searched for.
Searches are fast, since the database used by the app has pre-indexed tables consisting of 5-grams,4-grams,3-grams, 2-grams. Only the GLM search takes more than 3 secs.