NextWord App

Benjamin Rouillé d'Orfeuil
25 March 2017

A Cool App !

NextWord aims at predicting the word most likely to follow a meaningful sequence of words.

This has been achieved by:

  • ingesting a large Corpus made of blog posts, news articles and tweets;
  • building a 5-gram language model;
  • and using backoff mechanism to rank candidates.

Code: https://github.com/rouille/NextWord
App: https://rouille.shinyapps.io/NextWord

Language Model

  • Clean the Corpus. The procedure is described here.
  • Extract sequence of \( n \) = [1,5] words using ngram package.
  • Prune low count \( n \)-gram. Top 5-gram are shown below:

plot of chunk unnamed-chunk-1plot of chunk unnamed-chunk-1

Next Word Candidates Ranking

Use Stupid Backoff model as described in this article.

If \( w_1^{l} \) is a string of \( l \) tokens, then the score, \( S \), of a prediction is computed using relative frequencies of \( n \)-gram:

\[ S(w_i|w^{i-1}_{i-k+1}) = \begin{cases} \frac{f(w^{i}_{i-k+1})}{f(w^{i-1}_{i-k+1})} & \text{if}~f(w^{i}_{i-k+1}) > 0 \\ 0.4 \times S(w_i|w^{i-1}_{i-k+2}) & \text{otherwise} \end{cases} \]

with \( k \) in [\( n \) = 5, 1]. Recursion ends at unigrams (\( n \) = 1):

\[ S(w_i) = \frac{f(w_i)}{N}~\text{with}~N~\text{the size of training Corpus} \]

Performance

We use an independent benchmark.

Results from about 30,000 predictions:

  • NextWord reaches 13.5%/21.8% for overall top-1/top-3 precisions;
  • generates predictions in less than 12 ms;
  • and uses 244 MB of RAM.

NextWord occupies 34 MB of disk space.