Benjamin Rouillé d'Orfeuil
25 March 2017
NextWord aims at predicting the word most likely to follow a meaningful sequence of words.
This has been achieved by:
Code: https://github.com/rouille/NextWord
App: https://rouille.shinyapps.io/NextWord
Use Stupid Backoff model as described in this article.
If \( w_1^{l} \) is a string of \( l \) tokens, then the score, \( S \), of a prediction is computed using relative frequencies of \( n \)-gram:
\[ S(w_i|w^{i-1}_{i-k+1}) = \begin{cases} \frac{f(w^{i}_{i-k+1})}{f(w^{i-1}_{i-k+1})} & \text{if}~f(w^{i}_{i-k+1}) > 0 \\ 0.4 \times S(w_i|w^{i-1}_{i-k+2}) & \text{otherwise} \end{cases} \]
with \( k \) in [\( n \) = 5, 1]. Recursion ends at unigrams (\( n \) = 1):
\[ S(w_i) = \frac{f(w_i)}{N}~\text{with}~N~\text{the size of training Corpus} \]
We use an independent benchmark.
Results from about 30,000 predictions:
NextWord occupies 34 MB of disk space.