Next word prediction Web App

Alexandre Nanchen
July 2016

Training
- HC corpora data: blogs, news and twitter (> 100 Mio words)
- Vocabulary selection to retain 90% of the occurences
- Out of vocabulary modeling unsing <unk> symbol
- Fast training time < 40 seconds for more than 10 Mio N-grams
- KNeser-Ney interpolated model of order 4 with fix smoothing
- Model pruning
Evaluation
- Model and perplexity comparison with MITLM open source toolkit
- Perplexity comparison per sources (Twitter, Blogs and news) and model type

The full evaluation results are on the Web App.

Model compression during loading trough N-gram context hashing
Memory footpring reduction of a factor of 2.82
Selection of most probable words for each N-gram order using backoff weights and continuation probabilities
Averaging of probabilities through all N-gram orders
Ordering by decreasing scores and N-gram order

For a detailed description, see the Algorithm tab on the Web App.