Antonio Rueda-Toicen
October 1st, 2016
The Shiny app here predicts the next word in a sentence using an N-gram backoff technique formulated by Brants et al. (Google guys) in “Large Language Models for Translation.” N-gram histograms represent counts of consecutive words that appear a corpus of text. They serve as statistical models of likely uses for words.
From a training corpus of blog posts, news articles, and Twitter feeds, the app:
Try the app here. To store our training corpus we use a SQLite database that's good and fast enough for our use case.
Input text is normalized:
We check if a known (in the training corpus) 4-gram appears in the normalized input and use it to suggest a word, if not: