Pitch

Kevin S. (https://sg.linkedin.com/in/kevinsiswandi)
June, 2016

This data product was created and refined through iterative processes with the following general stages:

The source code of the product can be found here: https://github.com/Physicist91/swiftkey

Goal	Solution
To avoid storing the entire training data in memory for preprocessing the corpus	Created a permanent corpus on disk from the (raw) textual data
To transform the data into the right structure for analysis	Cleaned the data prior to N-gram tokenization by various transformations on the corpus.
To have a word dictionary from which the N-grams can be referenced	Created tables of trigrams, bigrams, and unigrams from the raw corpus.
To predict the next word given a sentence/phrase	Take the last N words, sort matching N-1 grams by descending count and output the top three.
To handle unseen N-grams	Employ backoff method to fall back to lower grams.

The backend model is completely general and can be used to train different text sources with little/no modifications.
The blogs data in the English language are used to train the model.
Only one table in csv format, properly designed to contain 3-, 2-, and 1-grams, is used for the model in production.
End-of-sentence markers are encoded to discard N-grams spanning multiple sentences.

Note: the app was hosted on a free shinyapps.io and loading the app/refresh may take a few seconds to a minute.

FEATURES: