Pitch

Kevin S. (https://sg.linkedin.com/in/kevinsiswandi)
June, 2016

Product Lifecycle

This data product was created and refined through iterative processes with the following general stages:

  • Getting and cleaning the data
  • Exploratory analysis
  • Using N-gram model to build a word prediction framework
  • Developing a predictive text application in shiny

The source code of the product can be found here: https://github.com/Physicist91/swiftkey

Analytic components

Goal Solution
To avoid storing the entire training data in memory for preprocessing the corpus Created a permanent corpus on disk from the (raw) textual data
To transform the data into the right structure for analysis Cleaned the data prior to N-gram tokenization by various transformations on the corpus.
To have a word dictionary from which the N-grams can be referenced Created tables of trigrams, bigrams, and unigrams from the raw corpus.
To predict the next word given a sentence/phrase Take the last N words, sort matching N-1 grams by descending count and output the top three.
To handle unseen N-grams Employ backoff method to fall back to lower grams.

Critical product decisions

  1. The backend model is completely general and can be used to train different text sources with little/no modifications.
  2. The blogs data in the English language are used to train the model.
  3. Only one table in csv format, properly designed to contain 3-, 2-, and 1-grams, is used for the model in production.
  4. End-of-sentence markers are encoded to discard N-grams spanning multiple sentences.

The shiny app

https://kevinsis.shinyapps.io/wordapp/

Note: the app was hosted on a free shinyapps.io and loading the app/refresh may take a few seconds to a minute.

FEATURES:

  • Simple: just enter a phrase/sentence; it just works.
  • Fast: near-instantaneous results.
  • Lightweight: minimum amount of RAM and processing power.