Swiftkey-Based Ngram Text Predictor

William Holst
12/15/2016

Final Project for the Coursera Data Science Capstone

alt text alt text

What This Application Does

It predicts the next word in a phrase or 'babbles' based on the phrase.

  • Main Input
    • User enters a brief phrase and selects 'Predict..' or 'Babble..'
    • System 'cleans' input data and looks up appropriate ngram
    • A back-off algorithm determines the 'best' available next word
    • Same algorithm predicts a fun babble phrase if that option selected
  • Frequent Ngrams - shows histograms of the most frequent phrases
  • About - Explains how the algorithm works

The ngram tables were constructed from Swiftkey-provided text sets from Twitter, blogs, and news sources.

The User Interface

alt text

The Algorithm and Performance

Application uses a simple Backoff algorithm

  • phrase of length 3 - pick highest probability from quadgram table
  • if not present in quadgram, use high probability phrase length 2 in trigram table
  • if not in trigram, use high probability phrase length 1 in bigram table

Performance of the algorithm

  • Accuracy -with random test cases of 2,3, and 4 word phrases - approximately 40% correct hit rate
  • Performance - easily observed with a long babble - approximately 0.2 seconds per lookup
  • App startup - around 15 seconds total for 4 tables between 7 and 16 mb each
  • Note: fivegram table not used - the frequencies are too small to be useful

References