WordUp - Capstone Final Project Presentation

Fariz Abdul Rahman
8/31/2017

A word prediction app based on Stupid Backoff 5-gram model

How it works?

Key in text in the app and after a few seconds, the top 5 candidate for the next word is served up.

Hence the product name, WordUp!

Behind the wheels, the app has two main components:

  1. The n-gram frequency matrix

    • Up to 5-gram frequency matrices are developed based on 15% sampling of the text corpora.
  2. The prediction algorithm

    • Based on the Stupid Backoff 5-gram model (pardon the language, but that is really what it is called!)

Prediction Algorithm

  • Upon receiving a string of text as input, the last four words are used to match 5-grams using the 5-gram frequency matrix.
  • If less than 5 candidates are matched, the last three words are used to match 4-grams using the 4-gram frequency matrix. This process continues until 5 candidates are found.
  • If 5 candidates have not been found after trying to match 2-grams, then the app will use the highest 1-gram to complete the list of candidates.
  • Finally the list is sorted by descending scores calculated using the Stupid Backoff method and presented as the results.
  • For further details on the Stupid Backoff algorithm, go here.

Predictive performance

  • The Stupid Backoff algorithm is inexpensive and approaches the quality of more expensive algorithm like Kneser-Ney Smoothing for large training dataset.

  • This compensates the slow performance using R, which is great for setting up models and graphics, but not for processing large amounts of data.

  • The corpus data provided has a total number of words exceeding 75 million using 556MB of storage.

  • With a 64-bit, 12GB RAM desktop, the largest attainable training dataset was using 15% random sampling.

App screenshot

alt text

After launching the app, the screenshot on the left should be visible.

A sample text input with the corresponding result will be visible.

The scores indicate the weight of each predicted word compared to other words in the list.

The app can be accessed here.