Data Science Capstone: Next Word Predictor

Hon Y. (Ken) Ho
Tue Apr 19 2016

Overview

The Shiny App Next Word Predictor returns a list of words as next words of a phrase entered by user using its build-in algorithm.

  • The challenge being a web app
    • Speed - Predictions should take less than 1 second
    • Accuracy - Trained data can't be too little or too much
    • Efficiency - Applied model should require relatively few resources
  • Final implementation
    • Overall performance was the priority
    • Over 30 MB (10%) of text data was trained
    • Web app and resource friendly algorithm was used

How To Use This App

Instructions

Underlying Algorithm (1/2)

Implements Stupid Backoff model, which applies fixed discounts for (n-1)-gram's possibility of interpolated smoothing.

  • Finds trigram matches for the last 2 words entered
  • Finds bigram matches for the last word entered
    • Applies a fixed discount (0.4) to the matched bigrams for scores
  • Also returns the number of top unigrams equals to the number of predicted words user wants to see
    • Applies a fixed discount (0.16) to the matched unigrams for scores
  • Finally, combines the matched n-grams and the words with top scores are returned

Underlying Algorithm (2/2)

The Stupid Backoff model has the following advantages over other models:

  • Scalability - Can be used to train on 2 trillion tokens
  • Inexpensive - Requires fewer resources compared to, for example, Katz' Backoff Model
  • Good Accuracy- Approaches quality of Kneser-Ney Smoothing when amount of data increases

Performance Considerations

  • Splits bigram and trigram frequency files by alphabets
    • 0.2 - 4.5 MB in size for bigrams
    • 0.5 - 20 MB in size for trigrams
  • Lazy loading of the splitted frequency files
  • Uses new.env() to implement hash tables that store the splitted frequency files
  • Uses fread() for fast reading the n-gram frequency files