Data Science Capstone: Next Word Predictor

Hon Y. (Ken) Ho
Tue Apr 19 2016

Overview

The Shiny App Next Word Predictor returns a list of words as next words of a phrase entered by user using its build-in algorithm.

The challenge being a web app
- Speed - Predictions should take less than 1 second
- Accuracy - Trained data can't be too little or too much
- Efficiency - Applied model should require relatively few resources
Final implementation
- Overall performance was the priority
- Over 30 MB (10%) of text data was trained
- Web app and resource friendly algorithm was used

How To Use This App

Instructions

Underlying Algorithm (1/2)

Implements Stupid Backoff model, which applies fixed discounts for (n-1)-gram's possibility of interpolated smoothing.

Finds trigram matches for the last 2 words entered
Finds bigram matches for the last word entered
- Applies a fixed discount (0.4) to the matched bigrams for scores
Also returns the number of top unigrams equals to the number of predicted words user wants to see
- Applies a fixed discount (0.16) to the matched unigrams for scores
Finally, combines the matched n-grams and the words with top scores are returned

Underlying Algorithm (2/2)

The Stupid Backoff model has the following advantages over other models:

Scalability - Can be used to train on 2 trillion tokens
Inexpensive - Requires fewer resources compared to, for example, Katz' Backoff Model
Good Accuracy- Approaches quality of Kneser-Ney Smoothing when amount of data increases

Performance Considerations

Splits bigram and trigram frequency files by alphabets
- 0.2 - 4.5 MB in size for bigrams
- 0.5 - 20 MB in size for trigrams
Lazy loading of the splitted frequency files
Uses new.env() to implement hash tables that store the splitted frequency files
Uses fread() for fast reading the n-gram frequency files