Capstone NLP Word Prediction

Phanindra Reddigari
1/18/2017

Overall Scope:

Analyze the Swiftkey text files (blogs, twitter, and news) and develop a simple Shiny UI for next word prediction in a free form phrase. The UI is designed to accept user input in a text box and upon user submission, predicts the next word in the phrase. Highlights of the Shiny App design are:

  • Implementation of Dan Jurafsky's n-Gram NLP Interpolation model
  • Building Hash Map indexes between 4-Grams, 3-Grams, 2-Grams, and 1-Grams for fast lookups (primary and foreign key relationships)
  • Hash Map columns for computing probabilities of n-Gram based on conditional probabilities
  • Tradeoff accuracy for speed and memory (reduce the size of n-Grams)

Mathematical Basis:

Basic n-Gram Probability based on computation of conditional probabilities using:

P(w1 w2 w3 w4) = lambda1 * c(w1 w2 w3 w4) / c(w1 w2 w3) +
lambda2 * c(w2 w3 w4) / c(w2 w3) +
lambda3 * c(w3 w4) / c(w3) +
lambda4 * c(w4) / sum of frequencies of all 1-Grams,
where w4 is the candidate word and the prefix (e.g., (w1 w2 w3) is a lower order n-gram,
c() represents the count of a n-gram (frequency),
and lambda1 + lambda2 + lambda3 + lambda4 = 1.0

Ranking the candidate words by aggregate probability of n-Gram (higher weights for higher order n-grams). The optimal lambda coefficients obtained by training on known phrases.

Design Details - Data Structures and Algorithm for Prediction

  • Cross references to 4-Grams, 3-Grams, 2-Grams, and 1-Grams stored as integer indexes
  • 1-Gram (ugset) stored on disk as RDS with named rows with frequency as the column
  • 2-Gram (bgset) stored on disk as RDS with named rows with prefix and suffix unigram indexes
  • 3-Gram (tgset) stored on disk as RDS with named rows with prefix bigram and suffix unigram indexes
  • 4-Gram (qgset) stored on disk as RDS with named rows with prefix trigram and suffix unigram indexes
  • (n-1)-Gram prefix in a n-Gram is indexed using rownum in (n-1)-Gram primary key

Instructions for the Shiny App usage

  • Input text box: Use this box to enter a new phrase or edit an existing phrase. This box supports full text navigation and edit capabilities. The text box populates a sample phrase at initialization or a refresh.
  • Submit Button: When the text edit is complete, click on this button to initiate the word prediction algorithm
  • Output text box: This box displays the predicted word appended to the original phrase
  • Repeat the above three steps for the next phrase
  • Initial Wait: The UI takes 20 to 30 seconds to load the training data sets for n-Grams. Please be patient to allow the loading to be complete. The UI is ready to use once the output shows the completion of the sample phrase.