Capstone NLP Word Prediction

Phanindra Reddigari
1/18/2017

Overall Scope:

Analyze the Swiftkey text files (blogs, twitter, and news) and develop a simple Shiny UI for next word prediction in a free form phrase. The UI is designed to accept user input in a text box and upon user submission, predicts the next word in the phrase. Highlights of the Shiny App design are:

Implementation of Dan Jurafsky's n-Gram NLP Interpolation model (Jurafsky Interpolation)
Building Hash Map indexes between 4-Grams, 3-Grams, 2-Grams, and 1-Grams for fast lookups (primary and foreign key relationships)
Hash Map columns for computing aggregate probabilities of n-Gram based on conditional probabilities
Tradeoff accuracy for speed and memory (Reduce Sparsity, 237894 1-Grams, 347112 Bigrams, 369445 Trigrams, 253621 Quadgrams)

Mathematical Basis:

Ranking the candidate words by aggregate probability of n-Gram (higher weights for higher order n-grams). The optimal lambda coefficients \( \LARGE (l1 + l2 + l3 + l4 = 1.0) \) obtained by training on known phrases:

\[ \LARGE P(w1 w2 w3 w4) = l1 * c(w1 w2 w3 w4) / c(w1 w2 w3) + \] \[ \LARGE l2 * c(w2 w3 w4) / c(w2 w3) + \] \[ \LARGE l3 * c(w3 w4) / c(w3) + \] \[ \LARGE l4 * c(w4) / sum of frequencies of all Unigrams, \]

where w4 is the candidate word and the prefix (e.g., (w1 w2 w3) is a embedded trigram, and c() represents the count of a n-gram (frequency column in each n-Gram data set)

Example: In sample phrase, “Hey sunshine, can you follow me and make me the ___”, the candidates for prediction: and are compared as follows:

\[ \LARGE. "happiest": l1 * c("make me the happiest")/c("make me the") + l2 * c("me the happiest")/c("me the") + l3 * c("the happiest")/c("the") + l4 * probability("happiest") \] \[ \LARGE "most": l1 * c("make me the most")/c("make me the") + l2 * c("me the most")/c("me the") + l3 * c("the most")/c("the") + l4 * probility("most") \]

Design Details - Data Structures and Algorithm for Prediction

Cross-references to 4-Grams, 3-Grams, 2-Grams, and 1-Grams stored as integer indexes
1-Gram (ugset) stored on disk as RDS with named rows with frequency column
2-Gram (bgset) stored on disk as RDS with named rows with frequency, embedded prefix and embedded suffix unigram index columns
3-Gram (tgset) stored on disk as RDS with named rows with frequency, embedded prefix bigram and embedded suffix unigram index columns
4-Gram (qgset) stored on disk as RDS with named rows with embedded prefix trigram and embedded suffix unigram index columns
Embedded (n-1)-Gram prefix in a n-Gram is indexed using rownum in (n-1)-Gram

Instructions for the Shiny App usage

Input text box: Use this box (at the top) to enter a new phrase or edit an existing phrase. This box supports full text navigation and edit capabilities. The text box populates a sample phrase at initialization or a refresh.
Submit Button: When the text edit is complete, click on this button to initiate the word prediction algorithm
Progress Bar: The progress of computation to indicate the status of loading/computing.
Output text box: This box displays the predicted word appended to the original phrase
Repeat the above three steps for the next phrase
Initial Wait: The UI takes 20 to 30 seconds to load the training data sets for n-Grams. Please be patient to allow the loading to be complete. The UI is ready to use once the output shows the completion of the sample phrase.