N-Gram Word Predictor: Data Science Capstone

Frank Cheng
Jan 24, 2016

Overview

If you haven't tried out the app, go (https://biomystery.shinyapps.io/predNextWord/) to try it!

Predicts next word
Shows you top 5 words with probablities for each prediction
It is fast after loading the model data

Data processing & Algorithm

Data processing:

Tokenize: wrote a get_tokens function, taking files and return tokens
get n-grams freqency: use data.table library process data quite fast

Algorithm:

Apply Good-Turning discounting for freq<10 1,2,3-gram
Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
Store the model using the ARPA format
predict: Use last two words of input to find max p_kz(w|w1,w2); if one word, max p_kz(w|w1); if no input, max(p_gt(w))

Model: - store in ARPA format

\data: (store in model.data): 1) ngram 1: number of 1-gram 2)ngram 2: number of 2-gram 3) ngram 3: nubmer of 3-gram
\1-gram: (model.1gram) columns: -log10(p_kz(w)), 1-gram, backoff weight a for higher gram
\2-gram: (model.2gram) columns: -log10(p_kz(w2|w1)), 2-gram, backoff weight a for higher gram
\3-gram: (model.3gram) columns: -log10(p_kz(w3|w1,w2)), 3-gram (no need weight since we are starting from 3gram)

Final Product

source code here