Frank Cheng
Jan 24, 2016
If you haven't tried out the app, go (https://biomystery.shinyapps.io/predNextWord/) to try it!
Data processing:
Tokenize: wrote a get_tokens function, taking files and return tokens
get n-grams freqency: use data.table library process data quite fast
Algorithm:
Apply Good-Turning discounting for freq<10 1,2,3-gram
Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
Store the model using the ARPA format
predict: Use last two words of input to find max p_kz(w|w1,w2); if one word, max p_kz(w|w1); if no input, max(p_gt(w))
\data: (store in model.data): 1) ngram 1: number of 1-gram 2)ngram 2: number of 2-gram 3) ngram 3: nubmer of 3-gram
\1-gram: (model.1gram) columns: -log10(p_kz(w)), 1-gram, backoff weight a for higher gram
\2-gram: (model.2gram) columns: -log10(p_kz(w2|w1)), 2-gram, backoff weight a for higher gram
\3-gram: (model.3gram) columns: -log10(p_kz(w3|w1,w2)), 3-gram (no need weight since we are starting from 3gram)
source code here