author: Jamieon
date: January 24th, 2016
Go to https://jamieon.shinyapps.io/word-predictor/ to try out the presented shiny app.
The application:
predicts next word after word inputs by the user
shows top 5 words with their probablities
is fast after loading the data
The following steps are taken in data processing phase:
tokenize: a get_tokens function is written, files taken and tokens returned
to get n-grams freqency: data.table library is used to processes the data
n-gram freqency data is used and:
Good-Turning discounting for freq<10 1,2,3-gram is applied
using Katz-back off the p_kz(w3|w1,w2) and p_kz(w1|w2) are calculated
the model is stored using ARPA format