This is a web application done for the Capstone project provided by the Coursera Data Science Specifications in cooperation with the company SwiftKey. It is an predictive text model powered, typing assistant capable of suggesting the next word for the user inputted phrase. The application is hosted at shinyapps.io, the text model is based on the RWeka n-gram tokenizer. Application can be accessed here.
The model is based on an N-gram hierarchy given preference to the highest matched N-gram, and then Katz’s back off to a lower N-gram. The application calculate the next word in a word sequence by calculating the conditional probability. The model is based on Markov’s assumption, and any sequence with more than four words will be cut off, therefore for all words the model will try to calculate the 4-grams probability and back-off stepwise to a uni-grams model if necessary. \[ P(w_{4}|w_{1},w_{2},w_{3}), P(w_{3}|w_{1},w_{2}), P(w_{2}|w_{1}), P(w_{1})\]
In order to handle any data sparsity the Kneser-Ney smoothing algorithms is implemented.
The raw data consisting of newspapers, magazines, (personal and professional) blogs and twitter updates, is provided by HC corpora.
The static n-grams are based on a randomized (the binomial distribution) sample of the raw dataset (totally 5 % of the source files).
Martin.C, Körner
Shaw, M. (Sept 1, 2013) “Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction”. Retrieved April 18, 2015 from:
https://west.uni-koblenz.de/sites/default/files/BachelorArbeit_MartinKoerner.pdf
Robin (Dec 15, 2009) “Markovs models: overviews”. Retrieved April 18, 2015 from:
http://language.worldofcomputing.net/pos-tagging/markov-models.html
Chambers (Fall, 2012) “Smoothing Language Models”. Retrieved April 18, 2015 from:
http://www.usna.edu/Users/cs/nchamber/courses/nlp/f12/slides/set4-smoothing.pdf