Application Introduction

This is a web application done for the Capstone project provided by the Coursera Data Science Specifications in cooperation with the company SwiftKey. It is an predictive text model powered, typing assistant capable of suggesting the next word for the user inputted phrase. The application is hosted at shinyapps.io, the text model is based on the RWeka n-gram tokenizer. Application can be accessed here.

Input and Output

The application requires the following input from the user.

  • To select the complexity of the n-grams model from a drop-down-menu
  • Toggle the usage of the smoothing
  • To input a text string for prediction

The application produces the following output.

  • A prediction of the next word in the text string. User can accept the prediction, or keep adding word to the text string. The application will react to the latest inputted words and give the user a new prediction to accept or ignore.

Assumptions and formula used in the prediction

The model is based on an N-gram hierarchy given preference to the highest matched N-gram, and then Katz’s back off to a lower N-gram. The application calculate the next word in a word sequence by calculating the conditional probability. The model is based on Markov’s assumption, and any sequence with more than four words will be cut off, therefore for all words the model will try to calculate the 4-grams probability and back-off stepwise to a uni-grams model if necessary. \[ P(w_{4}|w_{1},w_{2},w_{3}), P(w_{3}|w_{1},w_{2}), P(w_{2}|w_{1}), P(w_{1})\]

In order to handle any data sparsity the Kneser-Ney smoothing algorithms is implemented.

The raw data consisting of newspapers, magazines, (personal and professional) blogs and twitter updates, is provided by HC corpora.

The static n-grams are based on a randomized (the binomial distribution) sample of the raw dataset (totally 5 % of the source files).

References

Martin.C, Körner

Shaw, M. (Sept 1, 2013) “Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction”. Retrieved April 18, 2015 from:
https://west.uni-koblenz.de/sites/default/files/BachelorArbeit_MartinKoerner.pdf

Robin (Dec 15, 2009) “Markovs models: overviews”. Retrieved April 18, 2015 from:
http://language.worldofcomputing.net/pos-tagging/markov-models.html

Chambers (Fall, 2012) “Smoothing Language Models”. Retrieved April 18, 2015 from:
http://www.usna.edu/Users/cs/nchamber/courses/nlp/f12/slides/set4-smoothing.pdf