Report on the prediction algorithm built within the capstone project on Coursera Data Analysis specialization

Elena Chernousova
07/11/2016

Data presented HC Corpora. The coprus could be find there http://www.corpora.heliohost.org/, ~2% of data used.
The lib for quantitative analysis of textual data - quanteda
Kneser-Ney smoothing based on probabilistic language model
Prediction algorithm allows to get the third word max
Result is presented as a web - app

Making N-gram model as probability distribution over strings that attemps to reflect how frequently a string occurs as a sentence based on the counts in a traning set.
Applying smoothing - calculating probabilities by taking into consideration order of words and their context: the unigram probability used should not be proportional to the number of occurrences of a word , but instead of to the number of different words that it follows.
Used resourses: NLP Lunch Tutorial: Smoothing by Bill MacCartney, 21 April 2005 http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

UI of the shiny app

Put one or two word into the text field for prediction
Expected result: The app would show five the most probabl word if there is prediction based on the your input, otherwise shown the most probable words computed regardless of input.