Report on the prediction algorithm built within the capstone project on Coursera Data Analysis specialization

Elena Chernousova
07/11/2016

App built using the following:

  • Data presented HC Corpora. The coprus could be find there http://www.corpora.heliohost.org/, ~2% of data used.
  • The lib for quantitative analysis of textual data - quanteda
  • Kneser-Ney smoothing based on probabilistic language model
  • Prediction algorithm allows to get the third word max
  • Result is presented as a web - app

Kneser-Ney algorithm

  • Making N-gram model as probability distribution over strings that attemps to reflect how frequently a string occurs as a sentence based on the counts in a traning set.

  • Applying smoothing - calculating probabilities by taking into consideration order of words and their context: the unigram probability used should not be proportional to the number of occurrences of a word , but instead of to the number of different words that it follows.

  • Used resourses: NLP Lunch Tutorial: Smoothing by Bill MacCartney, 21 April 2005 http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Result - Shiny app

UI of the shiny app

Shiny app manual

  • Put one or two word into the text field for prediction

  • Expected result: The app would show five the most probabl word if there is prediction based on the your input, otherwise shown the most probable words computed regardless of input.

https://wertic.shinyapps.io/wordsprediction/