Text Prediction Model

CvP
November, 19th, 2017

Strategy Principles

  • Quanteda package used (RWeka turned out to be too slow and memory consuming).1 to 4 n-grams used, 60% Corpus processed, stopwords not taken away.
  • In order to make it possible to process a large enough piece of the available documents, texts were broken down in pieces (20 blocks), processed (cleaned up and tokenized) and results reassembled.
  • Prunning of low frequencies executed for each block (frequencies <5 discarded). “Dictionaries”“ saved and retrieved for ngrams 4 to 2 as 'data.table' objects.
  • Results (frequency databases for each n-gram) are based on indexed data.tables. This 'data.table' objects (aproximately 6.3Mb) are uploaded to Shiny and loaded by the prediction algorythm.

Model considerations

  • Input string is properly pre-processed to allow a normalized response.
  • Model based on 'Stupid backoff' approach with lambda = 0.4 adapted adding scores of terms found in different n-grams search rather than keeping the score from the highest ngram). For each n-gram 'dictionary' the score is calculated as the count of the found chain divided by the count of the search term set in the (n-1)-gram 'dictionary'.
  • A description of the base model can be found at http://www.aclweb.org/anthology/D07-1090.pdf
  • Processor (i7) and memory (16Gb) were very helpful processing texts.

Benchmarks

Text Prediction Utility

CvPred Utility

User guide

  • Input the text (>=1 words).
  • 2,3 words or longer inputs will allow better results.
  • To triger the prediction, press the 'GO' button.
  • Top 10 list of sugested next words will show up.
  • 'Adapted' Stupid backoff scores will be shown.
  • Bar chart shows relative relavance of the different sugestions.
  • To perform a new excution, enter new text and press GO.
  • Enjoy !