Text Prediction Model

CvP
November, 19th, 2017

Strategy Principles


  • Quanteda package used (RWeka turned out to be too slow and memory consuming).
  • 1 to 4 n-grams used, 60% Corpus processed, stopwords not taken away.
  • Block tracking performed in order to pull the complemetary 40% training block set.
  • In order to make it possible to process a large enough piece of the available documents, texts were broken down in pieces (20 blocks), processed (cleaned up and tokenized) and results reassembled.
  • Prunning of low frequencies executed for each block (frequencies <5 discarded).
  • “Dictionaries”“ saved and retrieved for ngrams 4 to 2 as 'data.table' objects.
  • After multiple optimizations, complete process took circa 45' in the machine detailed in this same document.
  • Results (frequency databases for each n-gram) are based on indexed data.tables.
  • This 'data.table' objects (aproximately 6.3Mb) are uploaded to Shiny and loaded by the prediction algorythm.

Model considerations


  • Input string is properly pre-processed to allow a normalized response.
  • Execution times targets were stablished to be acceptable below 100ms with optimums around 10ms
  • Model based on 'Stupid backoff' approach with lambda = 0.4.
  • For each n-gram 'dictionary' the score is calculated as the count of the found chain divided by the count of the search term set in the (n-1)-gram 'dictionary'.
  • Model has been adapted adding scores of terms found in different n-grams search rather than keeping the score from the highest ngram).
  • When no result is found, the term'the' with a fictitious score of 0.1 is provided as output
  • A description of the base model can be found @ http://www.aclweb.org/anthology/D07-1090.pdf


Stupid backoff

Accuracy & Performance Considerations



Tecnical environment description:

Apple

Text Prediction Utility Application



Application view:

CvPred Utility



Use guide:

  • Input the text (>=1 words)
  • 2,3 words or longer inputs will allow better results
  • To triger the prediction, press the 'GO' button
  • Top 10 list of sugested next words will show up
  • 'Adapted' Stupid backoff scores will be shown
  • Bar chart shows relative relavance of the different sugestions
  • To perform a new excution, enter new text and press GO