Text Prediction Model

CvP
November, 19th, 2017

Model & Strategy Principles

  • Quanteda package used (RWeka turned out to be too slow and memory consuming).
  • 1 to 4 n-grams used: , 60% Corpus processed, stopwords not taken away.
  • Texts broke down in pieces (20 blocks) and processed (cleaned up and tokenized).
  • Prunning of low frequencies executed for each block (frequencies <5 discarded).
  • Block results (frequency databases based on indexed data.tables) merged and saved.
  • Model based on 'Stupid backoff' approach with lambda = 0.4.
  • Model has been adapted adding scores of terms found in different n-grams search (rather than keeping the score from the highest ngram) Stupid backoff

Accuracy & Performance Considerations

  • Tecnical environment description

Hfofani benckmark for CvPred

  • Hfoffani benchmarks

Hfofani benckmark for CvPred

CvPred (Text Prediction Utility Application

CvPred Utility

Use guide:

  • Input the text (>=1 words)
  • 2,3 words or longer inputs will allow better results
  • To triger the prediction, press the 'GO' button
  • Top 10 list of sugested next words will show up
  • 'Adapted' Stupid backoff scores will be shown
  • Bar chart shows relative relavance of the different sugestions
  • To perform a new excution, enter new text and press GO