Text Prediction Model

CvP
November, 19th, 2017

Quanteda package used (RWeka turned out to be too slow and memory consuming).
1 to 4 n-grams used: , 60% Corpus processed, stopwords not taken away.
Texts broke down in pieces (20 blocks) and processed (cleaned up and tokenized).
Prunning of low frequencies executed for each block (frequencies <5 discarded).
Block results (frequency databases based on indexed data.tables) merged and saved.
Model based on 'Stupid backoff' approach with lambda = 0.4.
Model has been adapted adding scores of terms found in different n-grams search (rather than keeping the score from the highest ngram)
http://www.aclweb.org/anthology/D07-1090.pdf

Hfofani benckmark for CvPred

Hfofani benckmark for CvPred

CvPred Utility

Use guide: