Text Prediction Model
CvP
November, 19th, 2017
Model & Strategy Principles
- Quanteda package used (RWeka turned out to be too slow and memory consuming).
- 1 to 4 n-grams used: , 60% Corpus processed, stopwords not taken away.
- Texts broke down in pieces (20 blocks) and processed (cleaned up and tokenized).
- Prunning of low frequencies executed for each block (frequencies <5 discarded).
- Block results (frequency databases based on indexed data.tables) merged and saved.
- Model based on 'Stupid backoff' approach with lambda = 0.4.
- Model has been adapted adding scores of terms found in different n-grams search (rather than keeping the score from the highest ngram)
Accuracy & Performance Considerations
- Tecnical environment description

CvPred (Text Prediction Utility Application
Use guide:
- Input the text (>=1 words)
- 2,3 words or longer inputs will allow better results
- To triger the prediction, press the 'GO' button
- Top 10 list of sugested next words will show up
- 'Adapted' Stupid backoff scores will be shown
- Bar chart shows relative relavance of the different sugestions
- To perform a new excution, enter new text and press GO