Text Prediction Model

CvP
November, 19th, 2017

Quanteda package used (RWeka turned out to be too slow and memory consuming).1 to 4 n-grams used, 60% Corpus processed, stopwords not taken away.
In order to make it possible to process a large enough piece of the available documents, texts were broken down in pieces (20 blocks), processed (cleaned up and tokenized) and results reassembled.
Prunning of low frequencies executed for each block (frequencies <5 discarded). “Dictionaries”“ saved and retrieved for ngrams 4 to 2 as 'data.table' objects.
Results (frequency databases for each n-gram) are based on indexed data.tables. This 'data.table' objects (aproximately 6.3Mb) are uploaded to Shiny and loaded by the prediction algorythm.

Input string is properly pre-processed to allow a normalized response.
Model based on 'Stupid backoff' approach with lambda = 0.4 adapted adding scores of terms found in different n-grams search rather than keeping the score from the highest ngram). For each n-gram 'dictionary' the score is calculated as the count of the found chain divided by the count of the search term set in the (n-1)-gram 'dictionary'.
A description of the base model can be found at http://www.aclweb.org/anthology/D07-1090.pdf
Processor (i7) and memory (16Gb) were very helpful processing texts.