Quanteda package used (RWeka turned out to be too slow and memory consuming).
1 to 4 n-grams used, 60% Corpus processed, stopwords not taken away.
Block tracking performed in order to pull the complemetary 40% training block set.
In order to make it possible to process a large enough piece of the available documents, texts were broken down in pieces (20 blocks), processed (cleaned up and tokenized) and results reassembled.
Prunning of low frequencies executed for each block (frequencies <5 discarded).
“Dictionaries”“ saved and retrieved for ngrams 4 to 2 as 'data.table' objects.
After multiple optimizations, complete process took circa 45' in the machine detailed in this same document.
Results (frequency databases for each n-gram) are based on indexed data.tables.
This 'data.table' objects (aproximately 6.3Mb) are uploaded to Shiny and loaded by the prediction algorythm.
Model considerations
Input string is properly pre-processed to allow a normalized response.
Execution times targets were stablished to be acceptable below 100ms with optimums around 10ms
Model based on 'Stupid backoff' approach with lambda = 0.4.
For each n-gram 'dictionary' the score is calculated as the count of the found chain divided by the count of the search term set in the (n-1)-gram 'dictionary'.
Model has been adapted adding scores of terms found in different n-grams search rather than keeping the score from the highest ngram).
When no result is found, the term'the' with a fictitious score of 0.1 is provided as output