Quanteda package used (RWeka turned out to be too slow and memory consuming).1 to 4 n-grams used, 60% Corpus processed, stopwords not taken away.
In order to make it possible to process a large enough piece of the available documents, texts were broken down in pieces (20 blocks), processed (cleaned up and tokenized) and results reassembled.
Prunning of low frequencies executed for each block (frequencies <5 discarded). “Dictionaries”“ saved and retrieved for ngrams 4 to 2 as 'data.table' objects.
Results (frequency databases for each n-gram) are based on indexed data.tables. This 'data.table' objects (aproximately 6.3Mb) are uploaded to Shiny and loaded by the prediction algorythm.
Model considerations
Input string is properly pre-processed to allow a normalized response.
Model based on 'Stupid backoff' approach with lambda = 0.4 adapted adding scores of terms found in different n-grams search rather than keeping the score from the highest ngram). For each n-gram 'dictionary' the score is calculated as the count of the found chain divided by the count of the search term set in the (n-1)-gram 'dictionary'.