Additional filtering for scalability: 7 millions+ texts leads to performance problems to product with limited resources. Discounted Kneser-Ney smoothing criteria - http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criterias on word variabilities, e.g. prior words 3,4 are fixed probability of word 5. The dataset is reduced to 100,000 lines