Additional filtering for scalability: 7 millions+ texts leads to performance problems for product with limited resources. Discounted Kneser-Ney smoothing criteria - http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criterias on word variabilities, e.g. prior word 4 fixed retain word 5 with high variation enhance combination variety. The dataset is reduced to 100,000 lines
The training of dataset is key, the algorithm can be extended for more sophisticated filtering of data, and additional data processing word similarity…
Self learning system, words and texts unknown can be stored and evaluated to become potential new entries for the training dataset in a automated fashion
Specialization in text prediction mechanism based on the type of texts being analyzed