Predicting Next Word

Working data set undergoes a process of optimized random sampling, cleansing & filtering to increase n-gram frequency rates and prediction probability.
Filtering process not only removes symbols but also hashtags, email, emoticons, word elongations, numbers, ordinas, etc.
Different dictionaries are created based on a combination of one to 5 n-gram using one to 5 minimum word size and removal or not of common stop words. Weights are calculated percentually.
Continues a filtering process based on extremely skewed histogram of repeated words to reduce dictionary size and based on a minimum number of repeating times for each word.
We create a training set from random sentences in the whole corpora. It includes for each given sentence 25 possible solutions with information based on different n-gram percentages minimum characters per word, part of speech tagging forming a Matrix.
This entire matrix of weights is trained using random forests and extreme gradient boosting to optimize the weighted hit value to find which dictionary’s chosen word is used next based on feature importance.

Introduction