Unigrams, bigrams, trigrams, 4-grams and 5-grams are created with N-gram package for that reason, and we end up deciding to use all of them as long as they appear at least twice in the body of our data,
At the same time and in order to deal with profanity issues, we choose not to use those of our N-grams that contain words from the “SwearWords.csv” that can be found at www.bannedwordlist.com.
The 'Prediction Model' algorithm
The prediction model is based on an optimized stupid Back-off (λ=0.4) N-gram frequency algorithm,
5-grams are the first N-grams to be used. That means that the algorithm takes into account the last four words that user has provided in order to find “probabilities” for the 5th one according to the N-gram frequency tables of our “train” text corpus which serve as frequency dictionaries,
If no match is found, the 4-grams are used (taking into account the last three words of the user input),
If no match is found the algorithm συνεχιζει την ιδια διαδικασια με τα trigrams and the bigrams, until eventually ending up with proposing the most used single words (unigrams) of our text corpus regardless of the user input,
If, as is most often the case, the search algorithm finds one or more suggestions, then the sentences of the dictionaries of frequency N-grams of lower frequency are weighted with a lower weight λ.