July 26, 2018
N-Gram Next Word Predictor - Predictive Text Algorithm
Objectives
- This app predicts the next word(s) with one or more words as input
- For this, a large corpus of twitter, news and blog content was analyzed
- We extracted N-grams from the corpus and used them to build the predictive model
- We also explored various models for improving the prediction accuracy and speed
Designing the Algorithm
- N-gram model with back-off strategy has been used to train the algorithm
- Dataset was been cleaned, lower-cased, links, twitter handles, emojis, punctuations, extra whitespaces, numbers etc. removed
- Matrices from uni-gram to hexa-grams were extracted and sorted by frequency of occurrence
- Size of model was reduced by dropping least frequent N-grams
- Speed and memory usage was further optimized by dropping the least frequent bigrams and monograms since they do not appear to improve accuracy
Predictive Algorithm for the app
- Input Word(s): text input box for user to type a phrase / word
- The words typed are detected and the next word(s) predicted reactively
- Output iterated from longest N-gram (hexagram) to shortest (bigram)
- The last word in matching N-gram is used as predicted word
- Predictions are made using the longest, most frequent, matching N-gram
- If no matches are found using the existing {6:2}-grams, it selects the most frequent word from monogram
- User can configure the number of words the app should suggest
Application Interface