July 26, 2018

N-Gram Next Word Predictor - Predictive Text Algorithm

Objectives

  • This app predicts the next word(s) with one or more words as input
    • For this, a large corpus of twitter, news and blog content was analyzed
    • We extracted N-grams from the corpus and used them to build the predictive model
    • We also explored various models for improving the prediction accuracy and speed

Designing the Algorithm

  • N-gram model with back-off strategy has been used to train the algorithm
  • Dataset was been cleaned, lower-cased, links, twitter handles, emojis, punctuations, extra whitespaces, numbers etc. removed
  • Matrices from uni-gram to hexa-grams were extracted and sorted by frequency of occurrence
  • Size of model was reduced by dropping least frequent N-grams
  • Speed and memory usage was further optimized by dropping the least frequent bigrams and monograms since they do not appear to improve accuracy

Predictive Algorithm for the app

  • Input Word(s): text input box for user to type a phrase / word
  • The words typed are detected and the next word(s) predicted reactively
  • Output iterated from longest N-gram (hexagram) to shortest (bigram)
  • The last word in matching N-gram is used as predicted word
  • Predictions are made using the longest, most frequent, matching N-gram
  • If no matches are found using the existing {6:2}-grams, it selects the most frequent word from monogram
  • User can configure the number of words the app should suggest

Application Interface