Text Prediction Product based on Ngram Models

Siying Ruan
Nov. 26, 2020

Product Function

This is a application that make predictive text based on users' input. The training data is a random sample of the text files provided by Swiftkey. You can get phrase examples from GitHub.

  • Type in the text box, click Predict to see possible words following the text.
  • Click Clear to clear the your text input and mark the last phrase and its prediction. You can compare the results of the two phrases.
  • The predicted words are in its lemma format. You can compare the results by typing different tense.

Major Components of the Product

  1. A dictionary select from a unigram by taking the 10 most used words in the sample text.
  2. A dictionary of a bigram and a trigram model trained from the sample text. The top 5 possible words by probability are selected.
  3. A algorithm that get the possible words of given the given text. It can predict words based on one word or two words.

Description of the Algorithm

  1. Find the last two input words in the dictionary and list get all possible following words.
  2. If there are no predicted text, display the most used words from the unigram.
  3. If the input text are more than 1 word, and there are less than 5 predicted words, then get the possible following words from the last word as well.

Limitation and Further Improvement in the Future

  1. The prediction is limited to the sample text. The app will allow user to use their own text so they can get a desired set of predicted text.
  2. The prediction is limited to certain amount of words. To keep the dictionary compact, the app only selects the top 5 words by their probability. This is to balance the accurate prediction, the dictionary size and the algorithm speed.
  3. As users may see non valid words a few times, the app should improve its clean-up code to keep only what's useful.