Predictive Text Tool


Typing on mobile devices can be a serious pain. Swiftkey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types *I went to the*, the keyboard presents three options for what the next word might be. For example, the three words might be *gym*, *store*, *restaurant*.
We have created a predictive text product that implements a predictive text model and an algorithm built from analyzing a large corpus of text documents to provide an interface that can be accessed via web.

Use it

Type or paste some words in the text box (Enter your text) and you will obtain a prediction below (Main Prediction).


This prediction is the most likely one, but normally is not the only prediction obtained. You can check it in the drop down list (Select next word), where you’ll find first the most likely words and, after, other possible words (if any).

You can select any word in this list and it will be added to your text, followed by a new automatic prediction. This operational mode is intended to resemble the way keyboards in portable devices present options for what the next word might be.

Settings

  • Choose model: four customized models are available: ‘Blogs’, ‘News’ and ‘Twitter’ were generated specifically for each corresponding text file provided in the Capstone and ‘Global’ was built from a mix of the previous three and aims to provide a ‘general’ model.
  • Options:
    • Use interpolation? select which prediction algorithm to use, back off + interpolation or simple back off.
    • Final Sample? when checked, the word predicted is not the most likely, but the result of a sample using frequencies as weigths. This option is just for fun, as it allows to get (more) bizarre phrases.
    • Detailed results? when checked, the results Table Frequency List is showed.
  • Max. Predictions: choose the maximum number of posible next words to show.

The algorithm I


  • N-gram: the simplest model that assigns probabilities to sentences and sequences of words.
    • An N-gram is a sequence of N words.
    • he N-gram probability is estimated by dividing the observed frequency of a particular sequence (N gram) by the observed frequency of a prefix (N-1 gram).
  • We have to deal with words we haven’t seen before, which we’ll call unknown words, or out of vocabulary (OOV) words. An open vocabulary system vocabulary is one in which we model these potential unknown words in the test set by adding a pseudo-word called .

The algorithm II


  • To train the probabilities of the unknown word , we choose a fixed vocabulary in advance:
    1. Choose a vocabulary (word list) that is fixed in advance.
    2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token in a text normalization step.
    3. Estimate the probabilities for from its counts just like any other regular word in the training set.
  • In backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we only back off to a lower-order N-gram if we have zero evidence for a higher-order N-gram.
  • By contrast, in interpolation, we always mix the probability estimates from all the N-gram estimators, weighting and combining the trigram, bigram, and unigram counts.