Capstone Presentation: Improving Text Prediction

Mark Bulkeley
2016-10-01

Current State of Text Prediction

Current State of the World

  • Text prediction applications tend to focus on mobile applications where users don't have touch-typing capabilities available
  • Approaches so far focus on just getting likely words to the user, but don't give the user a sense of what words to focus on
  • Graphical environments allow for a richer amount of information being shown to the user, while speeding reaction times
  • Use a simple and fast model proven by years of research at Google

How Could it be better?

  • Provide user insight into the relative likelihood of words
  • Use color bars behind the words to help the user focus on the likely best choice

Improving Text Prediction Model and Interface

Model Approach

  • Use a “Stupid Backoff” quadgram model
    • If no quadgram solution is found, the model backs off to find a trigram, but penalizes the trigram probability with a factor (empirically derived) of 0.4.
    • Likewise, if no trigram is found, a bigram, then a unigram is looked for (each getting penalized an additional 0.4).
    • Will always result in a word prediction, if only based on the highest frequency words found in the unigram.
    • Computationally inexpensive and results approach more complex algorithms such as Kneser-Ney
  • Useful details can be found in this reference: Large Language Models in Machine Translation, Google Inc

Interface & Model Details

Interface Approach

Model

  • Build Approach
    • Quad-, tri-, bi- and uni- grams generated from a robust portion of the sample data (50%). This took less than an hour of processing on a new PC
    • N-grams were pruned to save on memory and speed return of results
  • Suitable Accuracy
    • Model was tested on a held-out sample; the next word was found in the top five suggested words in 28% of phrases, in the top 10 in 35% of phrases and in the top 100 in 55% of phrases

Next Steps

Easy Improvements

  • Allow user to hit tab and the numbered word that they want
  • Like Google, begin to suggest words based on the initial letters typed by the user
  • Determine right user balance between number of suggested words and speed of phrase entry

Hard Improvements

  • Use dynamic data exchange with the web client to facilitate a wider corpus of available options (i.e., increase the number of ngrams available to further improve likelihood of completing the users words)
  • Find the right color scheme that won't create a distracting user interface but will still give immediate visual cues to the user about which words they should be focused on