Capstone Presentation: Improving Text Prediction

Mark Bulkeley
2016-10-01

Text prediction applications tend to focus on mobile applications where users don't have touch-typing capabilities available
Approaches so far focus on just getting likely words to the user, but don't give the user a sense of what words to focus on
Graphical environments allow for a richer amount of information being shown to the user, while speeding reaction times
Use a simple and fast model proven by years of research at Google

Provide user insight into the relative likelihood of words
Use color bars behind the words to help the user focus on the likely best choice

Use a “Stupid Backoff” quadgram model
- If no quadgram solution is found, the model backs off to find a trigram, but penalizes the trigram probability with a factor (empirically derived) of 0.4.
- Likewise, if no trigram is found, a bigram, then a unigram is looked for (each getting penalized an additional 0.4).
- Will always result in a word prediction, if only based on the highest frequency words found in the unigram.
- Computationally inexpensive and results approach more complex algorithms such as Kneser-Ney
Useful details can be found in this reference: Large Language Models in Machine Translation, Google Inc

Enter text and hit submit in a simple Shiny interface
Returns words sorted by relative likelihood; bar provides an immediate visual cue of where to focus.

Build Approach
- Quad-, tri-, bi- and uni- grams generated from a robust portion of the sample data (50%). This took less than an hour of processing on a new PC
- N-grams were pruned to save on memory and speed return of results
Suitable Accuracy
- Model was tested on a held-out sample; the next word was found in the top five suggested words in 28% of phrases, in the top 10 in 35% of phrases and in the top 100 in 55% of phrases

Allow user to hit tab and the numbered word that they want
Like Google, begin to suggest words based on the initial letters typed by the user
Determine right user balance between number of suggested words and speed of phrase entry

Use dynamic data exchange with the web client to facilitate a wider corpus of available options (i.e., increase the number of ngrams available to further improve likelihood of completing the users words)
Find the right color scheme that won't create a distracting user interface but will still give immediate visual cues to the user about which words they should be focused on