Word Prediction Project: Final App Overview

Keith Wheeles
April 4, 2016

Word Prediction App: Background

March 7, 2016: Assignment to develop word prediction app was provided along with three data files. Over 70 million words were included in these files to serve as the training and validation data to train a word prediction application. Preprocessing code written in Python was applied to develop a dictionary of words and counts of bigrams (two words appearing together) and trigrams (three words appearing together). Tables of words and counts were further preprocessed in R to assemble the final tables optimized for use by the final Shiny app. The app uses these counts to predict the next word that the user may type.

App Description and Instructions

Resulting app:

  • intent: intuitive to operate, similar to familiar mobile phone keyboards
  • mimics lower case Swiftkey keyboard, limiting keys (thus limiting state transitions)
  • requires users to press “keys” rather than direct computer keyboard input
  • up to three possible word completions appear at the top of the screen which can be pressed to complete the word. These are updated as the user enters each new character
  • tracking metrics are provided at the bottom of the screen

Word Prediction Algorithm

Back-off model using trigram information if available, stepping back to bigram and then unigram if necessary

  • Initialize possibilities data frame (possdf) as empty
  • Add any applicable trigrams - look up using hash of last two words, ranked by decreasing frequency of appearance (4.5 million trigrams involving 950 thousand different first word combinations are stored)
  • Add any applicable bigrams - look up using numeric key for last word, ranked by decreasing frequency of appearance (2.5 million bigrams involving 31 thousand different first words are stored)
  • If less than 3 entries in possdf, add “words” table to possdf, ranked by decreasing frequency of appearance (31 thousand words) [should remove duplicates]

User Experience of App

App “keyboard” implemented for:

  • “look and feel” - familiar interface
  • well defined state transitions (taking each event directly)
  • human engineering - slows typing down slightly, which masks some of the algorithm delay

Full “proof of concept” prototype. Further refinement could be applied:

  • duplicate suggestions or less than three suggestions under certain known and understood circumstances
  • no backspace key (state transition actions necessary not implemented)
  • limited to lower case and apostrophe keys
  • offensive words exist in the corpus text and may appear in suggestions

App Novelty

  • Python preprocessor written to organize unigram, bigram and trigram counts cut development time of the preprocessor code to days
  • Integer representation of words is memory-efficient, and faster, allowing use of large number of trigrams and bigrams
  • Custom “keybaord” implemented to more closely mimic Swiftkey - including update of top three predictions with each user keystroke

I hope you enjoyed the app and appreciate your time in reviewing it!