Using NLP to Read Minds

Jeff Spoelstra
10/07/2016

Project Goals & Model Design

  • Project Goals:

    • The goal is to create a model for look-ahead prediction/suggestion of a likely next word given a sequence of words entered by a user.
    • The target environment for running the model is a hand-held device such as a smartphone or tablet.
    • Model must be fast and use as little memory as possible while still being highly accurate.
  • Model Design

    • The model is a combination of 2-gram, 3-gram, and 4-gram frequency lists along with R script code to determine the correct n-gram list to use to make a prediction.
    • For each n-gram size, the 10 most probable next words (in order of frequency of occurence in the training text) were saved in the model.
    • 256,000 lines of data (ranging from a few words to several sentences each) were taken from samples of online news story text and blog post text to build the n-gram lists (i.e., to train the model).
    • Twitter text was not used because the grammar and vocabulary differed radically from the news/blog text.
    • Total size of all the n-gram data used by the model: 32,017,752 bytes.
    • The runtime version of the model does not require any R packages.

Model Performance

  • Results compiled from a test run using 250,000 lines of blog post text:
    • 13.6% exact predictions - True Next Word (TNW) was the predicted word.
    • 16.0% close predictions - TNW was one of 9 suggested alternative words.
    • 50.2% of the close predictions suggested the TNW as one of the top 2 alternative words.
    • 48.0% of the time the model recognized predictable sequences of 2 or more words.
    • 4 predictions/second average rate.

Recommendations

  • Improve prediction accuracy by creating separate models for each text source (news text, blog text, twitter text, etc.) to overcome the grammar and vocabulary differences of text source.
  • Separate models may each require less memory for data as well.
  • Use a larger list of probable next words in the model and use character-by-character lookup as a user types to drill-down to the most probable next word.
  • Optimization of the model algorithm and possibly re-coding into C to make it faster.

Using the Sample App

  • Full disclosure: the app doesn't really read minds. It's a cute context for showing the model in action.
  • App URL: https://jeffspoelstra.shinyapps.io/PredictText/
  • Enter a word sequence in the Sentence Fragment box and click on the Read My Mind button.
  • The model's prediction and alternative suggestions will appear below the button.

app-image