Next Word Prediction in R

Neil Kutty
02/09/2018

Next Word Prediction App

The purpose of this app is to predict the next word a user may want to type based on the previous words they type.

It is powered by text data from twitter, blogs, and news. The app predicts by searching for the user's text input within datasets of phrases extracted from this text data.

It is designed to be reactive & fast- thus it is potentially applicable to numerous modern end-purposes. For example, this technology could be used to enable autocomplete on website forms that accept user text input such as search engine input boxes.

The algorithmic logic used by the core prediction function is based on Katz's Backoff Algorithm which is a probability based NLP prediction method.

Under The Hood

  • The text data is sampled and cleaned by removing profanity, numbers, special characters, punctuation, and whitespace as well as replacing contractions and abbreviations. It is then tokenized into length 5, 4, 3, 2, and 1-gram lengths.
  • For each gram-length, the frequency of a phrase appearing is calculated. Dataframes for each length are then created with phrases and frequency.
  • These dataframes are used by the prediction function to determine the most probable next word the user may intend to type.
  • When the fx can’t find a match in phrases one word longer than the input length, it searches for one word shorter than the full user input in the next dataframe of one gram-length shorter phrases than previous.
  • If no match can be determined from any of the dataframes, a sample of the top most frequent unigram words is returned.

How it Works

This flow chart describes how the prediction function works. alt text

What Makes This App Great

  • It’s Reactive

    No need to click a submit button, the app reactively recalculates each time the input text box is updated.

  • It’s Fast

    The backend data is trimmed to exclude sparse phrases where possible, improving load time and calculation speed.

  • It’s Barebones

    The main predictNext function is written in base R. So while generating the n-gram dataframes requires use of R libraries, the actual brain making the predictions requires none.

Instructions & Links

  • How to use the Next Word Prediction app:
    • Click link here or below, wait for the app to load.
    • App is ready when text appears below the text input box.
    • Type text into text box and see predicted next word appear below.
  • The text data used in the prediction app is provided by SwiftKey. It contains text data from twitter, news articles, & blogs.
  • Links