Next Word Prediction App

Yoav Pridor
February 2018


Data Science Specialization

JHU via Coursera.org

In a nutshell

People all over the world type in English. Anticipating their next word saves time and well… can also help with spelling :)
Objective: Predict the next word, in context, given some input text.
Method:
  • Initial data, from News, Blogs and Twitter.
  • All texts were cleaned (punctuation and Profanity removed).
  • Deriving nGram tables (list of tokens and their frequencies) from corpora (Using R package Quanteda).
  • Snipping the nGram tables to include 90% of tokens, for app prformance.
  • creation of prediction model (function) based on the Katz Backoff algorithm
  • Creation of shiny app that loads the nGram tables, takes text input, and offers a next word prediction.
  • This is how it works

  • Type any text into the input window
  • Click “Submit”
  • The app will return up to 5 probable next words (in descending probability order)
    Next Word Prediction App

  • Under the hood:

      These are the main stages in the prediction process:

    1. Loading 6 data tables of n-grams with 6-words, 5-words, 4-words, 3-words, 2-words, and 1-word including frequencies.
    2. Getting user input (any number of words)
    3. If input contains more than 5 words, grabs last 5.
    4. Run the prediction function, a form of a Stupid backoff algorithm
      • Search in the n+1 ngram table for tokens which start with the input
      • If not found, trim the first word and search the next ngram table
    5. When matches are found, return up to 5 most frequent matches
    6. If no matches are found, return the most frequent 5 words in the unigram data table.

    Next steps and possible improvements

    1. Possibly connect algorithm to external dictionaries to improve prediction accuracy.
    2. Tackle additional languages
    3. Optimize reaction speed