Next Word Prediction App

Yoav Pridor
February 2018

Data Science Specialization

JHU via Coursera.org

In a nutshell

People all over the world type in English. Anticipating their next word saves time and well… can also help with spelling :)

Objective: Predict the next word, in context, given some input text.
Method:
Initial data, from News, Blogs and Twitter.
All texts were cleaned (punctuation and Profanity removed).
Deriving nGram tables (list of tokens and their frequencies) from corpora (Using R package Quanteda).
Snipping the nGram tables to include 90% of tokens, for app prformance.
creation of prediction model (function) based on the Katz Backoff algorithm
Creation of shiny app that loads the nGram tables, takes text input, and offers a next word prediction.

This is how it works

Type any text into the input window

Click “Submit”

The app will return up to 5 probable next words (in descending probability order)
Next Word Prediction App

Under the hood:

These are the main stages in the prediction process:

Loading 6 data tables of n-grams with 6-words, 5-words, 4-words, 3-words, 2-words, and 1-word including frequencies.
Getting user input (any number of words)
If input contains more than 5 words, grabs last 5.
Run the prediction function, a form of a Stupid backoff algorithm

Search in the n+1 ngram table for tokens which start with the input
If not found, trim the first word and search the next ngram table

When matches are found, return up to 5 most frequent matches
If no matches are found, return the most frequent 5 words in the unigram data table.

Next steps and possible improvements

Possibly connect algorithm to external dictionaries to improve prediction accuracy.
Tackle additional languages
Optimize reaction speed
- Link to My App ←← Let me know what you think!