Next Word Prediction

Nate Foulkes
February 2022

Capstone project for the Data Science Specialization hosted by Coursera

Administered by: Roger Peng, Jeff Leek, and Brian Caffo of Johns Hopkins

The Data Clean-up Process:

  • Students were provided with a zip file containing samples of news articles, blog posts, and Tweets.
  • Using the read.table() function, samples of each of the English data sets were read into r.
  • Tokenization was done first as sentences for intial clean-up, and then again as words to build the n-grmas. Using the Tidyverse package.
  • Building a library of acceptable words was done using the AFINN lexicon within Tidytext. This package did a great job of language detection and profanity removal. Much faster compared to creating a list of acceptable words using regular expressions.
  • A library of 5-gram, 4-gram, 3-gram, 2-gram, and 1-grams were created from the cleaned data to use as a prediction model using the Katz Back-Off algorithm.

Katz Back-Off Algorithm

Katz Back-Off

  • A generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
  • It accomplishes this estimation by backing off through progressively shorter history models under certain conditions.
  • Disclaimer: this is not a strict Katz Back-Off since the initial discounting determined by Good-Turing was assumed to be equal to the word frequency minus 0.5.

Quantitative Performance

Time prediction for last word using a few random sentences:

  • 1. “Go on a romantic date at the”
  • 2. “Hey sunshine, can you follow me and make me the”
  • 3. “Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some”
  • 4. “The guy in front of me just bought a pound of bacon, a bouquet, and a case of”
      word user.self sys.self elapsed
1      end      0.20        0    0.22
2 sandwich      0.20        0    0.20
3     time      0.17        0    0.17
4      the      0.18        0    0.18

Predictive App

Directions

The app opens in a static state. The prediction algorithm will not run unless the user taps the predict button.

  • Enter a partial phrase of any number of words
  • Tap predict

The outcome will be the top three predictions for the next word.