Next Word

Justin Nafe
August 18, 2015

Introduction

NextWord by Justin Nafe (found on github at “justinnafe/NextWord”) is an R package that allows users to develop models for predicting the next word. The package contains an example model, which is used in the showcase Shiny app referenced on the last slide.

The application makes use of token frequencies and Parts of Speech (POS) to predict the next word.

Model

Building the model consists of a multi-step process:

  • Clean the text
    • Remove profanity
    • Normalize casing
  • Tag the text with POS
  • Extract the tokens (1 - 4 gram tokens)
  • Remove tokens where POS is unknown
  • Calculate the probabilities and sort in a descending order
  • Compress and store the model for efficient storage and retrieval

Prediction

The prediction algorithm uses the frequencies of words and Parts of Speech (POS) of the words supplied from the blogs corpus.

  • The model contains 1 - 4 gram models, sorted by the combined probability that the token and POS will occur
  • The prediction algorithm uses a Backoff method if the sequence is not found in the higher N-gram
  • Accuracy is ~ 14%
  • Results show the next three most likely words

References