Next Word

Justin Nafe
August 18, 2015

Introduction

NextWord by Justin Nafe (found on github at “justinnafe/NextWord”) is an R package that allows users to develop models for predicting the next word. The package contains an example model, which is used in the showcase Shiny app referenced on the last slide.

The application makes use of token frequencies and Parts of Speech (POS) to predict the next word.

Model

Building the model consists of a multi-step process:

Clean the text
- Remove profanity
- Normalize casing
Tag the text with POS
Extract the tokens (1 - 4 gram tokens)
Remove tokens where POS is unknown
Calculate the probabilities and sort in a descending order
Compress and store the model for efficient storage and retrieval

Prediction

The prediction algorithm uses the frequencies of words and Parts of Speech (POS) of the words supplied from the blogs corpus.

The model contains 1 - 4 gram models, sorted by the combined probability that the token and POS will occur
The prediction algorithm uses a Backoff method if the sequence is not found in the higher N-gram
Accuracy is ~ 14%
Results show the next three most likely words

References

Shiny app can be found at https://justinnafe.shinyapps.io/NextWordAnalysis
NextWord project code can be found at https://github.com/justinnafe/NextWord
Data used for this application was sourced from HC Corpora (www.corpora.heliohost.org) for Coursera's Data Science Capstone project.