Next Word Model

Logan J Travis
2014-12-14

John Hopkins University Data Science Capstone on Coursera

Summary

Project Goal
Predict the next word from arbitrary input text.

My Goal
Develop a low-processing power model that can easily add custom words and learn from user input. I chose to work within these constraints to attain my goal:

  • Table or simpler data structures within the model. Trees, custom classes, etc. complicate user updates.
  • Minimal calculation at run, ideally lookup-only to allowing hashing.
  • Ordered prediction for multiple next words.

Research

I recognized two problems when investigating the Hans Christensen Corpora (HC Corpora):

  1. My computing resources could did not match the size of the data.
  2. Calculating n-grams for word order multiplied the problem.

tdm2gramNoStop

Click to Zoom

Model Part 1

My research lead me to develop a four step model sampled from 10% of the HC Corpora:

  1. Clean Input Text
    Uses qdap package to replace abbreviations, contractions, symbols, and excess blank space.
  2. Predict Parts of Speech
    Tags part of speech (POS) in clean text then matches trigram to pre-calculated frequency matrix. Proportions used to weight potential next words.

Model Part 2

  1. Filter Sentences from Corpora
    Thanks again to the qdap package, I created a term document matrix with individual sentences as the terms. The model selects potential words from sentences with n - 1 matches where n is the preceding number of words (set to 5 but see next slide).
  2. Order Next Words
    Weighted by POS and filtered from similar similar sentences, the model returns an ordered list of next words. I limited the range from 1 to 10 in the Shiny application.

Results

In few words? Not great!

I split my data into training and testing sets but anticipate accuracy below 10% based on my Shiny application. Give it a try.

However, my initial results made no sense. I since adjusted model parameters - especially for the next POS and similar sentence filters - to yield predictions neighboring reason. I plan further tweaks after reviewing my fellow students' projects.