Capstone Word Prediction Project

Brad Dietz
1/18/2016

Summary

The ability to predict the next word a user will type is valuable especially on the web and mobile devices. The goal of this Capstone project is to build a model to predict the next word a user will type. The model will be accessible over the internet as a Shiny application.

This submission uses bi, tri, quad and penta n-gram data and the backoff method to predict the next word a user will type. The original scope of the project was expanded to include 2 guesses for each prediction.

Instructions

Go to https://bdietz77.shinyapps.io/WordPredict/

The left half of the screen is the text input box. Whatever text you enter, the model will produce 2 predictions.

The right half of the screen is the output of the model or where the 2 predictions are displayed.

There is a short summary of the project if you click on the 'About' button in the right half of the screen.

Data (Construction of N-Gram tables)

100% of the data was processed to construct the N-Gram tables

  • Scan was embedded into a for loop loading 1/10 of each of the 3 datasets
  • The loaded data was formatted and standardized
    • Convert to ascii, remove punctuation and numbers, and convert to lowercase
  • Split the data into words using txt.to.words (stylo)
  • Use the make.ngrams (stylo) function to create uni, bi, tri, quad, and penta N-Gram tables
  • Remove N-Gram counts of 1 and 2 to save space
  • Aggregate all the data together and a write csv file for each N-Gram table

Abridged Explanation of the Algorithm

Back Off Method

  • Load the Bi, Tri, Quad, and Penta N-Grams tables using fread (data.table)
  • User input is split into words using txt.to.words (stylo)
  • Grep is ran on the N-Gram tables passing the ending user input words
    • Example: The last 4 user input words are passed to beginning of the the Penta N-Gram table…
  • The results from the N-Gram Greps are aggregated using rbind
  • The last word from each entry is selected using word (stringr)
  • The predicted word table is finally constructed by removing duplicates using unique
  • The two guesses are the results of the table
  • If the predicted word table does not exist, the First Guess is 'the' and the Second Guess is 'to'