Capstone Word Prediction Project

Brad Dietz
1/18/2016

Summary

The ability to predict the next word a user will type is valuable especially on the web and mobile devices. The goal of this Capstone project is to build a model to predict the next word a user will type. The model will be accessible over the internet as a Shiny application.

This submission uses bi, tri, quad and penta n-gram data and the backoff method to predict the next word a user will type. The original scope of the project was expanded to include 2 guesses for each prediction.

Instructions

Go to https://bdietz77.shinyapps.io/WordPredict/

The left half of the screen is the text input box. Whatever text you enter, the model will produce 2 predictions.

The right half of the screen is the output of the model or where the 2 predictions are displayed.

There is a short summary of the project if you click on the 'About' button in the right half of the screen.

Data (Construction of N-Gram tables)

100% of the data was processed to construct the N-Gram tables

Scan was embedded into a for loop loading 1/10 of each of the 3 datasets
The loaded data was formatted and standardized
- Convert to ascii, remove punctuation and numbers, and convert to lowercase
Split the data into words using txt.to.words (stylo)
Use the make.ngrams (stylo) function to create uni, bi, tri, quad, and penta N-Gram tables
Remove N-Gram counts of 1 and 2 to save space
Aggregate all the data together and a write csv file for each N-Gram table

Abridged Explanation of the Algorithm

Back Off Method

Load the Bi, Tri, Quad, and Penta N-Grams tables using fread (data.table)

User input is split into words using txt.to.words (stylo)

Grep is ran on the N-Gram tables passing the ending user input words

Example: The last 4 user input words are passed to beginning of the the Penta N-Gram table…

The results from the N-Gram Greps are aggregated using rbind

The last word from each entry is selected using word (stringr)

The predicted word table is finally constructed by removing duplicates using unique

The two guesses are the results of the table

If the predicted word table does not exist, the First Guess is 'the' and the Second Guess is 'to'