Capstone Project: Word Predictor

December 2, 2017

Project Background

The basis of the project is to build a word prediction model, similar to Swiftkey on mobile phones, which takes a word or phrase as input and outputs the predicted next word.

There were 3 data sources provided:

Twitter data
News story data
Blog data

Data Processing

To create the training data for the model, I combined all three data sources into one and took 60% of the data as a sample.

Next, I created 4 different document feature matrices (dfms), broken out by ngrams (groups of consecutive words). I used quanteda to create ngrams of length 1, 2, 3, and 4. To process and clean the data, I used word stemming (combining words with the same root) and removed the following things from the text:

stopwords (a, an, the, but, etc.)
punctuation
numbers
symbols
twitter notation

Data Modeling

I took each of the dfms and used quanteda to create data frames of the top (most frequent) ngrams. In the interest of processing speed, I kept the top 5,000 unigrams and the top 5 million each of the bigrams, trigrams, and quadrigrams.

The prediction algorithm takes the word or phrase offered as the input and analyzes the final 3 (or fewer) words. If there are three words to work with, it looks for the most frequent quadrigram that starts with those three words. If it does not find a match, it goes down through the trigram, bigram, then finally the unigram. If there are two words to work with, it starts with trigram and goes down the list, and so on.

Final Product

The final product can be found here: https://debmartin06.shinyapps.io/Capstone_Word_Predictor/

It takes a minute or two for the data to load, but once loaded, the predictor works fairly quickly. You will know that it's loaded when the loading status says "Ready!"

To use, simply type or copy and paste a word or phrase into the text box, and the predicted next word will show up beneath it.