Coursera Capstone: Word Predictor

Gavin
April 19, 2018

Improved text prediction methods are in high demand

Many available tools out there, but online tools are not reactive.
Simple input box is sought after by many users.
Updates to current methods are needed due to shifting language

My solution

Mined text from blogs, news, and twitter posts.
Processed text, removed bad words, and identified n-grams.
Produced tables of 4-grams, 3-grams, and 2-grams.
Matches of consecutive words in the input sentence to the start of known n-grams are used to identify the final word in the n-grams.

Algorithm and novel feature

The algorithm is quite straight-forward:

Checks whether the last 3 words in the sentence correspond to the first 3 words in a 4-gram and will return the last word if so.
Does the same for 3-grams and 2-grams (for 2 and 1 word, respectively).
Will return the most common word in the database if no matches are found.

To save on memory I constrained the database of these n-grams to only those that were observed at least 5 times in the training dataset. This reduced the size about ~100 fold and greatly increased the running time.

Tool usage

The tool is available here: https://gavinmdouglas.shinyapps.io/coursera_capstone/

Using this tool is extremely easy - simply type a setence in the box at the above link and the predicted next word will be output below!