Coursera Capstone: Word Predictor

Gavin
April 19, 2018

Improved text prediction methods are in high demand

  • Many available tools out there, but online tools are not reactive.
  • Simple input box is sought after by many users.
  • Updates to current methods are needed due to shifting language

My solution

  • Mined text from blogs, news, and twitter posts.
  • Processed text, removed bad words, and identified n-grams.
  • Produced tables of 4-grams, 3-grams, and 2-grams.
  • Matches of consecutive words in the input sentence to the start of known n-grams are used to identify the final word in the n-grams.

Algorithm and novel feature

The algorithm is quite straight-forward:

  • Checks whether the last 3 words in the sentence correspond to the first 3 words in a 4-gram and will return the last word if so.
  • Does the same for 3-grams and 2-grams (for 2 and 1 word, respectively).
  • Will return the most common word in the database if no matches are found.

To save on memory I constrained the database of these n-grams to only those that were observed at least 5 times in the training dataset. This reduced the size about ~100 fold and greatly increased the running time.

Tool usage

The tool is available here: https://gavinmdouglas.shinyapps.io/coursera_capstone/

Using this tool is extremely easy - simply type a setence in the box at the above link and the predicted next word will be output below!