Capstone Project: Word prediction

E. Chasen

3/26/2017

Word Prediction

The following predictive text model and application was built using three datasets, supplied by the Coursera Data Science Capstone class: a blog text file, a twitter text file and a news text file.

After combining and cleaning the three datasets, four new datasets were created of varying n-gram lengths: a bigram, trigram, fourgram, and fivegram dataset. N-grams are pairs or groups of words that are found together in a text body.

Algorithm explanation

Larger n-grams include more context. Therefore, the algorithm searches first for the last word in a fivegram model. If there are no matches in the fivegram dataset, the algorithm moves on to the fourgram dataset, then the trigram dataset, and lastly the bigram dataset.

If there are no matches from any of the existing data, the next word prediction comes from the list of most frequent unigram words.

Shiny App

You can hit ‘submit’ or the enter key to get the word prediction.
You can predict 1 to 3 options for your next word.

https://emchasen.shinyapps.io/shinyapp/

Working example

https://emchasen.shinyapps.io/shinyapp/