App Description and Instructions
The goal of the capstone project was to create a word prediction algorithm and deploy it in a
Shiny application. We were instructed to use a
collection of newspaper articles, blog posts, and twitter feeds to train our model. An embedded live version of the app appears in the next slide.
INSTRUCTIONS:
- Enter your sentence in the input field and hit “Submit”.
- A primary suggestion for the next word will show up in the table to the right. There will be another table with supplementary suggestions beneath it.
- Read the DETAIL tab in the embedded app to learn more about the process of text mining on the Data/Sampling tab.
- Check out the EXPLORE tab in the embedded app to see the most frequent n-grams in the sampled Corpus.
Try the app for yourself, this is an embedded live version.
Algorithm
The algorithm works as follows:
- Clean the input sentence.
- Determine the length (n) of the cleaned sentence.
- If \( n >=3 \) then search for matches in the 4-gram matching on the \( n-2 \), \( n-1 \), and \( n \) words.
- If there are no matches, then back-off to the 3-gram and so on.
- Return the top words in descending order of likelihood.
Future Work
The algorithm currently in use relies entirely on (at maximum) the last 3 words in the sentence.
As we all know, a sentence has long-range context where the last 3 words may not really tell you much at all about the broader intent of the sentence. A method called bag-of-words should be explored to collect words used previously in the sentence to give more context around word prediction.