David Hayes
(August 2015)
This presentation was created to support the corresponding Shiny interactive web application, Word Prediction, for the Coursera Data Science Capstone Project Module; part of the Data Science Specialism.
The application can be accessed via the following link: https://davidhayes2.shinyapps.io/Word2/
The initial Exploratory Analysis can be accessed via: http://rpubs.com/DavidHayes2/94670
Interactive Web Front End Application deployed on shinyapps web server.
Allows entry of a Phrase (multiple words).
The app searches previously constructed N-grams to predict the next word with sub-second performance (dafault behaviour, or up to 3 predictions if checkbox un-ticked)..
N-grams constructed by analysing over 100 million words across 3 corpus (News, Twitter and Blogs) to understand the relationship between the words.
Application has option to restrict any of the top 15 Profanity words being displayed (if predicted). Run-time Statistics can also be displayed.
1.This application uses pre-built N-grams (2,3,4) to predict the next word.
2.Entry of the Phrase is via a text input box. The last 3 words (or less) are extracted from the phrase entered and any leading and trailing spaces are removed, including any entered numbers or punctuation (except apostrophe).
3.The algorithm utilises the Stupid Backoff approach by searching for a match starting with the 4-gram. If we have zero counts for Ngrams (entered phrase not found), then we Back off to (N-1) grams - searching the 3-gram and then 2-gram if required.
4.If there is a non-zero then the predicted word is returned.
5.If no word is identifed after searching the 4,3 and 2 grams then an arbitrary (escape) word is selected based upon the last word in the phrase.