Word Prediction App
Bradley Boehmke
December 14, 2014
Overview
- This presentation highlights a word prediction algorithm and Shiny application product
- The algorithm and product predicts the most probable word to follow a sequence of text provided by the user
- The slides that follow will summarize:
- The data used
- The prediction algorithm applied
- How the Shiny app works
Data
- Initial data was obtained from publicly available social media (Blog, News, & Twitter data). More details here
- Data contains over 4 million lines of text and over 100 million words. Stats here
- Preprocessing (see example of process here):
- Sampled approximately 50% of the initial data
- Removed all non-alphabetic (numbers, punctuation, special characters) characters and converted to lowercase to elminate case sensitivity
- Removed profanity words
- Extracted sequences of words (2-, 3-, 4-, & 5-grams) and their frequencies
Prediction Algorithm
- The approach applied is a Simple Backoff Algorithm
- User provides character sequence which is passed to the algorithm
- User input is preprocessed in similar manner as training data; if sequence contains > 4 words only the final 4 words are selected
- Algorithm identifies length of user input and searches for an n-gram that matches
- If match is found, model selects highest probable word that follows
- If no matching n-gram exists, the algorithm “backs-off” by reducing the user input to n-1 gram and searches for matching n-gram.
- If no match exists after backing off to smallest n-gram possible, algorithm searches for partial n-gram matches (ie: “data xxx capstone course” and/or “data science xxx course”)
- If no partial matches exist, algorithm predicts most common single words found in data
Shiny App
- Two apps to choose from:
- Full scale model: App 1
- Reduced scale model; reduces chance of frozen gray screen issue on shiny.io server: App 2
- Word Prediction Tab: Enter phrase in left hand panel textbox & click “Submit Sentence”. The most probable word to follow your input will appear in blue and the next five most probable words will appear as well.
- Word Cloud Tab: The top 50 predicted words are displayed in a word cloud art form