Word Prediction App

Bradley Boehmke
December 14, 2014

Overview

  • This presentation highlights a word prediction algorithm and Shiny application product
  • The algorithm and product predicts the most probable word to follow a sequence of text provided by the user
  • The slides that follow will summarize:
    • The data used
    • The prediction algorithm applied
    • How the Shiny app works

Data

  • Initial data was obtained from publicly available social media (Blog, News, & Twitter data). More details here
  • Data contains over 4 million lines of text and over 100 million words. Stats here
  • Preprocessing (see example of process here):
    • Sampled approximately 50% of the initial data
    • Removed all non-alphabetic (numbers, punctuation, special characters) characters and converted to lowercase to elminate case sensitivity
    • Removed profanity words
    • Extracted sequences of words (2-, 3-, 4-, & 5-grams) and their frequencies

Prediction Algorithm

  • The approach applied is a Simple Backoff Algorithm
  • User provides character sequence which is passed to the algorithm
  • User input is preprocessed in similar manner as training data; if sequence contains > 4 words only the final 4 words are selected
  • Algorithm identifies length of user input and searches for an n-gram that matches
  • If match is found, model selects highest probable word that follows
  • If no matching n-gram exists, the algorithm “backs-off” by reducing the user input to n-1 gram and searches for matching n-gram.
  • If no match exists after backing off to smallest n-gram possible, algorithm searches for partial n-gram matches (ie: “data xxx capstone course” and/or “data science xxx course”)
  • If no partial matches exist, algorithm predicts most common single words found in data

Shiny App

alt text alt text

  • Two apps to choose from:
    • Full scale model: App 1
    • Reduced scale model; reduces chance of frozen gray screen issue on shiny.io server: App 2
  • Word Prediction Tab: Enter phrase in left hand panel textbox & click “Submit Sentence”. The most probable word to follow your input will appear in blue and the next five most probable words will appear as well.
  • Word Cloud Tab: The top 50 predicted words are displayed in a word cloud art form