6/11/2018

Introduction

  • Goal: Create a 'shiny' app that can predict the next word, given a sequence of three words
  • Desirable Features of the App:
    • Highly stable
    • Minimal resource usage - use a small sample of the corpus
    • Usable on mobile phones and tablets

Data Cleaning

  • Randomly selected 4% of Each Data Set (blogs, news, twitter)
  • Combined and randomly shuffled the three sets together
  • Process the streamlined data to remove unwanted characters

Prediction Algorithm

  • Simple search method (back-off)
    • Use database of text frequencies for bigrams, trigrams, and quadgrams
    • Given a piece of text, use 'grep' to search for matching words in the database
      • Try quadgram first, if it fails, go to trigram
      • If trigram fails, attempt bigram
      • If bigram fails, the word 'the' is used - most common English word
    • The algorithm always returns the word with the highest frequency

App Description

  • Simple Interface
    • Enter 1 - 3 words into text box
    • Press 'Submit'
    • App will suggest next word in the sequence
    • App cleans input including extra spaces (leading/trailing), punctuation, capitalization, and numbers
  • Uses a streamlined database of n-gram frequencies
    • Good stability, thoroughly tested
  • Tested on mobile phones and tablets
  • Try the app here