6/11/2018
Introduction
- Goal: Create a 'shiny' app that can predict the next word, given a sequence of three words
- Desirable Features of the App:
- Highly stable
- Minimal resource usage - use a small sample of the corpus
- Usable on mobile phones and tablets
Data Cleaning
- Randomly selected 4% of Each Data Set (blogs, news, twitter)
- Combined and randomly shuffled the three sets together
- Process the streamlined data to remove unwanted characters
Prediction Algorithm
- Simple search method (back-off)
- Use database of text frequencies for bigrams, trigrams, and quadgrams
- Given a piece of text, use 'grep' to search for matching words in the database
- Try quadgram first, if it fails, go to trigram
- If trigram fails, attempt bigram
- If bigram fails, the word 'the' is used - most common English word
- The algorithm always returns the word with the highest frequency
App Description
- Simple Interface
- Enter 1 - 3 words into text box
- Press 'Submit'
- App will suggest next word in the sequence
- App cleans input including extra spaces (leading/trailing), punctuation, capitalization, and numbers
- Uses a streamlined database of n-gram frequencies
- Good stability, thoroughly tested
- Tested on mobile phones and tablets
- Try the app here