Word Predictor
Ram Ravichandran
June 12, 2018
Project Goal
- Given three words of inputs in English, predict the next word
- Create a Shiny Web Application for prediction
Background
- Analyze a large text data containing US news, blogs, and twitter feeds
- Exploratory Data Analysis
- Evaluate Prediction Approaches
- Select Katz Back-off N-gram Algorithm
- minimal computational and memory resources
Approach
- Sampled 5% of the data to reduce memory usage
- Cleaned the data
- e.g., remove special characters, offensive words
- Created n-grams and store the data for quick retrieval by Shiny App
- Implemented the Shiny app with UI and Prediction Component
- Inputs in the Shiny App go through the above cleaning process
- Shiny App will load the pre-processed n-gram data
- boost performance and reduce memory demands
- responsive to user input
Algorithm
- Given i input words, n is i+1, or i=3, n=4
- N-gram: n number of adjacent pairs of words found in the Corpus
- Create n-grams from sample data (n = 4, 3, 2, 1)
- Employ Katz Backoff Algorithm for N-grams to predict next word:
- Find the best match for input in n-grams (highest frequency).
- If a match is found, return the matching word. If not, remove the first word from the input and find the best match in n-1 gram
- Repeat Step 2 until you reach unigram. If no match is found, return the most frequent unigram
Instructions
Shiny Web App
Enter three input words in the input box and press the submit button. The predicted word will be displayed. 