JHU Capstone Project
Stephen O'Connell
4/22/2015
Algorithm - Model Construction
- Created an index of all words in all three corpuses.
- Pre-processed the data, removed profanity, punctuation, numbers, and white space
- Created the 3-nGrams from the sampled data, counting their frequency of occurrence
- Evaluated all words in the nGrams for misspellings, removed any nGrams with misspellings
- Using the index of all words converted nGram words to indexed values, i.e. 'could' = 99.
- Loaded the index and indexed nGrams, with frequency counts, into data.table and setkeys on the table
- Created a compressed Rdata file with the index and nGram model
Algorithm - Prediction
- Input text is per-processed removing profanity, punctuation, numbers, and white space
- The last two words in the phrase are converted to their indexed values
- The indexed values are used as keys to the nGram model returning all nGrams starting with the keyed words
- Result set is sorted by frequency in descending order
- Indexed values for predictions are converted back to words
- Top 4 words are returned to the UI
Application - Usage
- Application is located at
http://saoconnell.shinyapps.io/jhu_dss/
- After the model is loaded a Ready.To.Go.Message will appear below the phrase
- Clear the text field and start typing or paste a phrase into the text input box
- Input is continuously evaluated
- Pause briefly after completing a word for a prediction
- In tests a prediction takes approximately 800ms
Application - Usage
- An error will appear if a word is misspelled, i.e. it won't predict for misspellings.
- Only correctly spelled words in the sampled corpus are valid, i.e. you may spell the word correctly but the word was not in the corpus.
- An error will occur if the phase is too short; it needs at least 3 words
- Have fun!!