JHU Capstone Project

Stephen O'Connell
4/22/2015

Algorithm - Model Construction

  • Created an index of all words in all three corpuses.
  • Pre-processed the data, removed profanity, punctuation, numbers, and white space
  • Created the 3-nGrams from the sampled data, counting their frequency of occurrence
  • Evaluated all words in the nGrams for misspellings, removed any nGrams with misspellings
  • Using the index of all words converted nGram words to indexed values, i.e. 'could' = 99.
  • Loaded the index and indexed nGrams, with frequency counts, into data.table and setkeys on the table
  • Created a compressed Rdata file with the index and nGram model

Algorithm - Prediction

  • Input text is per-processed removing profanity, punctuation, numbers, and white space
  • The last two words in the phrase are converted to their indexed values
  • The indexed values are used as keys to the nGram model returning all nGrams starting with the keyed words
  • Result set is sorted by frequency in descending order
  • Indexed values for predictions are converted back to words
  • Top 4 words are returned to the UI

Application - Usage

  • Application is located at http://saoconnell.shinyapps.io/jhu_dss/
  • After the model is loaded a Ready.To.Go.Message will appear below the phrase
  • Clear the text field and start typing or paste a phrase into the text input box
  • Input is continuously evaluated
  • Pause briefly after completing a word for a prediction
  • In tests a prediction takes approximately 800ms

Application - Usage

  • An error will appear if a word is misspelled, i.e. it won't predict for misspellings.
  • Only correctly spelled words in the sampled corpus are valid, i.e. you may spell the word correctly but the word was not in the corpus.
  • An error will occur if the phase is too short; it needs at least 3 words
  • Have fun!!