Data Science Capstone Final Project

December 14, 2014

Objectives & Methodology

  • The goal is to develop an app that takes a phrase entered by the user and predicts the next word.
  • My solution in based in three N-Gram Models (e.g. a bi-gram model, a tri-gram model, and a 4-Gram model.)
  • For each phrase entered by the user, we run these three models and then pick the word with the highest probability across all of them.
  • The models were trained using a sample of the english corpus HC Corpora provided by the course instructors.
  • This app was developed for the Data Science Certification provided by John Hopkins University via Coursera

How to Use the App

  1. Go to https://carlosmirandad.shinyapps.io/NLPShinnyApp
  2. If you find a message that tells you to wait, please do so until the language models are loaded (it wont take too long.) This message will disappear when the app is ready to use.
  3. You have two options: You can type a phrase in the text box or select a sample phrase from the list.
  4. After you have typed or selected your phrase, press the Submit button
  5. The app will consult the models and give you the best prediction found for the next word (in the right pane).
  6. The app will also show how the prediction was made. You'll see the candidate words with their respective counts and probabilities.

Description of the Algorithm

The basic prediction algorithm is the “N-Gram Model”:

  • An “N-Gram” is a sequence of N consecutive words.
  • An “N-Gram Model” is a probabilistic language model that assumes the Markov property. It looks at the first N-1 words in each N-Gram and determines the conditional probability of the last word. We use three of such models (N=2,3 & 4)
  • How does the model work? It counts the occurrences of each word that follows the N-1 prefix in the “training corpus” and divides it by the total occurrences of the prefix itself. The result is the probability (which will be displayed for you.)
  • The process is memoryless (“Markov property”) so it forgets contents before the last N words. Although imperfect, this is efficient and adequate for this purpose.

More Details about the Algorithm

The app processes your phrase before searching it in the N-Gram Models:

  • The phrase start is marked with the token [START]
  • Numbers and times are replaced with tokens [NUMBER] and [HOUR] since we want to remember their position.
  • Dashes and apostrophes that are part of words (e.g. don't or e-mail) are retained but other special characters are removed.
  • Text is converted to lower case (“The” = “the”.)
  • Ngrams of low prediction value are removed and other minor transformations take place to help accuracy.

Thank you for using my word prediction app!