Get next word application Coursera JH data science capstone

Paul Barry
20 May 2017

The goal of this application

The goal of this application is to allow the user to type in a phrase and the application will respond with a list of suggested next words, from the most likely down. In fact, this application will also suggest the next two words.

  • The user types their phrase into the input box
  • The user clicks on the “Get Prediction” button
  • A list of suggested single next words appears
  • By choosing a second tab, the user can see a list of suggested next two words

How this works

The application uses what is called an n-gram model, where an n-gram is a sequence of n words. Thus “the cow jumped” is a 3-gram. The application uses a store of 1-grams, 2-grams, 3-grams, 4-grams and 5-grams gleaned from a selection of tweets, blogs and news feeds. It seeks to match the last n-1 words of the user's phrase to the first n-1 words of an n-gram, and then the n-th word (which is the last word) of the matching n-gram is used for the prediction. Of course, if the phrase is very long it has to be cut down to its last 4 words, which is then matched against available 5-grams. If no match is found, it is cut down to 3 words and matched against 4-grams, and so on. It may match more than one n-gram, in which case we take the next word from the most popular matched n-gram.

Some technical details

  • Only a small random portion (5%) of the database is used, as the processing times to establish the tables of n-grams can be very long
  • In this so-called “Katz back off model”, we have used Simple Good Turing smoothing [1] to distribute probabilities from known words to unknown words (words not encountered in the initial database)
  • Alternatives to using Simple Good Turing smoothing include using Kneser Ney smoothing

[1]: W. A. Gale, Good-Turing Smoothing without Tears, Journal of Quantitative Linguistics, 2 (1995), 217-237

The applicaton looks like this

application

The application is available at https://pbarry.shinyapps.io/Predict_New/