Next-Word Prediction with N-Grams

Jeremy Beck
January 2015

JHU's Coursera Data Science Specialization Capstone Project

The Task:

Small keys in mobile keyboards make typing difficult. If we can predict the next word a user wants to type, they can simply select the word, without having to type the entire thing. To prototype this functionality, I designed an application that, given user input, predicts the next word in the phrase.

The requirements for this usage suggest that the app should:

  • Be responsive
  • Run in a small memory footprint (suitable for mobile devices)

For the purpose of this project, shinyapps.io was used to deploy the app.

My Approach:

The app uses a 30% sample of Twitter posts, News articles, and Blog posts as a basis for predictions, and N-grams ranging from unigrams to trigrams. The processing pipeline and data are available from an earlier report

The prediction algorithm is an implementation of Modified Kneser-Ney Smoothing/Interpolation for trigrams.

  • Considers the possible contexts a word completes to calculate probabilities, not just raw counts
  • Uses probabilies for all trigrams, bigrams, and unigrams to come up with a probability for each word
  • Probabilities were precalculated for all N-Grams in the sample to keep the app quick
  • Trigrams and Bigrams seen less than 3 times were removed to speed up querying

Using the Application:

plot of chunk unnamed-chunk-1

  1. Type a phrase in the text input box (Keyboard Indicator Icon)
  2. Predictions are generated in real time, there is no “submit” button
  3. The next word prediction appears in the box to the right of the input
  4. The confidence level appears below the input and prediction.
  5. Check the “App Details” tab to get more information of what's going on.

Behind the Scenes:

  1. Clean the input phrase according to the processing pipeline
  2. Use the last 2 words in the phrase to search trigram list
    • If matches are found in trigram list, skip to step
  3. Search Bigram list using last word in input
    • If matches are found, skip to step 4
  4. If no matches are found in N-gram lists, sample frequent unigrams list
  5. Sort results by precalculated \( P_{KN} \) probabilities in descending order.
  6. Divide \( P_{KN} \) of top hit by cumulative \( P_{KN} \) of top 5 hits, and report as confidence level.

See the app in action Thanks for looking!