Next word prediction with Natural Language Processing

Julia Hoffman
November 20, 2017

Introduction


This application is a capstone project (final assignment) for the Johns Hopkins University Data Science Specialization course and its corporate partner, SwiftKey.

  • The goal of this project was to build an interactive application with Shiny R that can predict the next word based on user input.

  • The model for this application was built on HC Corpora Data (English language data sets of sentences from blogs, news, and twitter), and used in an algorithm to predict the next word in a phrase.

  • The application can be found here

Data preparation


  • An N-gram model was created from the text sources. Packages \( \color{blue}{tm} \) and \( \color{blue}{RWeka} \) were used for cleaning dataset and creation of N-gram tables.

  • The dataset was cleaned to contain all lower-case characters, white space and special characters were removed. Profanities were not removed.

  • The dataset was sorted into N-grams (1- to 4-grams) with cumulative frequencies. Only the most frequent N-grams (frequency > 3) were retained in the database. It increases execution speed of the application, however, this reduces accuracy of the algorithm.

  • The cleaning techniques are also applied to the user input phrase.

Algorithm description

The “Stupid Backoff” algorithm looks at the highest order n-grams matching the end of the user-entered phrase, and, if needed, “backs off” with a discount to lower-order n-grams until a highest score match is found. The discount weight, \( \alpha \) is heuristically set to a fixed value of 0.4 instead of being calculated to reduce complexity of the algorithm.

The accuracy of the algorithm comes to ~20%, with sacrificed low frequency N-grams and limited N-gram creation (up to 4-grams).

Possible improvements include:

  • fixing typos and merged words (feature seen in twitted data),
  • add smoothing for unseen N-grams (like Good Touring or Knesser-Ney),
  • deliver most frequent words solution if no words found,
  • paralelize lookup process for faster performance.

The Shiny application

The next word prediction application can be found here: http://kurowska.shinyapps.io/final

  • wait for the application to load,
  • type in the sentence for which you would like to find the last word.

The result can be seen in two modes: as a table with up to ten highest scores and as a word cloud, where the highest score words appear as the biggest in the cloud.