Predict Next Word

Jonathan Friedman
June 2, 2017

Predict Next Word App

App Screenshot

The Predict Next Word App used natural language processing to predict the next word of a user's sentence or phrase based on word combinations of a corpus containing hundreds of thousands of texts. The app:

  • reads a phrase or sentence
  • applies an algorithm that identifies similar word sequences
  • suggests the five most likely subsequent words bases on those similar occurences

The app is simple to use and takes between 1-2 seconds to generate next word predictions.

Data Sources and Processing

Data for this project is from a corpus called HC Corpora. Data came from publicly available sources via a web crawler. Three types of sources are included:

  • tweets
  • blog posts
  • news stories

A few data cleaning steps were taken to try to eliminate non-words and non-English words, such as eliminating words that contain numbers or that contained symbols other than letters A-Z.

Because the N-Gram tables were becoming quite large, I filtered for N-Grams with counts exceeding three.

Data Exploration

The app's algorithm looks for similar word sequences in a foundation of word combinations. The following were the most common word combinations, or 2grams, 3grams, and 4grams, with stop words removed. The app itself included stop words, but they were deleted for exploratory purposes to obtain greater insight into commonly used words.

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Prediction Algorithm

The prediction algorithm uses a Stupid Backoff approach developed by Google researchers. The paper can be accessed at http://www.aclweb.org/anthology/D07-1090.pdf

Stupid Backoff Paper Stupid Backoff

The algorithm itself applies the logic above. For a three-word phrase, the algorithm identifies all 4-grams for which the first three words are the three-word phrase, and calculate scores by dividing the 4-gram frequencies by the total number occurences of the three-word phrase. It does the same for 3-grams and 2-grams that match the end of the user-defined phrase, penalizing each for less precisely matching the user-defined phrase.

Using the App

Using the App could not be simpler. You put in your phrase and press the Generate next Word button. On the right hand side, the predicted next word appears at the top, and the four next likely next words appear below. Below are the results for the phrase “going to new”.

Users can also dig deeper into the N-Grams utilized by the algorithm by navigating to the N-Gram tab and searching the N-Gram tables.

It's a simple app to use, and it was a lot of fun to develop!