Jonathan Friedman
June 2, 2017
The Predict Next Word App used natural language processing to predict the next word of a user's sentence or phrase based on word combinations of a corpus containing hundreds of thousands of texts. The app:
The app is simple to use and takes between 1-2 seconds to generate next word predictions.
Data for this project is from a corpus called HC Corpora. Data came from publicly available sources via a web crawler. Three types of sources are included:
A few data cleaning steps were taken to try to eliminate non-words and non-English words, such as eliminating words that contain numbers or that contained symbols other than letters A-Z.
Because the N-Gram tables were becoming quite large, I filtered for N-Grams with counts exceeding three.
The app's algorithm looks for similar word sequences in a foundation of word combinations. The following were the most common word combinations, or 2grams, 3grams, and 4grams, with stop words removed. The app itself included stop words, but they were deleted for exploratory purposes to obtain greater insight into commonly used words.
The prediction algorithm uses a Stupid Backoff approach developed by Google researchers. The paper can be accessed at http://www.aclweb.org/anthology/D07-1090.pdf
The algorithm itself applies the logic above. For a three-word phrase, the algorithm identifies all 4-grams for which the first three words are the three-word phrase, and calculate scores by dividing the 4-gram frequencies by the total number occurences of the three-word phrase. It does the same for 3-grams and 2-grams that match the end of the user-defined phrase, penalizing each for less precisely matching the user-defined phrase.
Using the App could not be simpler. You put in your phrase and press the Generate next Word button. On the right hand side, the predicted next word appears at the top, and the four next likely next words appear below. Below are the results for the phrase “going to new”.
Users can also dig deeper into the N-Grams utilized by the algorithm by navigating to the N-Gram tab and searching the N-Gram tables.
It's a simple app to use, and it was a lot of fun to develop!