Kishore Mamidi
June 16, 2017
An application to predict next word
The goal of this application is to build a model using natural language processing tools that will predict the next words given an input of logical words in a sentence.
There are many uses for this application including
This application was developed as part of the capstone project for the data science specialization offered by John Hopkin university via Coursera. To learn more about the project, visit course page
This app uses data from sample blogs, news articles, and tweets that were downloaded from the course repository
Following cleaning operations were performed on the raw data:
Once data was cleaned, n-grams (n = 1 to 5) frequency tables were generated based on tidytext package. Since term frequencies follow Zipf’s law, n-grams with single frequency were pruned to reduce data size, and improve performance
This application uses a 5-gram probabilistic model and applies the Stupid Backoff algorithm to rank next-word candidates.
The Stupid backoff algorithm can be summarized as follows
The Stupid Backoff implementation in this app starts by using upto the last four words typed in, and tries to find 5-grams that complete those four words. If less than the max defined predicitions are found, then the algorithm proceeds to match the last 3 words in 4-grams library and so on, until it has found the defined number of results to return.
The word prediction application can be accessed here. To use the app
In a future iteration, the app can be further optimized by