Richard A. Lent
2 April 2019
Coursera Data Science Capstone Project
N-grams are sequences of words in a body of text, or corpus. The “n” in n-gram refers to the number of words in the n-gram. Thus a 2-gram is a sequence of two words, a 3-gram is a sequence of three words, and so on. The n-gram predictive text model reduces the complexity of natural language to the most recent few words typed by a user, and uses those words to predict the next word in the sequence (see n-gram).
After obtaining input from the user, the app searches an internal data table for a matching n-gram. The data table contains over five million n-grams produced from a text corpus of blog postings, Twitter posts, and news feeds. If a match is found the corresponding predicted word is returned and the backoff history displayed. If a match is not found, the first word is removed from the n-gram and the search repeated (the backoff model). This continues until either a match is found or no match is found and the n-gram has been reduced to a single word. At this point the most common word in the English language, the, is returned as the “best guess.”
For more information on the development of this web app, including R code for data processing and implementation of the word prediction model, see Milestone Report: Predictive Text Modeling Using N-grams.