Jeff Register
6 June 2021
Johns Hopkins University Data Science Specialization
Capstone Project
The objective of the Johns Hopkins University Data Science capstone project is to create a “next word” predictor, similar (but not identical) to how a mobile phone or Google search “suggests” the next word when you are typing.
The Next Word Predictor (link) is a web-based application which an English-speaking user can interact with the prediction model.
The application has 2 key requirements for the Next Word Predictor to be considered a success:
reasonably accurate (i.e., the suggested next word are rationally expected for an English-speaking person)
responds quickly so that the user is not waiting for a response from the application
Using the Next Word Predictor is easy.
Simply start typing in the text box
App suggests up to 3 words which to continue the user's thought.
Suggested words appear below the text box and are in bold blue. The phrase which the user has typed appears before the suggested word
In addition, a frequency percentage appears next to each suggestion which shows percentage of time that the suggested next word appears in this context.
To build the predicting model, extracts of three (3) English text sources provided by JHU and Swiftkey are used as the representative “normal” English language/speech.
These raw text sources are input into the R natural language processing (NLP) quanteda package to “normalize” the text for analysis.
N-grams represent sequences of tokens (words) which occur within text extracts. 2-gram, 3-gram and 4-gram sequences have been generated.
A simple Markov model is the prediction engine (see figure).
The last n-1 words are taken from the user input string and used to match with the search terms.
1. If a match is found, the highest Kneser-Ney estimates are selected and returned.
2. If no match, then the number of words from user input string is reduced by 1 and the search is repeated.