This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.
A Stupid Backoff smoothing strategy was used to calculate a ‘score’ for each word follows:
if the rows ngram model was 5 score = matched 5 gram Count / input 4 gram Count else if the rows ngram model was 4 score = 0.4 * matched 4 gram Count / input 3 gram Count else if the rows ngram model was 3 score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count else if the rows ngram model was 2 score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
The prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only use 1-3 ngram models sped up the search time by half but also dropped the accuracy by 10%.
To use the application navigate to the following URL
To use the application start typing in text.
When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.
Click on the green side menu for visual display options.