This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.
The milestone report outlines the initial approach to building a series of ngram models from a range of text documents.
A Stupid Backoff smoothing strategy was used to calculate a ‘score’ for each word follows:
if the rows ngram model was 5
score = matched 5 gram Count / input 4 gram Count
else if the rows ngram model was 4
score = 0.4 * matched 4 gram Count / input 3 gram Count
else if the rows ngram model was 3
score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count
else if the rows ngram model was 2
score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
| ngram | predicted | score | |
|---|---|---|---|
| 4 | you | 0.2 | |
| 3 | you | 0.1 | |
| 2 | you | 0.05 |
The prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only use 1-3 ngram models sped up the search time by half but also dropped the accuracy by 10%.
To use the application navigate to the following URL
https://chrismckelt.shinyapps.io/datascience-capstone/
To use the application start typing in text.
When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.
Click on the green side menu for visual display options.