This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.
The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey’s applications, takes in a word phrase and returns next-predicted word.
The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft’s Word Flow Technology is another example of NLP in action.
Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.
The milestone report outlines the initial approach to building a series of ngram models from a range of text documents.
The application was developed in R using a number of packages and the Shiny web framework.
Below outlines the methodology used to build,predict and evaluate the application.
A prediction function then takes a sentence as input and execute the below steps
image
With a data table containing the ngram model, sentence, frequency and predicted word, the top 3 most probable words are predicted using a Stupid Backoff smoothing strategy.
A pseudo code description to calculate the ‘score’ for each word follows:
if the rows ngram model was 5
score = matched 5 gram Count / input 4 gram Count
else if the rows ngram model was 4
score = 0.4 * matched 4 gram Count / input 3 gram Count
else if the rows ngram model was 3
score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count
else if the rows ngram model was 2
score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
Finally we group and sum similar words
For example if the predicted word ‘you’ was found in ngram 4 (and thus ngram 3 & 2) it may look like
| ngram | predicted | score | |
|---|---|---|---|
| 4 | you | 0.2 | |
| 3 | you | 0.1 | |
| 2 | you | 0.05 |
The total score for the predicted word ‘you’ is (0.2 + 0.1 + 0.05) = 0.35
The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score.
When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.
The prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only using 1-3 ngram models to speed up the search cut the time in half and only dropped the accuracy by 10%.
image
To use the application navigate to the following URL
https://chrismckelt.shinyapps.io/datascience-capstone/
Start typing in text
image
For access to the code please contact the author using one of the contact links on the site.