Next Word Prediction
Leo Timmermans
September 4th, 2017
Presentation for the Coursera Data Science Specialisation Capstone Project
Algorithm for predicting the next word:
NGram model with stupid back off method
deployed in a shiny web application
Go to the app
The model - part 1
Inputs:
Steps taken:
- Clean the textfiles.
- Extract NGrams.
- Create frequency counts of NGrams.
- Add NGrams from the Corpus of Contemporary American English.
The model - part 2
Steps in algorithm
- Clean the input string.
- Count the number of words in the cleaned string to identify the highest possible NGram.
- Use stupid backoff to find the best matching prediction Stupid backof starts with highest NGram, if frequency of NGram is less than 5, go back one NGram back off again if needed.
- Calculate probability based on highest NGram and number of times a back off was needed.
- Return the predicted words, sorted by probability, maximum number of words to return is defined by the input.
The model - part 3
Results in the app
- Predicted words and probabilities are returned to the app with a maximum number of words defined by the input value.
- The model returns the predictions in less than a second, which is pretty fast. Needed time to calculate the results is shown in the app.
- Results are shown in 3 boxes showing the top 3 results, a bar chart, a table (including probabilities) and a Word Cloud.
- Highest possible NGram to return results is the Five-Gram.
Check out the app here.