Next Word Prediction

Leo Timmermans
September 4th, 2017

Presentation for the Coursera Data Science Specialisation Capstone Project

Over 500MB in text files containing:
- over 4 million lines of text.
- over 76 million words.
1 million most frequent 2, 3, 4 and 5 grams from the
Corpus of Contemporary American English.

Clean the input string.
Count the number of words in the cleaned string to identify the highest possible NGram.
Use stupid backoff to find the best matching prediction Stupid backof starts with highest NGram, if frequency of NGram is less than 5, go back one NGram back off again if needed.
Calculate probability based on highest NGram and number of times a back off was needed.
Return the predicted words, sorted by probability, maximum number of words to return is defined by the input.

Predicted words and probabilities are returned to the app with a maximum number of words defined by the input value.
The model returns the predictions in less than a second, which is pretty fast. Needed time to calculate the results is shown in the app.
Results are shown in 3 boxes showing the top 3 results, a bar chart, a table (including probabilities) and a Word Cloud.
Highest possible NGram to return results is the Five-Gram.

Check out the app here.