Word Predictor

Coursera Data Science Capstone

Application WordPredictor is trying to guess next word while you are typing into input box.
The application was developed as a capstone project of Coursera Data Science Specialization led by Johns Hopkins University.
The data used for the project was taken from Coursera project site.
The list of bad-words used for data cleaning was downloaded from Carnegie Mellon University.

The application uses Stupid Backoff algorithm.
Based on the last four words in input it searches in stored five-words combinations (5-grams) and finds the most probable five words as output candidate with their probability taken as a score.
Next it searches the last three input words in stored four-words combinations (4-grams) and finds the most probable five words with their probability. The probability is multiplied by constant \(0.4\) and taken as a score of the next output candidates.
The same is done for the last two words and the last word with multiplication constant \(0.4^2\) and \(0.4^3\).
At the end the algorithm returns five words from output candidates with the highest scores.

The application data were cleaned in a few steps:

On the input only letters and apostrophe was considered for the next processing.
The input text was split into words and only the words which cover 95% of the text were considered. The words with the lowest probability which cover together 5% of the input text and bad-words were substituted by a special token.
The combinations of 2-5 words (n-grams) were created and the probability was computed for all the word combinations from the cleaned words.
And all those word combinations were again cleaned - the combinations with the lowest probability were removed to reduce file size.

The application reads text, removes numbers and non ASCII characters and parse input into words.
Words are compared with internal vocabulary (1-gram) then if an input word does not match with the vocabulary, the application tries to guess the word. If the guess is not successful then the special token is used instead of the input word.
The new prediction is chosen by the algorithm based on the last four words of the text.
If the last input word does not match the vocabulary then the guess of the word is added to output candidates.
The output can be extended by the most probable words if the number of output candidates is not sufficient.

The application was tested on 100,000 randomly chosen chunks of text from source files.
The success rate of the application was 18.11%.
Link to the application.
Simply type your sentence in the input box and the application will propose five most probable words. Enjoy!
The source code can be found in github.