Next Word Prediction Tool

George Pipis
2016-05-20

This presentation is a brief description of the algorithm used to make the "Next Word Prediction Tool""

Data obtained from here. The zip file has data for 4 languages but for this project we are interested in English. The Enlish file contains three txt files with data from blogs, news and twitter. Because the algorithm should return fastly the results, as a train dataset I took a small sample of around 50k lines in total

Development of the Predictive Algorith

Step 1 Cleaning the Sample by removing the special characters, the punctions and by turning to lower case

Step 2 Create the Unigrams, Bigrams, Trigrams and Fourthgrams

Step 3 Apply a simple Katz's Back-off Algorithm which is based on n-grams

Step 4 Return the Next Predicted word but also a table with other probable words representing their estimated probability

The Predictive Algorithm

User input a phrase in a text box and click the Predict button
The algorith applies the cleaning processing to the input phrase
Based on the length of the input the algorith searches for the same N-Gram in the database. So with a sentence of 5 words it conisders the last 3 words and searches for the 4-gram, if there is no match it considers the last 2 words and seaches for the 3-gram and so on.
If there is a match it returns the last word of the most frequent N-gram and also a table with up to six most probable Next Words with their respective estimated probabilities
Also it returns a word cloud with the most probable “Next Words” as well as the most probable N-grams -If there is no match it returns the most probable Word of the train data set
The application can be found here.