George Pipis
2016-05-20
Data obtained from here. The zip file has data for 4 languages but for this project we are interested in English. The Enlish file contains three txt files with data from blogs, news and twitter. Because the algorithm should return fastly the results, as a train dataset I took a small sample of around 50k lines in total
Step 1 Cleaning the Sample by removing the special characters, the punctions and by turning to lower case
Step 2 Create the Unigrams, Bigrams, Trigrams and Fourthgrams
Step 3 Apply a simple Katz's Back-off Algorithm which is based on n-grams
Step 4 Return the Next Predicted word but also a table with other probable words representing their estimated probability