Somnath Maji
February 24, 2018
This project attempts to predict the next word(s). Users are welcome to visit the project website. As users type on the text box, the next predicted word(s) will be displayed in the right.
Capstone project of Data Science Specialization Stream in collaboration with Swiftkey
Purpose of this project is to build a data product based on the input text(s) from user.
Project uses data from HC Corpora
Data Processing & Analysis :
Three data sources created for Blogs, News and Twitter
The data cleaned (by removing all weird characters, profanity, URL, punctuations, numbers, whitespace, converting to lower case etc.)
Some exploratory analysis done for analysis
Created n-gram files (biGram, triGram and quadGram) by tokenizing the cleaned sample corpus file
We have taken small sample size to get acceptable performance.
Katz's Back-off Algorithm has been used to predict next word using a training dataset.
From an overall perspective the calculations involved in this algorithm are inexpensive, but quite accurate if training set is large
Essentially, if the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the (n - 1)-gram.
In order to explain the model, wiki content can be checked: https://en.wikipedia.org/wiki/Katz%27s_back-off_model
This Word Predictor Shiny application : https://myshiny-project.shinyapps.io/WordPredictor/
Challenges Faced during preparation of this data product.
Thank you for trying this data product.