Jitender Kumar
2 July 2016
This presentation is part of Coursera Data Science program. We have to apply concepts of Natural Language Processing to predict next word.
For the capstone course we have been given a Swiftkey data sets and a Shiny application has been developed to using the NLP concepts and techniques.
Objective of this presentation is to demonstrate
Four n-gram dictionary model along with Stupid back off smoothing model for calculating probability of next word has been used. Following are main steps
Few points
Cleaned and processed input string is searched through the grams starting with 4 gram. If next word is found in quad, tri or bi gram then: Probability of predicted next word = Discount * Occurance of word in the subset found
Discount: It is fixed as of now but we can estimate and use Good-Turing Discounting also.
gram wise: Quadgram-0.4, Trigram -0.3, Bigram -0.2, Unigram -0.1
Word Frequency: It is calculated dynamically. Following is the formula
Word frequency/ sum(Frequency of all the word in subset found)
There are various approaches but I found the above one simple to implement which gives results with very good accurary.
Following diagram depicts the Application architecture and flow
The grams are have structure like below
The application is hosted at
https://kjitender.shinyapps.io/Word_Prediction_App/
Would like to do the following improvements in next version of application
I hope you have liked the application and methodology.
All suggestions and feedback are welcome at kumar[dot]jitender[at]gmail.com