Srikanth Patloori
28 July 2018
Next word prediction model is an application which is trained on large amount of text data to predict next word based on previous words Using ANLP.
Click Here for Shiny APP
It is very difficult to build a model with 4+ million lines on commodity hardware. So 15% of the data was used as training data
Data cleaning is one of the major task and challenge in the area of NLP. Natural language can be written in many forms and everyone uses in different forms
Following transformations were applied to clean data:
Remove non-English characters, Remove punctuations, Remove symbols and numbers, Convert to lower case, Replace contractions and abbreviations, Remove text within brackets, Remove profanity words
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
I faced many challenges building this application, so I developed a library ANLP which is enough to build some of the NLP applications like next word prediction, information retrieval etc,.
Current functionalities:
Read and sample text data Clean data Build N-gram model Predict next word using Backoff algorithm ANLP library is available on CRAN: (Open sourced) click Here
There are many improvements which can be done for performance and accuracy enhancement.