Next word prediction model: Capstone Project

Srikanth Patloori
28 July 2018

Next word prediction model is an application which is trained on large amount of text data to predict next word based on previous words Using ANLP.

Click Here for Shiny APP

Data preparation and cleaning

It is very difficult to build a model with 4+ million lines on commodity hardware. So 15% of the data was used as training data

Data cleaning is one of the major task and challenge in the area of NLP. Natural language can be written in many forms and everyone uses in different forms

Following transformations were applied to clean data:

Remove non-English characters, Remove punctuations, Remove symbols and numbers, Convert to lower case, Replace contractions and abbreviations, Remove text within brackets, Remove profanity words

N-grams and backoff

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

  • Data was tokenized to generate term frequency tables or n-grams
  • 5 tables were generated. i.e. unigram, bigram, trigram, quadrigram, five gram (n = 1 to 5)
  • Each table is having two attributes.
  1. Word
  2. Frequency of that word in data

ANLP library

I faced many challenges building this application, so I developed a library ANLP which is enough to build some of the NLP applications like next word prediction, information retrieval etc,.

Current functionalities:

Read and sample text data Clean data Build N-gram model Predict next word using Backoff algorithm ANLP library is available on CRAN: (Open sourced) click Here

Future Scope

There are many improvements which can be done for performance and accuracy enhancement.

  • Recurrent Neural Networks
  • Parallel processing
  • Real time processing
  • Interpolation
  • Predict punctuations and contractions
  • Remove stop words
  • Stemming
  • Personalized prediction