Predict Next Word

Hari Prasad
March-07-2017

Objective & Final Product

This project is intended to create a product which can predict next word based on the words entered earlier(you might have already seen it in modern keyboards in mobile!). Following link will take you to the application.

  • Application: This link takes you to shinyapps where Application is hosted.
    • How it works: Enter the word, enter a space and you will get the words predicted below. By default GloVe algorithm is selected which searches cosine distance vectors in resultant linear combination of input word vectors in word vector space, but user can also select Ngram algorithm. (Note:Please give it some time to load.)

Data Load and Cleansing

  • 7% of Random Sample from Twitter, Blog and News Data are used.
  • Data has been cleansed with special character removal, spell checks, remove stop words and repeated phrases.
  • This cleansed data is further used to create models to predict words.

Model Evaluvation: GloVe Algorithm

  • GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Text2Vec package has GloVe Implementation.
  • The query will be checked for nearest word vectors with linear combination of word vectors and word vector with best cosine distance will be considered as most probable word and next distance and so on.
  • Application gives user to select between GloVe & Simple NGram Backoff Model.

Model Evaluvation: Simple Ngram Backoff Model

  • Cleansed Data is loaded to create ngrams from 2 to 6.
  • An algorithm is written to check the frequency of matching query and corresponding n(th) gram word starting from 6 gram and if not then going to next ngram all the way up to bigram.
  • Most occuring word will be predicted first, followed by others.