Predict Next Word

by Sasa Pavkovic

An application for predicting next word using a sequence of defined words and a previously defined corpus.

Application is made using statistical programming environment R, and published using an extension for publishing R applications on a Web app environment called Shiny. Shiny can read R code and render Web pages based on that code.

Prediction model

NLP is used for buliding up the prediction model.

Prediction model is based on term frequency tables that are built from the initial data sources provided (News, Twitter, Blogs).

N-grams (unigram, bigram, trigram, and 4-grams) are built from these corpuses. The prediction algoritham is used on TF tables in order to predict the next word.

Prediction algoritham

Prediction algoritham needs and input set of words (a part of a sentence) and prepares the sentence for prediction.

Then it uses ngram frequencies to calculate the most probable next word.

Simple interpolation is used for calculating and some of the lower frequency words are substituted by UNK term. This allows the algoritham to predict using out-of-vocabulary words.

How the app works?

You can access the Shiny app at https://scenthr.shinyapps.io/PredictNextword

First you need to select one of the data sources and click on the “Confirm dataset selection” command button. It takes a couple of secs to load the prediction model for the selected dataset.

Then enter a sentence in the provided textbox below and click on the “Confirm sentence selection” command button. It will take some short time before the prediction algoritham shows the maximum of 5 possible next words on the right hand side.

The highest probability words is on the left hand side.

Whats next?

The prediction algoritham suffers from low accuracy even when used with sequence of words from the corpus. This is because of size restrictions (100MB) for the prediciton model. Next steps would be to:

  • make the model size smaller by using u programming environemnt that allows better control of memory consumption
  • using grammar to make the model smaller and more precise, e.g. enhance the model itself by using synonyms
  • expand and unifiy the corpus to include more longer n-grams