Next Word Predictor

Rajesh Vikraman
27- Oct-2017

Purpose and Basic Data processing

This Shiny app takes as input a phrase (multiple words), and when one clicks submit it predicts the next word. There is a slider provision to select a minimum of one next word to a maximum of 50 nextwords along with their probabilities of occurence.

The source data for the App are three text files as below

  • A collection of blogs with 37,570,839 words
  • A collection of news with 2,651,432 words
  • A collection of twiiter with 30,451,170 words

Ngram models were developed from a data corpus created by using 80% of the each of the data types.

Data Processing and Building Ngram Language Models

The Quanteda Package in R was used for this purpose

The data was cleaned to remove numbers, punctuations, symbols, seperators and twitter characters while creating the tokens.

It was ensured that ngrams are not formed across sentences by replacing end of sentence indicating symbols like ., !, ? etc with a term “eos”. Later all those ngrams containing “eos” was removed. To save on the space only those ngrams occuring at least twice in two seperate lines were considered. The Document Frequency matrix created for 2 to 5 ngrams was converted into a data table with Frequency of each of the ngrams.

Data Sparsity and Smoothing Algorithm

In NLP Data sparsity is the term used to describe the phenomenon of not observing enough data in a corpus to model language accurately. To solve this Kneser- Ney Smoothing technique was used.

Kneser-Ney smoothing [KN95] is a modified version of absolute discounting. The idea is to optimize the calculation of the lower-order n-gram probabilities in case the higher-order n-gram was unseen in the corpus. It is thereby originally a backoff smoothing algorithm [KN95]. The high-level motivation is that, using the backoff version of absolute discounting, the information that the higher-order n-gram was unseen is not taken into account when backing-off and calculating the lower-order probability [KN95].For details check https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing

How the APP works

When a group of words are entered, the App will take the last four words and search in 5 grams for a match. If no match is found it takes the last three words and search in the 4 gram model . In case of no match the model backs off to trigrams and bigrams.

The probability displayed along with the next word indicates whether the match was from higher gram or lower grams. Higher gram matches have a probability closer to 1 . Lower gram matches have much lower probabilities as the continuation counts are used in the calculation. Click the link to acces the App https://cvrajesh.shinyapps.io/Nextwordpredict/ The memory requirement is 250 MB and the next word prediction takes 10-15 secs due to the complexity of Kneser Ney Algorithm.