Next Word Prediction using GLM

Narendra S. Shukla
October '2016

Analyzing the Corpus

For the sake of this Analysis, a Data-Set from (HC Corpora) was used, http://www.corpora.heliohost.org
Original Data-Set consisted of 899288 Blogs, 1010242 News and 2360148 Tweets
Exploratory Data Analysis was done on 10k Blogs, News and Tweets (from HC Corpora) using “tm” and “RWeka” packages
Here's the report of the Analysis, http://rpubs.com/NarenShukla/TextDataMining
This Exploratory Data Analysis laid solid foundation for building Next Word Prediction (NWP) Algorithm, down the line
Noticed performance bottlenecks in using “tm” and “RWeka” packages, while scaling to higher volumes
So used “Stylo” package instead, to build the model, for faster performance

Building the Language Model

Loaded 25% of Corpus (Blogs, News and Tweets) using “Stylo” package and created (Bigrams, Trigrams, Quadgrams & Unigrams) Frequency Lists from it, in 1 hour
Pre-Processing applied was : Remove Non-English Words, Remove Swear Words, Remove Punctuation, Use lower-case
English Stop-Words were NOT REMOVED. Stemming was NOT REMOVED. They played vital role in next word prediction (NWP) logic
“Stylo” package commands used were, like, (load.corpus.and.parse), (delete.stop.words), (make.frequency.list)
To speed up the performance further, from each of the N-grams, only those Frequencies were retained which were Greater than MEAN(Frequency) for that N-gram
The Final (Bigrams, Trigrams, Quadgrams & Unigrams) were written to CSV files, which would be fed to NWP algorithm at run-time
The NWP (Next Word Prediction) algorithm would read these files in, load them into (data.tables) and also use (setkey) commands to index (word) column of these N-grams
These measures ensured that the NWP algorithm returned results in reasonable amount of time

The Next Word Prediction (NWP) Algorithm

Basic N-gram models calculate conditional probability of the (n)th word based on (Maximum Likelyhood) estimate of (n-1) words sequence
Smoothing methods “reduce” probability mass of occurring sequences and “assign” reasonable probabilities to sequences that didn't occur in the corpus
Considered various smoothing methods like “backoff”, “interpolated”, “Good-Turing Estimate”, “Absolute Discounting” and “Kneser-Ney Smoothing”
Finally settled down to “Generalized Lauguage Models (GLM)” based on “t sequences” (Typology Lauguage Models) of Jugend forscht project
This model splits word sequences \( w_{i-n+1}^{i} \) into (n-1) different t sequences, \( (t_{i-n+1}^{i}, t_{i-n+2}^{i}, ..., t_{i-1}^{i}) \) and then calculates conditional probability of each sequence and aggregates them
GLM achieves reduction of data sparsity, by replacing every word in a sequence (except 1st and last word) with a wildcard (*) word. Hence this model yields better results for unseen sequences, than any other Smoothing Model including Kneser-Ney
Reference - https://west.uni-koblenz.de/sites/default/files/BachelorArbeit_MartinKoerner.pdf

The Next Word Prediction (NWP) Shiny App

Let's say, you are given a phrase (hope you get) and you have to determine the Probability of word (to). GLM t-sequence model works this way,

P(ML) (to|hope * *) = c(hope * * to)/c(hope * *)   # Derived from (Quadgram) and (Trigram)
P(ML) (to|you *) = c(you * to)/c(you *)            # Derived from (Trigram) and (Bigram)
P(ML) (to|get) = c(get to)/c(get)                  # Derived from (Bigram) and (Unigram)

P(ML) (to|hope you get) = mean(P(ML) (to|hope * *), P(ML) (to|you *), P(ML) (to|get))

The app can be accessed by URL https://narenshukla.shinyapps.io/predictNextWordApp/
The code is available at https://github.com/NarenShukla/predictNextWordApp
You enter the Sentence, for which, you want the App, to Predict the Next Word. You have to Press (Fetch Results) Button to See Results. The app takes the last 3 words. And tries to predict the Next Word. If it can't do that, it will take last 2 words. Else, it takes the last Word
This will trigger the backend logic. You will get the message … Running Prediction Logic. Once it's done, the results are displayed in the Main Panel
You see 2 types of results. One, A WordCloud of Predicted Words. Two, Predicted Words along with their Probabilities and Percents, in Descending Order. The app also lists some of the phrases, previously tried out
I am sure you will be amazed by this App !!!