Next Word Prediction using GLM

Narendra S. Shukla
October '2016

Analyzing the Corpus

  • For the sake of this Analysis, a Data-Set from (HC Corpora) was used, http://www.corpora.heliohost.org

  • Original Data-Set consisted of 899288 Blogs, 1010242 News and 2360148 Tweets

  • Exploratory Data Analysis was done on 10k Blogs, News and Tweets (from HC Corpora) using “tm” and “RWeka” packages

  • Here's the report of the Analysis, http://rpubs.com/NarenShukla/TextDataMining

  • This Exploratory Data Analysis laid solid foundation for building Next Word Prediction (NWP) Algorithm, down the line

  • Noticed performance bottlenecks in using “tm” and “RWeka” packages, while scaling to higher volumes

  • So used “Stylo” package instead, to build the model, for faster performance

Building the Language Model

  • Loaded 25% of Corpus (Blogs, News and Tweets) using “Stylo” package and created (Bigrams, Trigrams, Quadgrams & Unigrams) Frequency Lists from it, in 1 hour

  • Pre-Processing applied was : Remove Non-English Words, Remove Swear Words, Remove Punctuation, Use lower-case

  • English Stop-Words were NOT REMOVED. Stemming was NOT REMOVED. They played vital role in next word prediction (NWP) logic

  • “Stylo” package commands used were, like, (load.corpus.and.parse), (delete.stop.words), (make.frequency.list)

  • To speed up the performance further, from each of the N-grams, only those Frequencies were retained which were Greater than MEAN(Frequency) for that N-gram

  • The Final (Bigrams, Trigrams, Quadgrams & Unigrams) were written to CSV files, which would be fed to NWP algorithm at run-time

  • The NWP (Next Word Prediction) algorithm would read these files in, load them into (data.tables) and also use (setkey) commands to index (word) column of these N-grams

  • These measures ensured that the NWP algorithm returned results in reasonable amount of time

The Next Word Prediction (NWP) Algorithm

  • Basic N-gram models calculate conditional probability of the (n)th word based on (Maximum Likelyhood) estimate of (n-1) words sequence

  • Smoothing methods “reduce” probability mass of occurring sequences and “assign” reasonable probabilities to sequences that didn't occur in the corpus

  • Considered various smoothing methods like “backoff”, “interpolated”, “Good-Turing Estimate”, “Absolute Discounting” and “Kneser-Ney Smoothing”

  • Finally settled down to “Generalized Lauguage Models (GLM)” based on “t sequences” (Typology Lauguage Models) of Jugend forscht project

  • This model splits word sequences \( w_{i-n+1}^{i} \) into (n-1) different t sequences, \( (t_{i-n+1}^{i}, t_{i-n+2}^{i}, ..., t_{i-1}^{i}) \) and then calculates conditional probability of each sequence and aggregates them

  • GLM achieves reduction of data sparsity, by replacing every word in a sequence (except 1st and last word) with a wildcard (*) word. Hence this model yields better results for unseen sequences, than any other Smoothing Model including Kneser-Ney

  • Reference - https://west.uni-koblenz.de/sites/default/files/BachelorArbeit_MartinKoerner.pdf

The Next Word Prediction (NWP) Shiny App

  • Let's say, you are given a phrase (hope you get) and you have to determine the Probability of word (to). GLM t-sequence model works this way,

  • P(ML) (to|hope * *) = c(hope * * to)/c(hope * *)   # Derived from (Quadgram) and (Trigram)
    P(ML) (to|you *) = c(you * to)/c(you *)            # Derived from (Trigram) and (Bigram)
    P(ML) (to|get) = c(get to)/c(get)                  # Derived from (Bigram) and (Unigram)
    
    P(ML) (to|hope you get) = mean(P(ML) (to|hope * *), P(ML) (to|you *), P(ML) (to|get))
    
  • The app can be accessed by URL https://narenshukla.shinyapps.io/predictNextWordApp/

  • The code is available at https://github.com/NarenShukla/predictNextWordApp

  • You enter the Sentence, for which, you want the App, to Predict the Next Word. You have to Press (Fetch Results) Button to See Results. The app takes the last 3 words. And tries to predict the Next Word. If it can't do that, it will take last 2 words. Else, it takes the last Word

  • This will trigger the backend logic. You will get the message … Running Prediction Logic. Once it's done, the results are displayed in the Main Panel

  • You see 2 types of results. One, A WordCloud of Predicted Words. Two, Predicted Words along with their Probabilities and Percents, in Descending Order. The app also lists some of the phrases, previously tried out

  • I am sure you will be amazed by this App !!!