The product is developed using an English corpus( large set of text) that contains news articles, blogs and tweets.
The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (big data territory.)
There are about 1 million words in the English language as of Jan 1, 2014*.
Everyday conversation requires roughly 2500-3000 actively used words to communicate effectively.(with passive vocabulary words not used often, totaling about 15k-20k.)
Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.
Statistical Modeling
Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.
Because english language words are highly correlated, the probability of seeing any word W given the previous words w1, w2….Wn-1 can be tabulated with Markov assumption[Martin Christian Körner].
As an example, given a sentence “the quick brown fox”, the probability of “fox” following “the quick brown”, can be tabulated as follows.
Maximum Likelihood(ML) and Smoothing for Missing tokens
n-grams (word tokens) and their respective frequencies of occurance were tabulated from the sampled data (those that were seen in the corpus), Good-Turing discount smoothing technique was applied to fill the gaps for value for unseen tokens words, however due limitation on my home PC were left out (permutations words in unigram just was too much data to process with home PC).
After stacking the n-gram tokens, backoff back off technique was applied to prioritize the trigram then bigram finally the unigram. (A lambda value is selected after few trials for best outcome.)
Data Cleaning and preparation steps
How the Application work
The data would have been indexed and listed in the order in which they are most frequently used. The application will return the most frequent use of the word in a sentence in naturally spoken English..
For Example if there are 10 sentences that start with 'I love ' the 3 that will be returned will have been the once that come after 'I love ' example 'I love you' or 'I love him' or 'I love her' vs 'I love cars' or 'I love cows' etc.
The application go one step further and return the searches based on the position of the word in a sentence as it is frequently used.
Application Feature
Application Usage
High level End to End view
The app is deployed on shinyapps.io and can be found here.
Libraries used to develop the product
library('tm')
library(openNLP)
library(stringi)
library(dplyr)
library(stringr)
library(qdap)
library(shiny)
library(shinythemes)
R version 3.2.3 (2015-12-10)
Unix command line functions with regex to help with some of the cleaning.
Challenges developing the product
The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (This is big data territory.) Because the model was built on home computer, that is limited in processing/memory, a very small sample(1%) was used and conditional probability statistical model is applied to predict the next word.
Some of the packages were too slow to create n-grams and other were too slow to clean the data, for this UNIX was used.
Even with 1% of the corpus the trigram/bigram Good-Touring smoothing for unseen data explode beyond the means of personal computers to process.
Image Credit
Conditional Probability with Markov Assumption formula image (slide 5)
Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
Maximum Likelihood and Example (slide 6)
Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
Word Cloud (slide 2) - Abiyu Giday
High Level (slide 3 & 9) - Abiyu Giday
Data Cleaning Steps (slide 7) - Abiyu Giday
Application features (slide 9-11) - Abiyu Giday
References
John Fry - Lecture slides - Bigrams and Trigrams, Boise State University.
Martin Christian Körner - Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction. Universitat Koblenz Landau
Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van A t t ev el d t, RTextTools: A Supervised Learning
Package for Text Classification