Arash Amoozegar
June 2016
Natural Language Processing with N-gram modeling
Unigram, bigram, trigram, and quadgram
Stupid Backoff
Guess My Next Word application, developed in R and deployed using Shiny package, is a web application that uses a Natural Language Processing model of N-grams to predict the next word in a sentence. This prediction is based on a single sampled corpus constructed from three sources of text namely, Twitter, News, and Blogs.
The raw data set includes three courpus files from Twitter, News, and Blogs in English language. All three corpus files are cleaned by removing numbers, removing punctuations, whitespaces, special Twitter handles, tags, email and website addresses, and profanity words. Due to processing limitations, a 1mb random sample is created from the main 556mb corpus file.
N-gram tokens (unigram, bigram, trigram, and quadgram) were made from the cleaned random sample. Based on the repitition frequency of each of the N-grams in the sample corpus, frequency tables were created. These frequency tables are the basis of the prediction algorithm.
When user inputs an incomplete sentence, the app searches the frequency tables and finds the best most probable upcoming word in the sentence. A Backoff model is implemented to start the search from the quadgram back to the bigram.
Guess My Next Word Application
Please use the above link to access the application. After loading, type in your incomplete sentence in the provided text box, click the Submit button, and wait for the application to provide you with the most probable next word.
