Dynamic_Word Predictor

author: Thomas V Joseph

date: 26th April 2015

The Dynamic Word Predictor is an application with self learning capabilities build in RStudio and deployed on Shiny.io cloud.

The application has the following features

Three tier Corpora - Trigram corpora, Bigram Corpora and Unigram corpora
Self learning prediction capability where the application learns from user inputs and user preferences of words
A word predictor model incorporating the Markov chain principles, Backoff model principle and Kneyser-ney Smoothing principles.
Deployed with Shiny dashboards

The base corpora was extracted through sampling from the given data samples.
The corpora was cleaned and made to three basic data frames ( Trigram ,Bigram dataframe and Unigram)
Each of the dataframe is composed of the words of the respective n-gram and the probability of the ngram
- For eg. a Trigram dataframe has 4 columns, the first is the first word of the trigram, the second column the second word of the trigram and so on
- The fourth column of the above mentioned dataframe is the probability of the ngram
The probability of the n-gram was calculated based on the Markov chain principle

The flow of the application works as following
- When a user inputs a string of words, it is first looked up in the dynamic frequency dataframe(Part of the self learning module which will be explained in subsequent slides)
- If the words are not found in the dynamic frequency table then it is backed off to the trigram dataframe, bigram dataframe and the unigram dataframe in that order
- The unigram model is weighted based on the Kneyser-ney smoothing principle
- The words are predicted based on the highest probability of the ngram as calculated in each ngram dataframe

The application has an inbuilt capability to learn based on user's propensity of sentence construction
This is achieved by the creation of seperate dataframe built from the input text.
The input text is broken down into a trigram model on the fly and stored in a seperate dataframe
The frequencies of each trigrams are also calculated based on the user input
The prediction is based on the trigrams with the highest frequency calculated.
In the overall application flow, a user input is first searched in the self learned dataframe and thereafter proceeds to the other subsequent dataframes

The link for the application is as represented below https://tvjoseph.shinyapps.io/shiny8/
- The text has to be entered in the box which states “Your Text Goes here”
- Based on the text, the current word which is being typed will be displayed.
- After the current word is input, the first,second and third words will be displayed.
References and Citations
- The trigram and bigram corpora was also supplemented with the following corpora Davies, Mark. (2011) N-grams data from the Corpus of Contemporary American English (COCA) Downloaded from http://www.ngrams.info