Data Science Specialization by Johns Hopkins University
2nd January 2017
Tom Checkiewicz
Project Objective
The main purpose of the project is to use advanced Natural Language Processing algorithms to build an application with text prediction capability which will effectively utilize source text data sets (emails, tweets, news) in order to run predictive algorithm and generate predictions in computationally limited environment with no or very limited impact on UX of the front end application.
The application utilizes Discounting method and Katz back-off as the key Natural Language Processing algorithms in order to generate and prioritize the predictions.
The source data sets have been cleansed and prepared using the TM package. It included removing non-Latin and special characters, numbers, duplicate words, letter capitalization and punctuation. The corpus data set has been also steamed and cleaned out of the redundant white spaces.
The corpus data set has been tokenized into 1-3 grams using RWeka package.
The source data set had to be sampled in order to reduce the time and computing processing power required to generate the N-gram files, which aimed at finding an acceptable trade-off between the size of the samples, user experience, predictive performance metrics and the available run time environment.
In order to overcome the single-core processing limitation of R, DoParallel package has been applied to enable multi-core processing and to reduce the processing time.
Application Front End
The application has been deployed on shiny.io servers. It's been my intention and design imperative to make this application interface as simple and as intuitive as possible. It is interactive and reacts to the user input in real time. Simply start typing your sentence and click the below buttons if any of the predictions turns out to be correct.
The applied algorithm uses n-gram concept to transform the source text data into sequenced set of word occurrences listed and ordered based on frequency of appearance. The algorithm applies a unique, interactive mechanism which reacts real-time to every single character typed by user. It's been achieved by using reactive expressions in shiny.
Application logic
Now the question is what happens in case when the there is no any n-gram which meets the sequence criteria of the typed text?
The applied discounting method calculates the probability distribution of the n-grams and deducts a pre-set (0.5 in our case) value from the individual probabilities of n-grams. This in turn allows us to allocate the “discounted” value to unseen n-grams and calculate the “Missing Probability Mass” parameter. This allows us to divide this value between the words which count is zero for given contextual word.
The above discounting method has been combined with Katz back-off mechanism which deals with unseen n-grams by allocating the missing probability mass to (n-1) gram table and generating the predictions based on words probability distributions of (n-1) grams. The concept behind is to match words sequence to a higher-level ngrams, and if no match is found, back off to a lower-order ngram recursively until the unigram level with the highest probability of occurrence)
The details of the applied method can be found here: https://youtu.be/hsHw9F3UuAQ
Reference documentation for Katz back-off algorithm can be found here: https://en.wikipedia.org/wiki/Katz's_back-off_model
Good-Turing discounting method description can be found here: http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/smoothing+backoff.pdf
The application can be accessed and viewed here:
References: