Data Science Capstone Project Presentation

Data Science Specialization by Johns Hopkins University
2nd January 2017

Tom Checkiewicz

Project Objective

The main purpose of the project is to use advanced Natural Language Processing algorithms to build an application with text prediction capability which will effectively utilize source text data sets (emails, tweets, news) in order to run predictive algorithm and generate predictions in computationally limited environment with no or very limited impact on UX of the front end application.


Application Outline

The application utilizes Discounting method and Katz back-off as the key Natural Language Processing algorithms in order to generate and prioritize the predictions.

The source data sets have been cleansed and prepared using the TM package. It included removing non-Latin and special characters, numbers, duplicate words, letter capitalization and punctuation. The corpus data set has been also steamed and cleaned out of the redundant white spaces.

The corpus data set has been tokenized into 1-3 grams using RWeka package.

The source data set had to be sampled in order to reduce the time and computing processing power required to generate the N-gram files, which aimed at finding an acceptable trade-off between the size of the samples, user experience, predictive performance metrics and the available run time environment.
In order to overcome the single-core processing limitation of R, DoParallel package has been applied to enable multi-core processing and to reduce the processing time.

Application Front End
The application has been deployed on shiny.io servers. It's been my intention and design imperative to make this application interface as simple and as intuitive as possible. It is interactive and reacts to the user input in real time. Simply start typing your sentence and click the below buttons if any of the predictions turns out to be correct.
alt text

Applied Predictive Algorithm

The applied algorithm uses n-gram concept to transform the source text data into sequenced set of word occurrences listed and ordered based on frequency of appearance. The algorithm applies a unique, interactive mechanism which reacts real-time to every single character typed by user. It's been achieved by using reactive expressions in shiny.

Application logic

  1. When a user types the first letter of the phrase, the algorithm generates Unigram based predictions which start from the typed letter.
  2. The same process takes place for every next letter in the first word of the sentence typed by user until a “space” key has been hit.
  3. Pressing “space” key informs the algorithm that the user finished the first word. It implies moving the source of the predictions to Bigram file. Once the first word is known the algorithm looks for top occurrences of the bigrams starting from the first word.
  4. When a user types the first letter of the second word, the algorithm identifies the top occurances of the bigrams which start with the first word and the second words which start with the letter(s) typed by the user until the space key has been hit next time.
  5. Pressing “space” key after the second word implies moving the source of the predictions to Trigram file. Once the first and the second word is known the algorithm looks for top occurrences of the trigrams starting from the first and the second word typed by user.
  6. When a user types the first letter of the third word, the algorithm identifies the top occurrences of the trigrams which start from the first, the second and the third words which start from the letter(s) typed by the user until the space key has been hit.
  7. The application has got a mechanism which limits the processing of the n-grams to three words only. It means that if the sentence typed by the user has got more than 3 words, only last three will be taken for n-gram processing.

Now the question is what happens in case when the there is no any n-gram which meets the sequence criteria of the typed text?

Discounting Method and Katz back-off mechanism

  • The applied discounting method calculates the probability distribution of the n-grams and deducts a pre-set (0.5 in our case) value from the individual probabilities of n-grams. This in turn allows us to allocate the “discounted” value to unseen n-grams and calculate the “Missing Probability Mass” parameter. This allows us to divide this value between the words which count is zero for given contextual word.

  • The above discounting method has been combined with Katz back-off mechanism which deals with unseen n-grams by allocating the missing probability mass to (n-1) gram table and generating the predictions based on words probability distributions of (n-1) grams. The concept behind is to match words sequence to a higher-level ngrams, and if no match is found, back off to a lower-order ngram recursively until the unigram level with the highest probability of occurrence)

The details of the applied method can be found here: https://youtu.be/hsHw9F3UuAQ

Reference documentation for Katz back-off algorithm can be found here: https://en.wikipedia.org/wiki/Katz's_back-off_model

Good-Turing discounting method description can be found here: http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/smoothing+backoff.pdf

The application can be accessed and viewed here:

https://oli2.shinyapps.io/Predict_Next_Word/

Conclusions

  1. The overall UX and predictive performance of the application satisfy project requirements.
  2. Before N-Gram model has been applied other models like Naive Bayesian and Maximum Entropy were assessed but the application performance wasn't satisfatory.
  3. The prediction performance is impacted by inability to process larger input data sets due to single machine configuration limitation.
  4. Data cleaning and preparation process must be well thought over (necessary transformations and their sequence).
  5. The current implementation can be expanded by grammar spell-checker and self-learning moduls however it is out of scope of this project.

References: