DS Capstone Project Presentation

Jitender Kumar
2 July 2016

Executive Summary and Objective

This presentation is part of Coursera Data Science program. We have to apply concepts of Natural Language Processing to predict next word.

For the capstone course we have been given a Swiftkey data sets and a Shiny application has been developed to using the NLP concepts and techniques.

Objective of this presentation is to demonstrate

  • Model and Steps
  • Methodology
  • Architecture and Flow
  • How to use application

Model and Steps

Four n-gram dictionary model along with Stupid back off smoothing model for calculating probability of next word has been used. Following are main steps

  • Data sample from HC Corpora has been tokenized into quad, tri, bi and uni grams. n grams and their frequency stored in data frames
  • Search the input string through the n grams data base. If not found in n gram then it backs off to n-1 gram
  • If found then fixed discount applied to probabilty of occurance. - If no grams found, then provide top words from uni gram

Few points

  • Katz suggested to discount the probablity when using the lower grams.
  • Good-Turing Discounting is a good method for estimation. I will employ this method some time later.
  • Kneser-Ney smoothing is sugested but it is slower in response and overall prediction is not better than Katz Backoff method.

Methodology

Cleaned and processed input string is searched through the grams starting with 4 gram. If next word is found in quad, tri or bi gram then: Probability of predicted next word = Discount * Occurance of word in the subset found

Discount: It is fixed as of now but we can estimate and use Good-Turing Discounting also.

gram wise: Quadgram-0.4,  Trigram -0.3, Bigram -0.2, Unigram -0.1 

Word Frequency: It is calculated dynamically. Following is the formula

Word frequency/ sum(Frequency of all the word in subset found)

There are various approaches but I found the above one simple to implement which gives results with very good accurary.

App Data flow and diagram

Following diagram depicts the Application architecture and flow

alt text

The grams are have structure like below

alt text

Links and Future improvements

The application is hosted at

https://kjitender.shinyapps.io/Word_Prediction_App/

Would like to do the following improvements in next version of application

  • Apply Kneser-Ney smoothing and compare results with Back off smoothing.
  • Ability to select words from predicte output and include it in input string

I hope you have liked the application and methodology.

All suggestions and feedback are welcome at kumar[dot]jitender[at]gmail.com