Capstone Presentation

Apoorv Saxena

Introduction

The Capstone Project of the specialization has us dealing with text corpuses from blogs, Twitter and the news and we're asked to build a predictive model for text input.

While reading about different n-gram and backoff models, I came across Kneser-Ney Smoothing and Katz's Back-off Model and thought that it would be a decent place to start my foray into the work of NLP.

Katz's Backoff Model is a generative model used in language modeling to estimate the conditional probability of a word, given its history given the previous few words. However, probability estimates can change suddenly on adding more data when the back-off algorithm selects a different order of n-gram model on which to base the estimate.

Due to limited compute, I could only train on about 15000 lines from each of the sources resulting in a rather low accuracy.

The model was implemented in a Shiny Web-app.

WebApp on Startup

WebApp Predictions

Conclusions

Further scope for improvement is with respect to the speed and perhaps applying some sort of smoothing technique like Good-Turing Estimation.
You can access my WebApp here