Aman Bhagat
05/10/2020
This project aims to build a product which can predict the next word as the user starts typing words.
In this capstone we will be applying data science in the area of natural language processing.
The language model is applied on a small percentage of corpus which is scrapped from News, twitter and blogs.
I have used Maximum Likelihood estimator with Kneyser Ney Smoothing for the prediction.
Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams.
Given the sentence, “Francisco”“ is presented as the suggested ending, because it appears more often than "glasses” in some text.
I can't see without my reading. __ Francisco __
However, even though “Francisco” appears more often than “glasses”, “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”).
Kneser -Ney General Formula for Bigram Model is:
\[ P_{abs}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \alpha\; p_{abs}(w_i) \]
Here are the few steps to use the application:
This was a significant educational experience in handling and processing large textual data. There is a lot work needs to be done in optimising the model accuracy and execution time. This was my simple take on kneser kney algorithm in a full recursive manner. I learned how to explore algorithms to optimize predictive power.
References: