Text Prediction App

Aman Bhagat
05/10/2020

Overview

  • This project aims to build a product which can predict the next word as the user starts typing words.

  • In this capstone we will be applying data science in the area of natural language processing.

  • The language model is applied on a small percentage of corpus which is scrapped from News, twitter and blogs.

  • I have used Maximum Likelihood estimator with Kneyser Ney Smoothing for the prediction.

Language Modelling

  • Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams.

  • Given the sentence, “Francisco”“ is presented as the suggested ending, because it appears more often than "glasses” in some text.

I can't see without my reading. __ Francisco __

  • However, even though “Francisco” appears more often than “glasses”, “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”).

  • Kneser -Ney General Formula for Bigram Model is:

\[ P_{abs}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \alpha\; p_{abs}(w_i) \]

  • Here \( \delta \) refers to a fixed discount value, and \( \alpha \) is a normalizing constant.

Application Workflow

  • The application initially generates candidate list using the bigram model with last word given by the user,
  • We use the list of candidate words one at a time and calculate probability in combination with the given string input.
  • The Application calls the PKN() function to calculate the probability for the higher and then it agains call PKN() to get the probability for the lower order, which is terminated as the number of words becomes 1.
  • By doing this, we get the best appoximation of the probability with each of the n-gram model.
  • At last it will return a dataframe of words and probabilities in decreasing order.
  • Top 5 words are then displayed in the UI and rest are shown as word cloud. (If the user wants)
  • The NGram Kneser Kney Formula is shown below and can be used as a reference for the workflow of the application.

Application UI Application UI

Shiny Application

Here are the few steps to use the application:

  1. Please enter you scentence in the input box
  2. Press predict button to get the predicted top 5 words
  3. Wait for few seconds as the list of next word appear
  4. Please click on the Show Wordcloud button to see a word cloud of possible words

Conclusion

This was a significant educational experience in handling and processing large textual data. There is a lot work needs to be done in optimising the model accuracy and execution time. This was my simple take on kneser kney algorithm in a full recursive manner. I learned how to explore algorithms to optimize predictive power.

References: