Text Prediction Model: Katz Backoff With Good-Turing Discounting

Lee Gang
25 March 2020

Executive Summary

Objective:

To Create A Proof of Concept (PoC) for a Text Prediction Model with: (i) Small Package Size; (ii) Fast Prediction Speed; and (iii) Accurate.

Results

  • Successfully built a text prediction model using Katz Backoff with Good-Turing Discount;
  • Up to 3-grams is used;
  • Achieved small size (32MB), fast prediction speed (average 1.24 seconds per prediction), and fairly accurate predictions (perplexity = 98.04).

How The Model Works

Process Flow

  • With a series of text as input, retrieve last 2 words
  • Search all records with last two words in the tri-gram frequency table
  • Calculate the probability for each of the last word in the tri-gram frequency table
  • Search all records with the last word in the bi-gram frequency table
  • Calculate the probability for each of the last word in the bi-gram frequency table

How The Model Works (Con't)

Process Flow (Con't)

  • Get the product of the probability from the bi-gram frequency table and the leftover probability from the tri-gram frequency table to be the estimated probability
  • Calculate the probability of the mono-grams
  • Get the product of the mono-gram probability and leftover probabilities from the bi-gram and tri-gram frequency table to be the estimated probability
  • Combined all the last words and their estimated probabilities
  • Return the top 3 words with the highest estimated probability as the prediction.

How To Use The App

Link To App

Text Prediction Model Demo Application

How To Use

  • Open the application via the above url.
  • Enter any words or sentences into the input box.
  • Click on the “Predict Next Word” button.
  • A table will be displayed, showing the top 3 words predicted, with their corresponding estimated probabilities.
  • The total prediction time will also be displayed.

Future Works And Deployments

Depending on the specific purposes of the text prediction model, the training of the model can be adjusted using weightages for each different sources of the Corpus, with higher weightages given to the Corpus similar to the specific use (e.g. Contract/ Legal Corpus for text prediction of Contracts/ Legal documents).

User-specific learning can also be included if the model is deployed in a local device (e.g. mobile phones). Text inputs from the device can be processed and split into the respective n-grams, and added to the existing n-grams tables. Over time, the user's language and word preferences will be captured by the algorithm to provide higher accuracy and relevant predictions.