WordPredict

Raj Maddali
Sep 12, 2017

Executive Summary

  • WordPredict is a small footprint word predictor.
  • Three distinct corpii were used(Blog, Twitter, News).
  • Currently English only, however can be transposed onto other languages/corpii
  • WordPredict is built on a Shiny software platform.
  • Enjoy !!!

WordPredict - Preparation

  • The simple GUI overlays a complex framework of pre processed data.
  • Corpii are initially processed using Quanteda producing n-grams (1,2,3,4)
  • N-grams are hashed to reduce memory footprint.
  • Low frequency n-grams are excluded for performance
  • Modified Knesser-Ney probabilites are computed on the compressed n-gram tables.
  • Referenced from Stanley Chen and Joshua Goodman: An empirical study of smoothing techniques for language modeling. Computer Speech & Language Journal
  • Final predictions are unhashed to their English words for presentation

WordPredict - Architecture & Prediction

  • Knesser-Nye algorithm is used to compute next word predictions
  • \[ P_{KN}(w_i|w_{i-n+1}^{i+1}) = \frac{max(N_{1+}(*w_{i-n+1}^{i}-D),0)}{N_{1+}(*w_{i-n+1}^{i-1}*)} \] \[ + \frac{D}{N_{1+}(*w_{i-n+1}^{i-1}*)}{N_{1+}(w_{i-n+1}^{i-1}*)}{P_{KN}(w_i|w_{i-n+2}^{i-1}} \]
  • where
  • \[ N_{1+}(*w_{i-n+1}^{i}) = |\{w_{i-n+1}:C(w_{i-n}^{i}) > 0\} \]
  • \[ N_{1+}(*w_{i-n+1}^{i-1}*) = |\{w_{i-n},w_{i}:C(w_{i-n}^{i}) > 0\} =\sum_{N_{1+}}(*w_{i-n+1}^{i}) \]

  • Additionally, a simple variant of Simple Back Off MLE is used to estimate missing words within the Knesser-Nye framework.

WordPredict - Usage and Demo

  • WordPredict
  • Enter your sentence in the text area and press the predict button
  • The right part of the screen displays your prediction along with a word cloud of other possibilities
  • https://rajm.shinyapps.io/WordPredict/