TextSeer - Word Prediction Engine

Peng Zhao
April 3, 2015

Data Science Specialization Capstone Project
-Coursera
-John Hopkins Bloomberg School of Public Health
-SwiftKey

The source code for the project is available on GitHub
The app: TextSeer

WordSeer Highlights

Try TextSeer now!

  • Live app responsiveness to text input
  • Displays top 30 word possibilities with weighted scores
  • WordCloud display for fast visualization of suggested words
  • Handles and respects punctuation and special characters in user input
  • The size of data used by language model is 99.8Mb with an average bytes/ngram: 80.5 bytes/ngram; this guarantees fast performance and minimal loading time.

Instructions

Instructions

The Underlying Algorithm

  • The model was trained using a random sample of 5% from the HC corpora containing English documents from twitter, blogs and news feeds (descriptive statistics).
  • Data for each N-gram (range of N = 4 - 1) collection is stored in hash tables (R data.table and spooky.32 for efficiency).
  • WordSeer employs a Markov process and is based on a modified “Stupid Backoff” algorithm (Brants 2007). Using all input N-grams (4-grams - 1-grams) it calculates the probability for all the words following that N-gram (MLE). These probabilities are summed across all N-gram collections (1 to k) to obtain a weight (W). \[ W_{word}=\sum_{n=1}^k(MLE_{word})_{n} \]

Further Exploration

  1. Context information is discarded for n-grams where n > 4.
  2. The average bytes per N-gram can be further decreased by storing unique words as numbers for the higher order M-grams.
  3. Implementation of interpolation (Key-Neisser algorithm) may improve prediction accuracy but decrease performance.
  4. Table of know contractions can allow for their use in the model.

The app: TextSeer